Last updated: April 19, 2026

Application No. 18/486,980

POSE ESTIMATION FOR A HANDHELD DEVICE

Final Rejection §102

Filed

Oct 13, 2023

Examiner

DAVIS, DAVID DONALD

Art Unit

2627

Tech Center

2600 — Communications

Assignee

Meta Platforms Technologies, LLC

OA Round

3 (Final)

Interview Optional

— +9.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 900 resolved cases, 2023–2026

Examiner Intelligence

DAVIS, DAVID DONALD View full profile →

Grants 70% — above average

Career Allow Rate

631 granted / 900 resolved

+8.1% vs TC avg

Moderate +9% lift

Without

With

+9.1%

Interview Lift

resolved cases with interview

Typical timeline

3y 2m

Avg Prosecution

41 currently pending

Career history

941

Total Applications

across all art units

Statute-Specific Performance

§101

1.2%

-38.8% vs TC avg

§103

41.6%

+1.6% vs TC avg

§102

40.8%

+0.8% vs TC avg

§112

10.6%

-29.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 900 resolved cases

Office Action

§102

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 21-40 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yokokawa (US 2022/0163800).

As per claim 21 Yokokawa discloses:   A computer-implemented method, comprising:

accessing an image comprising information associated with a hand 502-902 of a user or a handheld device 212 & 504-904, wherein the image is captured by a first camera 208 associated with a headset 200 {[0033] The HMD 200 also may include one or more outward-oriented cameras 208 for imaging objects such as hands of the wearer of the HMD 200.};

generating a vision-based pose 1108 estimation {figure 11}  for the handheld device 212 & 504-904 by processing the image {figure 3:306 & figures 5-9 [0044] FIG. 11 illustrates an example ML module or engine 1100 that may be used in which initial hand detection 1102 need not be used. Instead, as described previously left and right keypoint estimation stages 1104, 1106 (details of the right stage 1104 only shown for clarity) may receive multiple images 1108 of a hand holding a controller, cropped if desired and up-res′d using super-resolution if desired according to principles discussed elsewhere herein.};

generating a map-based pose 1112 & 1114 estimation {figure 11} for the handheld device 212 & 504-904 based at least in part upon one or more images captured using a second camera 208 associated with the handheld device 212 & 504-904 { [0036] Commencing at block 300 an image is received from, e.g., the camera 208 of a controller 212 that may be held by a human hand.}; and

generating a final pose 1118 estimation {figure 11} for the handheld device 212 & 504-904 based on the vision-based pose 1108 estimation and the map-based pose 1112 & 1114 estimation { [0045] The key NNs 1110 produce both two dimensional (2D) and 1D heatmaps 1112, 1114, from which keypoints 1116 are derived for altering the pose of a template hand 1118 according to the keypoints 1116. Model parameters are learned by optimizing min E(θ).}.

As per claim 22 Yokokawa discloses:   The method of claim 21, further comprising:

accessing a first set of images of the one or more images captured using the second camera 208 to build a map of a portion of an environment, wherein the map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 is based at least in part upon the map { [0047] With respect to the example heat map technique discussed herein, in one non-limiting implementation, K heatmaps of size W0×H0,{H1, H2, . . . , Hk} may be estimated, where each heatmap Hk indicates the location confidence of the kth keypoint of the virtual hand to be rendered. (K keypoints in total).}.

As per claim 23 Yokokawa discloses:   The method of claim 21, wherein the map-based pose 1112 & 1114 estimation is generated using Simultaneous Localization and Mapping (SLAM) engine { [0047] “Efficient Object Localization Using Convolutional Networks”, Tompson et al., arXiv:1411.4280v3 (June, 2015) describes such an approach in which heatmaps are generated by running an image through multiple resolution banks in parallel to simultaneously capture features at a variety of scales. The output is a discrete heatmap instead of continuous regression. A heatmap predicts the probability of the joint occurring at each pixel. A multi-resolution CNN architecture (coarse heatmap model) is used to implement a sliding window detector to produce a coarse heatmap output. This is but one example heatmap technique that may be used.  Note: SLAM has not been set forth with any specificity to distinguish over the applied prior art}.

As per claim 24 Yokokawa discloses:   The method of claim 21, wherein the vision-based pose 1108 estimation is based at least in part upon sensor data from a sensor associated with the handheld device 212 & 504-904 { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218.}.

As per claim 25 Yokokawa discloses:   The method of claim 21, wherein the vision-based pose 1108 estimation is based at least in part upon metadata associated with the image { [0034] The controller 212 may include one or more processors 220 configured to send signals from the sensors 216, 218 and control key 214 to other components using one or more network interfaces 222. The controller may also include one or more position sensors 223 such as inertial sensors, global positioning satellite sensors, accelerometers, magnetometers, gyroscopes, and combinations thereof.}.

As per claim 26 Yokokawa discloses:   The method of claim 21, wherein at least one of the vision-based pose 1108 estimation, the map-based pose 1112 & 1114 estimation, or the final pose 1118 estimation comprises information for six degrees of freedom (6DoF) associated with the handheld device 212 & 504-904 { [0003] Tracking a hand based on sensors on a controller can yield “dead zones” for parts of the hands that are not located near a sensor and for parts of the hand such as the thumb that can assume a wide degree of freedom of movement.  Note: it is seen that the handheld device has six degrees of freedom}.

As per claim 27 Yokokawa discloses:   The method of claim 21, wherein the vision-based pose 1108 estimation is generated using a machine-learning model { [0041] Indeed, FIG. 3 illustrates that the cropped region of the controller with hand may be input to a ML module at block 308, with corresponding touch signals from the controller sensors 216, 218 generated at the same time the image was generated being input to the ML module at block 310. The ML module uses both the sensor signals and controller/hand image to output at block 312 a virtual image of a complete hand in the same pose as it is in grasping the controller in the cropped region generated at block 306. The virtual image is presented on a display such as the HMD 200 at block 314.}.

As per claim 28 Yokokawa discloses:   The method of claim 27, wherein the machine-learning model comprises a neural network { [0044] The images 1108 may be processed through key neural networks 1110, such as but not limited to convolutional neural networks (CNN).}.

As per claim 29 Yokokawa discloses:   The method of claim 21, wherein the final pose 1118 estimation is used as user input { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218. & Figure 11}.

As per claim 30 Yokokawa discloses:   A system, comprising:

one or more processors 24; and

one or more memories 28 coupled to at least one of the one or more processors 24, wherein the one or more memories 28 comprise computer-readable program instructions, which when executed by at least one of the one or more processors 24, cause the system to:

access an image comprising information associated with a hand 502-902 of a user or a handheld device 212 & 504-904, wherein the image is captured by a first camera 208 associated with a headset 200 {[0033] The HMD 200 also may include one or more outward-oriented cameras 208 for imaging objects such as hands of the wearer of the HMD 200.};

generate a vision-based pose 1108 estimation for the handheld device 212 & 504-904 by processing the image {figure 3:306 & figures 5-9};

generate a map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 based at least in part upon one or more images captured using a second camera 208 associated with the handheld device 212 & 504-904 { [0036] Commencing at block 300 an image is received from, e.g., the camera 208 of a controller 212 that may be held by a human hand.}; and

generate a final pose 1118 estimation for the handheld device 212 & 504-904 based on the vision-based pose 1108 estimation and the map-based pose 1112 & 1114 estimation {[0044] FIG. 11 illustrates an example ML module or engine 1100 that may be used in which initial hand detection 1102 need not be used. Instead, as described previously left and right keypoint estimation stages 1104, 1106 (details of the right stage 1104 only shown for clarity) may receive multiple images 1108 of a hand holding a controller, cropped if desired and up-res′d using super-resolution if desired according to principles discussed elsewhere herein. The images 1108 may be processed through key neural networks 1110, such as but not limited to convolutional neural networks (CNN).}.

As per claim 31 Yokokawa discloses:   The system of claim 30, wherein the instructions, which when executed by at least one of the one or more processors 24, cause the system to:

access a first set of images of the one or more images captured using the second camera 208 to build a map of a portion of an environment, wherein the map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 is based at least in part upon the map { [0047] With respect to the example heat map technique discussed herein, in one non-limiting implementation, K heatmaps of size W0×H0,{H1, H2, . . . , Hk} may be estimated, where each heatmap Hk indicates the location confidence of the kth keypoint of the virtual hand to be rendered. (K keypoints in total).}.

As per claim 32 Yokokawa discloses:   The system of claim 30, wherein the vision-based pose 1108 estimation is based at least in part upon sensor data from a sensor associated with the handheld device 212 & 504-904 { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218.}.

As per claim 33 Yokokawa discloses:   The system of claim 30, wherein at least one of the vision-based pose 1108 estimation, the map-based pose 1112 & 1114 estimation, or the final pose 1118 estimation comprises information for six degrees of freedom (6DoF) associated with the handheld device 212 & 504-904 { [0003] Tracking a hand based on sensors on a controller can yield “dead zones” for parts of the hands that are not located near a sensor and for parts of the hand such as the thumb that can assume a wide degree of freedom of movement.  Note: it is seen that the handheld device has six degrees of freedom}.

As per claim 34 Yokokawa discloses:   The system of claim 30, wherein the vision-based pose 1108 estimation is generated using a machine-learning model { [0041] Indeed, FIG. 3 illustrates that the cropped region of the controller with hand may be input to a ML module at block 308, with corresponding touch signals from the controller sensors 216, 218 generated at the same time the image was generated being input to the ML module at block 310. The ML module uses both the sensor signals and controller/hand image to output at block 312 a virtual image of a complete hand in the same pose as it is in grasping the controller in the cropped region generated at block 306. The virtual image is presented on a display such as the HMD 200 at block 314.}.

As per claim 35  Yokokawa discloses:   The system of claim 30, wherein the final pose 1118 estimation is used as user input { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218. & Figure 11}.

As per claim 36 Yokokawa discloses:   A non-transitory computer-readable storage medium including computer- readable instructions embodied therein, which when executed by one or more processors 24, cause a computer system to:

access an image comprising information associated with a hand 502-902 of a user or a handheld device 212 & 504-904, wherein the image is captured by a first camera 208 associated with a headset 200 {[0033] The HMD 200 also may include one or more outward-oriented cameras 208 for imaging objects such as hands of the wearer of the HMD 200.};

generate a vision-based pose 1108 estimation for the handheld device 212 & 504-904 by processing the image {figure 3:306 & figures 5-9};  

generate a map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 based at least in part upon one or more images captured using a second camera 208 associated with the handheld device 212 & 504-904 { [0036] Commencing at block 300 an image is received from, e.g., the camera 208 of a controller 212 that may be held by a human hand.}; and

generate a final pose 1118 estimation for the handheld device 212 & 504-904 based on the vision-based pose 1108 estimation and the map-based pose 1112 & 1114 estimation {[0044] FIG. 11 illustrates an example ML module or engine 1100 that may be used in which initial hand detection 1102 need not be used. Instead, as described previously left and right keypoint estimation stages 1104, 1106 (details of the right stage 1104 only shown for clarity) may receive multiple images 1108 of a hand holding a controller, cropped if desired and up-res′d using super-resolution if desired according to principles discussed elsewhere herein. The images 1108 may be processed through key neural networks 1110, such as but not limited to convolutional neural networks (CNN).}.

As per claim 37 Yokokawa discloses:   The non-transitory computer-readable storage medium of claim 36, wherein the instructions, which when executed by the one or more processors 24, cause the computer system to:access a first set of images of the one or more images captured using the second camera 208 to build a map of a portion of an environment, wherein the map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 is based at least in part upon the map { [0047] With respect to the example heat map technique discussed herein, in one non-limiting implementation, K heatmaps of size W0×H0,{H1, H2, . . . , Hk} may be estimated, where each heatmap Hk indicates the location confidence of the kth keypoint of the virtual hand to be rendered. (K keypoints in total).}.

As per claim 38 Yokokawa discloses:   The non-transitory computer-readable storage medium of claim 36, wherein the vision-based pose 1108 estimation, the map-based pose 1112 & 1114 estimation, or the final pose 1118 estimation comprises information for six degrees of freedom (6DoF) associated with the handheld device 212 & 504-904 { [0003] Tracking a hand based on sensors on a controller can yield “dead zones” for parts of the hands that are not located near a sensor and for parts of the hand such as the thumb that can assume a wide degree of freedom of movement.  Note: it is seen that the handheld device has six degrees of freedom}.

As per claim 39 Yokokawa discloses:   The non-transitory computer-readable storage medium of claim 36, wherein the vision-based pose 1108 estimation is generated using a machine-learning model { [0041] Indeed, FIG. 3 illustrates that the cropped region of the controller with hand may be input to a ML module at block 308, with corresponding touch signals from the controller sensors 216, 218 generated at the same time the image was generated being input to the ML module at block 310. The ML module uses both the sensor signals and controller/hand image to output at block 312 a virtual image of a complete hand in the same pose as it is in grasping the controller in the cropped region generated at block 306. The virtual image is presented on a display such as the HMD 200 at block 314.}.

As per claim 40 Yokokawa discloses:   The non-transitory computer-readable storage medium of claim 36, wherein the final pose 1118 estimation is used as user input { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218. & Figure 11}.

Response to Arguments
Applicant's arguments filed October 8, 2025 have been fully considered but they are not persuasive. Applicant asserts in the paragraph bridging pages 3 and 4 the following:
Applicant asserts “cameras 208” of “HMD 200”, as recited by Yokokawa, cannot be reasonable understood to teach or suggest “one or more images captured using a second camera associated with the handheld device,” as recited by Claim 21. Therefore, independent Claim 21 is believed to be in condition for allowance.  Reconsideration and withdrawal of the rejections are respectfully requested.
Camera 208 does capture the images as claimed as it is shown in figure 3 and described in [0036] of the applied prior art which discloses “An example of such logic is illustrated in FIG. 3 and may be executed by any processor or combination of processors shown herein. Commencing at block 300 an image is received from, e.g., the camera 208 of a controller 212 that may be held by a human hand”

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID D DAVIS whose telephone number is (571)272-7572. The examiner can normally be reached Monday - Friday, 8 a.m. - 4 p.m..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ke Xiao can be reached at 571-272-7776. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/DAVID D DAVIS/Primary Examiner, Art Unit 2627                                                                                                                                                                                                        

DDD

Read full office action

Prosecution Timeline

Oct 13, 2023

Application Filed

Aug 21, 2024

Non-Final Rejection — §102

Nov 08, 2024

Interview Requested

Nov 20, 2024

Applicant Interview (Telephonic)

Nov 21, 2024

Examiner Interview Summary

Nov 26, 2024

Response Filed

Apr 14, 2025

Request for Continued Examination

Apr 16, 2025

Response after Non-Final Action

Jul 09, 2025

Non-Final Rejection — §102

Oct 03, 2025

Applicant Interview (Telephonic)

Oct 03, 2025

Examiner Interview Summary

Oct 08, 2025

Response Filed

Jan 12, 2026

Final Rejection — §102

Apr 07, 2026

Applicant Interview (Telephonic)

Apr 07, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

18/088,300

Patent 12602106

Ambience-Driven User Experience

2y 5m to grant Granted Apr 14, 2026

18/422,649

Patent 12602128

DISPLAY DEVICE HAVING PIXEL DRIVE CIRCUITS AND SENSOR DRIVE CIRCUITS

2y 5m to grant Granted Apr 14, 2026

18/963,891

Patent 12602121

TOUCH DEVICE FOR PASSIVE RESONANT STYLUS, DRIVING METHOD FOR THE SAME AND TOUCH SYSTEM

2y 5m to grant Granted Apr 14, 2026

18/538,736

Patent 12596265

Aiming Device with a Diffractive Optical Element and Reflective Image Combiner

2y 5m to grant Granted Apr 07, 2026

18/397,082

Patent 12592178

Display Device Including an Electrostatic Discharge Circuit for Discharging Static Electricity

2y 5m to grant Granted Mar 31, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

4-5

Expected OA Rounds

70%

Grant Probability

79%

With Interview (+9.1%)

3y 2m

Median Time to Grant

High

PTA Risk

Based on 900 resolved cases by this examiner. Grant probability derived from career allow rate.

POSE ESTIMATION FOR A HANDHELD DEVICE

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email