Prosecution Insights
Last updated: April 19, 2026
Application No. 18/486,980

POSE ESTIMATION FOR A HANDHELD DEVICE

Final Rejection §102
Filed
Oct 13, 2023
Examiner
DAVIS, DAVID DONALD
Art Unit
2627
Tech Center
2600 — Communications
Assignee
Meta Platforms Technologies, LLC
OA Round
3 (Final)
70%
Grant Probability
Favorable
4-5
OA Rounds
3y 2m
To Grant
79%
With Interview

Examiner Intelligence

Grants 70% — above average
70%
Career Allow Rate
631 granted / 900 resolved
+8.1% vs TC avg
Moderate +9% lift
Without
With
+9.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
41 currently pending
Career history
941
Total Applications
across all art units

Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
41.6%
+1.6% vs TC avg
§102
40.8%
+0.8% vs TC avg
§112
10.6%
-29.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 900 resolved cases

Office Action

§102
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claims 21-40 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yokokawa (US 2022/0163800). As per claim 21 Yokokawa discloses: A computer-implemented method, comprising: accessing an image comprising information associated with a hand 502-902 of a user or a handheld device 212 & 504-904, wherein the image is captured by a first camera 208 associated with a headset 200 {[0033] The HMD 200 also may include one or more outward-oriented cameras 208 for imaging objects such as hands of the wearer of the HMD 200.}; generating a vision-based pose 1108 estimation {figure 11} for the handheld device 212 & 504-904 by processing the image {figure 3:306 & figures 5-9 [0044] FIG. 11 illustrates an example ML module or engine 1100 that may be used in which initial hand detection 1102 need not be used. Instead, as described previously left and right keypoint estimation stages 1104, 1106 (details of the right stage 1104 only shown for clarity) may receive multiple images 1108 of a hand holding a controller, cropped if desired and up-res′d using super-resolution if desired according to principles discussed elsewhere herein.}; generating a map-based pose 1112 & 1114 estimation {figure 11} for the handheld device 212 & 504-904 based at least in part upon one or more images captured using a second camera 208 associated with the handheld device 212 & 504-904 { [0036] Commencing at block 300 an image is received from, e.g., the camera 208 of a controller 212 that may be held by a human hand.}; and generating a final pose 1118 estimation {figure 11} for the handheld device 212 & 504-904 based on the vision-based pose 1108 estimation and the map-based pose 1112 & 1114 estimation { [0045] The key NNs 1110 produce both two dimensional (2D) and 1D heatmaps 1112, 1114, from which keypoints 1116 are derived for altering the pose of a template hand 1118 according to the keypoints 1116. Model parameters are learned by optimizing min E(θ).}. As per claim 22 Yokokawa discloses: The method of claim 21, further comprising: accessing a first set of images of the one or more images captured using the second camera 208 to build a map of a portion of an environment, wherein the map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 is based at least in part upon the map { [0047] With respect to the example heat map technique discussed herein, in one non-limiting implementation, K heatmaps of size W0×H0,{H1, H2, . . . , Hk} may be estimated, where each heatmap Hk indicates the location confidence of the kth keypoint of the virtual hand to be rendered. (K keypoints in total).}. As per claim 23 Yokokawa discloses: The method of claim 21, wherein the map-based pose 1112 & 1114 estimation is generated using Simultaneous Localization and Mapping (SLAM) engine { [0047] “Efficient Object Localization Using Convolutional Networks”, Tompson et al., arXiv:1411.4280v3 (June, 2015) describes such an approach in which heatmaps are generated by running an image through multiple resolution banks in parallel to simultaneously capture features at a variety of scales. The output is a discrete heatmap instead of continuous regression. A heatmap predicts the probability of the joint occurring at each pixel. A multi-resolution CNN architecture (coarse heatmap model) is used to implement a sliding window detector to produce a coarse heatmap output. This is but one example heatmap technique that may be used. Note: SLAM has not been set forth with any specificity to distinguish over the applied prior art}. As per claim 24 Yokokawa discloses: The method of claim 21, wherein the vision-based pose 1108 estimation is based at least in part upon sensor data from a sensor associated with the handheld device 212 & 504-904 { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218.}. As per claim 25 Yokokawa discloses: The method of claim 21, wherein the vision-based pose 1108 estimation is based at least in part upon metadata associated with the image { [0034] The controller 212 may include one or more processors 220 configured to send signals from the sensors 216, 218 and control key 214 to other components using one or more network interfaces 222. The controller may also include one or more position sensors 223 such as inertial sensors, global positioning satellite sensors, accelerometers, magnetometers, gyroscopes, and combinations thereof.}. As per claim 26 Yokokawa discloses: The method of claim 21, wherein at least one of the vision-based pose 1108 estimation, the map-based pose 1112 & 1114 estimation, or the final pose 1118 estimation comprises information for six degrees of freedom (6DoF) associated with the handheld device 212 & 504-904 { [0003] Tracking a hand based on sensors on a controller can yield “dead zones” for parts of the hands that are not located near a sensor and for parts of the hand such as the thumb that can assume a wide degree of freedom of movement. Note: it is seen that the handheld device has six degrees of freedom}. As per claim 27 Yokokawa discloses: The method of claim 21, wherein the vision-based pose 1108 estimation is generated using a machine-learning model { [0041] Indeed, FIG. 3 illustrates that the cropped region of the controller with hand may be input to a ML module at block 308, with corresponding touch signals from the controller sensors 216, 218 generated at the same time the image was generated being input to the ML module at block 310. The ML module uses both the sensor signals and controller/hand image to output at block 312 a virtual image of a complete hand in the same pose as it is in grasping the controller in the cropped region generated at block 306. The virtual image is presented on a display such as the HMD 200 at block 314.}. As per claim 28 Yokokawa discloses: The method of claim 27, wherein the machine-learning model comprises a neural network { [0044] The images 1108 may be processed through key neural networks 1110, such as but not limited to convolutional neural networks (CNN).}. As per claim 29 Yokokawa discloses: The method of claim 21, wherein the final pose 1118 estimation is used as user input { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218. & Figure 11}. As per claim 30 Yokokawa discloses: A system, comprising: one or more processors 24; and one or more memories 28 coupled to at least one of the one or more processors 24, wherein the one or more memories 28 comprise computer-readable program instructions, which when executed by at least one of the one or more processors 24, cause the system to: access an image comprising information associated with a hand 502-902 of a user or a handheld device 212 & 504-904, wherein the image is captured by a first camera 208 associated with a headset 200 {[0033] The HMD 200 also may include one or more outward-oriented cameras 208 for imaging objects such as hands of the wearer of the HMD 200.}; generate a vision-based pose 1108 estimation for the handheld device 212 & 504-904 by processing the image {figure 3:306 & figures 5-9}; generate a map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 based at least in part upon one or more images captured using a second camera 208 associated with the handheld device 212 & 504-904 { [0036] Commencing at block 300 an image is received from, e.g., the camera 208 of a controller 212 that may be held by a human hand.}; and generate a final pose 1118 estimation for the handheld device 212 & 504-904 based on the vision-based pose 1108 estimation and the map-based pose 1112 & 1114 estimation {[0044] FIG. 11 illustrates an example ML module or engine 1100 that may be used in which initial hand detection 1102 need not be used. Instead, as described previously left and right keypoint estimation stages 1104, 1106 (details of the right stage 1104 only shown for clarity) may receive multiple images 1108 of a hand holding a controller, cropped if desired and up-res′d using super-resolution if desired according to principles discussed elsewhere herein. The images 1108 may be processed through key neural networks 1110, such as but not limited to convolutional neural networks (CNN).}. As per claim 31 Yokokawa discloses: The system of claim 30, wherein the instructions, which when executed by at least one of the one or more processors 24, cause the system to: access a first set of images of the one or more images captured using the second camera 208 to build a map of a portion of an environment, wherein the map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 is based at least in part upon the map { [0047] With respect to the example heat map technique discussed herein, in one non-limiting implementation, K heatmaps of size W0×H0,{H1, H2, . . . , Hk} may be estimated, where each heatmap Hk indicates the location confidence of the kth keypoint of the virtual hand to be rendered. (K keypoints in total).}. As per claim 32 Yokokawa discloses: The system of claim 30, wherein the vision-based pose 1108 estimation is based at least in part upon sensor data from a sensor associated with the handheld device 212 & 504-904 { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218.}. As per claim 33 Yokokawa discloses: The system of claim 30, wherein at least one of the vision-based pose 1108 estimation, the map-based pose 1112 & 1114 estimation, or the final pose 1118 estimation comprises information for six degrees of freedom (6DoF) associated with the handheld device 212 & 504-904 { [0003] Tracking a hand based on sensors on a controller can yield “dead zones” for parts of the hands that are not located near a sensor and for parts of the hand such as the thumb that can assume a wide degree of freedom of movement. Note: it is seen that the handheld device has six degrees of freedom}. As per claim 34 Yokokawa discloses: The system of claim 30, wherein the vision-based pose 1108 estimation is generated using a machine-learning model { [0041] Indeed, FIG. 3 illustrates that the cropped region of the controller with hand may be input to a ML module at block 308, with corresponding touch signals from the controller sensors 216, 218 generated at the same time the image was generated being input to the ML module at block 310. The ML module uses both the sensor signals and controller/hand image to output at block 312 a virtual image of a complete hand in the same pose as it is in grasping the controller in the cropped region generated at block 306. The virtual image is presented on a display such as the HMD 200 at block 314.}. As per claim 35 Yokokawa discloses: The system of claim 30, wherein the final pose 1118 estimation is used as user input { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218. & Figure 11}. As per claim 36 Yokokawa discloses: A non-transitory computer-readable storage medium including computer- readable instructions embodied therein, which when executed by one or more processors 24, cause a computer system to: access an image comprising information associated with a hand 502-902 of a user or a handheld device 212 & 504-904, wherein the image is captured by a first camera 208 associated with a headset 200 {[0033] The HMD 200 also may include one or more outward-oriented cameras 208 for imaging objects such as hands of the wearer of the HMD 200.}; generate a vision-based pose 1108 estimation for the handheld device 212 & 504-904 by processing the image {figure 3:306 & figures 5-9}; generate a map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 based at least in part upon one or more images captured using a second camera 208 associated with the handheld device 212 & 504-904 { [0036] Commencing at block 300 an image is received from, e.g., the camera 208 of a controller 212 that may be held by a human hand.}; and generate a final pose 1118 estimation for the handheld device 212 & 504-904 based on the vision-based pose 1108 estimation and the map-based pose 1112 & 1114 estimation {[0044] FIG. 11 illustrates an example ML module or engine 1100 that may be used in which initial hand detection 1102 need not be used. Instead, as described previously left and right keypoint estimation stages 1104, 1106 (details of the right stage 1104 only shown for clarity) may receive multiple images 1108 of a hand holding a controller, cropped if desired and up-res′d using super-resolution if desired according to principles discussed elsewhere herein. The images 1108 may be processed through key neural networks 1110, such as but not limited to convolutional neural networks (CNN).}. As per claim 37 Yokokawa discloses: The non-transitory computer-readable storage medium of claim 36, wherein the instructions, which when executed by the one or more processors 24, cause the computer system to:access a first set of images of the one or more images captured using the second camera 208 to build a map of a portion of an environment, wherein the map-based pose 1112 & 1114 estimation for the handheld device 212 & 504-904 is based at least in part upon the map { [0047] With respect to the example heat map technique discussed herein, in one non-limiting implementation, K heatmaps of size W0×H0,{H1, H2, . . . , Hk} may be estimated, where each heatmap Hk indicates the location confidence of the kth keypoint of the virtual hand to be rendered. (K keypoints in total).}. As per claim 38 Yokokawa discloses: The non-transitory computer-readable storage medium of claim 36, wherein the vision-based pose 1108 estimation, the map-based pose 1112 & 1114 estimation, or the final pose 1118 estimation comprises information for six degrees of freedom (6DoF) associated with the handheld device 212 & 504-904 { [0003] Tracking a hand based on sensors on a controller can yield “dead zones” for parts of the hands that are not located near a sensor and for parts of the hand such as the thumb that can assume a wide degree of freedom of movement. Note: it is seen that the handheld device has six degrees of freedom}. As per claim 39 Yokokawa discloses: The non-transitory computer-readable storage medium of claim 36, wherein the vision-based pose 1108 estimation is generated using a machine-learning model { [0041] Indeed, FIG. 3 illustrates that the cropped region of the controller with hand may be input to a ML module at block 308, with corresponding touch signals from the controller sensors 216, 218 generated at the same time the image was generated being input to the ML module at block 310. The ML module uses both the sensor signals and controller/hand image to output at block 312 a virtual image of a complete hand in the same pose as it is in grasping the controller in the cropped region generated at block 306. The virtual image is presented on a display such as the HMD 200 at block 314.}. As per claim 40 Yokokawa discloses: The non-transitory computer-readable storage medium of claim 36, wherein the final pose 1118 estimation is used as user input { [0038] Proceeding to block 308, the cropped image can be analyzed to determine the pose of the hand based on both the image and the signals from the controller sensors 216, 218. & Figure 11}. Response to Arguments Applicant's arguments filed October 8, 2025 have been fully considered but they are not persuasive. Applicant asserts in the paragraph bridging pages 3 and 4 the following: Applicant asserts “cameras 208” of “HMD 200”, as recited by Yokokawa, cannot be reasonable understood to teach or suggest “one or more images captured using a second camera associated with the handheld device,” as recited by Claim 21. Therefore, independent Claim 21 is believed to be in condition for allowance. Reconsideration and withdrawal of the rejections are respectfully requested. Camera 208 does capture the images as claimed as it is shown in figure 3 and described in [0036] of the applied prior art which discloses “An example of such logic is illustrated in FIG. 3 and may be executed by any processor or combination of processors shown herein. Commencing at block 300 an image is received from, e.g., the camera 208 of a controller 212 that may be held by a human hand” Conclusion THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID D DAVIS whose telephone number is (571)272-7572. The examiner can normally be reached Monday - Friday, 8 a.m. - 4 p.m.. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ke Xiao can be reached at 571-272-7776. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /DAVID D DAVIS/Primary Examiner, Art Unit 2627 DDD
Read full office action

Prosecution Timeline

Oct 13, 2023
Application Filed
Aug 21, 2024
Non-Final Rejection — §102
Nov 08, 2024
Interview Requested
Nov 20, 2024
Applicant Interview (Telephonic)
Nov 21, 2024
Examiner Interview Summary
Nov 26, 2024
Response Filed
Apr 14, 2025
Request for Continued Examination
Apr 16, 2025
Response after Non-Final Action
Jul 09, 2025
Non-Final Rejection — §102
Oct 03, 2025
Applicant Interview (Telephonic)
Oct 03, 2025
Examiner Interview Summary
Oct 08, 2025
Response Filed
Jan 12, 2026
Final Rejection — §102
Apr 07, 2026
Applicant Interview (Telephonic)
Apr 07, 2026
Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602106
Ambience-Driven User Experience
2y 5m to grant Granted Apr 14, 2026
Patent 12602128
DISPLAY DEVICE HAVING PIXEL DRIVE CIRCUITS AND SENSOR DRIVE CIRCUITS
2y 5m to grant Granted Apr 14, 2026
Patent 12602121
TOUCH DEVICE FOR PASSIVE RESONANT STYLUS, DRIVING METHOD FOR THE SAME AND TOUCH SYSTEM
2y 5m to grant Granted Apr 14, 2026
Patent 12596265
Aiming Device with a Diffractive Optical Element and Reflective Image Combiner
2y 5m to grant Granted Apr 07, 2026
Patent 12592178
Display Device Including an Electrostatic Discharge Circuit for Discharging Static Electricity
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

4-5
Expected OA Rounds
70%
Grant Probability
79%
With Interview (+9.1%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 900 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month