Last updated: April 19, 2026
Application No. 18/179,784
DEPTH ESTIMATION FROM RGB IMAGES

Non-Final OA §102§103§112
Filed
Mar 07, 2023
Examiner
SORRIN, AARON JOSEPH
Art Unit
2672
Tech Center
2600 — Communications
Assignee
Snap Inc.
OA Round
3 (Non-Final)
Interview Optional

— +50.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 62 resolved cases, 2023–2026
Examiner Intelligence

SORRIN, AARON JOSEPH View full profile →
Grants 74% — above average
Career Allow Rate
46 granted / 62 resolved
+12.2% vs TC avg
Strong +51% interview lift
Without
With
+50.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
22 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
20.4%
-19.6% vs TC avg
§103
35.6%
-4.4% vs TC avg
§102
14.1%
-25.9% vs TC avg
§112
29.3%
-10.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 62 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 2/4/26 have been fully considered but they are not persuasive. 
In response to applicant's arguments on Pages 7-8 of Remarks that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., “the respective depth value being a depth from a specified landmark of the hand,") are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
On Pages 7-8 of Remarks, Applicant argues that the cited references do not expressly disclose “a respective depth value for each foreground pixel of the image data corresponding to the hand”. Ge does disclose this limitation. According to Figure 6 and Paragraph 69, Ge estimates depth data for vertices of the hand, wherein the vertices inherently are composed of a set of respective pixels. These pixels are mapped to the foreground pixels. Depth data is estimated for the hand, thus the hand is processed as a foreground of the image. Accordingly, Ge estimates respective depth values for foreground pixels of image data of the hand, as claimed. Note that “each foreground pixel” lacks antecedence and is interpreted in accordance with 35 USC 112(b) the rejection below.
Applicant’s argument on Page 8 of the Specification that claims 2-7, 9-14, and 16--20 are allowable as dependent on claims 1, 8, and 15 are not persuasive because the independent claims are not allowable for the reasons outlined above.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1, 8, and 15 recites the limitation “each foreground pixel”.  There is insufficient antecedent basis for this limitation in the claim. This is being interpreted as a plurality or set of foreground pixels.
Claims 2-7, 9-14, and 16--20 are rejected as dependent on the above claims.
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claims 3, 10, and 17 are rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.  The independent claims on which these claims depend already describe image data that inherently comprises pixels, and respective depth values for pixels. See the following limitation: “estimated depth data comprising a respective depth value for each foreground pixel of the image data corresponding to the hand” in the independent claims. Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 3, 8, 10, 15, and 17 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Ge (US 20200184721 A1).

Regarding claim 1, Ge teaches, “A computer-implemented method comprising: capturing, by one or more processors using a camera of an eXtended Reality (XR) system, image data of a hand in a real-world scene;” (Ge, Paragraph 16, “The disclosed embodiments improve the efficiency of using the electronic device by applying machine learning techniques, including a graph CNN, to generate a 3D hand mesh for presentation in a VR or AR application. The 3D hand mesh is generated directly from a single RGB image depicting a real-world hand and represents the pose (e.g., joint locations) and shape (e.g., surface features) of the hand that is depicted in the RGB image. The disclosed embodiments generate the 3D hand mesh without also obtaining a depth map of the real-world hand. Specifically, according to the disclosed techniques, image features of a single RGB image are extracted by one or more machine learning techniques and then graph convolutions (e.g., using a graph CNN) are applied hierarchically with upsampling and nonlinear activations to generate 3D hand mesh vertices. According to the disclosed embodiments, the generated 3D hand mesh can better represent highly variable 3D hand shapes (e.g., surface features) and their local details. This enables a user device with a simple RGB camera (without a depth camera) to accurately and quickly render an animated 3D hand model of the real-world user's hand within the VR or AR application, allowing the user to interact with the VR or AR content in a more realistic environment.“)
“generating, by the one or more processors, using the image data and a depth estimation model trained using synthetic 2D image data, estimated depth data comprising a respective depth value for each foreground pixel of the image data corresponding to the hand, the estimated depth data defining a 3D point cloud of the hand in the real-world scene using the image data and a depth estimation model trained using synthetic 2D image data;” (Ge, Figure 6 element 603 describes the estimation of depth data of the real hand (3D hand mesh). The depth estimation model trained using synthetic 2D image data is disclosed in Figure 7 and Paragraph 83, “At operation 701, the hand shape and pose estimation system 124 obtains a first plurality of input images that include synthetic representations of a hand. For example, machine learning techniques network 410 initially receives synthetic hand training image data 402. An illustrative synthetic hand training image data 801 and its corresponding output is shown in a first row 810 of FIG. 8.” Further, see Paragraph 69 for description of 3D coordinates of vertices in the hand mesh: “An embodiment of the graph CNN module 418 is shown in FIG. 5. Particularly, the graph CNN module 418 generates 3D coordinates of vertices in the hand mesh and estimates the 3D hand pose from the mesh. In this way, the graph CNN module 418 models, based on features extracted by other machine learning technique modules of FIG. 5, a post of a hand depicted in a monocular image by adjusting skeletal joint positions of a 3D hand mesh and also models a shape of the hand in the monocular image by adjusting blend shape values of the 3D hand mesh representing surface features of the hand depicted in the monocular image. The resulting 3D hand mesh is then generated for display.” The vertices of the hand (wherein the hand amounts to the foreground), for which respective depth values are estimated, inherently contain pixels. These pixels are mapped to the foreground pixels for which depth values are estimated. Accordingly, Ge discloses respective depth values for the foreground pixels of the image corresponding to the hand.)
“generating, by the one or more processors, an XR effect using the estimated depth data and the image data; and providing, by the one or more processors, the XR effect to a user in a user interface.” (Ge, Figure 6, elements 604 and 605).

Regarding claim 3, Ge teaches, “The computer-implemented method of claim 1,”
“wherein the image data of the hand comprises a set of pixels, and wherein the estimated depth data comprises a respective depth for each pixel of the set of pixels.” (Ge, Paragraph 16, “The disclosed embodiments improve the efficiency of using the electronic device by applying machine learning techniques, including a graph CNN, to generate a 3D hand mesh for presentation in a VR or AR application. The 3D hand mesh is generated directly from a single RGB image depicting a real-world hand and represents the pose (e.g., joint locations) and shape (e.g., surface features) of the hand that is depicted in the RGB image. The disclosed embodiments generate the 3D hand mesh without also obtaining a depth map of the real-world hand. Specifically, according to the disclosed techniques, image features of a single RGB image are extracted by one or more machine learning techniques and then graph convolutions (e.g., using a graph CNN) are applied hierarchically with upsampling and nonlinear activations to generate 3D hand mesh vertices. According to the disclosed embodiments, the generated 3D hand mesh can better represent highly variable 3D hand shapes (e.g., surface features) and their local details. This enables a user device with a simple RGB camera (without a depth camera) to accurately and quickly render an animated 3D hand model of the real-world user's hand within the VR or AR application, allowing the user to interact with the VR or AR content in a more realistic environment.” The input RGB image inherently has a set of pixels, which are converted to the 3D hand mesh (depth data).)

Regarding claims 8, 10, 15, and 17, claims 8 and 10 recite a system, with one or more processors and a memory, with elements corresponding to the steps recited in claims 1 and 3. Therefore, the recited elements of these claims are mapped to the analogous steps in the corresponding method claims. Claims 15 and 17 recite a non-transitory machine-readable storage medium storing a program with instructions corresponding to the steps recited in claims 1 and 3.  Therefore, the recited programming instructions are mapped to the analogous steps in the corresponding method claims.  Additionally, Ge teaches a system with one or more processors and a memory, and a non-transitory machine-readable storage medium (Ge, Paragraph 109, ““MACHINE-READABLE MEDIUM” in this context refers to a component, device, or other tangible media able to store instructions and data temporarily or permanently and may include, but is not limited to, random-access memory (RAM), read-only memory (ROM), buffer memory, flash memory, optical media, magnetic media, cache memory, other types of storage (e.g., Erasable Programmable Read-Only Memory (EEPROM)) and/or any suitable combination thereof. The term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions. The term “machine-readable medium” shall also be taken to include any medium, or combination of multiple media, that is capable of storing instructions (e.g., code) for execution by a machine, such that the instructions, when executed by one or more processors of the machine, cause the machine to perform any one or more of the methodologies described herein. Accordingly, a “machine-readable medium” refers to a single storage apparatus or device, as well as “cloud-based” storage systems or storage networks that include multiple storage apparatus or devices. The term “machine-readable medium” excludes signals per se.”)

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 2, 9, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ge in view of Xia (CN107992858A)

Regarding claim 2, Ge teaches, “The computer-implemented method of claim 1,”

Ge does not expressly disclose, “wherein generating the estimated depth data comprises: determining, by the one or more processors, cropping boundary data using the image data and a detection model; and cropping, by the one or more processors, the image data using the cropping boundary data.”
Xia discloses “wherein generating the estimated depth data comprises: determining, by the one or more processors, cropping boundary data using the image data and a detection model; and cropping, by the one or more processors, the image data using the cropping boundary data.” (Xia, 2nd to last Paragraph of Page 2, and second Paragraph of page 3, “wherein the three-dimensional gesture estimation method of real-time, firstly as input color image, using the most advanced detector to locate the hand for each hand image in the input, hand detector generating a likelihood estimation value and the hand bounding box coordinates, pre-training network running OpenPose (body language recognition system) to estimate the two-dimensional position of the hand joint to generate hand 2D joint position of the hotspot graph, finally, nonlinear minimization of hand fitting the 3D model to the estimated 2D joint position. recovering the 3D hand gesture.”;  “wherein, said 2D joint estimation according to the detected boundary frame cropping the image, the 2D fed to the key point detector, because training only on the left hand key point detector, passes along the Y-axis mirror image processing right, the output of the detector is 21 hotspot graph, corresponding to the 20 key points (four) per finger and a wrist point estimate.”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to use the detection model of Xia to generate the bounding box and perform the cropping of the images of Ge using the processor of Ge.
The motivation for doing so would have been to isolate the hand from an image which contains non-hand objects to avoid inadvertent image processing on the non-hand objects. Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Ge with the above bounding box generation and cropping of Xia to obtain the invention of claim 2.

Regarding claims 9 and 16, claim 9 recites a system, with one or more processors and a memory, with elements corresponding to the steps recited in claim 2. Therefore, the recited elements of this claim are mapped to the analogous steps in the corresponding method claim. Claim 16 recites a non-transitory machine-readable storage medium storing a program with instructions corresponding to the steps recited in claim 2.  Therefore, the recited programming instructions are mapped to the analogous steps in the corresponding method claim. The rationale and motivation to combine the Ge and Xia references apply here.  Additionally, Ge teaches a system with one or more processors and a memory, and a non-transitory machine-readable storage medium (Ge, Paragraph 109, ““MACHINE-READABLE MEDIUM” in this context refers to a component, device, or other tangible media able to store instructions and data temporarily or permanently and may include, but is not limited to, random-access memory (RAM), read-only memory (ROM), buffer memory, flash memory, optical media, magnetic media, cache memory, other types of storage (e.g., Erasable Programmable Read-Only Memory (EEPROM)) and/or any suitable combination thereof. The term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions. The term “machine-readable medium” shall also be taken to include any medium, or combination of multiple media, that is capable of storing instructions (e.g., code) for execution by a machine, such that the instructions, when executed by one or more processors of the machine, cause the machine to perform any one or more of the methodologies described herein. Accordingly, a “machine-readable medium” refers to a single storage apparatus or device, as well as “cloud-based” storage systems or storage networks that include multiple storage apparatus or devices. The term “machine-readable medium” excludes signals per se.”)


Claim(s) 4-7, 11-14, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ge in view of Guay (US20210374993A1) further in view of Hushchyn (US20180300927A1)

Regarding claim 4, Ge teaches, “The computer-implemented method of claim 1,”
While Ge teaches that training the depth estimation model comprises receiving 3D data of a measured hand;” (Ge, Paragraph 41, “In some implementations, the second plurality of images, stored in synthetic and real hand training images 209, includes hand RGB images with the corresponding depth images of each RGB image captured using a RGB-D camera (e.g., an RGB camera fitted with a depth sensor or depth camera.)”); and Ge teaches training “the depth estimation model using the synthetic 2D image data and the target depth data”, (Ge, Paragraph 86, “At operation 704, the hand shape and pose estimation system 124 trains the first and second machine learning techniques together with a graph CNN based on the first plurality of input images. For example, for a given image from the synthetic hand training image data 402, ground truth hand mesh information is obtained and provided to hand mesh loss module 420. The graph CNN module 418 estimates a hand mesh for the given image and is trained to minimize the loss computed by hand mesh loss module 420. The machine learning techniques network 410 are trained to minimize the loss together in accordance with Equation 17.”), Ge does not expressly disclose using the 3D data of the measured hand to generate both the synthetic 2D image data and target depth data that are used for training, and the use of a second processor.
Guay teaches the generation of 3D model data using measured 3D data (Guay, Paragraph 43, “As shown, the inference subsystem 104 inputs the input image 188 into the pose estimation model 160 and, in response, the pose estimation model 160 outputs the estimated 3D pose 190. The estimated 3D pose 190 is an estimate of the 3D pose of the costumed person 122(2) as depicted in the input image 188. The estimated 3D pose 190 specifies explicitly or implicitly, in any technically feasible fashion, and without limitation, a set of 3D positions associated with a set of joints. In some embodiments, the set of 3D positions estimates positions and orientations associated with a set of physical joints of the costumed person 122(2).”; generating synthetic 2D images using the generated 3D model data (Guay, Paragraph 93, “As shown, a method 600 begins a step 602, where the training application 140 acquires the pose dataset 130 and the 3D model 124 of the costumed person 122(1). At step 604, for each of the 3D poses 132 included in the pose dataset 130, the variation engine 240 generates any number of the image recipes 262 based on the synthesis variation list 134. At step 606, the synthesis/rendering engine 270 generates the synthetic images 278 based on the image recipes 262 and the 3D model 124.”); and generating target depth data by pairing depths with the synthetic 2D images using the synthetic 2D image data and the 3D model data; (Guay, Paragraphs 94 and 98, “At step 608, the synthesis/rendering engine 270 generates the synthetic training dataset 280 based on the synthetic images 278 and the associated 3D poses 132. At step 610, the training application 140 trains the machine-learning model 290 based on the synthetic training dataset 280 to generate the pose estimation model 160. At step 612, the training application 140 transmits the pose estimation model 160 to the inference subsystem 104.”; “During a training phase, the variation engine generates multiple variants based on the list of synthesis variations. Each variant specifies a different combination of body changes, lighting, camera viewpoint, and texture. For each of the 3D poses, the variation engine generates multiple image recipes, where each image recipe specifies the 3D pose and a different variant. For each of the image recipes, the synthesis/rendering engine optionally modifies the 3D model as per any specified body changes and then fits the associated 3D pose to either the 3D model or the modified 3D model to generate a posed 3D model. For each of the image recipes, the synthesis/rendering engine renders the associated posed 3D model based on the specified lighting, camera viewport, and texture to generate a synthetic image. For each synthetic image, the synthesis/rendering engine generates a training item that includes the synthetic image and the associated 3D pose. Together, the training items form a synthetic training dataset. Subsequently, the training application trains the machine-learning model based on the synthetic training database to generate the pose estimation model.”) 
Hushchyn teaches using a second processor for training (Hushchyn, Paragraph 11, “In some embodiments, the training of the at least one face detection regressor is performed by at least one second processor; and where the at least one second processor is distinct from the at least one first processor of the portable electronic device and is remotely located from the portable electronic device.”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to generate the synthetic images of Ge from the 3D hand images of Ge using the above described process of Guay, including 3D model generation from the 3D image and synthetic image generation from the 3D model. It would have further been obvious to incorporate these synthetic images paired with associated 3D pose information, performed by Guay above, as the training data of Ge. Additionally, it would have been obvious to use a second processor, taught by Hushchyn, for the training method of Ge in view of Guay.
The motivation for incorporating Guay would have been to generate more accurate training data using actual images, rather than fully computer-generated images, and further train with the more accurate training data for a more accurate depth estimation model. The motivation for incorporating Hushchyn would have been to enable parallel processing for simultaneous training and model generation. Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Ge with the above teachings of Guay and Hushchyn to obtain the above limitations.
Accordingly, Ge in view of Guay further in view of Hushchyn fully disclose “receiving, by second one or more processors, 3D data of a measured hand; generating, by the second one or more processors, 3D model data of the measured hand using the 3D data; generating, by the second one or more processors, synthetic 2D image data comprising one or more synthetic 2D images using the 3D model data; generating, by the second one or more processors, target depth data comprising one or more sets of depths paired to the one or more synthetic 2D images using the synthetic 2D image data and the 3D model data; training, by the second one or more processors, the depth estimation model using the synthetic 2D image data and the target depth data.”

Regarding claim 5, Ge in view of Guay further in view of Hushchyn teach “The computer-implemented method of claim 4,”
“wherein generating the synthetic 2D image data comprises using camera and lighting parameter data.” (Ge, Paragraph 40, In some implementations, the first plurality of images (also referred to as synthetic images), stored in synthetic and real hand training images 209, provides the labels of both 3D hand joint locations and full 3D hand meshes. A 3D hand model is generated, rigged with joints, and then photorealistic textures are applied on the 3D hand model as well as natural lighting using high-dynamic range (HDR) images. The variations of the hand are modeled by creating blend shapes with different shapes and ratios, and then random weights are applied to the blend shapes. Hand poses from 500 common hand gestures and 1000 unique camera viewpoints are created and captured in the first plurality of images. To simulate real-world diversity, 30 lightings and five skin colors are used. The hand is rendered using global illumination. In some implementations, the first plurality of images includes 375,000 hand RGB images with large variations. In some embodiments, only a portion (e.g., 315,000) of the first plurality of images are used in the first training phase to train the machine learning techniques. During training or before, each rendered hand in the first plurality of images is cropped from the image and blended with a randomly selected background image (e.g., a city image, a living room image, or any other suitable image obtained randomly or pseudo-randomly from a background image server(s)).”)

Regarding claim 6, Ge in view of Guay further in view of Hushchyn teach “The computer-implemented method of claim 5,”
Ge in view of Guay further in view of Hushchyn do not expressly disclose “wherein the camera and lighting parameter data comprise randomized values.”
Guay further discloses, “wherein the camera and lighting parameter data comprise randomized values.” (Guay, Paragraph 73, “In some alternate embodiments, the training application 140 generates the synthetic training dataset 280 in any technically feasible fashion based on the 3D model 124 and any number of 3D poses 132. In the same or other alternate embodiments, the training application 140 ensures that the synthetic training dataset 280 represents variations in any number and/or types of depicted objects (e.g., the costumed person 122) and/or conditions in any technically feasible fashion. For instance, in some alternate embodiments, the training application 140 directly and/or indirectly performs any number of domain randomization operations to create variations in body shape, lighting, camera viewpoint, and texture across the synthetic images 278 and therefore across the synthetic training dataset 280.”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to perform the randomization of camera and lighting parameters, further taught by Guay, into the synthetic image generation of Ge in view of Guay further in view of Hushchyn.
The motivation for doing so would have been to train the model with a greater variety of training images for enhanced performance in a greater variety of real world circumstances. Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Ge in view of Guay further in view of Hushchyn with the above additional teaching of Guay to achieve the invention of claim 6.

Regarding claim 7, Ge in view of Guay further in view of Hushchyn teach “The computer-implemented method of claim 4,”
“wherein training the depth estimation model further comprises: determining, by the second one or more processors, cropping boundary data using the synthetic 2D image data and a detection model; and cropping, by the second one or more processors, the synthetic 2D image data using the cropping boundary data.” (Ge, Paragraph 40, “In some implementations, the first plurality of images (also referred to as synthetic images), stored in synthetic and real hand training images 209, provides the labels of both 3D hand joint locations and full 3D hand meshes. A 3D hand model is generated, rigged with joints, and then photorealistic textures are applied on the 3D hand model as well as natural lighting using high-dynamic range (HDR) images. The variations of the hand are modeled by creating blend shapes with different shapes and ratios, and then random weights are applied to the blend shapes. Hand poses from 500 common hand gestures and 1000 unique camera viewpoints are created and captured in the first plurality of images. To simulate real-world diversity, 30 lightings and five skin colors are used. The hand is rendered using global illumination. In some implementations, the first plurality of images includes 375,000 hand RGB images with large variations. In some embodiments, only a portion (e.g., 315,000) of the first plurality of images are used in the first training phase to train the machine learning techniques. During training or before, each rendered hand in the first plurality of images is cropped from the image and blended with a randomly selected background image (e.g., a city image, a living room image, or any other suitable image obtained randomly or pseudo-randomly from a background image server(s)”; Additionally, note that as the references are combined in claim 4, the entire training process performed on the second processor. Therefore, the elements of claim 7, which are part of the training process, are performed on the second processor.)

Regarding claims 11-14, and 18-20, claims 11-14 recite a system, with one or more processors and a memory, with elements corresponding to the steps recited in claims 4-7. Therefore, the recited elements of these claims are mapped to the analogous steps in the corresponding method claims. Claims 18-20 recite a non-transitory machine-readable storage medium storing a program with instructions corresponding to the steps recited in claims 4, 5, and 7.  Therefore, the recited programming instructions are mapped to the analogous steps in the corresponding method claims.  The rationale and motivation to combine the Ge, Guay, and Hushchyn references apply here. Additionally, Ge teaches a system with one or more processors and a memory, and a non-transitory machine-readable storage medium (Ge, Paragraph 109, ““MACHINE-READABLE MEDIUM” in this context refers to a component, device, or other tangible media able to store instructions and data temporarily or permanently and may include, but is not limited to, random-access memory (RAM), read-only memory (ROM), buffer memory, flash memory, optical media, magnetic media, cache memory, other types of storage (e.g., Erasable Programmable Read-Only Memory (EEPROM)) and/or any suitable combination thereof. The term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions. The term “machine-readable medium” shall also be taken to include any medium, or combination of multiple media, that is capable of storing instructions (e.g., code) for execution by a machine, such that the instructions, when executed by one or more processors of the machine, cause the machine to perform any one or more of the methodologies described herein. Accordingly, a “machine-readable medium” refers to a single storage apparatus or device, as well as “cloud-based” storage systems or storage networks that include multiple storage apparatus or devices. The term “machine-readable medium” excludes signals per se.”)

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Dong (CN 112233161 A) teaches a method for determining depth of a hand in an image using a depth detection neural network, wherein depth data for every pixel of the hand is estimated.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to AARON JOSEPH SORRIN whose telephone number is (703)756-1565. The examiner can normally be reached Monday - Friday 9am - 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached at (571) 272-3638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/AARON JOSEPH SORRIN/Examiner, Art Unit 2672



/SUMATI LEFKOWITZ/Supervisory Patent Examiner, Art Unit 2672
Read full office action
Prosecution Timeline

Mar 07, 2023
Application Filed
May 19, 2025
Non-Final Rejection — §102, §103, §112
Aug 28, 2025
Response Filed
Oct 31, 2025
Final Rejection — §102, §103, §112
Feb 04, 2026
Request for Continued Examination
Feb 14, 2026
Response after Non-Final Action
Feb 23, 2026
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/999,467
Patent 12592054
LOW-LIGHT VIDEO PROCESSING METHOD, DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/183,423
Patent 12586245
ROBUST LIDAR-TO-CAMERA SENSOR ALIGNMENT
2y 5m to grant Granted Mar 24, 2026
17/756,401
Patent 12566954
SOLVING MULTIPLE TASKS SIMULTANEOUSLY USING CAPSULE NEURAL NETWORKS
2y 5m to grant Granted Mar 03, 2026
18/060,645
Patent 12555394
IMAGE PROCESSING APPARATUS, METHOD, AND STORAGE MEDIUM FOR GENERATING DATA BASED ON A CAPTURED IMAGE
2y 5m to grant Granted Feb 17, 2026
17/809,781
Patent 12547658
RETRIEVING DIGITAL IMAGES IN RESPONSE TO SEARCH QUERIES FOR SEARCH-DRIVEN IMAGE EDITING
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+50.6%)
3y 5m
Median Time to Grant
High
PTA Risk
Based on 62 resolved cases by this examiner. Grant probability derived from career allow rate.