Last updated: April 19, 2026
Application No. 17/165,701
GENERATION OF MOVING THREE DIMENSIONAL MODELS USING MOTION TRANSFER

Non-Final OA §103
Filed
Feb 02, 2021
Examiner
GILLIARD, DELOMIA L
Art Unit
2661
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
4 (Non-Final)
Interview Optional

— +10.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 1089 resolved cases, 2023–2026
Examiner Intelligence

GILLIARD, DELOMIA L View full profile →
Grants 90% — above average
Career Allow Rate
976 granted / 1089 resolved
+27.6% vs TC avg
Moderate +10% lift
Without
With
+10.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 2m
Avg Prosecution
12 currently pending
Career history
1101
Total Applications
across all art units
Statute-Specific Performance

§101
10.0%
-30.0% vs TC avg
§103
48.8%
+8.8% vs TC avg
§102
15.5%
-24.5% vs TC avg
§112
11.3%
-28.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 1089 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Claims 1-32 are pending. 
Response to Arguments
Applicant’s arguments, see Remarks, filed October 10, 2025, with respect to the rejection(s) of claim(s) 1 under 35 USC 103  have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of CN 111583399 A to Bao et al., hereinafter, “Bao” and Pose Estimation For A Partially Observable Human Body From RGB-D Cameras to Dib et al., hereinafter, “Dib”.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-5, 7, 9, 15, 17, 25, 30 and 31 is/are rejected under 35 U.S.C. 103 as being unpatentable over CN 111583399 A to Bao et al., hereinafter, “Bao” in view of Pose Estimation For A Partially Observable Human Body From RGB-D Cameras to Dib et al., hereinafter, “Dib”.
Claim 1. A processor comprising one or more circuits to use one or more neural networks to generate a three-dimensional model of a first object oriented according to a first pose based, at least in part, on: Bao [0011] performing convolution processing on the spliced training image features by using a convolution neural network to generate a training output image,

Bao [0007] determining the identity parameter of the first facial image, the pose parameter of the second facial image, and the expression parameter of the third facial image by means of three-dimensional reconstruction includes: performing three-dimensional reconstruction on the face image aiming at each face image in the first face image, the second face image and the third face image to obtain a three-dimensional face corresponding to the face image; …

a first image of the first object oriented according to a second pose; Bao [0007] determining the identity parameter of the first facial image, the pose parameter of the second facial image, and the expression parameter of the third facial image by means of three-dimensional reconstruction includes: performing three-dimensional reconstruction on the face image aiming at each face image in the first face image, …

Bao [0002] Various processes may be performed on the face image to achieve different effects. For example, by giving an original face and a reference face, a virtual face which is consistent with identity information of the reference face and retains effects of light, posture, …

Bao [Fig. 2] Fig. 2 shows an illustrative principle according to an embodiment of the application. As shown in fig. 2, in the embodiment of the present application, a process of face editing may be implemented based on three different face images. By respectively extracting the attitude parameters of the attitude image, the expression parameters of the expression image and the identity parameters of the identity image, a virtual human face can be generated as the result of face editing. Where the identity parameters are used to uniquely identify a person in the image, for example, the identity parameters may indicate facial features of a human face. The expression parameter is used to indicate an expression state of the face of the person in the image, for example, the expression parameter may indicate an action of five sense organs of the person in the image, such as smiling, laughing, crying, opening the eyes, closing the eyes, and the like. The pose parameters may be used to indicate the position and orientation of the person's face in the image, etc. In this virtual result, the face has a pose corresponding to the pose image, an expression corresponding to the expression image, and an identity corresponding to the identity image. The pose parameters, the expression parameters and the identity parameters may be obtained by processing the face image in a three-dimensional reconstruction mode.

Bao [0003] According to an aspect of the present application, there is provided an image processing method including: determining an identity parameter of the first facial image, an attitude parameter of the second facial image and an expression parameter of the third facial image by using a three-dimensional reconstruction mode; generating a two-dimensional fused rendered facial image based on the identity parameter of the first facial image, the pose parameter of the second facial image and the expression parameter of the third facial image

and a second image of a second object oriented according to the first pose. Bao [0007] determining the identity parameter of the first facial image, the pose parameter of the second facial image, and the expression parameter of the third facial image by means of three-dimensional reconstruction includes: performing three-dimensional reconstruction on the face image aiming at each face image in the first face image, the second face image and the third face image to obtain a three-dimensional face corresponding to the face image; determining the identity parameter and the expression parameter of the face image based on the three-dimensional face;

Bao [0003] According to an aspect of the present application, there is provided an image processing method including: determining an identity parameter of the first facial image, an attitude parameter of the second facial image and an expression parameter of the third facial image by using a three-dimensional reconstruction mode; generating a two-dimensional fused rendered facial image based on the identity parameter of the first facial image, the pose parameter of the second facial image and the expression parameter of the third facial image

Bao [Fig. 2] Fig. 2 shows an illustrative principle according to an embodiment of the application. As shown in fig. 2, in the embodiment of the present application, a process of face editing may be implemented based on three different face images. By respectively extracting the attitude parameters of the attitude image, the expression parameters of the expression image and the identity parameters of the identity image, a virtual human face can be generated as the result of face editing. Where the identity parameters are used to uniquely identify a person in the image, for example, the identity parameters may indicate facial features of a human face. The expression parameter is used to indicate an expression state of the face of the person in the image, for example, the expression parameter may indicate an action of five sense organs of the person in the image, such as smiling, laughing, crying, opening the eyes, closing the eyes, and the like. The pose parameters may be used to indicate the position and orientation of the person's face in the image, etc. In this virtual result, the face has a pose corresponding to the pose image, an expression corresponding to the expression image, and an identity corresponding to the identity image. The pose parameters, the expression parameters and the identity parameters may be obtained by processing the face image in a three-dimensional reconstruction mode.

Bao fails to explicitly teach the three-dimensional model is a three-dimensional occupancy RGB field. Dib, in the same field of human pose estimation in image data, teaches  wherein the three-dimensional model is a three-dimensional occupancy RGB field. Dib [Introduction] To cope with all these problems, we propose a model based pose estimation method for depth cameras composed of three stages: The first one uses a probabilistic occupancy grid updated with a Hidden Markov Model (HMM) to maintain an up-to-date background and to extract moving persons from a dynamic scene… The last contribution of this paper, is the proposal of a rich dataset of synchronized RGB-D camera and motion capture data for a person

Dib [A. Occupancy Grid and HMM model] … maintain an up-to-date background and extract moving subjects from the scene. This is done by using a 3D occupancy grid… This observation z is obtained from the depth image of the RGB-D sensor., Fig. 2

Bao teaches using pose parameters to indicate the position and orientation of the face of the person in the image. Thus, at the time of the invention, it would have been obvious to one of ordinary skill in the art to combine the teachings of Bao with the teachings of Dib [Abstract] to alleviate the challenges of human pose estimation in realistic world conditions such as foreground extraction, background update and occlusion by scene objects.

Claim 3. Bao further teaches wherein the processor generates a two-dimensional image of the first object in the first pose from a point of view. Bao [0007] teaches the In some embodiments, generating a two-dimensional fused rendered facial image based on the identity parameter of the first facial image,

Claim 4. Bao further teaches wherein: the first object is a human being; Bao [Fig. 2] Fig. 2 shows an illustrative principle according to an embodiment of the application. As shown in fig. 2, …a process of face editing may be implemented based on three different face images…, a virtual human face can be generated as the result of face editing. Where the identity parameters are used to uniquely identify a person in the image, for example, the identity parameters may indicate facial features of a human face.

and the processor generates a parametric model of the human being based at least on part on features determined from the first image. Bao [Fig. 2] The pose parameters may be used to indicate the position and orientation of the person's face in the image, etc. ...The pose parameters, the expression parameters and the identity parameters may be obtained by processing the face image in a three-dimensional reconstruction mode.

Claim 5. Bao further teaches wherein: first object is a first human being; Bao [Fig. 2] 1st pic

the second object is a second human being; Bao [Fig. 2] 2nd pic

and the first human being is a different person than the second human being. Bao [Fig. 2] 1st pic and 2nd pic

Claim 7. Bao further teaches wherein the one or more neural networks is trained using at least a pair of image frames from a segment of video. Bao [Fig. 8] In step S802, a training set comprising at least two training face images may be determined. Wherein the training set may be a mixed set of image data and video data. Where image data covers more identities and video data covers more facial expressions.
In step S804, a training identity parameter of the first training face image, a training pose parameter of the second training face image, and a training expression parameter of the third training face image may be determined.

Claim 9. It differs from claim 1 in that it is a computer system comprising one or more processors coupled to computer-readable media storing instructions that, as a result of being executed by the one or more processors, of claim 1. Therefore claim 9 has been analyzed and reviewed in the same way as claim 1. See the above analysis. 

Claim 15. Bao further teaches wherein: the second object is a human being; Bao [Fig. 2] Fig. 2 shows an illustrative principle according to an embodiment of the application. As shown in fig. 2, in the embodiment of the present application, a process of face editing may be implemented based on three different face images. By respectively extracting the attitude parameters of the attitude image, the expression parameters of the expression image and the identity parameters of the identity image, a virtual human face can be generated as the result of face editing. Where the identity parameters are used to uniquely identify a person in the image, for example, the identity parameters may indicate facial features of a human face.

Bao [Fig. 2] 1st pic and 2nd pic

and the first object is a humanoid character. Bao [Fig. 2] Fig. 2 shows an illustrative principle according to an embodiment of the application. As shown in fig. 2, in the embodiment of the present application, a process of face editing may be implemented based on three different face images. By respectively extracting the attitude parameters of the attitude image, the expression parameters of the expression image and the identity parameters of the identity image, a virtual human face can be generated as the result of face editing. Where the identity parameters are used to uniquely identify a person in the image, for example, the identity parameters may indicate facial features of a human face.

Bao [Fig. 2] 1st pic and 2nd pic

Claim 17. It differs from claim 1 in that it is a computer-implemented method performed by the processor or claim 1. Therefore claim 17 has been analyzed and reviewed in the same way as claim 1. See the above analysis. 

Claim 25. It differs from claim 1 in that it is a machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors of claim 1. Therefore claim 25 has been analyzed and reviewed in the same way as claim 1. See the above analysis. 

Claim 30. It differs from claim 5 in that it is a machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors of claim 5. Therefore claim 30 has been analyzed and reviewed in the same way as claim 5. See the above analysis. 

Claim 31. It differs from claim 4 in that it is a machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors of claim 4. Therefore claim 31 has been analyzed and reviewed in the same way as claim 4. See the above analysis. 

Claim(s) 6, 8, 13, 14, 16, 18-24, 26-29 and 32 is/are rejected under 35 U.S.C. 103 as being unpatentable over CN 111583399 A to Bao et al., hereinafter, “Bao” in view of ” in view of Pose Estimation For A Partially Observable Human Body From RGB-D Cameras to Dib et al., hereinafter, “Dib” and in further view of WO 2021/258386 A1 to Wang et al., hereinafter, “Wang”.
Claim 6. Bao is silent of claim 6, however, Wang, in the field of three-dimensional pose estimation in image data, teaches wherein the processor generates a plurality of two- dimensional images of the first object from different points of view. Wang [0028] teaches each image capture device 104, 106, 108, 110 of FIG. 1 generates image data representing the subject 102 (e.g., a human being) . For instance, as shown in FIG. 1, the first image capture device 104 generates a first image 114 including the subject 102 and the second image capture device 106 generates a second image 116 including the subject 102. The first image 114 and the second image 116 are generated by the respective device 104, 106 at substantially the same time (e.g., time-synchronized) . As shown in FIG. 1, the view of the subject 102 in the first image 114 (e.g., a side profile view) is different than the view of subject 102 in the second image data 116 (e.g., a front view) due to the different locations of the first and second image capture devices 104, …

Thus, at the time of the invention, it would have been obvious to one of ordinary skill in the art to combine the teachings of Bao with the teachings of Wang [0008] to improved systems and methods for human motion transfer for video synthesis.

Claim 8. Bao and Wang further teaches wherein the processor: constructs a parametric 3-D model of the first object in the first pose; Bao Fig. 2 shows an illustrative principle according to an embodiment of the application. As shown in fig. 2, in the embodiment of the present application, a process of face editing may be implemented based on three different face images. By respectively extracting the attitude parameters of the attitude image, the expression parameters of the expression image and the identity parameters of the identity image, a virtual human face can be generated as the result of face editing. Where the identity parameters are used to uniquely identify a person in the image, for example, the identity parameters may indicate facial features of a human face. The expression parameter is used to indicate an expression state of the face of the person in the image, for example, the expression parameter may indicate an action of five sense organs of the person in the image, such as smiling, laughing, crying, opening the eyes, closing the eyes, and the like. The pose parameters may be used to indicate the position and orientation of the person's face in the image, etc. In this virtual result, the face has a pose corresponding to the pose image, an expression corresponding to the expression image, and an identity corresponding to the identity image. The pose parameters, the expression parameters and the identity parameters may be obtained by processing the face image in a three-dimensional reconstruction mode.

and generates the three-dimensional model based at least in part on the parametric 3-D model. Wang [0028] teaches the 3D model generator analyzes each set of synchronized images to predict positions of keypoints, or joints (e.g., elbow, wrist, pelvis) , of the subject 102 in the images and to estimate a 2D pose of the subject based on the keypoints positions.

Claim 13. Wang further teaches wherein the three-dimensional model is a 3-D mesh.  Wang [0120] teaches in some known 3D pose estimation techniques, image data generated by multiple cameras is analyzed to estimate a 2D pose of a subject captured in the image data based on joint or keypoint recognition. To generate a 3D graphical model of the subject in the pose, known pose estimation techniques may calculate locations of the joints of the user using the 2D pose data and methods such as triangulation, statistical modeling (e.g., pictorial structural modeling) , and/or 3D geometric modeling techniques (e.g., volumetric voting, factor graph optimization) .

Claim 14. Wang further teaches wherein the first object and the second object represent a same person in different poses. Wang [0028] teaches each image capture device 104, 106, 108, 110 of FIG. 1 generates image data representing the subject 102 (e.g., a human being) . For instance, as shown in FIG. 1, the first image capture device 104 generates a first image 114 including the subject 102 and the second image capture device 106 generates a second image 116 including the subject 102. The first image 114 and the second image 116 are generated by the respective device 104, 106 at substantially the same time (e.g., time-synchronized) . As shown in FIG. 1, the view of the subject 102 in the first image 114 (e.g., a side profile view) is different than the view of subject 102 in the second image data 116 (e.g., a front view) due to the different locations of the first and second image capture devices 104, 106 in the environment 103. 

Wang [0049] teaches training is performed using training data. In examples disclosed herein, the training data originates from previously generated 2D and/or 3D images that include subject (s) in different pose (s) .

Wang [0056] teaches in the example of FIG. 2, … The training data 230 is labeled with (X, Y) joint or keypoint positions for each relevant keypoint (e.g., joint) of the subject (s) in each pose relative to a coordinate system for each image in the training data. The first neural network trainer 226 trains the neural network implemented by the neural network processor 222 using the training data 230. Based on the positions of the keypoints for subject (s) performing different poses …

Claim 16. Wang further teaches wherein the three-dimensional model is based at least in part on a plurality of images of the first object. Wang [0028] teaches
As disclosed herein, the 3D model generator 136 extracts images (e.g., video frames) from the image data feeds generated by each of the image capture devices 104, 106, 108, 110 and time-synchronizes the images obtained from each device 104, 106, 108, 110. The 3D model generator analyzes each set of synchronized images to predict positions of keypoints, or joints (e.g., elbow, wrist, pelvis) , of the subject 102 in the images and to estimate a 2D pose of the subject based on the keypoints positions.

Claim 18. Wang further teaches further comprising: receiving information that specifies a point of view; and generating, from the three-dimensional model, a 2-D image of the first object from the point of view. Wang  [0075] teaches the example 3D pose calculator 218 of FIG. 2 uses the neural-network trained joint depth offset model 258 to predict 3D joint coordinates for the keypoints and to generate an initial set of 3D graphical model (s) or skeleton (s) based on the predicted 3D joint coordinates. The flow diagram 500 of FIG. 5 illustrates the 3D pose regression from the normalized 2D skeleton data. As shown in FIG. 5, the joint depth offset model is generated as a result of training of the neural network 248 (block 502) . The 3D pose calculator 218 uses the normalized 2D skeleton data and the joint depth offset model D to generate predicted scaled 3D skeleton data   (e.g., via orthographic projection) . A predicted 3D skeleton is generated for each image capture device view, which is used to generate a final 3D skeleton in the world coordinate system defined by one of the image capture device views.

Wang [0077] teaches as a result of the regression analysis to obtain the 3D joint coordinates from the normalized 2D skeleton data and the joint depth offset model 258, the 3D pose calculator 218 creates initial 3D skeleton data for each image in the synchronized set of images. The initial 3D skeleton data represents a result of the regression analysis from the 2D skeleton data corresponding to a respective view of the image capture devices 104, 106, 108, 110. 

Claim 19. Wang further teaches further comprising generating, from the three-dimensional model, a plurality of 2-D images of the first object from a corresponding plurality of points of view. Wang [0077] teaches as a result of the regression analysis to obtain the 3D joint coordinates from the normalized 2D skeleton data and the joint depth offset model 258, the 3D pose calculator 218 creates initial 3D skeleton data for each image in the synchronized set of images. The initial 3D skeleton data represents a result of the regression analysis from the 2D skeleton data corresponding to a respective view of the image capture devices 104, 106, 108, 110. 

Claim 20. Wang further teaches wherein the one or more neural networks are trained by at least training the one or more neural networks to produce a parametric model of the first object from an image of the first object. Wang [0049] teaches training is performed using training data. In examples disclosed herein, the training data originates from previously generated 2D and/or 3D images that include subject (s) in different pose (s) .

Wang [0056] teaches in the example of FIG. 2, the training data 230 can include images including subject (s) in various pose (s) generated for purposes of training. In some examples, the training data includes the image data streams 200, 202, 204, 206 generated by the image capture devices (s) 104, 106, 108, 110. The training data 230 is labeled with (X, Y) joint or keypoint positions for each relevant keypoint (e.g., joint) of the subject (s) in each pose relative to a coordinate system for each image in the training data. The first neural network trainer 226 trains the neural network implemented by the neural network processor 222 using the training data 230. Based on the positions of the keypoints for subject (s) performing different poses in the training data 230, the first neural network trainer 226 trains the neural network 222 to identify (e.g., predict) the positions of the keypoints of the respective subjects in the synchronized images  (e.g., the images 114, 116, 118, 120) generated by the image capture devices 104, 106, 108, 110.

Wang [0057] teaches a keypoint prediction model 234 is generated as a result of the neural network training.

Claim 21. Wang further teaches wherein the one or more neural networks are trained by at least training the one or more neural networks to produce a parametric model of the first object from an image of the first object and an image of the first object according to a different pose. Wang [0049] teaches training is performed using training data. In examples disclosed herein, the training data originates from previously generated 2D and/or 3D images that include subject (s) in different pose (s) .

Wang [0056] teaches in the example of FIG. 2, the training data 230 can include images including subject (s) in various pose (s) generated for purposes of training. In some examples, the training data includes the image data streams 200, 202, 204, 206 generated by the image capture devices (s) 104, 106, 108, 110. The training data 230 is labeled with (X, Y) joint or keypoint positions for each relevant keypoint (e.g., joint) of the subject (s) in each pose relative to a coordinate system for each image in the training data. The first neural network trainer 226 trains the neural network implemented by the neural network processor 222 using the training data 230. Based on the positions of the keypoints for subject (s) performing different poses in the training data 230, the first neural network trainer 226 trains the neural network 222 to identify (e.g., predict) the positions of the keypoints of the respective subjects in the synchronized images  (e.g., the images 114, 116, 118, 120) generated by the image capture devices 104, 106, 108, 110.

Wang [0057] teaches a keypoint prediction model 234 is generated as a result of the neural network training.

Claim 22. Wang further teaches wherein the one or more neural networks are trained by at least training the one or more neural networks using two images from a segment of video of the first object. Wang [0028] teaches the 3D model generator 136 extracts images (e.g., video frames) from the image data feeds generated by each of the image capture devices 104, 106, 108, 110 and time-synchronizes the images obtained from each device 104, 106, 108, 110. The 3D model generator analyzes each set of synchronized images to predict positions of keypoints, or joints (e.g., elbow, wrist, pelvis) , of the subject 102 in the images and to estimate a 2D pose of the subject based on the keypoints positions. The 3D model generator 136 can recognize the position of the keypoints in the image data based on keypoint recognition model (s) generated via neural network training.

Wang [0069] teaches referring again to FIG. 2, the example system 100 includes a second computing system 246 to train a neural network to map the 2D positions (e.g., (X, Y) coordinates) of the joints or keypoints of the subject to 3D positions (e.g., (X, Y, Z) coordinates) . The example second computing system 246 includes a second neural network processor 248. In examples disclosed herein, the second neural network processor 248 implements a second neural network.

Wang [0070] teaches the example second computing system 246 of FIG. 2 includes a second neural network trainer 250. The example second neural network trainer 250 of FIG. 2 performs training of the neural network implemented by the second neural network processor 248. In some examples disclosed herein, training is performed using a stochastic gradient descent algorithm. However, other approaches to training a neural network may additionally or alternatively be used.

Claim 23. Wang further teaches wherein the three- dimensional model is generated from a human parametric model. Wang [0021] teaches pose estimation determines a pose (e.g., a position and orientation) of a subject (e.g., a human) or an object using image data. The image data is analyzed to, for example, identify positions of the subject’s joints (e.g., an elbow, a knee, an ankle) in the image data that indicate the subject’s pose. The pose information obtained from the image data can be used to analyze characteristics of the subject’s body during performance of an activity such as a sport.

Wang [0125] teaches the 3D pose calculator 218 performs a neural network regression analysis to generate initial 3D skeleton data for the subject (block 1408) . For example, the 3D pose calculator 218 uses the normalized 2D skeleton data 244, 700, 702, 704, 706, the joint depth offset model 258 generated by the training of the neural network 248 (e.g., as disclosed in the flowchart of FIG. 13) , and a 3D skeleton template to predict the 3D joint coordinates ( (X, Y, Z) coordinates) of the keypoints of the subject in the images (e.g., as disclosed in connection with the flow diagrams 400, 500 of FIGS. 4 and 5) .

Claim 24. Wang further teaches wherein the three- dimensional model is generated by applying, to a parametric model, two dimensional features determined from the first image. Wang [0058] teaches the keypoint prediction model 234 is executed by the 2D pose detector 216 of the 3D model generator 136 of FIG. 2. In particular, the 2D pose detector 216 executes the keypoint prediction model 234 for each synchronized image (e.g., the images 114, 116, 118, 120 of FIG. 1) to detect the positions (e.g., (X, Y) coordinate positions) of the keypoints of the subject (s) in the images and to generate 2D skeleton (s) (e.g., where k is an index identifying a particular keypoint of the subject 102 up to a total number of J keypoints, and p represents the (X, Y) coordinate position for the corresponding keypoint) of the subject (s) for each image. [0075-0077]

Claim 26. It differs from claim 8 in that it is a machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors of claim 8. Therefore claim 26 has been analyzed and reviewed in the same way as claim 8. See the above analysis. 

Claim 27. Wang further teaches wherein the one or more neural networks is trained, based at least in part, on a 2-D image loss produced by providing the one or more neural networks with a pair of images from a segment of video. Wang [0028] teaches the 3D model generator 136 extracts images (e.g., video frames) from the image data feeds generated by each of the image capture devices 104, 106, 108, 110 and time-synchronizes the images obtained from each device 104, 106, 108, 110. The 3D model generator analyzes each set of synchronized images to predict positions of keypoints, or joints (e.g., elbow, wrist, pelvis) , of the subject 102 in the images and to estimate a 2D pose of the subject based on the keypoints positions. The 3D model generator 136 can recognize the position of the keypoints in the image data based on keypoint recognition model (s) generated via neural network training.

Wang [0069] teaches referring again to FIG. 2, the example system 100 includes a second computing system 246 to train a neural network to map the 2D positions (e.g., (X, Y) coordinates) of the joints or keypoints of the subject to 3D positions (e.g., (X, Y, Z) coordinates) . The example second computing system 246 includes a second neural network processor 248. In examples disclosed herein, the second neural network processor 248 implements a second neural network.

Wang [0070] teaches the example second computing system 246 of FIG. 2 includes a second neural network trainer 250. The example second neural network trainer 250 of FIG. 2 performs training of the neural network implemented by the second neural network processor 248. In some examples disclosed herein, training is performed using a stochastic gradient descent algorithm. However, other approaches to training a neural network may additionally or alternatively be used.
 
Claim 28. Wang further teaches wherein the one or more processors generate a segment of video of the first object from a shifting point of view. Wang [0028] teaches each image capture device 104, 106, 108, 110 of FIG. 1 generates image data representing the subject 102 (e.g., a human being) . For instance, as shown in FIG. 1, the first image capture device 104 generates a first image 114 including the subject 102 and the second image capture device 106 generates a second image 116 including the subject 102. The first image 114 and the second image 116 are generated by the respective device 104, 106 at substantially the same time (e.g., time-synchronized) . As shown in FIG. 1, the view of the subject 102 in the first image 114 (e.g., a side profile view) is different than the view of subject 102 in the second image data 116 (e.g., a front view) due to the different locations of the first and second image capture devices 104, 106 in the environment 103. Similarly, in FIG. 1, the third image capture device 108 generates a third image 118 of the subject 102 and the fourth image capture device 108 generates a fourth image 120 of the subject 102. As shown in FIG. 1, the views of the subject 102 captured in each of the images 114, 116, 118, 120 differ based on the different field of views of the image capture devices 104, 106, 108, 110.

Wang [0029] teaches In the example of FIG. 1, each of the image capture devices 104, 106, 108, 110 capture images of the subject 102 over time to generate image data streams (e.g., sequences or series of images including, for instance, video frame, still images, etc. ) . The image capture devices are time-synchronized such that the image data generated by each image capture device captures the subject in the same pose, but from a different angle based on the position and/or orientation of the image capture device. Thus, the images captured by the image capture devices 104, 106, 108, 110 can be used to track movement of the subject 102 over time and, thus, changes in the poses of the subject 102. As disclosed herein, the image data generated by each of the image capture devices 102, 104, 106, 108 (where the image data generated by each device include the respective images 114, 116, 118, 120) is used to identify (e.g., predict) a pose of the subject 102 at a particular time corresponding to the time at which the respective image capture devices 104, 106, 108, 110 captured the images 114, 116, 118, 120 including the subject 102. The image data is used to generate a 3D graphical model of the subject 102 in the pose.

Wang [0038] teaches each of the image capture devices 104, 106, 108, 110 generates image data, where the image data includes a sequence or series of images (e.g., video frames, still images) of the subject 102 captured over time. As illustrated in FIG. 2, the example 3D model generator 136 receives a first image data stream 200 from the first image capture device 104 of FIG. 1, a second image data stream 202 from the second image capture device 106, a third image stream data 204 from the third image capture device 108, and a fourth image data stream 206 from the fourth image capture device 110. The image data streams 200, 202, 204, 206 can be stored in a database 208. In some examples, the 3D model generator 136 includes the database 208.

Claim 29. Wang further teaches wherein the three-dimensional model is a three-dimensional point field. Wang [0125] teaches the 3D pose calculator 218 uses the normalized 2D skeleton data 244, 700, 702, 704, 706, the joint depth offset model 258 generated by the training of the neural network 248 (e.g., as disclosed in the flowchart of FIG. 13) , and a 3D skeleton template to predict the 3D joint coordinates ( (X, Y, Z) coordinates) of the keypoints of the subject in the images (e.g., as disclosed in connection with the flow diagrams 400, 500 of FIGS. 4 and 5)

Claim 32. It differs from claim 6 in that it is a machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors of claim 6. Therefore claim 32 has been analyzed and reviewed in the same way as claim 6. See the above analysis. 

Claim(s) 10-12 is/are rejected under 35 U.S.C. 103 as being unpatentable over CN 111583399 A to Bao et al., hereinafter, “Bao” in view of Pose Estimation For A Partially Observable Human Body From RGB-D Cameras to Dib et al., hereinafter, “Dib”  and in further view of WO 2021/258386 A1 to Wang et al., hereinafter, “Wang” and US 2023/0070008 A1 to Kulon et al., hereinafter, “Kulon”.
Claim 10. While Bao is silent on claim 10, Wang and Kulon both in the field of using neural networks for pose estimation of an object wherein the computer system: determines a set of pose parameters from the second image; Wang [0072] teaches in the example of FIG. 2, the training data 254 includes previously generated 3D human pose datasets. The 3D human pose datasets are labeled with (X, Y, Z) joint positions relative to a 3D coordinate system for each image in the training data 254.

determines a set of shape parameters from the first image; Kulon [0017] teaches the three-dimensional parametric model of the object may be a mesh model parametrised by parameters comprising one or more shape parameters and/or one or more pose parameters. The parameters of the three-dimensional parametric model of the object may further comprise one or more camera parameters. The corresponding object keypoints in the three-dimensional parametric model may be identified from the mesh model using a sparse regression model.

and generates a parametric model of the first object based at least in part on the set of pose parameters and the set of shape parameters.  Kulon [0117] teaches the three-dimensional parametric hand model may be defined by a set of parameters comprising shape parameters, β, and pose parameters, θ. The parameters may also comprise scale and position parameters of a camera viewing the mesh. The model uses these parameters to generate a three-dimensional hand mesh via a differentiable function. An example of such a model is the MANO model, though other parametric hand models may alternatively be used.

Thus, at the time of the invention, it would have been obvious to one of ordinary skill in the art to combine the teachings of Bao with the teachings of Wang [0008] to improved systems and methods for human motion transfer for video synthesis and Kulon [0001] for the benefit of generating three-dimensional models from two-dimensional images.

Claim 11. Wang further teaches wherein the computer system: generates a 2-D feature map from the first image; Wang [0113] teaches FIG. 12 is a flowchart representative of example machine readable instructions 1200 that, when executed by the example first computing system 220 of FIG. 2, cause the example first computing system 220 to train a neural network to the identify position of keypoints or joints of a subject in 2D image data. The example instructions 1200 of FIG. 12, when executed by the first computing system 220 of FIG. 2, results in a neural network and/or a model thereof, that can be distributed to other computing systems, such as the 2D pose detector 216 of FIG. 2. Examiner interprets keypoints to be feature.

Wang [0074-0077]

Wang [0117]FIG. 13 is a flowchart representative of example machine readable instructions 1300 that, when executed by the example second computing system 246 of FIG. 2, cause the example second computing system 246 to train a neural network to learn a mapping from normalized 2D skeleton data including 2D joint coordinates to 3D joint coordinates using a depth offset map, where the depth offset map indicates a depth offset from a joint to a root joint (e.g., the pelvis joint) … 

Wang [0120] teaches the example training controller 252 instructs the neural network trainer 250 to perform training of the neural network 248 using the training data 254 (block 1306) . In the example of FIG. 13, the training is based on supervised learning. As a result of the training, the joint depth offset model 258 is generated (block 1308) . Based on the joint depth offset model 258, the neural network is trained to learn the mapping between the joint coordinates in the normalized 2D skeleton data and the depth offset for the joints to predict 3D joint positions. The joint depth offset model 258 can be stored in the database 256 for access by the 3D pose calculator 218 of the 3D model generator 136.

and the three-dimensional model is based at least in part on the 2-D feature map and the parametric model. Wang [0116] teaches the example training controller 228 instructs the neural network trainer 226 to perform training of the neural network 222 using the training data 230 (block 1206) . In the example of FIG. 12, the training is based on supervised learning. As a result of the training, the keypoint prediction model 234 is generated (block 1208) . Based on the keypoint prediction model 234, the neural network is trained to predict 2D (X, Y) positions of keypoints of a subject in image data. The keypoint prediction model 234 can be stored in the database 236 for access by the 2D pose detector 216 of the 3D model generator 136.

Claim 12. Wang further teaches wherein the computer system: generates a 3-D feature map from the parametric model; and the three-dimensional model is based at least in part on the 3-D feature map and the 2-D feature map.  Wang [0125] teaches the 3D pose calculator 218 performs a neural network regression analysis to generate initial 3D skeleton data for the subject (block 1408) . For example, the 3D pose calculator 218 uses the normalized 2D skeleton data 244, 700, 702, 704, 706, the joint depth offset model 258 generated by the training of the neural network 248 (e.g., as disclosed in the flowchart of FIG. 13) , and a 3D skeleton template to predict the 3D joint coordinates ( (X, Y, Z) coordinates) of the keypoints of the subject in the images (e.g., as disclosed in connection with the flow diagrams 400, 500 of FIGS. 4 and 5) . The 3D pose calculator 218 generates initial 3D skeleton data 265, 800 of FIGS. 2 and 8 based on the regression analysis. Examiner interprets keypoints to be feature.

Wang [0074-0077]

Wang [0117], [0120]

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DELOMIA L GILLIARD whose telephone number is (571)272-1681. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John Villecco can be reached on (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/DELOMIA L GILLIARD/Primary Examiner, Art Unit 2661
Read full office action
Prosecution Timeline

Feb 02, 2021
Application Filed
Sep 07, 2023
Non-Final Rejection — §103
Jan 05, 2024
Interview Requested
Jan 11, 2024
Examiner Interview Summary
Jan 11, 2024
Applicant Interview (Telephonic)
Mar 12, 2024
Response Filed
Jun 22, 2024
Non-Final Rejection — §103
Jan 02, 2025
Response Filed
Apr 05, 2025
Non-Final Rejection — §103
Jul 11, 2025
Interview Requested
Jul 17, 2025
Applicant Interview (Telephonic)
Jul 17, 2025
Examiner Interview Summary
Oct 10, 2025
Response Filed
Jan 18, 2026
Non-Final Rejection — §103 (current)
Precedent Cases