Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Notice to Applicants
This communication is in response to the amendment filed on 1/6/2026.
Claim 1-14 are pending.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-14 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 1-3, 5, 7 and 11-14 are rejected under 35 U.S.C. 103 as being unpatentable over WANG et al. (CN 113762129A) (hereafter, "WANG") in view of HSU (U.S. Publication No. 2021/0201502) and further in view of MEHTA et al. (NPL, “XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera”) (hereafter, “MEHTA").
Regarding claim 1, WANG teaches a data processing method, comprising: acquiring an object pose detection result corresponding to an object in an image frame and ([0008] The input device captures an image stream of at least 25 frames/second for smoother results, which is then fed into the 2D human pose estimation module; [0013] Figure 2 shows a general 2D human pose estimation system. The system takes as input an image or a sequence of images (0201). They are then passed to a pose estimator (0202) which generates a set of 2D joints (0203) for each person in the input image) a part pose detection result corresponding to a first object part of the object in the image frame ([0008] The estimation module then outputs a set of 2D locations for each joint of each person in the image being processed. For each joint, a triplet (x, y, score) is generated, where x and y are the horizontal and vertical positions of the joint, respectively, and the score determines the confidence of the joint detection, ranging from 0 to 1 (1 means complete confidence); [0016] FIG4 shows a human joint model (0400). The model consists of 21 joints, where each joint is a 3D vector), wherein at least one object part of the object is missing from the object pose detection result, and ([0110] The pose with missing joints in frame 1 is shown in Figure 11; [0121] In frame 2, joints ... are missing. The pose with missing joints in frame 2 is shown in Figure 16) the first object part is one or more parts of the object; and ([0121] the image of frame 2 is fed into the pose estimation system and produces the detected joints).
WANG does not expressly teach performing interpolation processing on the at least one object part missing from the object pose detection result according to the part pose detection result and a standard pose associated with the object to obtain a global pose corresponding to the object, wherein the standard pose comprises a pre-constructed default pose of the object, and the global pose is used for controlling a computer to realize a service function corresponding to the global pose.
However, HSU teaches performing interpolation processing on the at least one object part missing from the object pose detection result according to the part pose detection result and a standard pose associated with the object to obtain a global pose corresponding to the object ([0036] The prediction model further predicts joint positions of occluded limbs according to the joint positions of unoccluded limbs. This technology is based on the body pose hypothesis (that is, the relative relationship (such as a distance and an angle) between each joint position and the remaining joint positions) learned by the AI from the MoCap motion database to predict the positions where the remaining joints are most likely to appear in the image according to every joint position. Therefore, when the joint positions of parts of the limbs are occluded, the joint positions of occluded limbs could be predicted according to the relative relationship between the joints of the unoccluded limbs and the remaining joints, so as to detect the positions of the occluded limbs. The purpose is to make the predicted positions conform to the body pose hypothesis), wherein the standard pose comprises a pre-constructed default pose of the object, and ([0036] MoCap motion database; [0041] The joint information includes relevant information (for example, three-dimensional coordinates of all joint positions of the object in the training image) of the limbs of the whole body).
It would have been obvious before the effective filing date of the claimed invention to one having ordinary skill in the art to modify the method and device of WANG to incorporate the step/system of obtaining the positions of the occluded limbs by predicting the locations of obscured limbs by analyzing the unoccluded joints based on motion capture (MoCap) database taught by HSU.
The suggestion/motivation for doing so would have been to improve the accuracy of predicting the motion of the target object ([0032] the image capturing device 420 may be further combined with a depth camera to improve the accuracy of predicting the motion of the target object 410). Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predicted results.
The combination of WANG and HSU does not expressly teach the global pose is used for controlling a computer to realize a service function corresponding to the global pose.
However, MEHTA teaches the global pose (Page 4, line 3-4, Fig. 2. Overview: ... Stage II is a compact fully-connected network that runs in parallel for each detected person, and reconstructs the complete 3D pose, including occluded joints, by leveraging global (full body) context; Page 5, right col., line 13-16, Stage II, which we discuss in Section 4.2, uses a lightweight fully connected neural network that ‘decodes’ the input from the previous stage into a full 3D pose, i.e. root-relative 3D joint positions for visible and occluded joints, per individual) is used for controlling a computer to realize a service function corresponding to the global pose (Page 2, right col., line 10-13 from the bottom, we fit a model-based skeleton to the 3D and 2D predictions in order to satisfy kinematic constraints and reconcile the 2D and 3D predictions across time. This produces temporally stable predictions, with skeletal angle estimates, which can readily drive virtual characters; Page 4, right col., line 9-15, our approach works … yielding skeletal joint angles and camera relative positioning of the subject, which can be readily be used to control animated characters in a virtual environment. Our approach predicts the complete body pose even under significant person-object occlusions; Page 10, right col., line 29-31, results in temporally smooth joint-angle estimates which can readily be used to drive virtual characters; Fig. 7).
It would have been obvious before the effective filing date of the claimed invention to one having ordinary skill in the art to modify the method and device of WANG and HSU to incorporate the step/system of using complete 3D pose for controlling animated characters in a virtual environment taught by MEHTA.
The suggestion/motivation for doing so would have been to enhance the prediction of body parts for 3D human pose (Page 14, left col., line 5-8, Stage III predictions show a significant improvement for limb joints such as elbows, wrists, knees, and ankles over Stage II; Page 15, right col., in conclusion section, One of the key components of our system is a new CNN architecture that uses selective long and short range skip connections to improve the information flow and have a significantly smaller memory footprint, allowing for a drastically faster network without compromising accuracy). Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predicted results. Therefore, it would have been obvious to combine WANG and HSU with MEHTA to obtain the invention as specified in claim 1.
Regarding claim 2, the combination of WANG, HSU and MEHTA teaches all the limitations of claim 1 above. WANG teaches wherein the acquiring an object pose detection result and ([0013] Figure 2 shows a general 2D human pose estimation system … They are then passed to a pose estimator (0202) which generates a set of 2D joints (0203) for each person in the input image) a part pose detection result comprises ([0008] The estimation module then outputs a set of 2D locations for each joint of each person in the image being processed. For each joint, a triplet (x, y, score) is generated, where x and y are the horizontal and vertical positions of the joint, respectively, and the score determines the confidence of the joint detection; [0016] The model consists of 21 joints, where each joint is a 3D vector): inputting the image frame into an object detection model; acquiring the object pose detection result by the object detection model ([0008] The input device captures an image stream of at least 25 frames/second for smoother results, which is then fed into the 2D human pose estimation module; [0013] Figure 2 shows a general 2D human pose estimation system. The system takes as input an image or a sequence of images (0201). They are then passed to a pose estimator (0202) which generates a set of 2D joints (0203) for each person in the input image).
WANG does not expressly teach inputting the image frame into a part detection model; and acquiring the part pose detection result through the part detection model.
However, MEHTA teaches inputting the image frame into a part detection model; and (Page 5, right col., line 1-2, Stage I uses a convolutional neural network to process the complete input frame, jointly handling all subjects in the scene; Page 5, left col., line 3-4 from the bottom, The input to our method is a live video feed, i.e., a stream of monocular color frames showing a multi-person scene) acquiring the part pose detection result through the part detection model (Page 2, right col., line 8-13, Since Stage I handles the already complex task of parsing the image for body parts, as well as associating the body parts to identities, our key insight with regards to the pose formulation is to have Stage I only consider body joints for which direct image evidence is available, i.e., joints that are themselves visible).
It would have been obvious before the effective filing date of the claimed invention to one having ordinary skill in the art to modify the method and device of WANG to incorporate the step/system of acquiring body joints by inputting the image frame into a network taught by MEHTA.
Motivation for this combination has been stated in claim 1.
Regarding claim 3, the combination of WANG, HSU and MEHTA teaches all the limitations of claim 2 above. WANG teaches wherein the inputting the image frame into an object detection model and acquiring the object pose detection result by the object detection model comprises ([0008] The input device captures an image stream of at least 25 frames/second for smoother results, which is then fed into the 2D human pose estimation module; [0013] Figure 2 shows a general 2D human pose estimation system. The system takes as input an image or a sequence of images (0201). They are then passed to a pose estimator (0202) which generates a set of 2D joints (0203) for each person in the input image): inputting the image frame into the object detection model ([0008] The input device captures an image stream of at least 25 frames/second for smoother results, which is then fed into the 2D human pose estimation module).
WANG does not expressly teach acquiring an object pose feature corresponding to the object by the object detection model; recognizing a first classification result corresponding to the object pose feature, wherein the first classification result is used for characterizing an object part class corresponding to key points of the object; generating a first activation map according to the first classification result and an object convolutional feature of the image frame outputted by the object detection model acquiring a pixel average value corresponding to the first activation map ; determining a positioning result of the key points of the object in the image frame according to the pixel average value; and determining the object pose detection result according to the object part class and the positioning result.
However, MEHTA teaches acquiring an object pose feature corresponding to the object by the object detection model (Page 1, left col., line 9-11, The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals; Page 4, Fig. 2 shows the “2D branch” is obtaining pose feature); recognizing a first classification result corresponding to the object pose feature (Page 2, right col., line 8-10, Since Stage I handles the already complex task of parsing the image for body parts, as well as associating the body parts to identities), wherein the first classification result is used for characterizing an object part class corresponding to key points of the object (Page 5, right col., line 31-32, Our algorithm and pose representation applies to any CNN architecture suitable for keypoint prediction; Page 10, right col., line 1-2, Evaluation of 2D key point detections of the complete Stage I of our system (both 2D and 3D branches trained); Fig. 2 shows the “J+2J” generating J maps that indicate at each pixel whether a joint type is exist or not); generating a first activation map according to the first classification result and (Page 6, left col., line 13-15, each map represents the per-pixel confidence of the presence of body joint type j jointly for all subjects in the scene; Page 7, left col., 2nd para., For each individual k, the 2D pose P2Dk , joint confidences {cj,k}Jj=1 , and 3D pose encodings {lj,k}Jj=1 at the visible joints are extracted and input to Stage II (Sec. 4.2); FIG. 3 shows Sk which is an activation map) an object convolutional feature of the image frame outputted by the object detection model (Page 2, right col., line 3-7, Our pose formulation uses two deep neural network stages that perform local (per body joint) and global (all body joints) reasoning, respectively. Stage I is fully convolutional and jointly reasons about the 2D and 3D pose for all the subjects in the scene at once; Page 1, left col., line 9-11, The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals; Page 4, Fig. 2 shows the “2D branch” and “3D branch” are obtaining convolutional feature); acquiring a pixel average value corresponding to the first activation map (Page 7, left col., 2nd para., For each individual k, the 2D pose P2Dk , joint confidences {cj,k}Jj=1 , and 3D pose encodings {lj,k}Jj=1 at the visible joints are extracted and input to Stage II (Sec. 4.2). Stage II uses a fully-connected decoding network that leverages the full body context that is available to it, to give the complete 3D pose with the occluding joints filled in; FIG. 3 shows Sk having individual’s 2D joint locations with joint detection confidence and 3D pose encodings and “L” (output of 3D pose encodings) can be considered as a pixel average value); determining a positioning result of the key points of the object in the image frame according to the pixel average value (Page 7, right col., line 1-3, Stage II uses a lightweight fully-connected network to predict the root-relative 3D joint positions {P3Dk }K k=1 for each individual considered visible after Stage I); and determining the object pose detection result according to the object part class and the positioning result (Fig. 4; Page 7, right col., line 5-7, For each individual k, at each detected joint location, we extract the 1×1×(3·J) 3D pose encoding vector lj,k; Page 5, right col., line 31-32, Our algorithm and pose representation applies to any CNN architecture suitable for keypoint prediction; Page 10, right col., line 1-2, Evaluation of 2D key point detections of the complete Stage I of our system (both 2D and 3D branches trained)).
It would have been obvious before the effective filing date of the claimed invention to one having ordinary skill in the art to modify the method and device of WANG to incorporate the step/system of acquiring body joints by inputting the image frame into a network and determining 3D joint positions of the object in the image frame by using the average value from the activation map taught by MEHTA.
Motivation for this combination has been stated in claim 1.
Regarding claim 5, the combination of WANG, HSU and MEHTA teaches all the limitations of claim 2 above. MEHTA teaches wherein the inputting the image frame into a part detection model and acquiring the part pose detection result through the part detection model comprises: inputting the image frame into the part detection model; detecting a first object part of the object in the part detection model (Page 5, right col., line 1-2, Stage I uses a convolutional neural network to process the complete input frame, jointly handling all subjects in the scene; Page 5, left col., line 3-4 from the bottom, The input to our method is a live video feed, i.e., a stream of monocular color frames showing a multi-person scene; Page 2, right col., line 8-13, Since Stage I handles the already complex task of parsing the image for body parts, as well as associating the body parts to identities, our key insight with regards to the pose formulation is to have Stage I only consider body joints for which direct image evidence is available, i.e., joints that are themselves visible); in a case that the first object part is detected in the image frame, acquiring an area image containing the first object part from the image frame; acquiring part key point positions corresponding to the first object part according to the area image; determining a part pose detection result corresponding to the image frame based on the part key point positions (Page 6, Fig. 3. Input to Stage II: Sk for each detected individual k, is comprised of the individual’s 2D joint locations P2Dk, the associated joint detection confidence values C extractedfromthe2D branch output, and the respective 3D pose encodings {lj,k}Jj=1 extracted from the output of the 3D branch. Refer to Section 4 for details); and in a case that the first object part is not detected in the image frame, determining that the part pose detection result corresponding to the image frame is a null value (Page 7, right col., line 10-11, If the joint is not visible, we instead concatenate zero vectors of appropriate dimensions (see Figure 3)).
Regarding claim 7, the combination of WANG, HSU and MEHTA teaches all the limitations of claim 1 above. WANG teaches wherein the performing interpolation processing further comprises: acquiring the standard pose associated with the object ([0006] we use a recursive method to recover the missing joints from the known joint information in the frames), and determining a first key point quantity corresponding to the standard pose, and a second key point quantity corresponding to the object pose detection result; in a case that the first key point quantity is greater than the second key point quantity ([0106] The pose estimator detects most of the joints, but a few joint localizations are lost, and the missing joints are recovered using the known joint information), performing interpolation processing on the at least one object part missing from the object pose detection result according to the standard pose to obtain a first candidate object pose; and performing interpolation processing on the object part associated with the first object part in the first candidate object pose according to the part pose detection result to obtain a global pose corresponding to the object ([0106] The pose estimator detects most of the joints, but a few joint localizations are lost, and the missing joints are recovered using the known joint information; [0006] we use a recursive method to recover the missing joints from the known joint information in the frames. The recovery phase is the basis of our invention to further improve the quality of the final pose; [0060] Figure 12 shows the pose recovery and updated joints for the first frame of the image sequence. The lost joints ... are recovered and updated; [0064] FIG16 shows the posture obtained by joint detection in the second frame of the image sequence. In this frame, all joints except joints J0, J2, and J10 are detected; [0065] Figure 17 shows the pose recovery and updated joints for the second frame of the image sequence. The lost joints J0, J2, and J10 are recovered and updated).
Regarding claim 11, the combination of WANG, HSU and MEHTA teaches all the limitations of claim 1 above. MEHTA teaches further comprising: constructing a virtual object associated with the object (Page 11, right col., line 7-10, Character Animation: Since we reconstruct temporally coherent joint angles and our camera-relative subject localization estimates are stable, the output of our system can readily be employed to animate virtual avatars as shown in Figure 7), and controlling a pose of the virtual object (Page 2, right col., line 10-13 from the bottom, we fit a model-based skeleton to the 3D and 2D predictions in order to satisfy kinematic constraints and reconcile the 2D and 3D predictions across time. This produces temporally stable predictions, with skeletal angle estimates, which can readily drive virtual characters; Page 4, right col., line 9-15, our approach works … yielding skeletal joint angles and camera relative positioning of the subject, which can be readily be used to control animated characters in a virtual environment. Our approach predicts the complete body pose even under significant person-object occlusions; Page 10, right col., line 29-31, results in temporally smooth joint-angle estimates which can readily be used to drive virtual characters; Fig. 7) according to the global pose (Page 4, line 3-4, Fig. 2. Overview: ... Stage II is a compact fully-connected network that runs in parallel for each detected person, and reconstructs the complete 3D pose, including occluded joints, by leveraging global (full body) context; Page 5, right col., line 13-16, Stage II, which we discuss in Section 4.2, uses a lightweight fully connected neural network that ‘decodes’ the input from the previous stage into a full 3D pose, i.e. root-relative 3D joint positions for visible and occluded joints, per individual).
With respect to claim 12, arguments analogous to those presented for claim 1, are applicable.
With respect to claim 13, arguments analogous to those presented for claim 1, are applicable.
With respect to claim 14, arguments analogous to those presented for claim 1, are applicable.
Allowable Subject Matter
Claim 4, 6 and 8-10 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL C. CHANG whose telephone number is (571)270-1277. The examiner can normally be reached Monday-Thursday and Alternate Fridays 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chan S. Park can be reached at (571) 272-7409. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DANIEL C CHANG/Examiner, Art Unit 2669 /CHAN S PARK/Supervisory Patent Examiner, Art Unit 2669