DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed 08/25/2025 has been entered. Applicant’s amendments to claims 7 and 9 have overcome each and every objection previously set forth in the Non-Final Office Action mailed 06/18/2025. Claims 1-5, 7-18, and 20-22 remain pending in the application, with claims 6 and 19 having been cancelled.
Response to Arguments
Applicant argues on pg. 18 of the Remarks filed 08/25/2025 that “Zhang does not determine, in “the aligned depth image,” at least one nearest neighbor of “the position of the target human joint point” and that Zhang “does not consider, for human each joint, a plurality of depth values from “the aligned depth image.”” Examiner agrees; however, the aforementioned claim limitations are taught in the combination of Zhang in view of Potje (in claims 1 and 6 of the Non-Final Office Action mailed 06/18/2025, respectively). Zhang determines a depth field value in physical space (Zhang, “the three-dimensional distance” at the beginning of para 97 on pg. 20) based on a distance between two three-dimensional values (Zhang, pg. 20, para 97: “the three-dimensional distance between the position of the target human joint point obtained and the corresponding human joint point observed based on the depth image.” Emphasis added.). Para 97 of Zhang further explains that one of the three-dimensional values, determined for each keypoint, or joints, is determined using a calculation based on the position of the camera (Zhang, pg. 20, para 99: “pixel coordinates to camera coordinates”). Potje teaches a nearest neighbor calculation, wherein a value associated with a keypoint is compared with multiple other values to determine the smallest distance (Potje, 1st paragraph on pg. 9, teaching the plurality of values for each keypoint). In the combination of Zhang and Potje, instead of utilizing the camera position to determine the 3D location of each reference keypoint in the depth image, a nearest neighbor calculation between multiple other similar values and the target human joint point is utilized (further description of the rationale to combine teachings is provided below). Thus, Zhang in view of Potje teach amended claim 1, as further described in the claim rejection below.
Applicant argues on pg. 19 of the Remarks filed 08/25/2025, with regard to amended claim 1, that Potje fails to make up for the shortcomings of Zhang. Examiner respectfully disagrees. As described above, Zhang in view of Potje teaches the limitations of claim 1. Although Potje determines a distance in a feature space, not a 3D physical space, Potje is solely relied upon to teach the use of a nearest neighbor calculation to determine a distance comparison between a keypoint and other similar values (see explanation above). Zhang is still relied upon to teach the keypoints and locations in a 3D physical space. See MPEP 2145(IV): “One cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.” Thus, Zhang in view of Potje teach amended claim 1, as further described in the claim rejection provided below.
Applicant argues in the 1st paragraph on pg. 20 of the Remarks filed 08/25/2025, that “a “person of ordinary skill in the art having common sense at the time of the invention would not have reasonably looked to [a second reference] to solve a problem already solved by [a first reference].” Ex parte Rinkevich, PTAB, 2007-1317 (Decided May 29, 2007), p. 9.” Additionally, Applicant argues in the 3rd-4th paragraphs on pg. 20 of the Remarks filed 08/25/2025, that a person of ordinary skill in the art would have no reason to look to Potje since this problem is already solved by Zhang, and further that the proposed modification of Zhang with the teachings of Potje lacks the “rational underpinning [required] to support the legal conclusion of obviousness”. As described in the Non-Final Office Action mailed 06/18/2025, the motivation to combine the teachings of Zhang and Potje relies on the KSR decision; see rationale (B) in MPEP 2143(I). A simple substitution of one known element for another requires: (1) a finding that the prior art contained a device (method, product, etc.) which differed from the claimed device by the substitution of some components (step, element, etc.) with other components; (2) a finding that the substituted components and their functions were known in the art; (3) a finding that one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable; and (4) whatever additional findings based on the Graham factual inquiries may be necessary, in view of the facts of the case under consideration, to explain a conclusion of obviousness (MPEP 2143(I)(B)). Because the rejection meets the requirements above (see rejection below), the proposed modification of Zhang in view of Potje is maintained.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 3-5, 7, and 16-18, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (CN Patent No. 113610889 A), hereinafter Zhang, in view of Potje et al. (Potje, G., Martins, R., Cadar, F., & Nascimento, E. R., Learning geodesic-aware local features from RGB-D images, 2022, Computer Vision and Image Understanding, 219, 103409.), hereinafter Potje.
Regarding claim 1, Zhang teaches a computer-implemented method (Zhang, pg. 27, para 129: “the present invention implements all or part of the processes in the abovementioned embodiment method, and can also be completed by instructing related hardware through a computer program”) comprising:
determining, for each respective keypoint of a plurality of keypoints (Zhang, pg. 11, para 62: “human joint points”), a corresponding initial three-dimensional (3D) position of the respective keypoint (Zhang, pg. 11, para 63: “three-dimensional coordinate information of the human joint points corresponding to the two-dimensional coordinate information of each of the human joint points”), wherein the plurality of keypoints represents a corresponding plurality of predetermined body locations of a body of an actor (Zhang, pg. 11, para 62: “human joint points is the position coordinate of the human joint point of the target object in the above color image”; pg. 16, para 83: “the 15 human joints are preferably the head, neck, mid-hip, left shoulder…”);
receiving sensor data representing the body of the actor and comprising depth data (Zhang, pg. 11, para 59: “Acquire a color image and a depth image”);
determining, for each respective keypoint of the plurality of keypoints, a corresponding depth field value based on a distance between (i) the corresponding initial 3D position and (ii) the corresponding reference 3D position (Zhang, pixel coordinates to camera coordinates, pg. 19, para 97: “based on the 15 human joints identified in the above color image, the depth corresponding to each human joint can be obtained in the aligned depth image…the conversion formula from pixel coordinates to camera coordinates, i.e., the above formula (1), can be used to obtain the observation coordinates of 15 human joints in the camera coordinate system”);
determining, based on the corresponding depth field value of each respective keypoint of the plurality of keypoints, a loss value (Zhang, depth field values are reflected in the three-dimensional joint point loss function, of which the summed loss function is based on; pg. 21, para 101: “the above-mentioned loss functions are combined and summed to obtain the value of the sum of the loss functions”); and
determining, for each respective keypoint of the plurality of keypoints, a corresponding updated 3D position of the respective keypoint based on the loss value (Zhang, pg. 17, para 88: “Step S402, iteratively fitting the above-mentioned human body joints based on the above-mentioned loss function to obtain the position information of the target human body joints”).
While Zhang teaches the determining of a physical 3D distance between two 3D positions, representing a spatial relationship (Zhang, see pg. 20, para 97: “the three-dimensional distance between the position of the target human joint point obtained and the corresponding human joint point observed based on the depth image.”), Zhang fails to explicitly teach wherein the corresponding reference 3D position is based on at least one nearest neighbor of the corresponding initial 3D position in the depth data, and thus fails to teach further: determining, for each respective keypoint of the plurality of keypoints, a corresponding plurality of difference values, wherein each respective difference value of the corresponding plurality of difference values represents an estimate of a physical distance between (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding 3D position of each of a plurality of 3D points of the depth data; and selecting, for each respective keypoint of the plurality of keypoints, from the plurality of 3D points of the depth data, and based on the corresponding plurality of difference values, the at least one nearest neighbor that is spatially closest to the corresponding initial 3D position.
However, Potje teaches the detection of keypoints in images (Potje, first paragraph in section 4.3 on pg. 8: “the task of matching the keypoints consists in, for each keypoint in the reference set, find its correspondence in the other set in case the keypoint is visible in both sets”), wherein a corresponding reference 3D position is based on at least one nearest neighbor of the corresponding initial 3D position in the depth data (Potje, first paragraph on pg. 9: “we matched the keypoint descriptors from pairs of images using brute force nearest neighbor search, i.e., for each descriptor in the reference set, we compare it with all descriptors in the target set and consider the descriptor with the smallest Euclidean distance (the nearest neighbor) as its corresponding descriptor”);
determining, for each respective keypoint of the plurality of keypoints (Taught by Zhang, depth field value calculated for 15 joint points, pg. 19, para 97), a corresponding plurality of difference values, wherein each respective difference value of the corresponding plurality of difference values represents an estimate of a physical distance between (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding 3D position of each of a plurality of 3D points of the depth data (Potje, Euclidean distance between keypoints, see combination with Zhang described further below, first paragraph on pg. 9); and
selecting, for each respective keypoint of the plurality of keypoints, from the plurality of 3D points of the depth data, and based on the corresponding plurality of difference values, the at least one nearest neighbor that is spatially closest to the corresponding initial 3D position (Potje, first paragraph on pg. 9: “consider the descriptor with the smallest Euclidean distance (the nearest neighbor) as its corresponding descriptor”).
Potje discloses a method for determining a keypoint position by comparing the locations of keypoints in two images and detecting the nearest keypoint, wherein the reference position is the nearest neighbor. Zhang discloses determining a keypoint location based on the camera position. A person of ordinary skill in the art, before the effective filing date of the claimed invention, would have recognized that the method of detecting keypoints based on camera position, taught by Zhang, could have been substituted with the method of detecting keypoints based on their nearest neighbor reference position, taught by Potje, because both serve the purpose of detecting the corresponding keypoint based on a reference position and the initial keypoint. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the corresponding reference 3D position of Zhang for the corresponding reference 3D position of Potje according to known methods to yield the predictable result of determining the nearest keypoint to the detected initial keypoint.
Regarding claim 3 (dependent on claim 1), Zhang in view of Potje teaches further comprising:
determining, for each respective keypoint of the plurality of keypoints, a corresponding preceding 3D position of the respective keypoint, wherein the corresponding preceding 3D position is associated with a first time (Zhang, target human joints for consecutive frames, used in the smoothing loss function, see pg. 21, para 100; each consecutive frame is processed sequentially, which would include frames before the initial position, see pg. 24, para 114), and wherein each of the corresponding initial 3D position and the corresponding updated 3D position is associated with a second time that is subsequent to the first time (Zhang, the 3D position at this time is iteratively updated to create the updated 3D position, see pg. 17, para 88); and
determining, for each respective keypoint of the plurality of keypoints, a corresponding position difference value based on (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding preceding 3D position of the respective keypoint (Zhang, pg. 21, para 100: L2 smoothing loss between target human joints in consecutive frames), wherein the loss value is further based on the corresponding position difference value of each respective keypoint of the plurality of keypoints (Zhang, pg. 21, para 101: the summed loss function includes the smoothness function, L2 loss).
Regarding claim 4 (dependent on claim 3), Zhang in view of Potje teaches further comprising:
determining, for each respective keypoint of the plurality of keypoints, a corresponding predicted 3D position of the respective keypoint (Zhang, position of the joint points after fitting with the loss function, see pg. 17, para 88) by propagating the corresponding preceding 3D position of the respective keypoint from the first time to the second time based on a tracked motion of the respective keypoint (Zhang, fitting the model based on movement between frames, pg. 12, para 67: “a smoothing loss function can be set in the preset loss function to ensure that the three-dimensional human body model fitted by the upper and lower frames is as smooth as possible. By calculating the L2 loss of the joint points in the three-dimensional human body model fitted by the upper and lower frames, the visual effect may be affected by large joint point position jumps between frames”), wherein each of the corresponding initial 3D position and the corresponding updated 3D position represents the predicted 3D position updated based on the sensor data (Zhang, processing is done on sensor data, the video stream frames, pg. 12, para 67: “plurality of consecutive frames of synchronized and aligned color images and depth images”; the updated 3D position is a predicted position of sensor data that is iteratively fitted based on the loss value), and wherein the corresponding position difference value is based on a distance between (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding predicted 3D position of the respective keypoint (Zhang, pg. 21, para 100: L2 smoothing loss between target human joints in consecutive frames, as referenced in claim 3).
Regarding claim 5 (dependent on claim 3), Zhang in view of Potje teaches wherein the loss value is based on a weighted sum (Zhang, pg. 21, para 101: “The combined summation of the above loss functions may be performed by direct addition or by summing them according to weight distribution”; Zhang teaches a weighted sum of multiple loss functions, including the GM loss, which represents the depth field values, and the L2 loss, which represents the smoothing loss, or the position difference values – (i) and (ii) below) of (i) a square of the corresponding depth field value of each respective keypoint of the plurality of keypoints and (ii) a square of the corresponding position difference value of each respective keypoint of the plurality of keypoints (Zhang, pg. 21, para 100: “L2 loss”).
Regarding claim 7 (dependent on claim 1), Zhang in view of Potje teaches wherein the sensor data comprises image data (Zhang, pg. 11, para 59: “Acquire a color image and a depth image”), and wherein determining the corresponding reference 3D position comprises: determining, based on the image data, a mask that indicates a portion of the image data occupied by the actor (Zhang, human body segmentation, pg. 10, para 56: “based on the above depth image, respectively obtains three-dimensional coordinate information of human body joints corresponding to each of the above two-dimensional coordinate information of human body joints, and human body segmentation depth areas corresponding to each of the above human body segmentation areas”); and selecting the at least one nearest neighbor (nearest neighbor selection taught by Potje, see Zhang in view of Potje combination in claim 1) by selecting, from the depth data, at least one 3D point that is positioned within the mask (Potje, selection of a matching keypoint that must be in the target object’s mask, pg. 11, first paragraph: “Since the original images contain more than the object itself, we used a mask to detect SIFT keypoints only in the object region for all images in the database”).
Regarding claim 16 (dependent on claim 1), Zhang in view of Potje teaches wherein:
the sensor data represents a corresponding body of each of a plurality of actors (Zhang, pg. 22, para 107: “the above-mentioned color image and depth image may include multiple target objects”);
determining the corresponding initial 3D position of the respective keypoint comprises determining, for each respective actor of the plurality of actors, an actor-specific initial 3D position for each respective keypoint of a plurality of keypoints associated with the respective actor;
determining the corresponding reference 3D position of the respective keypoint comprises determining, for each respective actor of the plurality of actors, a corresponding actor-specific reference 3D position for each respective keypoint of a plurality of keypoints associated with the respective actor;
determining the corresponding depth field value comprises determining, for each respective actor of the plurality of actors, a corresponding actor-specific depth field value for each respective keypoint of the plurality of keypoints associated with the respective actor;
determining the loss value comprises determining, for each respective actor of the plurality of actors, an actor-specific loss value based on the corresponding actor-specific depth field value of each respective keypoint of the plurality of keypoints associated with the respective actor; and
determining the corresponding updated 3D position comprises determining, for each respective actor of the plurality of actors, an actor-specific updated 3D position of each respective keypoint of the plurality of keypoints associated with the respective actor based on the corresponding actor-specific loss value (Zhang, the limitations of claim 16 are taught in claim 1 for an actor, the target object; Zhang teaches wherein each actor is a separate model. Separate models would have different depth values, 3D positions, etc., and therefore different loss values as well, pg. 22, para 107: “In this embodiment, a specific description is given by taking the existence of one target object as an example. When there are multiple target objects, the device in this embodiment can be used to reconstruct the three-dimensional human body model for each target object separately”).
Regarding claim 17, Zhang teaches a system comprising:
A processor (Zhang, pg. 25, para 116: “When the human body three-dimensional model acquisition program is executed by a processor, the steps of any one of the above-mentioned human body three-dimensional model acquisition methods are implemented”);
A non-transitory computer-readable medium having stored thereon instructions (Zhang, pg. 27, para 129: computer program code and computer-readable medium) that, when executed by the processor, cause the processor to perform operations comprising:
determining, for each respective keypoint of a plurality of keypoints (Zhang, pg. 11, para 62: “human joint points”), a corresponding initial three-dimensional (3D) position of the respective keypoint (Zhang, pg. 11, para 63: “three-dimensional coordinate information of the human joint points corresponding to the two-dimensional coordinate information of each of the human joint points”), wherein the plurality of keypoints represents a corresponding plurality of predetermined body locations of a body of an actor (Zhang, pg. 11, para 62: “human joint points is the position coordinate of the human joint point of the target object in the above color image”; pg. 16, para 83: “the 15 human joints are preferably the head, neck, mid-hip, left shoulder…”);
receiving sensor data representing the body of the actor and comprising depth data (Zhang, pg. 11, para 59: “Acquire a color image and a depth image”);
determining, for each respective keypoint of the plurality of keypoints, a corresponding depth field value based on a distance between (i) the corresponding initial 3D position and (ii) the corresponding reference 3D position (Zhang, pixel coordinates to camera coordinates, pg. 19, para 97: “based on the 15 human joints identified in the above color image, the depth corresponding to each human joint can be obtained in the aligned depth image…the conversion formula from pixel coordinates to camera coordinates, i.e., the above formula (1), can be used to obtain the observation coordinates of 15 human joints in the camera coordinate system”);
determining, based on the corresponding depth field value of each respective keypoint of the plurality of keypoints, a loss value (Zhang, depth field values are reflected in the three-dimensional joint point loss function, of which the summed loss function is based on; pg. 21, para 101: “the above-mentioned loss functions are combined and summed to obtain the value of the sum of the loss functions”); and
determining, for each respective keypoint of the plurality of keypoints, a corresponding updated 3D position of the respective keypoint based on the loss value (Zhang, pg. 17, para 88: “Step S402, iteratively fitting the above-mentioned human body joints based on the above-mentioned loss function to obtain the position information of the target human body joints”).
While Zhang teaches the determining of a physical 3D distance between two 3D positions, representing a spatial relationship (Zhang, see pg. 20, para 97: “the three-dimensional distance between the position of the target human joint point obtained and the corresponding human joint point observed based on the depth image.”), Zhang fails to explicitly teach wherein the corresponding reference 3D position is based on at least one nearest neighbor of the corresponding initial 3D position in the depth data, and thus fails to teach further: determining, for each respective keypoint of the plurality of keypoints, a corresponding plurality of difference values, wherein each respective difference value of the corresponding plurality of difference values represents an estimate of a physical distance between (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding 3D position of each of a plurality of 3D points of the depth data; and selecting, for each respective keypoint of the plurality of keypoints, from the plurality of 3D points of the depth data, and based on the corresponding plurality of difference values, the at least one nearest neighbor that is spatially closest to the corresponding initial 3D position.
However, Potje teaches the detection of keypoints in images (Potje, first paragraph in section 4.3 on pg. 8: “the task of matching the keypoints consists in, for each keypoint in the reference set, find its correspondence in the other set in case the keypoint is visible in both sets”), wherein a corresponding reference 3D position is based on at least one nearest neighbor of the corresponding initial 3D position in the depth data (Potje, first paragraph on pg. 9: “we matched the keypoint descriptors from pairs of images using brute force nearest neighbor search, i.e., for each descriptor in the reference set, we compare it with all descriptors in the target set and consider the descriptor with the smallest Euclidean distance (the nearest neighbor) as its corresponding descriptor”);
determining, for each respective keypoint of the plurality of keypoints (Taught by Zhang, depth field value calculated for 15 joint points, pg. 19, para 97), a corresponding plurality of difference values, wherein each respective difference value of the corresponding plurality of difference values represents an estimate of a physical distance between (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding 3D position of each of a plurality of 3D points of the depth data (Potje, Euclidean distance between keypoints, see combination with Zhang described further below, first paragraph on pg. 9); and
selecting, for each respective keypoint of the plurality of keypoints, from the plurality of 3D points of the depth data, and based on the corresponding plurality of difference values, the at least one nearest neighbor that is spatially closest to the corresponding initial 3D position (Potje, first paragraph on pg. 9: “consider the descriptor with the smallest Euclidean distance (the nearest neighbor) as its corresponding descriptor”).
Potje discloses a method for determining a keypoint position by comparing the locations of keypoints in two images and detecting the nearest keypoint, wherein the reference position is the nearest neighbor. Zhang discloses determining a keypoint location based on the camera position. A person of ordinary skill in the art, before the effective filing date of the claimed invention, would have recognized that the method of detecting keypoints based on camera position, taught by Zhang, could have been substituted with the method of detecting keypoints based on their nearest neighbor reference position, taught by Potje, because both serve the purpose of detecting the corresponding keypoint based on a reference position and the initial keypoint. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the corresponding reference 3D position of Zhang for the corresponding reference 3D position of Potje according to known methods to yield the predictable result of determining the nearest keypoint to the detected initial keypoint.
Regarding claim 18 (dependent on claim 17), Zhang in view of Potje teaches wherein the operations further comprise:
determining, for each respective keypoint of the plurality of keypoints, a corresponding preceding 3D position of the respective keypoint, wherein the corresponding preceding 3D position is associated with a first time (Zhang, target human joints for consecutive frames, used in the smoothing loss function, see pg. 21, para 100; each consecutive frame is processed sequentially, which would include frames before the initial position, see pg. 24, para 114), and wherein each of the corresponding initial 3D position and the corresponding updated 3D position is associated with a second time that is subsequent to the first time (Zhang, the 3D position at this time is iteratively updated to create the updated 3D position, see pg. 17, para 88); and
determining, for each respective keypoint of the plurality of keypoints, a corresponding position difference value based on (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding preceding 3D position of the respective keypoint (Zhang, pg. 21, para 100: L2 smoothing loss between target human joints in consecutive frames), wherein the loss value is further based on the corresponding position difference value of each respective keypoint of the plurality of keypoints (Zhang, pg. 21, para 101: the summed loss function includes the smoothness function, L2 loss).
Regarding claim 20, Zhang teaches a non-transitory computer-readable medium having stored thereon instructions (Zhang, pg. 27, para 129: computer program code and computer-readable medium) that, when executed by a computing device (Zhang, pg. 27, para 127: “computer software and electronic hardware”), cause the computing device to perform operations comprising:
determining, for each respective keypoint of a plurality of keypoints (Zhang, pg. 11, para 62: “human joint points”), a corresponding initial three-dimensional (3D) position of the respective keypoint (Zhang, pg. 11, para 63: “three-dimensional coordinate information of the human joint points corresponding to the two-dimensional coordinate information of each of the human joint points”), wherein the plurality of keypoints represents a corresponding plurality of predetermined body locations of a body of an actor (Zhang, pg. 11, para 62: “human joint points is the position coordinate of the human joint point of the target object in the above color image”; pg. 16, para 83: “the 15 human joints are preferably the head, neck, mid-hip, left shoulder…”);
receiving sensor data representing the body of the actor and comprising depth data (Zhang, pg. 11, para 59: “Acquire a color image and a depth image”);
determining, for each respective keypoint of the plurality of keypoints, a corresponding depth field value based on a distance between (i) the corresponding initial 3D position and (ii) the corresponding reference 3D position (Zhang, pixel coordinates to camera coordinates, pg. 19, para 97: “based on the 15 human joints identified in the above color image, the depth corresponding to each human joint can be obtained in the aligned depth image…the conversion formula from pixel coordinates to camera coordinates, i.e., the above formula (1), can be used to obtain the observation coordinates of 15 human joints in the camera coordinate system”);
determining, based on the corresponding depth field value of each respective keypoint of the plurality of keypoints, a loss value (Zhang, depth field values are reflected in the three-dimensional joint point loss function, of which the summed loss function is based on; pg. 21, para 101: “the above-mentioned loss functions are combined and summed to obtain the value of the sum of the loss functions”); and
determining, for each respective keypoint of the plurality of keypoints, a corresponding updated 3D position of the respective keypoint based on the loss value (Zhang, pg. 17, para 88: “Step S402, iteratively fitting the above-mentioned human body joints based on the above-mentioned loss function to obtain the position information of the target human body joints”).
While Zhang teaches the determining of a physical 3D distance between two 3D positions, representing a spatial relationship (Zhang, see pg. 20, para 97: “the three-dimensional distance between the position of the target human joint point obtained and the corresponding human joint point observed based on the depth image.”), Zhang fails to explicitly teach wherein the corresponding reference 3D position is based on at least one nearest neighbor of the corresponding initial 3D position in the depth data, and thus fails to teach further: determining, for each respective keypoint of the plurality of keypoints, a corresponding plurality of difference values, wherein each respective difference value of the corresponding plurality of difference values represents an estimate of a physical distance between (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding 3D position of each of a plurality of 3D points of the depth data; and selecting, for each respective keypoint of the plurality of keypoints, from the plurality of 3D points of the depth data, and based on the corresponding plurality of difference values, the at least one nearest neighbor that is spatially closest to the corresponding initial 3D position.
However, Potje teaches the detection of keypoints in images (Potje, first paragraph in section 4.3 on pg. 8: “the task of matching the keypoints consists in, for each keypoint in the reference set, find its correspondence in the other set in case the keypoint is visible in both sets”), wherein a corresponding reference 3D position is based on at least one nearest neighbor of the corresponding initial 3D position in the depth data (Potje, first paragraph on pg. 9: “we matched the keypoint descriptors from pairs of images using brute force nearest neighbor search, i.e., for each descriptor in the reference set, we compare it with all descriptors in the target set and consider the descriptor with the smallest Euclidean distance (the nearest neighbor) as its corresponding descriptor”);
determining, for each respective keypoint of the plurality of keypoints (Taught by Zhang, depth field value calculated for 15 joint points, pg. 19, para 97), a corresponding plurality of difference values, wherein each respective difference value of the corresponding plurality of difference values represents an estimate of a distance between (i) the corresponding initial 3D position of the respective keypoint and (ii) the corresponding 3D position of each of a plurality of 3D points of the depth data (Potje, Euclidean distance between keypoints, see combination with Zhang described further below, first paragraph on pg. 9); and
selecting, for each respective keypoint of the plurality of keypoints, from the plurality of 3D points of the depth data, and based on the corresponding plurality of difference values, the at least one nearest neighbor that is closest to the corresponding initial 3D position (Potje, first paragraph on pg. 9: “consider the descriptor with the smallest Euclidean distance (the nearest neighbor) as its corresponding descriptor”).
Potje discloses a method for determining a keypoint position by comparing the locations of keypoints in two images and detecting the nearest keypoint, wherein the reference position is the nearest neighbor. Zhang discloses determining a keypoint location based on the camera position. A person of ordinary skill in the art, before the effective filing date of the claimed invention, would have recognized that the method of detecting keypoints based on camera position, taught by Zhang, could have been substituted with the method of detecting keypoints based on their nearest neighbor reference position, taught by Potje, because both serve the purpose of detecting the corresponding keypoint based on a reference position and the initial keypoint. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the corresponding reference 3D position of Zhang for the corresponding reference 3D position of Potje according to known methods to yield the predictable result of determining the nearest keypoint to the detected initial keypoint.
Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, and Luo et al. (U.S. Patent No. 2021/0279456 A1), hereinafter Luo.
Regarding claim 2 (dependent on claim 1), Zhang in view of Potje teaches corresponding updated 3D position of each respective keypoint of the plurality of keypoints (Zhang, see claim 1 rejection), but fails to teach further comprising: causing a robotic device to interact with the actor based on a pose represented by the corresponding updated 3D position. However, Luo teaches a human pose estimation method (Luo, Abstract) comprising: causing a robotic device to interact with the actor based on a pose represented by the corresponding updated 3D position (Luo, para 116: “The processing chip can recognize a human body pose in the image of the region based on the pose recognition model, and control the intelligent robot to make a preset response when the recognized human body pose is a preset pose. For example, when the recognized human body pose is a waving pose, an intelligent robot is controlled to make a greeting action”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the robotic device of Luo with the method of Zhang in order to utilize the human pose estimation to allow humans to interact with robots (Luo, para 116: “An example in which the user interacts with an intelligent robot as an example. An image acquisition apparatus and a processing chip are configured in the intelligent robot”).
Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, and Chen et al. (Chen, H., Feng, R., Wu, S., Xu, H., Zhou, F., & Liu, Z., 2D Human Pose Estimation: A Survey, 2022, arXiv preprint arXiv:2204.07370.), hereinafter Chen.
Regarding claim 21 (dependent on claim 1), Zhang in view of Potje teaches wherein the sensor data comprises image data (Zhang, pg. 11, para 59: “Acquire a color image and a depth image”), but fails to teach wherein the method further comprises: determining, for each respective keypoint of the plurality of keypoints, a corresponding visibility value based on a visibility of the respective keypoint in the image data, wherein the loss value is further based on the corresponding visibility value (Zhang recognizes the potential of points not being visible, but doesn’t teach a value representing visibility; see mention of the problem of self-occluded joints on pg. 19, para 97).
However, Chen teaches human pose estimation methods disclosing determining, for each respective keypoint of the plurality of keypoints, a corresponding visibility value (Chen, top left Equation 1 on pg. 11: “vj...visibility for joint 7’) based on a visibility of the respective keypoint in the image data (Chen, joint, j, is the keypoint, see paragraph under Equation 2, left column on pg. 11), wherein the loss value is further based on the corresponding visibility value (Chen, vj is part of the cost function, Equation 1, top left on pg. 11). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the visibility value of Chen with the method of Zhang in order to account for whether or not a keypoint is occluded when calculating the loss value (Chen, “Most of existing video datasets only label the joint visibility to indicate that whether a joint is occluded”, joint visibility indicates whether the joint is occluded).
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, Chen, and Cai et al. (WO Patent No. 2022241583 A1), hereinafter Cai.
Regarding claim 8 (dependent on claim 21), Zhang in view of Potje and Chen fails to teach wherein determining the corresponding visibility value comprises: determining, for each respective keypoint of the plurality of keypoints, a corresponding confidence value associated with detection of the respective keypoint within the image data; and determining, for each respective keypoint of the plurality of keypoints, the corresponding visibility value based on comparing the corresponding confidence value to a threshold confidence value.
However, Cai teaches motion capture of human body keypoints (Cai, pg. 1-2, para 3) wherein determining the corresponding visibility value comprises:
determining, for each respective keypoint of the plurality of keypoints, a corresponding confidence value associated with detection of the respective keypoint within the image data (Cai, pg. 24, para 164: confidence of key points); and
determining, for each respective keypoint of the plurality of keypoints, the corresponding visibility value based on comparing the corresponding confidence value to a threshold confidence value (Cai, comparison to threshold, pg. 24, para 164: “the confidence of some key points detected in all viewing angles is lower than the preset threshold T, it is considered that some key points of the human body are blocked and are in the blind spot of the viewing angle”; values are different for visible keypoints, pg. 25, para 169: “During regression, only the parameters ω of key points with high confidence are optimized, and the occluded key points maintain the original parameters”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the confidence threshold of Cai with the method of Zhang in order to distinguish between visible and occluded key points, and avoid creating loss values based on occluded values (Cai, occluded points are identified and regression parameters are unchanged, see previous citation).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, Chen, and Qin (CN Patent No. 112528770 A).
Regarding claim 9 (dependent on claim 21), Zhang in view of Potje and Chen teaches determining, based on the image data, a mask that indicates a portion of the image data occupied by the actor (Zhang, human body segmentation, pg. 10, para 56: “based on the above depth image, respectively obtains three-dimensional coordinate information of human body joints corresponding to each of the above two-dimensional coordinate information of human body joints, and human body segmentation depth areas corresponding to each of the above human body segmentation areas”);
determining, for each respective keypoint of the plurality of keypoints, a corresponding position of the respective keypoint relative to the mask (Zhang, pg. 18, para 93: “Furthermore, the parameterized human body model and the preset loss function are used to iteratively fit the target human body joints and the point clouds corresponding to each point in each human body segmentation depth area to obtain the human body three-dimensional model.”); and determining, for each respective keypoint of the plurality of keypoints, the corresponding visibility value (visibility values taught by Chen, see Zhang in view of Chen combination in claim 1), but fails to teach wherein determining the corresponding visibility value comprises the mask, and therefore fails to teach determining the corresponding visibility value based on the corresponding position of the respective keypoint relative to the mask.
However, Qin teaches determining the visibility value of human body keypoints (Qin, pg. 3, para 8: “The human body ROI area is subjected to feature extraction through the human body mixture model to obtain the human body key point features and image attribute features, and the visibility of the human body key point features is output through the human body mixture model”) and determining the corresponding visibility value based on the corresponding position of the respective keypoint relative to the mask (Qin, the human body ROI area is a mask, see citation above, pg. 24, para 165: “Being occluded by a person is mainly determined by the number of key points of the target person included in the segmented foreground area of the occluding person. Specifically, we first determine whether the two human body frames intersect, and then calculate the number of points where the 14 key points of the target human body fall within the foreground of the occluded human body mesh segmentation”; the position of the potentially occluding human body frame relative to the target human body ROI determines whether there is intersection or not, and thus detection of occluded points is based on their position relative to the target human body ROI). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the visibility detection of Qin with the method of Zhang in order to determine when keypoints are likely to be occluded based on the human in the image (Qin, pg. 24, para 165: “Generally speaking, invisible key points fall on the segmentation foreground with higher confidence and are given greater weight. When the number of invisible key points exceeds a certain threshold, the target body is judged to be occluded by a person”).
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, Chen, and Sun et al. (Sun, S., Liu, R., Du, Q., & Sun, S., Selective embedding with gated fusion for 6D object pose estimation, 2020, Neural Processing Letters, 51(3), 2417-2436.), hereinafter Sun.
Regarding claim 10 (dependent on claim 21), Zhang in view of Potje and Chen teaches wherein determining the corresponding visibility value comprises: determining, for each respective keypoint of the plurality of keypoints, a corresponding depth value associated with the respective keypoint within the depth data (Zhang, pg. 19, para 97: “corresponding human joint point observed based on the depth image”); and determing, for each respective keypoint of the plurality of keypoints, the corresponding visibility value (visibility values taught by Chen, see Zhang in view of Chen combination in claim 1); but fails to teach determining the corresponding visibility value based on comparing a depth of the corresponding initial 3D position of the respective keypoint to the corresponding depth value associated with the respective keypoint within the depth data.
However, Sun teaches the pose estimation of occluded objects (Sun, pg. 2417, Abstract) and determining visibility (Sun, occlusion of objects, last paragraph on pg. 2428 continued on pg. 2429) based on comparing a depth of the corresponding initial 3D position of a respective object to the corresponding depth value associated with the respective object within the depth data (Sun, pg. 2429: “If there is
occlusion, the projected depth pixel value should be significantly different from the measured depth value. Suppose d(p) is the depth value of the projected pixel p and dˆ(p) is the depth value of the measured pixel p. If |d(p) − dˆ(p)| > h, then p is considered invisible”, Sun compares the depth value based on the 2D image with a measured depth value to determine if pixels are occluded).
Sun discloses a method for determining if an object is visible based on a comparison of depth data. Chen discloses determining if an object is visible, but does not disclose a specific method for making the determination. A person of ordinary skill in the art, before the effective filing date of the claimed invention, would have recognized that the method of determining visibility, taught by Chen, could have been substituted for the respective method of Sun because both serve the purpose of determining if pixels are occluded using image and depth data. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the comparison of depth data of Sun for the visibility method of Chen according to known methods to yield the predictable result of determining whether a keypoint pixel is occluded based on its depth data.
Claims 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, and Yu (CN Patent No. 112348899 A).
Regarding claim 11 (dependent on claim 1), Zhang in view of Potje teaches wherein the sensor data comprises image data (Zhang, pg. 11, para 59: “Acquire a color image and a depth image”), and wherein the method further comprises:
determining, for each respective keypoint of the plurality of keypoints, a corresponding detected image position representing a detection of the respective keypoint within the image data (Zhang, pg. 19, para 96: “the corresponding human joint point obtained in the plane”), wherein determining the corresponding updated 3D position of the respective keypoint comprises (Zhang, pg. 19, para 95-96, the reprojection position loss, described below, is utilized to determine the iteratively fitted updated 3D position points):
selecting the corresponding updated 3D position such that a corresponding pixel difference value of each respective keypoint of the plurality of keypoints (Zhang, reprojection position loss, pg. 19, para 96: “The GM (Geman-McClure) loss of the two-dimensional pixel position and the corresponding human joint point position of the two-dimensional graphic output for identifying the human joint points in the color image is calculated as the above-mentioned reprojection loss function”) does not exceed a threshold value (Zhang, pg. 21, para 101, the loss functions are summed and compared to a threshold), wherein the corresponding pixel difference value represents, for each respective keypoint of the plurality of keypoints (Zhang, pg. 19, para 96: “15 target human joint points”), a difference between (i) the corresponding detected image position of the respective keypoint (Zhang, pg. 19, para 96: “the corresponding human joint point obtained in the plane”) and (ii) a corresponding projected image position of the respective keypoint (Zhang, pg. 19, para 96: “target human joint point after being projected onto a two-dimensional plane (color image plane)”), and wherein the corresponding projected image position represents, for each respective keypoint of the plurality of keypoints, a projection of the corresponding updated 3D position of the respective keypoint onto the image data (Zhang, pg. 19, para 96: “the 15 target human joint points obtained are projected onto the color image plane”).
Zhang in view of Potje fails to teach wherein the threshold value is a threshold pixel difference value. However, Yu teaches the calculation of reprojection error for an estimated three-dimensional point (Yu, pg. 12, para 63: “Therefore, the difference between the calculated value of the imaging point and the true value of the imaging point is the reprojection error”) wherein difference is compared to a threshold pixel difference value (Yu, pg. 15-16, para 82-86: “constructing an error function based on the reprojection error obtained from the point set…determining whether the error value calculated by the error function is less than a second preset threshold value”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the threshold of Yu with the method of Zhang in order to determine the effect of the projection loss individually and minimize the error when predicting 3D point positions (Yu, pg. 16, para 84: “the error minimization function is to iteratively optimize the calibration parameters to minimize the error, thereby obtaining the optimal calibration parameters”), as opposed to considering all loss values combined, as taught by Zhang.
Regarding claim 12 (dependent on claim 1), Zhang in view of Potje teaches wherein the sensor data comprises image data (Zhang, pg. 11, para 59: “Acquire a color image and a depth image”), and wherein the method further comprises:
determining, for each respective keypoint of the plurality of keypoints, a corresponding detected image position representing a detection of the respective keypoint within the image data (Zhang, pg. 19, para 96: “the corresponding human joint point obtained in the plane”), wherein determining the corresponding updated 3D position of the respective keypoint comprises (Zhang, pg. 19, para 95-96, the reprojection position loss, described below, is utilized to determine the iteratively fitted updated 3D position points):
determining, for each respective keypoint of the plurality of keypoints, a candidate 3D position of the respective keypoint based on the loss value (Zhang, human body joint points are candidates before possible iteration, pg. 21, para 101: “If the value of the sum of the loss functions is not within the preset threshold range, the target point cloud of the human body joint points and the corresponding human body segmentation depth area is continuously iterated to obtain a new human body three-dimensional model until the value of the sum of the loss functions is within the preset threshold range”);
determining, for each respective keypoint of the plurality of keypoints, a corresponding projected image position representing a projection of the candidate 3D position of the respective keypoint onto the image data (Zhang, pg. 19, para 96: “target human joint point after being projected onto a two-dimensional plane (color image plane)”);
determining, for each respective keypoint of the plurality of keypoints (Zhang, pg. 19, para 96: “15 target human joint points”), a corresponding pixel difference value (Zhang, reprojection position loss, pg. 19, para 96: “The GM (Geman-McClure) loss of the two-dimensional pixel position and the corresponding human joint point position of the two-dimensional graphic output for identifying the human joint points in the color image is calculated as the above-mentioned reprojection loss function”) based on a difference between (i) the corresponding detected image position of the respective keypoint (Zhang, pg. 19, para 96: “the corresponding human joint point obtained in the plane”) and (ii) the corresponding projected image position of the respective keypoint (Zhang, pg. 19, para 96: “target human joint point after being projected onto a two-dimensional plane (color image plane)”);
when the corresponding pixel difference value of each respective keypoint of the plurality of keypoints does not exceed a threshold value (Zhang, pg. 21, para 101, the loss functions are summed and compared to a threshold), selecting the candidate 3D position as the corresponding updated 3D position (Zhang, position iteration would end, see para 101 citation above); and
when the corresponding pixel difference value of at least one keypoint of the plurality of keypoints exceeds the threshold value, determining, for one or more keypoints of the plurality of keypoints, another candidate 3D position based on the loss value (Zhang, human body joint points are continuously iterated until the loss function threshold is in range, see para 101 citation above).
Zhang in view of Potje fails to teach wherein the threshold value is a threshold pixel difference value. However, Yu teaches the calculation of reprojection error for an estimated three-dimensional point (Yu, pg. 12, para 63: “Therefore, the difference between the calculated value of the imaging point and the true value of the imaging point is the reprojection error”) wherein difference is compared to a threshold pixel difference value (Yu, pg. 15-16, para 82-86: “constructing an error function based on the reprojection error obtained from the point set…determining whether the error value calculated by the error function is less than a second preset threshold value”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the threshold of Yu with the method of Zhang in order to determine the effect of the projection loss individually and minimize the error when predicting 3D point positions (Yu, pg. 16, para 84: “the error minimization function is to iteratively optimize the calibration parameters to minimize the error, thereby obtaining the optimal calibration parameters”), as opposed to considering all loss values combined, as taught by Zhang.
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, and Wang et al. (Wang, C., Wang, Y., Lin, Z., Yuille, A. L., & Gao, W., Robust estimation of 3d human poses from a single image, 2014, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2361-2368.), hereinafter Wang.
Regarding claim 13 (dependent on claim 1), Zhang in view of Potje fails to teach wherein the plurality of keypoints are interconnected to define a plurality of limbs of the actor, and wherein determining the corresponding updated 3D position of the respective keypoint comprises: selecting the corresponding updated 3D position such that a corresponding limb length of each respective limb of the plurality of limbs is between (i) a maximum limb length corresponding to the respective limb and (ii) a minimum limb length corresponding to the respective limb, wherein the corresponding limb length of each respective limb of the plurality of limbs is determined based on the corresponding updated 3D positions of keypoints that define the respective limb.
However, Wang teaches 3D human pose estimation (Wang, Abstract on pg. 2369) wherein the plurality of keypoints are interconnected to define a plurality of limbs of the actor (Wang, Figure 2 on pg. 2371), and wherein determining the corresponding updated 3D position of the respective keypoint comprises: selecting the corresponding updated 3D position such that a corresponding limb length of each respective limb of the plurality of limbs is between (i) a maximum limb length corresponding to the respective limb and (ii) a minimum limb length corresponding to the respective limb (Wang, pg. 2369 Abstract: “We enforce limb length constraints to eliminate anthropomorphically implausible skeletons”; section 3.1.3 on pg. 2371: “We require that the eight limb lengths of a 3D pose comply with certain proportions”; limb lengths for each respective limb are calculated as a proportion of the right lower leg, see section 3.1.3). The combination of Zhang in further view of Wang teaches wherein the corresponding limb length of each respective limb of the plurality of limbs is determined based on the corresponding updated 3D positions of keypoints that define the respective limb (Zhang teaches calculating the updated 3D positions of keypoints). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the limb constraints of Wang with the method of Zhang in order to determine which 3D human poses are anthropometrically implausible (Wang, Conclusion on pg. 2374: “It is robust to inaccurate 2D pose estimations by using a sparse basis representation, anthropometric constraints…”).
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, Wang, and Zhu et al. (U.S. Patent no. 9,165,199 B2), hereinafter Zhu.
Regarding claim 14 (dependent on claim 1), Zhang in view of Potje teaches wherein determining the corresponding updated 3D position of the respective keypoint comprises:
determining, for each respective keypoint of the plurality of keypoints, a candidate 3D position of the respective keypoint based on the loss value (Zhang, human body joint points are candidates before possible iteration, pg. 21, para 101);
Zhang in view of Potje fails to teach wherein the plurality of keypoints are interconnected to define a plurality of limbs of the actor, and determining, for each respective limb of the plurality of limbs, a corresponding limb length based on the candidate 3D positions of keypoints that define the respective limb; when the corresponding limb length of each respective limb of the plurality of limbs is between (i) a maximum limb length corresponding to the respective limb and (ii) a minimum limb length corresponding to the respective limb, selecting the candidate 3D position as the corresponding updated 3D position; and when the corresponding limb length of at least one limb of the plurality of limbs is not between (i) a maximum limb length corresponding to the at least one limb and (ii) a minimum limb length corresponding to the at least one limb, determining, for one or more keypoints of the plurality of keypoints, another candidate 3D position based on the loss value.
However, Wang teaches 3D human pose estimation (Wang, Abstract on pg. 2369) wherein the plurality of keypoints are interconnected to define a plurality of limbs of the actor (Wang, Figure 2 on pg. 2371), and
determining, for each respective limb of the plurality of limbs, a corresponding limb length based on the candidate 3D positions of keypoints that define the respective limb (Wang, pg. 2369 Abstract: “We enforce limb length constraints to eliminate anthropomorphically implausible skeletons”; section 3.1.3 on pg. 2371: “We require that the eight limb lengths of a 3D pose comply with certain proportions”; min/max limb lengths for each respective limb are calculated as a proportion of the right lower leg, see section 3.1.3);
when the corresponding limb length of each respective limb of the plurality of limbs is between (i) a maximum limb length corresponding to the respective limb and (ii) a minimum limb length corresponding to the respective limb, selecting the candidate 3D position as the corresponding updated 3D position (Wang, limb length constraints are used to estimate the 3D pose, section 3.1. on pg. 2371: “We also enforce L1-norm regularization on the basis coefficients α and eight limb length constraints on the inferred 3D pose”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the limb constraints of Wang with the method of Zhang in order to determine which 3D human poses are anthropometrically implausible (Wang, Conclusion on pg. 2374: “It is robust to inaccurate 2D pose estimations by using a sparse basis representation, anthropometric constraints…”).
While Wang teaches limb length constraints, Zhu teaches when the corresponding limb length of at least one limb of the plurality of limbs is not between limb lengths corresponding to the at least one limb, determining, for one or more keypoints of the plurality of keypoints, another candidate 3D position (Zhu, wrong candidates are eliminated, col 10, ln 55-60: “End-points must have distance values larger than a minimum value which is determined from anthropometric data. The pose estimation system 100 uses this constraint to eliminate spurious artifacts (i.e., wrong limb candidates)”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the elimination of inaccurate 3D position points of Zhu with the method of Zhang in order to avoid 3D position points with inaccurate limbs (Zhu, spurious artifacts and wrong limb candidates are eliminated, see prior citation).
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, Kiciroglu et al. (Kiciroglu, S., Wang, W., Salzmann, M., & Fua, P., Long term motion prediction using keyposes, Sept 2022, In 2022 International Conference on 3D Vision (3DV), pp. 12-21, IEEE.), hereinafter Kiciroglu.
Regarding claim 15 (dependent on claim 1), Zhang in view of Potje fails to teach further comprising: determining, for each respective keypoint of the plurality of keypoints, a tracked motion based on the corresponding updated 3D position of the respective keypoint and a preceding 3D position of the respective keypoint; and determining, for each respective keypoint of the plurality of keypoints and based on the tracked motion thereof, a subsequent 3D position of the respective keypoint by propagating the corresponding updated 3D position of the respective keypoint to a subsequent time that corresponds to the subsequent 3D position.
However, Kiciroglu teaches a method for human motion prediction (Kiciroglu, pg. 12, Abstract) and a tracked motion based on the corresponding updated 3D position of the respective keypoint and a preceding 3D position of the respective keypoint (Kiciroglu, input motion and past sequences, pg. 13-14, section 3), wherein for each respective keypoint of the plurality of keypoints (Kiciroglu, “P” includes the total number of joints, see first paragraph in section 3 on pg. 13) and based on the tracked motion thereof (Kiciroglu, pg. 13-14, section 3: “our keypose-to-keypose framework takes as input a motion P−M:0 defined by its keyposes K−I1:0, where I1<<M is the number of keyposes in the past sequence”), a subsequent 3D position of the respective keypoint (Kiciroglu, pg. 13-14, section 3: “We then predict K1:I2 , where I2<<N is the number of keyposes in the future sequence”) by propagating the corresponding updated 3D position of the respective keypoint to a subsequent time that corresponds to the subsequent 3D position (Kiciroglu, pg. 13-14, section 3: subsequent keyposes are inferred based on preceding keypoint data, “keyposes encode the important poses in a sequence P1:T , such that the remaining poses can be obtained by linear interpolation between subsequent keyposes”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the subsequent 3D position prediction of Kiciroglu with the method of Zhang in order to infer future human poses (Kiciroglu, applicable for robot interaction, pg. 12, Abstract: “Long term human motion prediction is essential in safety-critical applications such as human-robot interaction and autonomous driving”).
Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Potje, and Novel et al. (U.S. Patent No. 2020/0371214 A1), hereinafter Novel.
Regarding claim 22 (dependent on claim 1), Zhang in view of Potje fails to teach wherein determining the corresponding reference 3D position comprises: selecting, for each respective keypoint of the plurality of keypoints, from the plurality of 3D points of the depth data, and based on the corresponding plurality of difference values, a plurality of nearest neighbors that are spatially closest to the corresponding initial 3D position; and determining a center of mass of the plurality of nearest neighbors.
However, Novel teaches for certain keypoints (Novel, centroids of circular cluster, see below), based on the corresponding plurality of difference values, a plurality of nearest neighbors that are spatially closest to the corresponding initial 3D position (Novel, 50 nearest neighbors, see below); and determining a center of mass of the plurality of nearest neighbors (para 42: “To compute the centroid for a particular point in the circular cluster, the scanner position process 118 may first determine the particular point's k, e.g., 50, nearest neighbors of the point cloud (e.g., not in the circular cluster). The scanner position process 118 may then determine the average of the x, y, and z, coordinates for the k, e.g., 50, nearest neighbors to compute the centroid for the particular point (local centroid)”; the centroid is the center of mass for circular clusters, para 8: “The process may determine that the cluster is a circular cluster if, for example, the density value of the centroid is greater than, by some factor (e.g., 11 times), the density value of the point determined to have the maximum density value of the cluster.”).
Zhang in view of Potje teaches performing the disclosed method steps for each keypoint. Novel teaches a method for determining the center of mass in certain areas of point cloud data based on nearest neighbor points and density values. These calculations allow the system to determine information about the objects in relationship to the environment, for example, where objects are located relative to the camera (Novel, para 15). Novel teaches a known technique of determining the center of mass of a 3D data point. A person having ordinary skill in the art, before the effective filing date of the claimed invention, could have applied the known technique, as taught by Novel, in the same way to the method of Zhang and achieved predictable results of obtaining mass data about the plurality of human keypoints to further inform estimations of human pose movement, thereby improving accuracy of the pose detection system.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Choi et al. (Choi, O., Park, M. G., & Hwang, Y. (2020). Iterative K-closest point algorithms for colored point cloud registration. Sensors, 20(18), 5331.) teaches a method designed to minimize a probabilistic cost based on the color-supported soft matching of points in a point cloud to their K-closest points in the other point cloud (abstract).
D’Eusanio et al. (D'Eusanio, A., Pini, S., Borghi, G., Vezzani, R., & Cucchiara, R. (2021, January). Refinet: 3d human pose refinement with depth maps. In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 2320-2327). IEEE.) teaches a method to refine body joint points based on nearby points (pg. 2323, section D: “The third module of the framework firstly converts the depth map into a point cloud (using the camera calibration parameters K). Then, it refines the body joints exploiting the 3D information of the point cloud by sampling the points in the neighborhood of each joint location”).
Li et al. (CN Patent No. 110572655 A) teaches a method for adjusting point cloud distances based on nearest neighbor points (para 12: “a point cloud attribute encoding method is provided, comprising: determining the K nearest neighbors of the current point based on the spatial distance between the current point and the points in the point cloud; determining a distance adjustment parameter for calculating the weight of the nearest neighbors; and performing entropy encoding based on the distance adjustment parameter”).
Jung et al. (Jung, H. Y., Suh, Y., Moon, G., & Lee, K. M. (2016, September). A sequential approach to 3d human pose estimation: Separation of localization and identification of body joints. In European Conference on Computer Vision (pp. 747-761). Cham: Springer International Publishing.) teaches a method for determining joint keypoints based on a reference and nearest neighbor points (pg. 754-755; pg. 754, section 5: “the distance between a set of joints…and pose…can be measured by rigidly aligning the exemplar pose to the localized points”; pg. 755, section 5: “This performance implies that the intermediate representation of joint configuration is adequately discriminate to find the reliable nearest neighbor”).
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EMMA E DRYDEN whose telephone number is (571)272-1179. The examiner can normally be reached M-F 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ANDREW BEE can be reached at (571) 270-5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/EMMA E DRYDEN/Examiner, Art Unit 2677
/ANDREW W BEE/Supervisory Patent Examiner, Art Unit 2677