Prosecution Insights
Last updated: April 19, 2026
Application No. 17/352,022

Systems and Methods to Automatically Determine Human-Object Interactions in Images

Final Rejection §103
Filed
Jun 18, 2021
Examiner
PATEL, PINALBEN V
Art Unit
2673
Tech Center
2600 — Communications
Assignee
Huawei Technologies Co., Ltd.
OA Round
4 (Final)
89%
Grant Probability
Favorable
5-6
OA Rounds
2y 6m
To Grant
99%
With Interview

Examiner Intelligence

Grants 89% — above average
89%
Career Allow Rate
484 granted / 545 resolved
+26.8% vs TC avg
Moderate +10% lift
Without
With
+9.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 6m
Avg Prosecution
23 currently pending
Career history
568
Total Applications
across all art units

Statute-Specific Performance

§101
9.1%
-30.9% vs TC avg
§103
59.9%
+19.9% vs TC avg
§102
5.9%
-34.1% vs TC avg
§112
14.9%
-25.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 545 resolved cases

Office Action

§103
DETAILED ACTION With regard to the amendment filed 12/10/2025, Claims 1-20 are pending. Claim 1 is amended. Response to Arguments Applicant’s arguments with respect to claims have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). Claim Interpretation The following is a quotation of 35 U.S.C. 112(f): (f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph: An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. Claims 15-20 are interpreted to invoke 35 U.S.C. 112(f). The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked. As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph: (A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; (B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and (C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: “a pose detector”, “a human feature extractor”, “an object feature extractor”, “a relation network”, and “an HOI contextual and reasoning module” in claim 15. Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. The instant specifications in paragraphs [0124-00126[ discloses computer hardware processor(s) and their equivalents thereof. If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1, 3 and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang et al. (US Pub No. 20190180090 A1, as provided) in view of Chakravarty et al. (US Pub No. 20200097724 A1). Regarding Claim 1, Jiang discloses A method for detecting human-object interaction (HOI) in an image, the method comprising: receiving the image; (Jiang, [0005-0007], [0023], discloses provide methods and structures that enable image and/or video processing as may be performed using a computing device, a communication device, a method, and other platforms that includes targeted spatial-temporal joint human and object feature learning for each individual HOIA as a context and that also includes context (HOIA) dependent human tracking for contextual spatial-temporal human feature extraction; image is received) detecting at least one human in the image; (Jiang, [0025], discloses joint human (body) tracking and object tracking are performed in accordance with joint HOIA learning. For example, activity detection and recognition of one or more humans and one or more objects is performed in accordance with joint monitoring. Specific individuals and specific activities may be identified and logged over time; human is detected in image) detecting at least one object in the image; (Jiang, [0025], discloses joint human (body) tracking and object tracking are performed in accordance with joint HOIA learning. For example, activity detection and recognition of one or more humans and one or more objects is performed in accordance with joint monitoring. Specific individuals and specific activities may be identified and logged over time; object is detected in image) creating one or more proposals, each of the one or more proposals comprising a human of the at least one human and an object of the at least one object; (Jiang, [0005], [0024], [0027-0028], Fig. 1, discloses a computing device that includes a communication interface configured to interface and communicate with a communication network, memory that stores operational instructions, and processing circuitry coupled to the communication interface and to the memory. The processing circuitry is configured to execute the operational instructions to perform various functions, operations, processes, etc. The computing device is configured to process a video frame of a video segment on a per-frame basis and based on joint human-object interactive activity (HOIA) to generate a per-frame pairwise human-object interactive (HOI) feature based on a plurality of candidate HOI pairs. The computing device is also configured to process the per-frame pairwise HOI feature to identify a valid HOI pair among the plurality of candidate HOI pairs and track the valid HOI pair through subsequent frames of the video segment to generate a contextual spatial-temporal feature for the valid HOI pair to be used in activity detection; an embodiment 100 of a joint human-object interactive activity (HOIA) learning as may be performed by a computing device, a communication device, and/or method. A novel approach is presented herein by which human body tracking is associated with object tracking in accordance with joint human-object interactive activity (HOIA) learning and processing. Various aspects, embodiments, and/or examples, and their equivalents, presented herein operate by performing joint processing of one or more humans and one or more objects. Such processing is based on HOIA detection based on joint human and object detection and tracking. Such operations include targeted spatial-temporal joint human and object feature learning for each individual HOIA as a context and also include context (HOIA) dependent human tracking for contextual spatial-temporal human feature extraction. Such operations also include context (HOIA) dependent human object interaction (HOI) tracking for contextual spatial-temporal HOI feature extraction. This provides for effective HOIA detection by considering both spatial-temporal human feature and spatial-temporal HOI feature through implicit HOI relationship modeling. Also, one or more pre-trained human and object detectors may be used to transfer knowledge learned from large scale human and object detection data to the current activity detection task; HOI is tracked and human object interaction is determined using feature extractions of human and object) and determining whether an HOI exists in each of the one or more proposals. (Jiang, [0027-0028], discloses computing device is also configured to process the per-frame pairwise HOI feature to identify a valid HOI pair among the plurality of candidate HOI pairs. In some examples, a per-frame pairwise HOI feature includes 3-dimensional (3-D) data associated with a person and an object within a frame (e.g., the 3-D data associated with location of the person and the 3-D data associated with location of the object). In some examples, this information is provided as output from processing of the frame such as in accordance with that as performed by a neural network appropriately tailored to do so. Such 3-D related information may be associated a feature map as also described herein. Also, in some particular examples, a per-frame pairwise HOI feature is associated with a particular person per-frame pairwise HOI pair (e.g., a pair that includes a particular person and a particular object); the computing device is also configured to track the valid HOI pair through subsequent frames of the video segment to generate a contextual spatial-temporal feature for the valid HOI pair to be used in activity detection. For example, with respect to a person and an object associated with the valid HOI pair, the corresponding per-frame pairwise HOI feature is tracked through multiple frames to generate the contextual spatial-temporal feature. As an example, the 3-D data associated with location of the person and the 3-D data associated with location of the object are tracked, as being a valid HOI pair, through multiple frames to generate the contextual spatial-temporal feature. An example of a contextual spatial-temporal feature includes the data (e.g., the 3-D data) associated with person and an object based on performing activity through multiple frames. Additional examples, embodiments, and details are described below; HOI (Human and object interaction) pair and interactions is determined). Jiang does not explicitly disclose creating one or more proposals comprising a human of the at least one human, an object of the at least one object, after detecting the at least one human and the at least one object, the human pose feature information being used to classify the one or more proposals based at least in part on the extracted human pose feature information. Chakravarty discloses creating one or more proposals comprising a human of the at least one human, an object of the at least one object, after detecting the at least one human and the at least one object, the human pose feature information being used to classify the one or more proposals based at least in part on the extracted human pose feature information. (Chakravarty, [0040-0041], [0089], The object-identification system further recognizes each object on the support surface or involved in a handling activity. Object recognition serves to identify the type of object detected and tracked (e.g., a package from a certain carrier, a jar of pickles, a microscope). Such object recognition may involve human interaction to initially identify or to confirm, correct, or fine tune the recognition of a given object. The object-identification system employs machine-learning techniques to improve its object recognition capabilities. Recognition of a given object can facilitate the tracking of the object while the object is in the holding area, serving to confirm the presence or movement of the object; upon occasion, the sensor module 102 will capture an image for which object recognition falls below a threshold, namely, the object-identification system is unable to recognize an object in the image. Despite being unable to recognize the object (at least initially), the object-identification system can still track the object, namely, its initial placement and any subsequent location within the holding area, based on visual characteristics of the object. The unidentifiable image is retained for purposes of later retraining of the DNN 112 so that the DNN will become able to recognize a previously unrecognizable object when that object is present in subsequently processed images. Human interaction with the object-identification system, through voice recognition, gesture recognition, or keyboard input, can specifically identify an object in an unidentifiable image, giving the image a proper label. An example of gesture recognition is a person holding up three fingers to identify the object as type number 3, where the object-identification system has stored the association of a three-finger gesture with a specific object (e.g., three fingers correspond to a microscope). After an object in the previously unidentifiable image becomes recognized, with the help of the human input, the image and associated proper label are stored in an image database 122. The object-identification system 100 uses these stored images and labels to retrain the deep neural network 112. By retraining the deep neural network with previously unidentifiable images, now made identifiable by human-provided information, the neural network 112 increasingly grows “smarter”. Over time, the probability of the neural network recognizing objects in later captured images approaches one hundred percent; features of human and object interaction (HOI) are extracted, the gesture (pose) is determined from the extracted HOI pose features such a person holding three fingers (also being existence of HOI) to identify object as type number 3, wherein the a three finger gesture with a specific object is microscope and identified as HOI and gesture is pose of fingers holding an object in specific manner and classifying the pose would be recognizing the gesture of holding). Jiang discloses the claimed invention except for the determining HOI by detection of human and object existing and interacting in an image. Chakravarty teaches that it is known to use the pose information of HOI to further classify the type of HOI. It would have been obvious to one having ordinary skill in the art at the time the invention was made to use modification in Jiang that detects human and object interaction existence (HOI), as taught by Chakravarty order to improve the accuracy of HOI detection and classification for applications including gaming industry. Regarding Claim 3, The combination of Jiang and Chakravarty further discloses wherein the determining comprises: extracting human feature information from the one or more proposals; extracting object appearance feature information from the one or more proposals; and determining whether an HOI exists between the human and the object based at least in part on the human feature information and the object feature information. (Jiang, [0005-0007], [0024], [0027-0028], discloses computing device is also configured to process the per-frame pairwise HOI feature to identify a valid HOI pair among the plurality of candidate HOI pairs. In some examples, a per-frame pairwise HOI feature includes 3-dimensional (3-D) data associated with a person and an object within a frame (e.g., the 3-D data associated with location of the person and the 3-D data associated with location of the object). In some examples, this information is provided as output from processing of the frame such as in accordance with that as performed by a neural network appropriately tailored to do so. Such 3-D related information may be associated a feature map as also described herein. Also, in some particular examples, a per-frame pairwise HOI feature is associated with a particular person per-frame pairwise HOI pair (e.g., a pair that includes a particular person and a particular object); the computing device is also configured to track the valid HOI pair through subsequent frames of the video segment to generate a contextual spatial-temporal feature for the valid HOI pair to be used in activity detection. For example, with respect to a person and an object associated with the valid HOI pair, the corresponding per-frame pairwise HOI feature is tracked through multiple frames to generate the contextual spatial-temporal feature. As an example, the 3-D data associated with location of the person and the 3-D data associated with location of the object are tracked, as being a valid HOI pair, through multiple frames to generate the contextual spatial-temporal feature. An example of a contextual spatial-temporal feature includes the data (e.g., the 3-D data) associated with person and an object based on performing activity through multiple frames. Additional examples, embodiments, and details are described below; HOI (Human and object interaction) pair and interactions are determined). Regarding Claim 4, The combination of Jiang and Chakravarty further discloses extracting spatial feature information from the one or more proposals, the spatial feature information related to the human and the object; wherein the HOI determination is further based on the spatial feature information. (Jiang, [0005], [0023-0024], [0028], discloses a computing device that includes a communication interface configured to interface and communicate with a communication network, memory that stores operational instructions, and processing circuitry coupled to the communication interface and to the memory. The processing circuitry is configured to execute the operational instructions to perform various functions, operations, processes, etc. The computing device is configured to process a video frame of a video segment on a per-frame basis and based on joint human-object interactive activity (HOIA) to generate a per-frame pairwise human-object interactive (HOI) feature based on a plurality of candidate HOI pairs. The computing device is also configured to process the per-frame pairwise HOI feature to identify a valid HOI pair among the plurality of candidate HOI pairs and track the valid HOI pair through subsequent frames of the video segment to generate a contextual spatial-temporal feature for the valid HOI pair to be used in activity detection; methods and structures that enable image and/or video processing as may be performed using a computing device, a communication device, a method, and other platforms that includes targeted spatial-temporal joint human and object feature learning for each individual HOIA as a context and that also includes context (HOIA) dependent human tracking for contextual spatial-temporal human feature extraction. Such image and/or video processing also includes context (HOIA) dependent human object interaction (HOI) tracking for contextual spatial-temporal HOI feature extraction. This provides for effective HOIA detection by considering both spatial-temporal human feature and spatial-temporal HOI feature through implicit HOI relationship modeling. Also, one or more pre-trained human and object detectors may be used to transfer knowledge learned from large scale human and object detection data to the current activity detection task; spatio-temporal features are tracked to determine HOI pair). Claims 2 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang et al. as modified by Chakravarty et al. and further in view of Nawhal et al. (US Pub No. 20200302231 A1, as provided). Regarding Claim 2, The combination of Jiang Chakravarty does not explicitly disclose generating a mask for each proposal of the one or more proposals in which an HOI is determined to exist, the mask generated based on the determined HOI. Nawhal discloses generating a mask for each proposal of the one or more proposals in which an HOI is determined to exist, the mask generated based on the determined HOI. (Nawhal, [0219-0222], discloses mask contains ones in regions (either rectangular bounding boxes or segmentation masks) corresponding to the objects (non-person classes) detected using MaskRCNN and zeros for other regions. Intuitively, this helps ensure the generator learns to map the action and object embeddings to relevant visual content in the HOI video; to evaluate the generator's capability to synthesize the right human-object interactions, Applicants provide a background frame as described above. This background frame can be selected from either the test set or training set, and can be suitable or unsuitable for the target action-object composition. To capture these possibilities, we design two different generation scenarios; the input context frame I is the masked first frame of a video from the test set corresponding to the target action-object composition; I is the masked first frame of a video from the training set which depicts an object other than the target object. The original action in this video could be same or different than the target action. Refer to Table 1 to see the contrast between the two scenarios; mask is generated to distinguish human object interactions (HOI) from background information). The combination of Jiang and Chakravarty discloses the claimed invention except for the mask to highlight the regions of interests of human and object interaction. Nawhal teaches that it is known to mask the human and object interactions recognized in image. It would have been obvious to one having ordinary skill in the art at the time the invention was made to use the masking of recognized HOI in images as taught by Nawhal in order to improved visual representation of the HOI in recognized image. Regarding Claim 5, The combination of Jiang, Chakravarty and Nawhal further discloses editing the image based on the generated mask. (Nawhal, [0219-0222], discloses mask contains ones in regions (either rectangular bounding boxes or segmentation masks) corresponding to the objects (non-person classes) detected using MaskRCNN and zeros for other regions. Intuitively, this helps ensure the generator learns to map the action and object embeddings to relevant visual content in the HOI video; to evaluate the generator's capability to synthesize the right human-object interactions, Applicants provide a background frame as described above. This background frame can system be selected from either the test set or training set, and can be suitable or unsuitable for the target action-object composition. To capture these possibilities, we design two different generation scenarios; the input context frame I is the masked first frame of a video from the test set corresponding to the target action-object composition; I is the masked first frame of a video from the training set which depicts an object other than the target object. The original action in this video could be same or different than the target action. Refer to Table 1 to see the contrast between the two scenarios; mask is generated to distinguish human object interactions (HOI) from background information). Allowable Subject Matter Claims 6-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to PINALBEN V PATEL whose telephone number is (571)270-5872. The examiner can normally be reached M-F: 10am - 8pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chineyere Wills-Burns can be reached on (571)272-9752. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /Pinalben Patel/Examiner, Art Unit 2671
Read full office action

Prosecution Timeline

Jun 18, 2021
Application Filed
Jan 24, 2024
Non-Final Rejection — §103
Apr 29, 2024
Response Filed
Aug 08, 2024
Response after Non-Final Action
May 01, 2025
Final Rejection — §103
Jul 04, 2025
Response after Non-Final Action
Aug 26, 2025
Request for Continued Examination
Aug 27, 2025
Response after Non-Final Action
Sep 15, 2025
Non-Final Rejection — §103
Dec 10, 2025
Response Filed
Mar 24, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602824
SUBSTRATE TREATING APPARATUS AND SUBSTRATE TREATING METHOD
2y 5m to grant Granted Apr 14, 2026
Patent 12596437
Monitoring System and Method Having Gesture Detection
2y 5m to grant Granted Apr 07, 2026
Patent 12597235
INFORMATION PROCESSING APPARATUS, LEARNING METHOD, RECOGNITION METHOD, AND NON-TRANSITORY COMPUTER READABLE MEDIUM
2y 5m to grant Granted Apr 07, 2026
Patent 12586215
VEHICLE POSE
2y 5m to grant Granted Mar 24, 2026
Patent 12586217
VISION SENSOR, OPERATING METHOD OF VISION SENSOR, AND IMAGE PROCESSING DEVICE INCLUDING THE VISION SENSOR
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

5-6
Expected OA Rounds
89%
Grant Probability
99%
With Interview (+9.9%)
2y 6m
Median Time to Grant
High
PTA Risk
Based on 545 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month