Prosecution Insights
Last updated: April 19, 2026
Application No. 18/934,771

VIDEO CONTENT PROCESSING BASED ON FACIAL RECOGNITION AND POSE TRACKING MODELING

Final Rejection §103
Filed
Nov 01, 2024
Examiner
YANG, NIEN
Art Unit
2484
Tech Center
2400 — Computer Networks
Assignee
Shure Acquisition Holdings Inc.
OA Round
2 (Final)
72%
Grant Probability
Favorable
3-4
OA Rounds
2y 9m
To Grant
99%
With Interview

Examiner Intelligence

Grants 72% — above average
72%
Career Allow Rate
287 granted / 399 resolved
+13.9% vs TC avg
Strong +29% interview lift
Without
With
+28.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
30 currently pending
Career history
429
Total Applications
across all art units

Statute-Specific Performance

§101
5.6%
-34.4% vs TC avg
§103
73.6%
+33.6% vs TC avg
§102
6.5%
-33.5% vs TC avg
§112
7.8%
-32.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 399 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Preliminary Remarks This is a reply to the application filed on 02/02/2026, in which, claims 1, 8, 13, and 20 are amended. Claims 1-20 remain pending in the present application with claims 1, 13, and 20 being independent claims. When making claim amendments, the applicant is encouraged to consider the references in their entireties, including those portions that have not been cited by the examiner and their equivalents as they may most broadly and appropriately apply to any particular anticipated claim amendments. Response to Arguments Regarding the 35 U.S.C. §101 rejection of claim 20, Applicants have amended the claims to add the limitation "non-transitory" to the claim, rendering the rejection moot. Therefore, the outstanding 35 U.S.C. §101 rejection of claim 20 is withdrawn. Applicant's arguments filed on 02/02/2026 with respect to amended claims 1, 13, and 20 have been considered but are moot in view of the new ground(s) of rejection. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-3, 6-15, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Ng et al. (US 20200320278 A1, hereinafter referred to as “Ng”) in view of Bhat et al. (US 11582519 B1, hereinafter referred to as “Bhat”), and further in view of Sharma et al. (US 20230195224 A1, hereinafter referred to as “Sharma”). Regarding claim 1, Ng discloses an apparatus comprising at least one processor (see Ng, paragraph [0019]: “one or more processors”) and a memory storing instructions that are operable, when executed by the at least one processor, to cause the apparatus to (see Ng, paragraph [0019]: “The memory stores instructions that, when executed by the one or more processors, cause the system to”): receive video data captured by at least one video capture device located within a video environment (see Ng, paragraph [0039]: “face-detection-and-tracking subsystem 112 is configured to receive the captured video images, such as captured high-resolution video images via bus 102, perform CNN-based face-detection and face-pose-estimation operations on the received video images using joint face-detection and face-pose-estimation module 116 to detect faces within each video image and generate face-pose-estimations for each detected face”); extract an image feature set from the video data (see Ng, paragraph [0085]: “a feature extraction operation is performed on that best pose face to extract a predetermined image feature from the face image”); input the facial feature set to a pose tracking model to generate a pose tracking feature set for the facial identifier (see Ng, paragraph [0047]: “When a person is moving in a video, the person's head/face can have different orientations, i.e., different poses in different video images. Estimating the pose of each detected face allows for keeping track of the pose change of each face through the sequence of video frames”); and augment the facial feature set with the pose tracking feature set to generate an augmented feature set for the facial identifier (see Ng, paragraph [0050]: “In face-detection-and-tracking subsystem 200, face-pose-estimation module 208 is followed by best-pose-selection module 210 configured to determine and update the “best pose” for each tracked person from a sequence of pose-estimations associated with a sequence of detected faces of the tracked person in a sequence of video frames. In some embodiments, the best pose is defined as a face pose closest to the frontal view (i.e., with the smallest overall head rotations). As can be seen in FIG. 2, best-pose-selection module 210 can be coupled to face tracking module 212 to receive face tracking information. Hence, best-pose-selection module 210 can keep track of each tracked person as the pose of this person is continuously estimated at face pose estimation module 208 and the best pose of this person is continuously updated at best-pose-selection module”). Regarding claim 1, Ng discloses all the claimed limitations with the exception of input the image feature set to a facial recognition model to generate a facial feature set for a facial identifier associated with a target of interest in the video environment; generate location information for the facial identifier based at least in part on the augmented feature set; and output the location information for the facial identifier. Bhat from the same or similar fields of endeavor discloses input the image feature set to a facial recognition model to generate a facial feature set for a facial identifier associated with a target of interest in the video environment (Bhat, Column 28, line 63 to Column 29 line 2: “the video synthesis system 720 may also receive the source frame A 702, and identify and/or localize the source person within the source frame A 702 via the source person identifier 728. For example, the source person identifier 728 may execute any suitable face recognition algorithm and determine an identity of the source person”). Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Bhat with the teachings as in Ng. The motivation for doing so would ensure the system to have the ability to use the video synthesis system disclosed in Bhat to receive the source frame and to identify and/or localize the source person within the source frame via the source person identifier wherein the source person identifier may execute any suitable face recognition algorithm and determine an identity of the source person thus inputting the image feature set to a facial recognition model to generate a facial feature set for a facial identifier associated with a target of interest in the video environment in order to utilize one or more facial recognition techniques with respect to the video data received from the one or more video capture devices to identify one or more faces in the video data so that particular target of interest such as a specific person can be identified. Regarding claim 1, the combination teachings of Ng and Bhat disclose all the claimed limitations with the exception of generate location information for the facial identifier based at least in part on the augmented feature set; and output the location information for the facial identifier. Sharma from the same or similar fields of endeavor discloses generate location information for the facial identifier based at least in part on the augmented feature set (see Sharma, paragraph [0062]-[0064]: “Augment operation 608 augments data based on the image grid … the augment operation 608 may include a positional augmentation based on image mirroring … Generate operation 610 generates a multi-dimensional feature vector based on the image grid … The generate operation 612 may use the facial image associated with the image grid as received by the receive operation 604. The generate operation 612 generates head pose information based on facial landmarks in the facial image. In aspects, the generation 612 may identify the facial landmarks by extracting two-dimensional coordinates of points on the face (e.g., the corners of eyes, the tip of the nose, corners of the mouse, a tip of the chin, and the like) and three-dimensional locations of these points”); and output the location information for the facial identifier (see Sharma, paragraph [0062]-[0066]: “an output from the first fully connected neural network may be a vector with 128 dimensions. The second vector may be 64 dimensions. Similarly, the generate operation 614 may use a third fully connected neural network to determine a vector with two-dimensions as an eye-gaze location in the X-Y coordinates. In aspects, the generate operation 614 may proceed to the transmit operation 518 to transmit the eye-gaze location to one or more controllers in the system”). Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Sharma with the teachings as in Ng and Bhat. The motivation for doing so would ensure the system to have the ability to use the computer-implemented method for predicting a location of an eye gaze of an operator disclosed in Sharma to augment data based on the image grid; to use the facial image associated with the image grid to generate a multi-dimensional feature vector based on the image grid; to identify the facial landmarks by extracting two-dimensional coordinates of points on the face and three-dimensional locations of these points; to determine a vector with two-dimensions as an eye-gaze location in the X-Y coordinates; and to output the eye-gaze location to one or more controllers in the system thus generating location information for the facial identifier based at least in part on the augmented feature set and outputting the location information for the facial identifier in order to identify the location information of interest target so that it is possible to enable tracking across multiple video capture devices. Regarding claim 2, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: modify video framing of the least one video capture device based at least in part on the location information (see Ng, paragraphs [0047]-[0048]: “The outputs from face-pose-estimation module 208 can be used by best-pose-selection module 210 to update the best pose for each tracked person as that person moves through the sequence of video frames…In one technique, face pose is estimated based on the locations of some facial landmarks, such as eyes, nose, and mouth, e.g., by computing distances of these facial landmarks from the frontal view”). The motivation for combining the references has been discussed in claim 1 above. Regarding claim 3, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: generate input data for a machine learning model based at least in part on the location information (see Ng, paragraph [0076]: “face detection module 706 can be implemented with a DL-based MTCNN architecture ... the third stage of the MTCNN uses a more powerful CNN to further decide whether each input window is a face or not. If it is determined to be so, the locations of five facial landmarks are also estimated. The MTCNN architecture is generally more suitable for implementations on resource-limited embedded vision systems compared to the cascaded CNN framework. Other than using the MTCNN architecture, face detection module 706 can also be implemented with other known or later developed CNN-based face-detection architectures and techniques without departing from the scope of the described technology. Face detection module 706 generates a set of detected faces 716 and the corresponding bounding box locations”). The motivation for combining the references has been discussed in claim 1 above. Regarding claim 6, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: select a video capture device in the video environment for outputting a video stream associated with the facial identifier based at least in part on the location information (see Ng, paragraph [0043]: “a given video image of captured video 202 is first received by motion detection module 204. In some embodiments, it is assumed that a human face captured in video image 202 is associated with a motion, which begins when a person first enters the field of view of the camera and ends when that same person exits the field of view of the camera or being obscured by another person or an object. Hence, to reduce the computational complexity of face-detection-and-tracking subsystem 200, motion detection module 204 can be used to preprocess each video frame to locate and identify those areas within each video frame which are associated with motions”). The motivation for combining the references has been discussed in claim 1 above. Regarding claim 7, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: generate a three-dimensional (3D) model of the video environment based at least in part on the location information (Bhat, Column 15, lines 44-60: “FIG. 3 illustrates an example technique for generating facial and/or body parameters of a three-dimensional model of a particular person, in accordance with various embodiments. In diagram 300 of FIG. 3 , a person 302 is depicted. The person 302 may correspond to a localized portion of the person 302 captured within a particular frame (e.g., a first frame, such a frame 106 of FIG. 1 ) of a particular shot (e.g., shot 104) of a sequence of shots of a video file (e.g., video file 102). As described further herein, a video synthesis system (e.g., video synthesis system 310, which may be similar to any video synthesis system described herein) may generate a data structure that corresponds to (e.g., defines) the pose 304 for the body of the person 302. Then, based at least in part on the pose 304, the video synthesis system 310 may generate a 3D model 306 (e.g., a 3DMM) for the body of the person 302 that incorporates both the pose and shape of the body of the person 302”). The motivation for combining the references has been discussed in claim 1 above. Regarding claim 8, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: input the augmented feature set to a facial similarity model to determine an accuracy metric score for the augmented facial feature set (see Ng, paragraph [0086]: “if all computed similarity values between the newly extracted feature and the stored features are below a predetermined threshold, the best pose face is considered to be associated with a unique face, which is then transmitted to the server and the associated extracted feature is stored into the feature buffer or storage”). The motivation for combining the references has been discussed in claim 1 above. Regarding claim 9, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: compare the augmented feature set to a predetermined representation of the target of interest via normalized correlation matching to generate a similarity score for the augmented feature set (see Ng, paragraph [0064]: “compare the search block against the image patch at a given search location within the search window by computing a similarity score between the search block and the compared image patch”); and output the location information based at least in part on the similarity score (see Ng, paragraph [0063]: “process 600 identifies the same detected face in the unprocessed video frame at a search location where the best match between the search block and the corresponding image patch is found”). The motivation for combining the references has been discussed in claim 1 above. Regarding claim 10, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: input the augmented feature set to a Kalman filter model to provide a movement prediction in the video environment for the target of interest (see Ng, paragraph [0065]: “the prediction of the movement can include either a linear prediction or a non-linear prediction. In linear prediction, the trajectory and speed of the movement can be predicted. In non-linear prediction, a Kalman filter approach can be applied”). The motivation for combining the references has been discussed in claim 1 above. Regarding claim 11, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: update a list of tracked faces for respective video frames in the video data based at least in part on the location information (see Ng, paragraph [0062]: “face tracking module 212 is configured to determine the location of each tracked face within an unprocessed video frame (e.g., Frame 2) immediately following a processed frame 504 (e.g., Frame 1) based on the determined location of the tracked face in the processed frame (e.g., Frame 1)”). The motivation for combining the references has been discussed in claim 1 above. Regarding claim 12, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the facial recognition model includes a multi-task cascaded convolutional neural network (MTCNN) configured for facial recognition and a transfer learning model configured for facial recognition (see Ng, paragraph [0016]: “the face-detection statistical model includes a CNN face-detection module, and the CNN face-detection module further includes a multitask-cascaded-CNN (MTCNN)”). The motivation for combining the references has been discussed in claim 1 above. Claim 13 is rejected for the same reasons as discussed in claim 1 above. Claim 14 is rejected for the same reasons as discussed in claim 2 above. Claim 15 is rejected for the same reasons as discussed in claim 3 above. Claim 18 is rejected for the same reasons as discussed in claim 6 above. Claim 19 is rejected for the same reasons as discussed in claim 7 above. Claim 20 is rejected for the same reasons as discussed in claim 1 above. In addition, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose a computer program product, stored on a non-transitory computer readable storage medium (see Ng, paragraph [0100]: “the functions may be stored as one or more instructions or code on a non-transitory computer-readable storage medium or non-transitory processor-readable storage medium”). Claims 4-5 and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Ng, Bhat, and Sharma as applied to claim 1, and further in view of Kalinli (US 20120259638 A1, hereinafter referred to as “Kalinli”). Regarding claim 4, the combination teachings of Ng, Bhat, and Sharma as discussed above disclose all the claimed limitations with the exceptions of the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: steer a microphone array beam for an audio capture device in the video environment based at least in part on the location information. Kalinli from the same or similar fields of endeavor discloses the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: steer a microphone array beam for an audio capture device in the video environment based at least in part on the location information (see Kalinli, paragraph [0023]: “The microphone array can be used to steer and extract the sound only coming from the sound source located by the camera in the field of view”). Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Kalinli with the teachings as in Ng, Bhat, and Sharma. The motivation for doing so would ensure the system to have the ability to use the system and method for determining relevance of input speech disclosed in Kalinli to use the microphone array to steer and extract the sound only coming from the sound source located by the camera in the field of view and to use source separation algorithm with a priori information of the relevant user's location to extract relevant speech from the input to the microphone array thus steering a microphone array beam for an audio capture device in the video environment based at least in part on the location information and performing source separation for audio data related to the video data based at least in part on the location information in order to modify video framing of capture device so that an optimal video capture device can be selected in the video environment. Regarding claim 5, the combination teachings of Ng, Bhat, Sharma, and Kalinli as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: perform source separation for audio data related to the video data based at least in part on the location information (see Kalinli, paragraph [0023]: “The processor 113 can implement a source separation algorithm with a priori information of the relevant user's location to extract relevant speech from the input to the microphone array”). The motivation for combining the references has been discussed in claim 4 above. Claim 16 is rejected for the same reasons as discussed in claim 4 above. Claim 17 is rejected for the same reasons as discussed in claim 5 above. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to NIENRU YANG whose telephone number is (571)272-4212. The examiner can normally be reached Monday-Friday 10AM-6PM EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI TRAN can be reached at 571-272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. NIENRU YANG Examiner Art Unit 2484 /NIENRU YANG/Examiner, Art Unit 2484 /THAI Q TRAN/Supervisory Patent Examiner, Art Unit 2484
Read full office action

Prosecution Timeline

Nov 01, 2024
Application Filed
Oct 27, 2025
Non-Final Rejection — §103
Jan 20, 2026
Applicant Interview (Telephonic)
Jan 20, 2026
Examiner Interview Summary
Feb 02, 2026
Response Filed
Mar 06, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12604024
REPRODUCTION DEVICE, REPRODUCTION METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Apr 14, 2026
Patent 12592259
SYSTEMS AND METHODS TO EDIT VIDEOS TO REMOVE AND/OR CONCEAL AUDIBLE COMMANDS
2y 5m to grant Granted Mar 31, 2026
Patent 12586609
USING AUDIO ANCHOR POINTS TO SYNCHRONIZE RECORDINGS
2y 5m to grant Granted Mar 24, 2026
Patent 12581030
REPRODUCTION DEVICE, REPRODUCTION METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Mar 17, 2026
Patent 12556720
LEARNED VIDEO COMPRESSION AND CONNECTORS FOR MULTIPLE MACHINE TASKS
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
72%
Grant Probability
99%
With Interview (+28.7%)
2y 9m
Median Time to Grant
Moderate
PTA Risk
Based on 399 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month