Last updated: April 19, 2026

Application No. 18/934,771

VIDEO CONTENT PROCESSING BASED ON FACIAL RECOGNITION AND POSE TRACKING MODELING

Final Rejection §103

Filed

Nov 01, 2024

Examiner

YANG, NIEN

Art Unit

2484

Tech Center

2400 — Computer Networks

Assignee

Shure Acquisition Holdings Inc.

OA Round

2 (Final)

Interview Optional

— +28.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 399 resolved cases, 2023–2026

Examiner Intelligence

YANG, NIEN View full profile →

Grants 72% — above average

Career Allow Rate

287 granted / 399 resolved

+13.9% vs TC avg

Strong +29% interview lift

Without

With

+28.7%

Interview Lift

resolved cases with interview

Typical timeline

2y 9m

Avg Prosecution

30 currently pending

Career history

429

Total Applications

across all art units

Statute-Specific Performance

§101

5.6%

-34.4% vs TC avg

§103

73.6%

+33.6% vs TC avg

§102

6.5%

-33.5% vs TC avg

§112

7.8%

-32.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 399 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Preliminary Remarks
This is a reply to the application filed on 02/02/2026, in which, claims 1, 8, 13, and 20 are amended. Claims 1-20 remain pending in the present application with claims 1, 13, and 20 being independent claims.
When making claim amendments, the applicant is encouraged to consider the references in their entireties, including those portions that have not been cited by the examiner and their equivalents as they may most broadly and appropriately apply to any particular anticipated claim amendments.

Response to Arguments
Regarding the 35 U.S.C. §101 rejection of claim 20, Applicants have amended the claims to add the limitation "non-transitory" to the claim, rendering the rejection moot. Therefore, the outstanding 35 U.S.C. §101 rejection of claim 20 is withdrawn.
Applicant's arguments filed on 02/02/2026 with respect to amended claims 1, 13, and 20 have been considered but are moot in view of the new ground(s) of rejection.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 6-15, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Ng et al. (US 20200320278 A1, hereinafter referred to as “Ng”) in view of Bhat et al. (US 11582519 B1, hereinafter referred to as “Bhat”), and further in view of Sharma et al. (US 20230195224 A1, hereinafter referred to as “Sharma”). 
Regarding claim 1, Ng discloses an apparatus comprising at least one processor (see Ng, paragraph [0019]: “one or more processors”) and a memory storing instructions that are operable, when executed by the at least one processor, to cause the apparatus to (see Ng, paragraph [0019]: “The memory stores instructions that, when executed by the one or more processors, cause the system to”): 
receive video data captured by at least one video capture device located within a video environment (see Ng, paragraph [0039]: “face-detection-and-tracking subsystem 112 is configured to receive the captured video images, such as captured high-resolution video images via bus 102, perform CNN-based face-detection and face-pose-estimation operations on the received video images using joint face-detection and face-pose-estimation module 116 to detect faces within each video image and generate face-pose-estimations for each detected face”); 
extract an image feature set from the video data (see Ng, paragraph [0085]: “a feature extraction operation is performed on that best pose face to extract a predetermined image feature from the face image”); 
input the facial feature set to a pose tracking model to generate a pose tracking feature set for the facial identifier (see Ng, paragraph [0047]: “When a person is moving in a video, the person's head/face can have different orientations, i.e., different poses in different video images. Estimating the pose of each detected face allows for keeping track of the pose change of each face through the sequence of video frames”); and
augment the facial feature set with the pose tracking feature set to generate an augmented feature set for the facial identifier (see Ng, paragraph [0050]: “In face-detection-and-tracking subsystem 200, face-pose-estimation module 208 is followed by best-pose-selection module 210 configured to determine and update the “best pose” for each tracked person from a sequence of pose-estimations associated with a sequence of detected faces of the tracked person in a sequence of video frames. In some embodiments, the best pose is defined as a face pose closest to the frontal view (i.e., with the smallest overall head rotations). As can be seen in FIG. 2, best-pose-selection module 210 can be coupled to face tracking module 212 to receive face tracking information. Hence, best-pose-selection module 210 can keep track of each tracked person as the pose of this person is continuously estimated at face pose estimation module 208 and the best pose of this person is continuously updated at best-pose-selection module”). 
Regarding claim 1, Ng discloses all the claimed limitations with the exception of input the image feature set to a facial recognition model to generate a facial feature set for a facial identifier associated with a target of interest in the video environment; generate location information for the facial identifier based at least in part on the augmented feature set; and output the location information for the facial identifier.
Bhat from the same or similar fields of endeavor discloses input the image feature set to a facial recognition model to generate a facial feature set for a facial identifier associated with a target of interest in the video environment (Bhat, Column 28, line 63 to Column 29 line 2: “the video synthesis system 720 may also receive the source frame A 702, and identify and/or localize the source person within the source frame A 702 via the source person identifier 728. For example, the source person identifier 728 may execute any suitable face recognition algorithm and determine an identity of the source person”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Bhat with the teachings as in Ng. The motivation for doing so would ensure the system to have the ability to use the video synthesis system disclosed in Bhat to receive the source frame and to identify and/or localize the source person within the source frame via the source person identifier wherein the source person identifier may execute any suitable face recognition algorithm and determine an identity of the source person thus inputting the image feature set to a facial recognition model to generate a facial feature set for a facial identifier associated with a target of interest in the video environment in order to utilize one or more facial recognition techniques with respect to the video data received from the one or more video capture devices to identify one or more faces in the video data so that particular target of interest such as a specific person can be identified.
Regarding claim 1, the combination teachings of Ng and Bhat disclose all the claimed limitations with the exception of generate location information for the facial identifier based at least in part on the augmented feature set; and output the location information for the facial identifier.
Sharma from the same or similar fields of endeavor discloses generate location information for the facial identifier based at least in part on the augmented feature set (see Sharma, paragraph [0062]-[0064]: “Augment operation 608 augments data based on the image grid … the augment operation 608 may include a positional augmentation based on image mirroring … Generate operation 610 generates a multi-dimensional feature vector based on the image grid … The generate operation 612 may use the facial image associated with the image grid as received by the receive operation 604. The generate operation 612 generates head pose information based on facial landmarks in the facial image. In aspects, the generation 612 may identify the facial landmarks by extracting two-dimensional coordinates of points on the face (e.g., the corners of eyes, the tip of the nose, corners of the mouse, a tip of the chin, and the like) and three-dimensional locations of these points”); and 
output the location information for the facial identifier (see Sharma, paragraph [0062]-[0066]: “an output from the first fully connected neural network may be a vector with 128 dimensions. The second vector may be 64 dimensions. Similarly, the generate operation 614 may use a third fully connected neural network to determine a vector with two-dimensions as an eye-gaze location in the X-Y coordinates. In aspects, the generate operation 614 may proceed to the transmit operation 518 to transmit the eye-gaze location to one or more controllers in the system”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Sharma with the teachings as in Ng and Bhat. The motivation for doing so would ensure the system to have the ability to use the computer-implemented method for predicting a location of an eye gaze of an operator disclosed in Sharma to augment data based on the image grid; to use the facial image associated with the image grid to generate a multi-dimensional feature vector based on the image grid; to identify the facial landmarks by extracting two-dimensional coordinates of points on the face and three-dimensional locations of these points; to determine a vector with two-dimensions as an eye-gaze location in the X-Y coordinates; and to output the eye-gaze location to one or more controllers in the system thus generating location information for the facial identifier based at least in part on the augmented feature set and outputting the location information for the facial identifier in order to identify the location information of interest target so that it is possible to enable tracking across multiple video capture devices.
Regarding claim 2, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
modify video framing of the least one video capture device based at least in part on the location information (see Ng, paragraphs [0047]-[0048]: “The outputs from face-pose-estimation module 208 can be used by best-pose-selection module 210 to update the best pose for each tracked person as that person moves through the sequence of video frames…In one technique, face pose is estimated based on the locations of some facial landmarks, such as eyes, nose, and mouth, e.g., by computing distances of these facial landmarks from the frontal view”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 3, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
generate input data for a machine learning model based at least in part on the location information (see Ng, paragraph [0076]: “face detection module 706 can be implemented with a DL-based MTCNN architecture ... the third stage of the MTCNN uses a more powerful CNN to further decide whether each input window is a face or not. If it is determined to be so, the locations of five facial landmarks are also estimated. The MTCNN architecture is generally more suitable for implementations on resource-limited embedded vision systems compared to the cascaded CNN framework. Other than using the MTCNN architecture, face detection module 706 can also be implemented with other known or later developed CNN-based face-detection architectures and techniques without departing from the scope of the described technology. Face detection module 706 generates a set of detected faces 716 and the corresponding bounding box locations”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 6, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
select a video capture device in the video environment for outputting a video stream associated with the facial identifier based at least in part on the location information (see Ng, paragraph [0043]: “a given video image of captured video 202 is first received by motion detection module 204. In some embodiments, it is assumed that a human face captured in video image 202 is associated with a motion, which begins when a person first enters the field of view of the camera and ends when that same person exits the field of view of the camera or being obscured by another person or an object. Hence, to reduce the computational complexity of face-detection-and-tracking subsystem 200, motion detection module 204 can be used to preprocess each video frame to locate and identify those areas within each video frame which are associated with motions”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 7, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
generate a three-dimensional (3D) model of the video environment based at least in part on the location information (Bhat, Column 15, lines 44-60: “FIG. 3 illustrates an example technique for generating facial and/or body parameters of a three-dimensional model of a particular person, in accordance with various embodiments. In diagram 300 of FIG. 3 , a person 302 is depicted. The person 302 may correspond to a localized portion of the person 302 captured within a particular frame (e.g., a first frame, such a frame 106 of FIG. 1 ) of a particular shot (e.g., shot 104) of a sequence of shots of a video file (e.g., video file 102). As described further herein, a video synthesis system (e.g., video synthesis system 310, which may be similar to any video synthesis system described herein) may generate a data structure that corresponds to (e.g., defines) the pose 304 for the body of the person 302. Then, based at least in part on the pose 304, the video synthesis system 310 may generate a 3D model 306 (e.g., a 3DMM) for the body of the person 302 that incorporates both the pose and shape of the body of the person 302”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 8, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
input the augmented feature set to a facial similarity model to determine an accuracy metric score for the augmented facial feature set (see Ng, paragraph [0086]: “if all computed similarity values between the newly extracted feature and the stored features are below a predetermined threshold, the best pose face is considered to be associated with a unique face, which is then transmitted to the server and the associated extracted feature is stored into the feature buffer or storage”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 9, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
compare the augmented feature set to a predetermined representation of the target of interest via normalized correlation matching to generate a similarity score for the augmented feature set (see Ng, paragraph [0064]: “compare the search block against the image patch at a given search location within the search window by computing a similarity score between the search block and the compared image patch”); and 
output the location information based at least in part on the similarity score (see Ng, paragraph [0063]: “process 600 identifies the same detected face in the unprocessed video frame at a search location where the best match between the search block and the corresponding image patch is found”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 10, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
input the augmented feature set to a Kalman filter model to provide a movement prediction in the video environment for the target of interest (see Ng, paragraph [0065]: “the prediction of the movement can include either a linear prediction or a non-linear prediction. In linear prediction, the trajectory and speed of the movement can be predicted. In non-linear prediction, a Kalman filter approach can be applied”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 11, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
update a list of tracked faces for respective video frames in the video data based at least in part on the location information (see Ng, paragraph [0062]: “face tracking module 212 is configured to determine the location of each tracked face within an unprocessed video frame (e.g., Frame 2) immediately following a processed frame 504 (e.g., Frame 1) based on the determined location of the tracked face in the processed frame (e.g., Frame 1)”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 12, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose the apparatus of claim 1, wherein the facial recognition model includes a multi-task cascaded convolutional neural network (MTCNN) configured for facial recognition and a transfer learning model configured for facial recognition (see Ng, paragraph [0016]: “the face-detection statistical model includes a CNN face-detection module, and the CNN face-detection module further includes a multitask-cascaded-CNN (MTCNN)”).
The motivation for combining the references has been discussed in claim 1 above.
Claim 13 is rejected for the same reasons as discussed in claim 1 above. 
Claim 14 is rejected for the same reasons as discussed in claim 2 above.
Claim 15 is rejected for the same reasons as discussed in claim 3 above.
Claim 18 is rejected for the same reasons as discussed in claim 6 above.
Claim 19 is rejected for the same reasons as discussed in claim 7 above.
Claim 20 is rejected for the same reasons as discussed in claim 1 above. In addition, the combination teachings of Ng, Bhat, and Sharma as discussed above also disclose a computer program product, stored on a non-transitory computer readable storage medium (see Ng, paragraph [0100]: “the functions may be stored as one or more instructions or code on a non-transitory computer-readable storage medium or non-transitory processor-readable storage medium”).
Claims 4-5 and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Ng, Bhat, and Sharma as applied to claim 1, and further in view of Kalinli (US 20120259638 A1, hereinafter referred to as “Kalinli”).
Regarding claim 4, the combination teachings of Ng, Bhat, and Sharma as discussed above disclose all the claimed limitations with the exceptions of the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: steer a microphone array beam for an audio capture device in the video environment based at least in part on the location information.
Kalinli from the same or similar fields of endeavor discloses the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
steer a microphone array beam for an audio capture device in the video environment based at least in part on the location information (see Kalinli, paragraph [0023]: “The microphone array can be used to steer and extract the sound only coming from the sound source located by the camera in the field of view”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Kalinli with the teachings as in Ng, Bhat, and Sharma. The motivation for doing so would ensure the system to have the ability to use the system and method for determining relevance of input speech disclosed in Kalinli to use the microphone array to steer and extract the sound only coming from the sound source located by the camera in the field of view and to use source separation algorithm with a priori information of the relevant user's location to extract relevant speech from the input to the microphone array thus steering a microphone array beam for an audio capture device in the video environment based at least in part on the location information and performing source separation for audio data related to the video data based at least in part on the location information in order to modify video framing of capture device so that an optimal video capture device can be selected in the video environment.
Regarding claim 5, the combination teachings of Ng, Bhat, Sharma, and Kalinli as discussed above also disclose the apparatus of claim 1, wherein the instructions are further operable to cause the apparatus to: 
perform source separation for audio data related to the video data based at least in part on the location information (see Kalinli, paragraph [0023]: “The processor 113 can implement a source separation algorithm with a priori information of the relevant user's location to extract relevant speech from the input to the microphone array”).
The motivation for combining the references has been discussed in claim 4 above.
Claim 16 is rejected for the same reasons as discussed in claim 4 above.
Claim 17 is rejected for the same reasons as discussed in claim 5 above.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NIENRU YANG whose telephone number is (571)272-4212. The examiner can normally be reached Monday-Friday 10AM-6PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI TRAN can be reached at 571-272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

NIENRU YANG
Examiner
Art Unit 2484



/NIENRU YANG/Examiner, Art Unit 2484                                                                                                                                                                                                        

/THAI Q TRAN/Supervisory Patent Examiner, Art Unit 2484

Read full office action

Prosecution Timeline

Nov 01, 2024

Application Filed

Oct 27, 2025

Non-Final Rejection — §103

Jan 20, 2026

Applicant Interview (Telephonic)

Jan 20, 2026

Examiner Interview Summary

Feb 02, 2026

Response Filed

Mar 06, 2026

Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/937,101

Patent 12604024

REPRODUCTION DEVICE, REPRODUCTION METHOD, AND RECORDING MEDIUM

2y 5m to grant Granted Apr 14, 2026

17/849,050

Patent 12592259

SYSTEMS AND METHODS TO EDIT VIDEOS TO REMOVE AND/OR CONCEAL AUDIBLE COMMANDS

2y 5m to grant Granted Mar 31, 2026

18/340,082

Patent 12586609

USING AUDIO ANCHOR POINTS TO SYNCHRONIZE RECORDINGS

2y 5m to grant Granted Mar 24, 2026

17/966,994

Patent 12581030

REPRODUCTION DEVICE, REPRODUCTION METHOD, AND RECORDING MEDIUM

2y 5m to grant Granted Mar 17, 2026

18/033,697

Patent 12556720

LEARNED VIDEO COMPRESSION AND CONNECTORS FOR MULTIPLE MACHINE TASKS

2y 5m to grant Granted Feb 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

72%

Grant Probability

99%

With Interview (+28.7%)

2y 9m

Median Time to Grant

Moderate

PTA Risk

Based on 399 resolved cases by this examiner. Grant probability derived from career allow rate.