DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendments filed on October 1, 2025 have been entered. Applicant amended claims 1, 4, 9, 10, 12, and 17 and added new claims 18 and 19. Claims 1-19 remain pending in the application.
Response to Arguments
Applicant’s arguments filed on June 3, 2025 in response to the Non-Final Office Action dated April 1, 2025 have been fully considered.
Applicant argues, in pages of 7-8 of the Remarks, “Johnson does not describe providing for display a real-time digital twin of the speaker, as required by amended claims 1 and 9”. Applicant further argues that Johnson describes generating a plurality of replacement image frames to be inserted into a video stream, which is not the same as generating a real-time.
In response, Examiner respectfully disagrees. Johnson discloses the generated replacement image frames are consistent video depiction of the person’s face based on the facial location data augment by audio data as explained in Col. 3, lines 57-63 and Col. 8, lines 18-27. Therefore, such depiction of person’s face is a representation of the real-time digital twin of the speaker.
Examiner’s Note about the Format of 35 U.S.C. 102/103 Rejections
Generally, limitations of a claim are reproduced identically and followed by examiner’s explanation with citation from prior art in Italic enclosed by a parenthesis, (), for each limitation. In examiner’s explanation, the mapping of the key elements of a limitation to the disclosed elements of prior art is shown by stating the disclosed element immediately followed by the claimed element inside a parenthesis. Specific quotation from prior art is delineated with quotation mark, ““. If primary art fails to teach a limitation or part of the limitation, the limitation or the part of the limitation is placed inside double square brackets, [[]], for better understandability, and appropriate secondary art(s) is/are applied later addressing the deficiency of the primary art.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-6, 9-14, and 17-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by over Johnson et al. (US Patent No. 11368652), hereinafter, Johnson.
Regarding claim 1:
Johnson teaches:
A method comprising:
capturing at least one image of a speaker in a video stream (Col. 10, lines 22-25, discloses image frames of first video content (video stream) as stated “The process of FIG. 8 is initiated at operation 810, at which played image frames included in first video content (e.g., captured video 172 of FIG. 1) are received over one or more networks.” . Col. 10, lines 37-39, discloses the image frames include a face of the person (speaker) as stated “As described above, the played image frames and the replaced image frame may include a face of a person.”);
determining that a network condition between a client device and a video service node is insufficient when compared to a threshold level (Fig. 4, graph 420, shows monitoring bandwidth (network condition) of the connection as explained in Col. 7, lines 19-46. Col. 10, lines 62-67, and Col. 11, lines 1-22, disclose the bandwidth is insufficient compared to a threshold);
capturing movement data for the speaker based at least in part on the determining that the network condition between the client device and the video service node is insufficient when compared to the threshold level (Col. 4, lines 22-24, states “The captured video 172 may include the person's head movements, lip movements, facial expressions, and the like.”. Col. 10, lines 1-7, discloses the location data includes the person’s head movement as stated “In some examples, data from a given frame, as well as auxiliary data (e.g., audio data and/or location data) corresponding to a given frame, may be used to make a prediction regarding the contents of future frames. For example, the auxiliary data may be used to make a prediction about a person's head movement and positions and orientations of facial features (e.g., lips, eyes, nose, etc.)”);
storing a range of physical movements for the speaker based on the movement data for the speaker (Col. 10, lines 51-58, discloses storing the facial features of the person in location data stream as stated “For example, as shown in FIG. 1, user node 162 may receive location data 173 over one or more network(s) 150 via location data stream 103. The location data may indicate locations of facial features of the face of the person within the replaced image frame. For example, as shown in FIG. 5, the location data 173 includes a data set 503, which indicates locations of facial features of a face of the person within a corresponding replaced frame (frame 403)”);
generating a data stream comprising changes to the at least one image of the speaker to recreate a real-time digital twin of the speaker, wherein generating the data stream is based at least in part on the stored range of physical movements for the speaker (Fig. 8, step 818, 818A, 818B, discloses generating replacement image frame recreating the representation of the person based on the location data. Col. 3, lines 57-63, disclose the replacement image frames are generated based on the facial location data and consistent video depiction of the person’s face (real-time digital twin) as stated “Thus, by generating the replacement frames based on auxiliary data, such as the location data and/or audio data described above, the techniques described herein may allow high quality video depictions of a person's face to be consistently displayed to users, even during periods of reduced bandwidth and while videoconferencing with large quantities of other participants.”. Also see Col. 8, lines 18-27, stating “The locations of a person's facial features in the replacement frame 413 may be determined based on the locations of the person's facial features in the corresponding replaced frame (frame 403), and these locations may be specified in the data set 503. In particular, the data set 503 may indicate locations of designated points on the person's face that were detected in frame 403, and the person's corresponding facial features may then be rendered at these locations in the replacement frame 413.” );
stopping transmission of video data to the client device and beginning transmission of the data stream; and providing for display, at the client device, the real-time digital twin of the speaker (Col. 13, lines 39-49, discloses replacement frames are transmitted to be displayed on the user node instead of the video content as stated “For example, in some cases, location data 173 (and optionally captured video 172) may be streamed from the user node 161 to a cloud service (or other service or node). The cloud service could then generate the replacement frames 143 using the techniques described above. The cloud service could then send the replacement frames 143 to the user node 162 to be played at the user node 162. In some examples, this strategy could be employed when the transmitting node (e.g., user node 161) has a low bandwidth connection and the receiving node (e.g., user node 162) has a high bandwidth connection.”).
As to claim 2, the rejection of claim 1 is incorporated. Johnson teaches all the limitations of claim 1 as shown above.
Johnson further teaches further comprising: analyzing the captured movement data with corresponding audio data received from the speaker, wherein the analysis identifies correlations between the range of physical movements of the speaker and audio characteristics of the speaker, and wherein the generating of the data stream is further based at least in part on audio characteristics of the speaker (Col. 8, lines 52-67, and Col. 9, lines 1-7, disclosing correlating audio data with facial expression of the user ).
As to claim 3, the rejection of claim 2 is incorporated. Johnson teaches all the limitations of claim 2 as shown above.
Johnson further teaches further comprising: determining a correlation between physical movements of the speaker and audio characteristics associated with speech patterns, wherein each physical movement corresponds to a speech pattern (see at least Col. 11, line 67, and Col. 12, lines 1-9, stating “The frame generator may then determine positions indicated by the portion of the audio content, which may include lip positions associated with the one or more sounds. The facial features may then be rendered at the positions indicated by the portion of the audio content, such as the lip positions associated with the one or more sounds. As also described above, output from a machine learning model may indicate a correlation between the sounds spoken by the person and the lip positions associated with the one or more sounds.”).
As to claim 4, the rejection of claim 1 is incorporated. Johnson teaches all the limitations of claim 1 as shown above.
Johnson further teaches wherein the providing for display, at the client device, the real-time digital twin of the speaker comprises: providing a real-time digital twin of a head of the speaker, wherein control circuitry generates vectors describing different portions of the at least one image, including at least one of eyes, a mouth, or a nose; and partially modifying the video data such that only a facial region of the speaker within the video stream is replaced (se Fig. 3 showing a person’s frame. Also see Col. 12, lines 47-66).
As to claim 5, the rejection of claim 1 is incorporated. Johnson teaches all the limitations of claim 1 as shown above.
Johnson further teaches wherein the capturing the at least one image of the speaker in the video stream comprises: storing a video frame from the video stream in which the speaker is depicted (Col. 5, lines 45-51, states “Specifically, as shown in FIG. 2, a frame 200, which is included in captured video 172, is provided to object recognition components 112. The object recognition components 112 perform an object recognition analysis on frame 200 to identify facial features 210 of a person's face that is shown in frame 200.”).
As to claim 6, the rejection of claim 1 is incorporated. Johnson teaches all the limitations of claim 1 as shown above.
Johnson further teaches wherein the determining that the network condition between the client device and the video service node is insufficient when compared to the threshold level comprises: monitoring available bandwidth for the client device; and determining, based at least in part on the monitoring, that the available bandwidth is below a threshold bandwidth (Fig. 4, graph 420, showing monitoring bandwidth (network condition) of the connection as explained in Col. 7, lines 19-46. Col. 10, lines 62-67, and Col. 11, lines 1-22, discloses the bandwidth is insufficient compared to a threshold ).
Regarding claim 9:
Claim 9 is directed towards a system performing method of claim 1. Accordingly, it is rejected under similar rationale.
Claim 10 is directed towards a system performing method of claim 2. Accordingly, it is rejected under similar rationale.
Claim 11 is directed towards a system performing method of claim 3. Accordingly, it is rejected under similar rationale.
Claim 12 is directed towards a system performing method of claim 4. Accordingly, it is rejected under similar rationale.
Claim 13 is directed towards a system performing method of claim 5. Accordingly, it is rejected under similar rationale.
Claim 14 is directed towards a system performing method of claim 6. Accordingly, it is rejected under similar rationale.
Regarding claim 17:
Johnson teaches:
A method comprising:
capturing an image of an object in a video (Col. 10, lines 22-25, discloses image frames of first video content (video) as stated “The process of FIG. 8 is initiated at operation 810, at which played image frames included in first video content (e.g., captured video 172 of FIG. 1) are received over one or more networks.” . Col. 10, lines 37-39, discloses the image frames include a face of the person (object) as stated “As described above, the played image frames and the replaced image frame may include a face of a person.”);
accessing a network condition; comparing the accessed network condition to a set threshold (Fig. 4, graph 420, shows monitoring bandwidth (network condition) of the connection as explained in Col. 7, lines 19-46. Col. 10, lines 62-67, and Col. 11, lines 1-22, disclose the bandwidth is insufficient compared to a threshold;
collecting movement data based at least in part on the accessed network condition being below the set threshold; storing a range of movements of the object based on the movement data t (Col. 4, lines 22-24, states “The captured video 172 may include the person's head movements, lip movements, facial expressions, and the like.”. Col. 10, lines 1-7, discloses the location data includes the person’s head movement as stated “In some examples, data from a given frame, as well as auxiliary data (e.g., audio data and/or location data) corresponding to a given frame, may be used to make a prediction regarding the contents of future frames. For example, the auxiliary data may be used to make a prediction about a person's head movement and positions and orientations of facial features (e.g., lips, eyes, nose, etc.)”);
generating a data stream from the stored range of movements, wherein the generating the data stream is based at least in part on the stored range of movements of the object (Col. 10, lines 51-58, discloses storing the facial features of the person in location data stream as stated “For example, as shown in FIG. 1, user node 162 may receive location data 173 over one or more network(s) 150 via location data stream 103. The location data may indicate locations of facial features of the face of the person within the replaced image frame. For example, as shown in FIG. 5, the location data 173 includes a data set 503, which indicates locations of facial features of a face of the person within a corresponding replaced frame (frame 403)”. Also see Col. 8, lines 18-27, stating “The locations of a person's facial features in the replacement frame 413 may be determined based on the locations of the person's facial features in the corresponding replaced frame (frame 403), and these locations may be specified in the data set 503. In particular, the data set 503 may indicate locations of designated points on the person's face that were detected in frame 403, and the person's corresponding facial features may then be rendered at these locations in the replacement frame 413.” );
switching from video stream transmission to data stream transmission; and providing for display a real-time digital twin of the object based at least in part on the data stream from the stored range of movements speaker (Col. 13, lines 39-49, discloses replacement frames are transmitted to be displayed on the user node instead of the video content as stated “For example, in some cases, location data 173 (and optionally captured video 172) may be streamed from the user node 161 to a cloud service (or other service or node). The cloud service could then generate the replacement frames 143 using the techniques described above. The cloud service could then send the replacement frames 143 to the user node 162 to be played at the user node 162. In some examples, this strategy could be employed when the transmitting node (e.g., user node 161) has a low bandwidth connection and the receiving node (e.g., user node 162) has a high bandwidth connection.”. Col. 3, lines 57-63, disclose the replacement image frames are generated based on the facial location data and video depiction of the person’s face (real-time digital twin) as stated “Thus, by generating the replacement frames based on auxiliary data, such as the location data and/or audio data described above, the techniques described herein may allow high quality video depictions of a person's face to be consistently displayed to users, even during periods of reduced bandwidth and while videoconferencing with large quantities of other participants.”).
As to claim 18, the rejection of claim 1 is incorporated. Johnson teaches all the limitations of claim 1 as shown above.
Johnson further teaches wherein the generated data stream comprises a video stream of the real-time digital twin, and wherein the real-time digital twin comprises a life-like recreation of the speaker (Col. 3, lines 57-63, disclose the replacement image frames are generated based on the facial location data and consistent video depiction of the person’s face as stated “Thus, by generating the replacement frames based on auxiliary data, such as the location data and/or audio data described above, the techniques described herein may allow high quality video depictions of a person's face to be consistently displayed to users, even during periods of reduced bandwidth and while videoconferencing with large quantities of other participants.).
As to claim 19, the rejection of claim 18 is incorporated. Johnson teaches all the limitations of claim 18 as shown above.
Johnson further teaches wherein the real-time digital twin appears to move in real-time (Col. 3, lines 57-63, discloses the person’s face video depiction is rendered based on the real time facial location and audio data).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 7 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Johnson in view of Lohmar et al. (US PGPUB No. 20190173935), hereinafter, Lohmar.
As to claim 7, the rejection of claim 1 is incorporated. Johnson teaches all the limitations of claim 1 as shown above.
Johnson does not teach wherein the determining that the network condition between the client device and the video service node is insufficient when compared to the threshold level comprises: determining that a video buffer depth is below a threshold buffer depth.
Lohmar teach wherein the determining that the network condition between the client device and the video service node is insufficient when compared to the threshold level comprises: determining that a video buffer depth is below a threshold buffer depth (paragraph 0058 discloses switching to low quality encoding when the buffer size falls below a threshold ).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Johnson to incorporate the teaching of Lohmar about switching to low quality encoding when the buffer size falls below a threshold. One would be motivated to do that to maintain satisfactory live streaming QoS parameters (see paragraph 0058 of Lohmar).
Claim 15 is directed towards a system performing method of claim 7. Accordingly, it is rejected under similar rationale.
Claims 8 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Johnson in view of Liu et al. (US PGPUB No. 20200274641), hereinafter, Liu.
As to claim 8, the rejection of claim 1 is incorporated. Johnson teaches all the limitations of claim 1 as shown above.
Johnson does not teach wherein the determining that the network condition between the client device and the video service node is insufficient when compared to the threshold level comprises: monitoring latency of the video stream; and determining, based on the monitoring, that the latency exceeds a threshold latency.
Liu teaches wherein the determining that the network condition between the client device and the video service node is insufficient when compared to the threshold level comprises: monitoring latency of the video stream; and determining, based on the monitoring, that the latency exceeds a threshold latency (paragraph 0019 discloses switching to low quality video when the latency exceeds a threshold. Also see Fig. 4).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Johnson to incorporate the teaching of Liu about switching to low quality video when the latency exceeds a threshold. One would be motivated to do that to prevent QoE degradation in delay-sensitive network-based communication (see paragraphs 0001 and 0020 of Liu).
Claim 16 is directed towards a system performing method of claim 8. Accordingly, it is rejected under similar rationale.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KAMAL M HOSSAIN whose telephone number is (571)270-3070. The examiner can normally be reached 9:30-5:30 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John Follansbee can be reached at (571)272-3964. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
January 5, 2026
/KAMAL M HOSSAIN/ Primary Examiner, Art Unit 2444