Last updated: May 29, 2026

Application No. 18/527,668

TWO-STAGE FRAMEWORK FOR ZERO-SHOT IDENTITY-AGNOSTIC TALKING-HEAD GENERATION

Non-Final OA §102

Filed

Dec 04, 2023

Priority

Jun 16, 2023 — provisional 63/508,852

Examiner

TAHA, AHMED

Art Unit

2613

Tech Center

2600 — Communications

Assignee

Salesforce Inc.

OA Round

2 (Non-Final)

Interview Optional

— +60.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 67% grant rate with +60.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 9 resolved cases, 2023–2026

Examiner Intelligence

TAHA, AHMED View full profile →

Grants 67% — above average

Career Allowance Rate

6 granted / 9 resolved

+4.7% vs TC avg

Strong +60% interview lift

Without

With

+60.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 4m

Avg Prosecution

16 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§103

95.8%

+55.8% vs TC avg

§102

4.2%

-35.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 9 resolved cases

Office Action

§102

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
This action is in response to the amendment filed on October 30th, 2025. Claims 1, 7, and 13 have been amended and applicant added claims 19 and 20 that were not previously presented. The amended claims limitations have been fully considered but are not persuasive. Claims 1-20 remain rejected in the application. 
Response to Arguments
In response to applicant’s arguments regarding Tandon failing to disclose “generating a second audio stream”, argument has been fully considered but is not persuasive. Tandon explicitly discloses generating an audio stream [Tandon: 0041 “Process 400 converts ( 420 ) content from the compressed text transcripts to audio content . In many embodiments , text content is converted to audio content using a TTS system.” “TTS systems are calibrated to produce audio in the voice of the participant who generated the initial encoded content.”], however, the disagreement is whether this citation corresponds to a second audio stream. Examiner notes that implementing it with a second identity [Tandon: 0044 “Initialization data may contain a “ User ID ” that designates the data to a particular participant.”] is an inherent. Tandon teaches utilizing ID’s for the data that is generated, assigning the result a second identity is inherent. Claims 1-20 remained rejected in the application.
In response to applicant’s arguments regarding allowing the dependent claims, argument has been fully considered but is not persuasive. Due to the Examiner maintaining the rejection for the independent claims, rejection for the dependent claims are maintained. Claims 1-20 remained rejected in the application. 
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Tandon et. al (U.S. Patent Publication No. 2022/0417291).
Regarding claim 1, Tandon discloses a method for data generation, comprising: inputting a first audio stream and a first text string corresponding to the first audio stream into a machine learning model, wherein the first audio stream and the first text string correspond to a first identity (interpreted as, feed both the Speaker A audio and its transcript into a machine learning model. Both pieces of data belong to the same person)[Tandon: 0029 “encoders convert captured content to text transcripts by extracting audio content from the captured content”][Tandon: 0031 “Upon reception of the text transcripts, either communication devices 110 or 130 may reconstruct the captured content based on the text transcripts. In many embodiments, text transcripts are converted back to audio using a text - to speech ( TTS ) system accessible through an application programming interface ( API ).”][Tandon: 0044 “Initialization data may contain a “ User ID ” that designates the data to a particular participant.”](Tandon teaches that the encoder first takes speaker A audio and derives the matching text. Those paired inputs (audio + text) are then consumed by downstream ML components (ASR/TTS, facial-animation model)(TTS or text to speech is a machine learning model), further teaching designating a “User ID” corresponding exactly to the limitation claimed); generating a second audio stream based at least in part on an output of the machine learning model, wherein the second audio stream is associated with a second identity that is different from the first identity (an inherent limitation, the USER ID that can be set is expected to be set differently from the two different identities) and mimics the first audio stream (interpreted as the model outputs a new audio track that sounds like person B, but mimcs the words/timing of person A) [Tandon: 0041 “Process 400 converts ( 420 ) content from the compressed text transcripts to audio content . In many embodiments , text content is converted to audio content using a TTS system.” “TTS systems are calibrated to produce audio in the voice of the participant who generated the initial encoded content.”](Tandon teaches the decoders TTS module (a machine learning model) creates a fresh audio stream from the transcript; calibration lets it reproduce a chosen voice. Because the voice can be selected per participant, this meets the “second identity” while retaining the linguistic content of the first stream); inputting a visual medium that depicts the second identity that is different from the first identity into a talking head generation model (Tandon teaches this in the previous limitation with the first identity, to repeat that with a second identity, please refer to the citations of the previous limitations); and generating a video that displays the second identity speaking the first text string based at least in part on combining the second audio stream with the visual medium that depicts the second identity (interpreted as building a talking-head video of person B saying the same words, by merging B’s face imagery with the cloned audio) [Tandon: 0026 “Further, initialization data can be used to reconstruct a facsimile of a presenter which is synchronized to a realistic reproduction of their voice speaking the transmitted text.”][Tandon: 0028 “the decoder may reconstruct video content by applying the driving video and the reconstructed audio content to a facial animation model.”][Tandon: 0044 “Process 400 reconstructs ( 460 ) video content based on the converted audio using the initialization data and a facial animation model.”] (Tandon teaches that the driving video (visual medium of the target participant) plus the TTS-generated audio are fused via the facial-animation model to yield a lip-synced video of the second identity uttering the original text).
Regarding claim 2, Tandon discloses the method of claim 1, further comprising: training the machine learning model based at least in part on a set of audio streams and a set of text strings [Tandon: 0009 “the method further includes training a facial animation model with initialization data, reconstructing video content of the converted audio being spoken using the converted audio and the facial animation model”][Tandon: 0041 “Frequent participants of the communication system may provide more training data of them speaking to calibrate the TTS systems such that audio reconstruction becomes increasingly accurate with more usage.”](Tandon clearly teaches training the machine learning model on initialization data (text strings) and participants speaking (audio stream)), wherein the set of audio streams and the set of text strings correspond to a plurality of identifiers (interpreted as the training data comes from more than one speaker/identity)[Tandon: 0031 “the facial animation model can be trained offline with initialization data of one participant or a group of participants.”].
Regarding claim 3, Tandon discloses the method of claim 1, further comprising: identifying a first set of features associated with the first audio stream [Tandon: 0030 “encoders convert captured content to text transcripts by extracting audio content from the captured content.”](teaches capturing features from the captured content); identifying a second set of features associated with the first text string [“Expression data can include visual cues of a participant, for example, eyelid movements, eyebrow movements, micro expressions, head movements, facial expressions, gestures, and / or any other visual cue as appropriate to the requirements of specific applications of embodiments of the invention. In various embodiments, the expression data is provided as metadata to the text transcript. In numerous embodiments, the expression data is used to recreate similar visual cues in the reconstructed audio video content.”] (teaches collecting additional features from the participants); and generating a head motion sequence corresponding to the second identity speaking the first text string based at least in part on the first set of features and the second set of features [Tandon: 0044 “In many embodiments, facial animation models can animate faces to produce effects such as but not limited to a lip - syncing effect. Facial animation models may animate any other facial expressions in accordance with embodiments of the invention. In selected embodiments, facial animation models can include lip - sync models. Facial animation models in accordance with numerous embodiments of the invention are trained with initialization data.” “In several embodiments, facial animation models are able to use reconstructed audio content and produce video content of the participant that accurately mimics how the participant's facial features would move when speaking the audio content based on the initialization data.”](clearly teaches facial animation corresponding to the text string based on the features).
Regarding claim 4, Tandon discloses the method of claim 3, wherein generating the video comprises: generating the video based at least in part on the generated head motion sequence (interpreted as the final talking head video is rendered by feeding the previously produced head-motion data into the video synthesis stage) [Tandon: 0018 “Expression data can include visual cues of a participant, for example, eyelid movements, eyebrow movements, micro expressions, head movements”], [Tandon: 0028 “In many embodiments, the decoder may reconstruct video content by applying the driving video and the reconstructed audio content to a facial animation model”], [Tandon: 0044 “In several embodiments, facial animation models are able to use reconstructed audio content and produce video content of the participant that accurately mimics how the participant's facial features would move when speaking the audio content based on the initialization data.”](clearly teaches generating the video content based on the previously reconstructed content which includes head movement).
Regarding claim 5, Tandon discloses the method of claim 1, further comprising: identifying a set of characteristics associated with the visual medium (interpreted as pulling out descriptive data from the target persons reference video/images), wherein the set of characteristics include one or more geometric parameters [Tandon: 0028 “The driving video can be agnostic to the content that is being decoded, and can identify facial key points of the participant in accordance with several embodiments of the invention.”] (teaches identifying key points (geometric parameters) associated with the participant) and one or more appearance characteristics associated with a head motion of the second identity [Tandon: 0027 “Expression data can include visual cues of a participant, for example, eyelid movements, eyebrow movements, micro expressions, head movements”](teaches that characteristics associated with participants that is collected may be head movement).
Regarding claim 6, Tandon discloses the method of claim 5, wherein generating the video comprises: generating the video based at least in part on combining the set of characteristics with the first audio stream (Tandon: 460; Fig. 4 – “Reconstruct video content based on the converted audio using the initialization data, additional expression data and a facial animation model”)(clearly teaches that the reconstructed video content is based on combining the additional expression data (characteristics) and the first audio stream).
Claims 7 and 13 are apparatus and non-transitory computer readable claims corresponding to the method claim 1 above. Tandon further discloses a processor (Tandon: 610; Fig. 6) and a non-transitory computer-readable medium storing code [Tandon: 0052 “In various embodiments, processor instructions can be stored on a non - transitory machine readable medium”]. Thus, claims 7 and 13 are rejected for the same reason as claim 1.
Claims 8 and 14 are apparatus and non-transitory computer readable claims corresponding to the method claim 2 above. Tandon further discloses a processor (Tandon: 610; Fig. 6) and a non-transitory computer-readable medium storing code [Tandon: 0052 “In various embodiments, processor instructions can be stored on a non - transitory machine readable medium”]. Thus, claims 8 and 14 are rejected for the same reason as claim 2.
Claims 9 and 15 are apparatus and non-transitory computer readable claims corresponding to the method claim 3 above. Tandon further discloses a processor (Tandon: 610; Fig. 6) and a non-transitory computer-readable medium storing code [Tandon: 0052 “In various embodiments, processor instructions can be stored on a non - transitory machine readable medium”]. Thus, claims 9 and 15 are rejected for the same reason as claim 3.
Claims 10 and 16 are apparatus and non-transitory computer readable claims corresponding to the method claim 4 above. Tandon further discloses a processor (Tandon: 610; Fig. 6) and a non-transitory computer-readable medium storing code [Tandon: 0052 “In various embodiments, processor instructions can be stored on a non - transitory machine readable medium”]. Thus, claims 10 and 16 are rejected for the same reason as claim 4.
Claims 11 and 17 are apparatus and non-transitory computer readable claims corresponding to the method claim 5 above. Tandon further discloses a processor (Tandon: 610; Fig. 6) and a non-transitory computer-readable medium storing code [Tandon: 0052 “In various embodiments, processor instructions can be stored on a non - transitory machine readable medium”]. Thus, claims 11 and 17 are rejected for the same reason as claim 5.
Claims 12 and 18 are apparatus and non-transitory computer readable claims corresponding to the method claim 6 above. Tandon further discloses a processor (Tandon: 610; Fig. 6) and a non-transitory computer-readable medium storing code [Tandon: 0052 “In various embodiments, processor instructions can be stored on a non - transitory machine readable medium”]. Thus, claims 12 and 18 are rejected for the same reason as claim 6.
	Regarding claim 19, Tandon discloses the method of claim 1, wherein the visual medium includes at least one of an image or a video of the second identity [Tandon: 0044 “User ID”](teaches identifying the data collected which may represent an image or video and it’s not limited to one time use, meaning it’s inherent or routine to identify multiple different images with different identities). 
	Claim 20 is an apparatus claim corresponding to method claim 19 without any additional limitations. Thus, claim 20 is rejected for the same reasons as claim 19 above. 
Conclusion
THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to AHMED TAHA whose telephone number is (571)272-6805. The examiner can normally be reached 8:30 am - 5 pm, Mon - Fri. Examiner interviews are available via telephone, in person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, XIAO WU can be reached at (571)272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786- 9199 (IN USA OR CANADA) or 571-272-1000.

/AHMED TAHA/Examiner, Art Unit 2613      


/XIAO M WU/Supervisory Patent Examiner, Art Unit 2613

Read full office action

Prosecution Timeline

Show 1 earlier event

Jun 30, 2025

Non-Final Rejection mailed — §102

Oct 21, 2025

Applicant Interview (Telephonic)

Oct 22, 2025

Examiner Interview Summary

Oct 30, 2025

Response Filed

Feb 12, 2026

Final Rejection mailed — §102

Apr 09, 2026

Response after Non-Final Action

May 01, 2026

Request for Continued Examination

May 06, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/411,475

Patent 12565101

WINDSHIELD AND VISIBILITY IMPROVEMENTS FOR DRIVERS IN ADVERSE WEATHER AND LIGHTING CONDITIONS

2y 1m to grant Granted Mar 03, 2026

18/143,708

Patent 12561880

AUGMENTED REALITY TATTOO

2y 9m to grant Granted Feb 24, 2026

Study what changed to get past this examiner. Based on 2 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3

Expected OA Rounds

67%

Grant Probability

99%

With Interview (+60.0%)

2y 4m (~0m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 9 resolved cases by this examiner. Grant probability derived from career allowance rate.