Last updated: April 19, 2026

Application No. 18/320,129

METHODS FOR DUBBING AUDIO-VIDEO MEDIA FILES

Non-Final OA §103

Filed

May 18, 2023

Examiner

LEE, JANGWOEN

Art Unit

2656

Tech Center

2600 — Communications

Assignee

Pylon AI Inc.

OA Round

3 (Non-Final)

Interview Optional

— +24.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 44 resolved cases, 2023–2026

Examiner Intelligence

LEE, JANGWOEN View full profile →

Grants 82% — above average

Career Allow Rate

36 granted / 44 resolved

+19.8% vs TC avg

Strong +24% interview lift

Without

With

+24.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

23 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

26.5%

-13.5% vs TC avg

§103

54.6%

+14.6% vs TC avg

§102

11.0%

-29.0% vs TC avg

§112

4.1%

-35.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 44 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/28/2026 has been entered. Claims 21-39 are pending and have been examined. Claims 21, 36 and 39 are independent. Claims 1-20 are cancelled. Claims 21-39 are new.
	
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 21-23, 28-31, 33 and 35-39 are rejected under 35 U.S.C. 103 as being unpatentable over Ingel et al. (U.S. Pub. No. 2020/0213680, hereinafter, Ingel) in view of Gabrys et al. (U.S. Pub. No. 2023/0260502, hereinafter, Gabrys).
Regarding Claim 21,
Ingel discloses a computer-implemented method for generating a dubbed audio-video presentation, the method comprising (Ingel, Fig.6, par [109], "…methods and systems for dubbing a media stream with voices generated using artificial intelligence technology..."; par [112], "…The term “media stream” refers to digital data that includes video frames, audio frames, multimedia, or any combination thereof..."):
(a) accessing a learning engine trained to produce synthesized audio based on audio samples representing different speaker vocal characteristics (Ingel, Fig.4B, par [163], "…an artificial neural network (such as recurrent neural network, a long short-term memory neural network, a deep neural network, etc.) may be configured to transform speech and/or textual information and/or other representations of speech…", par [171], "…step 444 may use the trained machine learning model to generate the speech data ( or audio data including the speech data) based on the voice profile and/or on the desired voice characteristics and/or on the desired speech characteristics…"; par [164-165], "…step 442 may comprise receiving voice profiles…step 442 may select the voice profiles from a plurality of alternative voice profiles…" );
(d) synchronizing the at least one dubbed audio track with frames of the audio-video production to create the dubbed audio-video presentation (Fig.6, par [191], "…artificial dubbing system 100) configured to generate artificial voice for a media stream. In this example, the media stream includes an audio stream and a video stream"; par [195], "… prosody analysis unit 670 may suggests adjustments that should be made to the final dubbed voice, e.g., the speed of speech (based on the length of the resulting audio from the local language voice audio segment compared to the timing mentioned in the transcript file and the next transcript's timing that should not be overlapped, and/or the actual timing of the original voice in the video's audio track, etc.)..."; par [196], "…revoicing unit 680 may merge the new created audio track into the original movie to create revoiced media stream 150").
Ingel discloses the artificial dubbing system for revoicing the media stream to target languages  (Ingel, Fig. 4A, par [135], "…media receipt module 402 …Transcript processing module 404…Voice profile determination module 406…Voice generation module 408 may generate a revoiced media stream in second language based on the determined voice profile..."), but does not explicitly discloses the voice synthesis (or dubbing) system by adapting vocal characteristics of different speakers. 
However, Gabrys, in the analogous field of endeavor, discloses (b) transforming at least a portion of a script for an audio-video production into a phonetic representation (Gabrys, Fig.3, par [039], "…The trained TTS component 180 and voice modifier component 190 may be used to convert text data 305 to synthesized speech approximating the target voice… The TTS component 180 may generate synthesized speech in the form of a synthesized spectrogram data 182…"; Fig.1, par [031], "…The multi-speaker dataset 114 may include examples of speech of various different speakers…");
(c) producing, from the phonetic representation, at least one dubbed audio track using the learning engine (Figs.3,7, par [077], "…The voice modifier component 190 may receive the output audio data 790 and modify it to generate modified output audio data 795 (e.g., modified spectrogram data 184) having voice characteristics different from the intrinsic voice(s) of the TTS component 180 and /or approximating a desired target voice, such as a celebrity or particular user of the system 600…"). 
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified an artificial dubbing system of Ingel with a TTS system incorporating the voice modifying model of Gabrys with a reasonable expectation of success to create customized synthesized speech for many different potential target voices when only a limited number of examples are available of the speech whose characteristics are to be imitated (Gabrys, paras [020-021]).
Regarding Claim 22, 
Ingel in view of Gabrys discloses the method of claim 21, further comprising 
Gabrys further discloses applying a linguistic or audio effect to the phonetic representation, wherein the effect comprises at least one of: (a) modifying word order to achieve a style of speech; (b) altering pitch or pacing; or (c) changing accent or prosody (Gabrys, Fig.7, par [084], "…the TTS front end 716 may then perform linguistic prosody generation where the phonetic units are annotated with desired prosodic characteristics, also called acoustic features…Such acoustic features may include syllable-level features, word-level features, emotion, speaker, accent, language, pitch, energy, duration…").
Regarding Claim 23,
Ingel in view of Gabrys discloses the method of claim 22, further comprising receiving metadata from an existing audio-video file, wherein the metadata indicates at least one of speaker emotion or scene mood, and using the metadata to select the linguistic or audio effect applied in step (d) of claim 21 (Gabrys, par [081], "…The TTS front end 716 may also process other input data 715, such as text tags or text metadata, that may indicate, for example, how specific words should be pronounced"; Ingel, paras [353, 362], "…using the determined voice profile, the translated transcript, and the metadata information to artificially generate a revoiced media stream…The metadata information may include the desired level of intonation, pitch, accent, and more for each individual…").
Regarding Claim 28,
Ingel in view of Gabrys discloses the method of claim 21, further comprising 
Ingel further discloses providing at least one user-selectable sharing option that publishes the dubbed audio-video presentation to one or more platforms (Ingel, par [125], "… graphical user interface instructions 256 may include a software program that facilitates user 170 to capture a media stream, select a target language, provide user input, and so on...";  par [116],  "…revoiced media stream 150 may be played on a communications device 160. The term "communications device" is intended to include all possible types of devices capable of receiving and playing different types of media streams…").
Regarding Claim 29,
Ingel in view of Gabrys discloses the method of claim 21, further comprising 
Ingel further discloses storing the dubbed audio-video presentation on a network-accessible server, thereby allowing the dubbed audio-video presentation to be assessed or edited later by a user (Ingel, Fig.3, par [127], "…an example revoicing unit 130 associated with artificial dubbing system 100 may include server 133 and data structure 136...a communications interface 350 for transmitting revoiced media streams 150 to communications device 160…").
Regarding Claim 30,
Ingel in view of Gabrys discloses the method of claim 21, further comprising 
Ingel further discloses converting the dubbed audio-video presentation into multiple rendering formats, each format adapted for a different publication channel or playback (Ingel, Fig.1B, par [111], "…revoicing unit 130 may generate revoiced media streams 150 in different languages to be played by a plurality of communications devices 160 (e.g., 160A, 160B, and 160C) associated with different users…"; par [116],  "…revoiced media stream 150 may be played on a communications device 160. The term "communications device" is intended to include all possible types of devices capable of receiving and playing different types of media streams…").
Regarding Claim 31,
Ingel in view of Gabrys discloses the method of claim 21, further comprising 
Gabrys further discloses analyzing metadata extracted from the audio-video production, wherein the metadata includes at least one of speaker emotion, facial expression, mood indicators (Gabrys, par [081], "…The TTS front end 716 may also process other input data 715, such as text tags or text metadata, that may indicate, for example, how specific words should be pronounced…"), and 
automatically applying an audio or linguistic transformation corresponding to the metadata  (Ingel, paras [353, 362], "…using the determined voice profile, the translated transcript, and the metadata information to artificially generate a revoiced media stream…The metadata information may include the desired level of intonation, pitch, accent, and more for each individual…")..
Regarding Claim 33,
Ingel in view of Gabrys discloses the method of claim 21, further comprising 
Ingel further discloses hosting at least part of the method as a remote service, wherein users interact with a web-based or network-accessible interface to edit, generate, or store the dubbed audio-video presentation (Ingel, Fig.3, par [128], "…revoicing unit 130 may be configured as a distributed computer system including multiple servers, server farms, clouds, or computers…").
Regarding Claim 35,
Ingel in view of Gabrys discloses the method of claim 21, further comprising 
Ingel further discloses encoding the dubbed audio-video presentation into a file format for sharing via a media or social platform (Ingel, Fig.6, par [191], "…the operation of an example system 600 (e.g., artificial dubbing system 100) configured to generate artificial voice for a media stream. In this example, the media stream includes an audio stream and a video stream (e.g., YouTube, Netflix)…").
Claim 36 is a system claim with limitations similar to the limitations of Claim 21 and is rejected under similar rationale. Additionally,
Ingel discloses the system comprising: one or more processors; and a memory storing instructions that, when executed by the one or more processors (Ingel, Fig.3, par [127], "…a revoicing unit 130 associated with artificial dubbing system 100…"; par [129], "…Processor 310 may be one or more processing devices configured to perform functions of the disclosed methods…"; par [130], "…Memory 320…").
…
Rationale for combination is similar to that provided for Claim 21.
Claim 37 is a system claim with limitations similar to the limitations of Claim 24 and is rejected under similar rationale.
Claim 38 is a non-transitory computer-readable medium claim with limitations similar to the limitations of Claim 21 and is rejected under similar rationale. Additionally,
Ingel discloses a non-transitory computer-readable medium storing instructions that, when executed by one or more processors  (Ingel, par [206], "…The non-transitory computer-readable storage media may store program instructions that when executed by a processing device of the disclosed system…"; par [129], "…Processor 310 may be one or more processing devices configured to perform functions of the disclosed methods…").,
…
Rationale for combination is similar to that provided for Claim 21.
Claim 39 is a non-transitory computer-readable medium claim with limitations similar to the limitations of Claims 27, 31 and 32 and is rejected under similar rationale.



Claims 24-27, 32 and 34 are rejected under 35 U.S.C. 103 as being unpatentable over Ingel in view of Gabrys further in view of McCartney (U.S. Pub. No. 2023/0039248, hereinafter, McCartney).
Regarding Claim 24,
Ingel in view of Gabrys discloses the method of claim 23, further comprising 
Ingel further discloses (a) assign a speaker profile corresponding to a desired vocal characteristic for each segment of the script (Ingel, Fig.6, par [193], "… voice generation unit 655 may generate a second revoiced audio stream 665 in the target language based on target transcript 645. Second revoiced audio stream 665 is artificially generated using the updated voice profile 625, video data 630, and metadata transcript information 650...")
(c) adjust the speaker profile or the linguistic or audio effect of the dubbed audio track (Ingel, par [419], "…such additional controls 2740 may include controls that enable the user to controls that enable the user to selectively control the voices of one or more individuals in the video (for example, by controlling pitch, intensity, gender, accent, and so forth)…").
Ingel discloses the graphical user interface to provide user input (par [125], "… graphical user interface instructions 256 may include a software program that facilitates user 170 to capture a media stream, select a target language, provide user input, and so on..."), but neither Ingel nor Gabrys explicitly discloses a timeline editor that displays segments of the script and the limitation (b).
However, McCartney, in the analogous field of endeavor, discloses providing a timeline editor that displays segments of the script and enables a user to (McCartney, Fig.1, par [028], "…processing system 102 for performing the methods described herein (i.e., Assisted Translation and Lip Matching for Voice Dubbing)...a frame editing utility 118 for adding or removing frames from a selected video sample..."; par [029], "…the text-to-speech synthesizer 114 will be configured to generate not only a synthesized audio clip comprising synthesized speech corresponding to input text (e.g., a word, sentence, sequence of text), but also an audio spectrogram of the synthesized audio clip, and data regarding the timing (e.g., start and end time, and/or duration) of each phoneme in the synthesized speech"; par [083], "…the processing system may be configured to correlate the voice dubbing and the video clip such that they each begin at the same time"; Fig.11, par [083], "…the text-to-speech synthesizer may be configured to provide a start and end time, and/or a duration, for each spoken word or phoneme of the candidate translation. Likewise, where the voice dubbing is generated by a human actor, the start and end times may be hand-coded (e.g., by a human adapter)..."):
(b) preview the synthesized audio in synchronization with frames of the audio-video production (Fig.19, par [132], "…the processing system initially modified the voice dubbing and/or video frame(s), the processing system may be configured to allow a human user to make further edits to the modified voice dubbing and/or video frames (e.g., to fine-tune their timing based on what the user feels looks most realistic)...").
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified an artificial voice dubbing system the voice modifying model taught by Ingel in view of Gabrys with an audio editor of McCartney further finetuning the timing of the newly recorded voice dubbing with a reasonable expectation of success to improve lip matching and to reduce the costs and time associated with voice dubbing (McCartney, Background, par [002]).
Regarding Claim 25,
Ingel in view of Gabrys further in view of McCartney discloses the method of claim 24, further comprising 
Gabrys further discloses extracting an existing soundtrack from the audio-video production and separating speech components from background audio to produce a text transcription, wherein the text transcription is used as part of the script for generating the dubbed audio-video presentation (Gabrys, Fig.8, par [100], "…the ASR component 650 may generate transcripts of recorded speech in one or more of the voice dataset 112, the multi-speaker dataset 114, and/or the target voice dataset 116…").
Regarding Claim 26,
Ingel in view of Gabrys further in view of McCartney discloses the method of claim 25, further comprising 
Ingel further discloses accessing, in the timeline editor, a stored selection of emotional or style of speech to apply to the text transcription  (Ingel, par [125], "… graphical user interface instructions 256 may include a software program that facilitates user 170 to capture a media stream, select a target language, provide user input, and so on..."; par [145], "…database access module 412 may cooperate with database 414 to retrieve voice samples of associated media streams, transcripts, voice profiles, and more..."), and 
generating synthesized audio that reflects tone and voice (Ingel, Fig.4A, par [135], "…Voice generation module 408 may generate a revoiced media stream based on the determined voice profile…").
Regarding Claim 27,
Ingel in view of Gabrys further in view of McCartney discloses the method of claim 26, further comprising 
Ingel further discloses outputting the dubbed audio-video presentation in at least one rendering format for sharing via a media or social platform (Ingel, Fig.6, par [191], "…the operation of an example system 600 (e.g., artificial dubbing system 100) configured to generate artificial voice for a media stream. In this example, the media stream includes an audio stream and a video stream (e.g., YouTube, Netflix)…").
Regarding Claim 32,
Ingel in view of Gabrys discloses the method of claim 21, further comprising 
McCartney discloses generating a preview of the dubbed audio track during editing, enabling a user to playback of any applied speaker profiles or filters before presentation  (McCartney, Fig.1, par [028], "…processing system 102 for performing the methods described herein (i.e., Assisted Translation and Lip Matching for Voice Dubbing)...a frame editing utility 118 for adding or removing frames from a selected video sample..."; par [029], "…the text-to-speech synthesizer 114 will be configured to generate not only a synthesized audio clip comprising synthesized speech corresponding to input text (e.g., a word, sentence, sequence of text), but also an audio spectrogram of the synthesized audio clip, and data regarding the timing (e.g., start and end time, and/or duration) of each phoneme in the synthesized speech"; par [083], "…the processing system may be configured to correlate the voice dubbing and the video clip such that they each begin at the same time"; Fig.19, par [132], "…the processing system initially modified the voice dubbing and/or video frame(s), the processing system may be configured to allow a human user to make further edits to the modified voice dubbing and/or video frames (e.g., to fine-tune their timing based on what the user feels looks most realistic)...").
Rationale for combination is similar to that provided for Claim 24.
Regarding Claim 34,
Ingel in view of Gabrys discloses the method of claim 21.
Ingel discloses the graphical user interface to provide user input (par [125], "… graphical user interface instructions 256 may include a software program that facilitates user 170 to capture a media stream, select a target language, provide user input, and so on..."), but neither Ingel nor Gabrys explicitly discloses a timeline editor that displays segments of the script and the limitation (b).
McCartney discloses receiving user-specified prompts through a timeline editor, each prompt instructing an application of a specified vocal characteristics, comedic effect, or pacing to a particular portion of the script (McCartney, Fig.1, par [028], "…processing system 102 for performing the methods described herein (i.e., Assisted Translation and Lip Matching for Voice Dubbing)...a frame editing utility 118 for adding or removing frames from a selected video sample..."; par [029], "…the text-to-speech synthesizer 114 will be configured to generate not only a synthesized audio clip comprising synthesized speech corresponding to input text (e.g., a word, sentence, sequence of text), but also an audio spectrogram of the synthesized audio clip, and data regarding the timing (e.g., start and end time, and/or duration) of each phoneme in the synthesized speech").
Rationale for combination is similar to that provided for Claim 24.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JANGWOEN LEE whose telephone number is (703)756-5597. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BHAVESH MEHTA can be reached at (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JANGWOEN LEE/Examiner, Art Unit 2656                                                                                                                                                                                                        
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656

Read full office action

Prosecution Timeline

May 18, 2023

Application Filed

May 02, 2025

Non-Final Rejection — §103

Aug 05, 2025

Response Filed

Nov 05, 2025

Final Rejection — §103

Jan 28, 2026

Request for Continued Examination

Jan 30, 2026

Response after Non-Final Action

Feb 04, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/007,025

Patent 12597432

HUM NOISE DETECTION AND REMOVAL FOR SPEECH AND MUSIC RECORDINGS

2y 5m to grant Granted Apr 07, 2026

18/118,619

Patent 12586571

EFFICIENT SPEECH TO SPIKES CONVERSION PIPELINE FOR A SPIKING NEURAL NETWORK

2y 5m to grant Granted Mar 24, 2026

18/258,569

Patent 12573381

SPEECH RECOGNITION METHOD AND APPARATUS, STORAGE MEDIUM, AND ELECTRONIC DEVICE

2y 5m to grant Granted Mar 10, 2026

17/925,261

Patent 12567430

METHOD AND DEVICE FOR IMPROVING DIALOGUE INTELLIGIBILITY DURING PLAYBACK OF AUDIO DATA

2y 5m to grant Granted Mar 03, 2026

18/310,577

Patent 12566930

CONDITIONING OF PRODUCTIVITY APPLICATION FILE CONTENT FOR INGESTION BY AN ARTIFICIAL INTELLIGENCE MODEL

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

82%

Grant Probability

99%

With Interview (+24.2%)

2y 11m

Median Time to Grant

High

PTA Risk

Based on 44 resolved cases by this examiner. Grant probability derived from career allow rate.