Last updated: April 19, 2026
Application No. 18/641,535
Dialogue Localization

Final Rejection §103
Filed
Apr 22, 2024
Examiner
MCLEAN, IAN SCOTT
Art Unit
2654
Tech Center
2600 — Communications
Assignee
Sony Interactive Entertainment Europe Limited
OA Round
2 (Final)
This examiner grants 43% of cases after interview

— +31.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 44 resolved cases, 2023–2026
Examiner Intelligence

MCLEAN, IAN SCOTT View full profile →
Grants 43% of resolved cases
Career Allow Rate
19 granted / 44 resolved
-18.8% vs TC avg
Strong +31% interview lift
Without
With
+31.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
40 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
9.9%
-30.1% vs TC avg
§103
60.0%
+20.0% vs TC avg
§102
27.2%
-12.8% vs TC avg
§112
2.1%
-37.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 44 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
2.	Applicant's arguments filed 01/09/2026 have been fully considered but they are not persuasive. Applicant argues that the previously cited art does not overcome the newly added claims 21-39. The examiner respectfully disagrees. In total, the arguments state: 
“The independent claims have been amended to recite several new, previously unexamined features, including the new feature of a ‘obtaining  translations of the script in several different languages,’ ‘obtaining, for each of the different  languages, a text-to-speech output associated with the language,’ and ‘allocating the duration of  the longest one of the text-to-speech outputs as the amount of time for the script that is to be  spoken in the animated scene.’ Applicant argues that at least these new features are not disclosed or taught by the applied combination of references, and submits that the introduction of  these features merits further search and/or further examination. Reconsideration, further  examination, and withdrawal of all rejections are requested.”
The Examiner respectfully disagrees. Claims 21-39 do not sufficiently disclose new features that were not already taught by the cited references Mathur (US 11,172,266), Roche (US 10,732,708), Gao (US 12,039,969), Glatt (US 2010/0057232) and Raitt (US 2012/0021828). In total, the proposed combination discloses the amended claims which still center on the same overall concept of determining a speech related duration across different language output and using that duration to control a corresponding animated presentation. As already mapped, Mathur teaches determining timing/duration based on spoken content and translated text, Roche teaches generating an animated scene responsive to spoken content, Gao teaches multilingual text-to-speech generation for different output languages and Glatt teaches selecting the longest duration among multiple candidate audio outputs as the operative time basis. The newly added dependent claims merely elaborate those same concepts by reciting audio recordings, extending shorter outputs, filler words and pauses, animated speaking scenes, video game implementations and their corresponding systems. These additions do not change the basic framework already disclosed in the combination. Claims 21-39 simply recast the same subject matter as disclosed before.
Accordingly, the newly added claims 21-39 do not overcome the prior art and are rejected.

Claim Rejections - 35 USC § 103
3.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

4.	Claims 1, 21-25, 28-34 and 37-39 are rejected under 35 U.S.C. 103 as being unpatentable over Mathur (US 11,172,266), in view of Roche (US 10,732,708), further in view of Gao (US 12,039,969) and further in view of Glatt (US 2010/0057232).

Regarding Claim 1:
Mathur discloses a computer-implemented comprising (Mathur: ¶[0062] discloses a computer implemented process):
determining an amount of time to allocate for a script that is to be spoken in an (Mathur: ¶[0041] explicitly discloses that an amount of time based on spoken content audio and text duration), comprising:
obtaining the script that is written in an initial language obtaining translations of the script in several different languages (Mathur: ¶[0025], ¶[0037] and ¶[0039] disclose obtaining text corresponding to spoken audio in an initial language and translating that text into another language, supporting obtaining translations of the script/text in different languages);



Mathur does not explicitly disclose an animated scene. However, Roche discloses an animated scene (Roche: Col 31:62 – Col 32:28 discloses an animation service that generates animations/scenes responsive to spoken words and real world scene content. For example, Roche discloses that the animation service generates the second animation sequences to include animation gestures toward the object” when the avatar speaks words associated with objects or people in the scene and the avatar looks or points toward that person in the scene, therefore the system teaches generating an animated scene/animation sequence driven by the words to be spoken and their timing within a scene).
Mathur in view of Roche are combinable because they disclose subject matter that is pertinent to each other. Mathur already teaches determining scripted audio duration. Roche teaches generating and controlling an animated scene in response to spoken content and scene context. It would have been obvious to one of ordinary skill in the art modify Mathur in order to include an animated scene whose timing is consistent with the overall spoken script duration across languages. The suggestion for doing this is “there is a never ending cycle of [developers] having to update the VR/AR application.” As disclosed in the background of Roche. In other words, rather than having to reanimate scenes, it is efficient to have a system that can dynamically animate.
Mathur and Roche do not explicitly disclose:
obtaining, for each of the different languages, a text-to-speech output associated with the language;
However, Gao discloses:
obtaining, for each of the different languages, a text-to-speech output associated with the language (Gao: ¶[0031]-[0039] receives a first speech in a second language, and outputs a second speech signal comprising a first language, ¶[0126] discloses taking target language text and speaker characteristics to output text to speech in the second language, ¶[0065] explicitly supports multiple output languages by generating a third text signal comprising a third language and generating a third speech signal. additionally, ¶[0235]-[0252] supports this clearly, the model is trained to work with multiple language, each speaker can have their own speaker vector which has speaker characteristics. There is a separate language vector for the chosen output language, so these are combined and fed into the speech synthesis system, the same underlying text to speech pipeline may be used to output English, Spanish, French, etc. depending on which language vector is selected).
Mathur and Roche in view of Gao are combinable because they are from the same field of endeavor of audio/visual presentation timing; Mathur discloses determining and adjusting caption durations based on audio duration (including translated audio), and Gao discloses generating translated speech audio from text with timing/alignment information for the target language. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Mathur’s multilingual text-to-speech output. The suggestion/motivation for doing so is “to improve such spoken language translation systems” as disclosed in Gao’s background of invention
Mathur, Roche and Gao do not explicitly disclose:
and determining a duration of a longest one of the text-to-speech outputs;
and allocating the duration of the longest one of the text-to-speech outputs as the amount of time for the script that is to be spoken in the animated scene.
However, Glatt discloses:
and determining a duration of a longest one of the text-to-speech outputs (Glatt: ¶[0042] teaches determining durations among multiple candidate audio outputs and identifying the longest one. Specifically, Glatt in ¶[0070] discloses a database having information on a plurality of different predefined sequences of audio samples, the sequence lengths can be 15 minutes, 10 minutes, 5 min, 20 sec etc., therefore expressly disclosing multiple audio outputs with different respective durations. Finally, Glatt ¶[0095] discloses the ‘longest one’ by disclosing that the longest sequence in the plurality of sequences, which has a duration smaller than the length determined is searched);
and allocating the duration of the longest one of the text-to-speech outputs as the amount of time for the script that is to be spoken in the animated scene (Glatt: ¶[0044] teaches using the duration of the selected longest output as the operative time allocation for constructing the final timed output. After selecting the longest-duration sequence, the system builds the final timed output. ¶[0095] further disclosing using the longest duration as the primary temporal building block).
Mathur, Roche and Gao in view of Glatt is combinable because they are all directed to controlling media timing based on duration information; e.g., Mathur, Roche and Gao together disclose determining durations of language specific audio segments (original and translated), while Glatt discloses selecting and arranging audio sequences based on their respective durations to match a target overall length using the longest suitable segments first. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Glatt’s duration-based selection strategy to the per-language audio durations obtained from Mathur and Gao, so that the required scene length is set to accommodate the longest spoken version while still matching the overall video or scene constraints. The suggestion/motivation for doing so is “provide an improved concept for audio signal generation” as disclosed in ¶[0011] of Glatt.

Regarding Claim 21:
The proposed combination of Mathur, Roche, Gao and Glatt further discloses the computer-implemented method of claim 1, wherein obtaining, for each of the different languages, a text-to-speech output associated with the language comprises generating, for each of the different languages, an audio recording of the script in the language (Gao: ¶[0121]-[0124] discloses generating/synthesizing a speech signal in the target language(s)).
Mathur in view of Gao are combinable because they are from the same field of endeavor of audio/visual presentation timing; Mathur discloses determining and adjusting caption durations based on audio duration (including translated audio), and Gao discloses generating translated speech audio from text with timing/alignment information for the target language. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Gao’s multilingual text-to-speech output. The suggestion/motivation for doing so is “to improve such spoken language translation systems” as disclosed in Gao’s background of invention.

Regarding Claim 22:
The proposed combination of Mathur, Roche, Gao and Glatt further discloses the computer-implemented method of claim 1, further comprising: 
determining one of the text-to-speech outputs whose duration is shorter than the duration of the longest one of the text-to-speech outputs (Glatt: ¶[0042] discloses selecting or constructing audio so that it matches a target or maximum desired length; Gao: ¶[0220] directly teaches that some language outputs are shorter (i.e., faster), the system determines a longest and aligns the rest, therefore mapping to outputs whose duration is shorter than the duration of the longest one of the text-to-speech outputs);
generating an extended script for the text-to-speech output whose duration is shorter than the duration of the longest one of the text-to-speech outputs (Mathur: ¶[0008]-[0009] discloses adjusting speech timing to satisfy synchronization constraints, adding pauses, matching expected timing for the captioning); and 
generating an extended text-to-speech output using the generated extended script (Gao: ¶[0220] discloses that when multiple utterances have different durations the shorter ones must be extended (i.e., padded) meaning it somehow implements one or more pauses).
Mathur and Glatt is combinable because they are all directed to controlling media timing based on duration information; e.g., Mathur, Roche and Gao together disclose determining durations of language specific audio segments (original and translated), while Glatt discloses selecting and arranging audio sequences based on their respective durations to match a target overall length using the longest suitable segments first. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Glatt’s duration detection to determine that one output is longer than another, so that the required scene length is set to accommodate the longest spoken version while still matching the overall video or scene constraints. The suggestion/motivation for doing so is “provide an improved concept for audio signal generation” as disclosed in ¶[0011] of Glatt.
Mathur in view of Gao are combinable because they are from the same field of endeavor of audio/visual presentation timing; Mathur discloses determining and adjusting caption durations based on audio duration (including translated audio), and Gao discloses generating translated speech audio from text with timing/alignment information for the target language. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Gao’s multilingual text-to-speech output for explicitly determining a text-to-speech output that is shorter than another. The suggestion/motivation for doing so is “to improve such spoken language translation systems” as disclosed in Gao’s background of invention.

Regarding Claim 23:
	The proposed combination of Mather, Roche, Gao and Glatt further discloses the computer-implemented method of claim 22, wherein generating the extended script for the text-to-speech output whose duration is shorter than the duration of the longest one of the text-to-speech outputs comprises modifying a script of the corresponding shorter text-to- speech output to include one or more additional elements that extends the script (Mathur: ¶[0008] discloses adjusting speech timing to satisfy synchronization constraints, adding pauses, matching expected timing for the captioning; Gao ¶[0042] ¶[0220]  discloses that when multiple utterances have different durations the shorter ones must be extended (i.e., padded) meaning it somehow implements one or more pauses).
Mathur in view of Gao are combinable because they are from the same field of endeavor of audio/visual presentation timing; Mathur discloses determining and adjusting caption durations based on audio duration (including translated audio), and Gao discloses generating translated speech audio from text with timing/alignment information for the target language. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Gao’s multilingual text-to-speech output for explicitly determining a text-to-speech output that is shorter than another. The suggestion/motivation for doing so is “to improve such spoken language translation systems” as disclosed in Gao’s background of invention.

Regarding Claim 24:
The proposed combination of Mather, Roche, Gao and Glatt further discloses the computer-implemented method of claim 23, wherein the one or more additional elements comprises one or more filler words or one or more pauses (Mathur: ¶[0008]-[0009] discloses adjusting speech timing to satisfy synchronization constraints, adding pauses, matching expected timing for the captioning).

Regarding Claim 25:
The proposed combination of Mather, Roche, Gao and Glatt further discloses the computer-implemented method of claim 1, in response to allocating the duration of the longest one of the text-to-speech outputs, the computer-implemented method further comprises generating the animated scene of an individual speaking the longest one of the text-to-speech outputs (Roche: Col 31:62 – Col 32:28 discloses an animation service that generates animation sequences responsive to spoken words and scene content, including an avatar that speaks and gestures within a scene, therefore, once the duration has been allocated based on the selected longest output, Roche teaches generating the corresponding animated scene of the speaking individuation. The duration allocation portions remains taught by Glatt ¶[0042], ¶[0044] and ¶[0095] and the language specific text to speech output remains taught by Gao ¶[0126], ¶[0235]-[0252]).
Mather, Roche, Gao and Glatt are combinable because they all relate to controlling media output based on speech and duration information. Mathur and Gao provide multilingual speech generation and duration related timing information. Glatt provides selection/allocation based on the longest suitable duration, and Roche provides generation of an animated scene driven by spoken output. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose Roche’s animated speaking scene using the longest selected spoken output so that the animated scene accommodates the allocated speech duration across languages. The suggestion/motivation for doing so is “Developing a VR/AR application, however, can be challenging and time consuming. For example, developers have to create programming code for receiving and processing data received from different input devices, as well as developing programming code for displaying objects (e.g., physical and/or virtual objects) in the environment. In addition to processing the input received from the different devices, it can be difficult and time consuming for a developer to create a virtual environment. For instance, the developer has to specify the different graphical models to include the environment, as well as define the layout of the environment. Even after creating an application, however, the input devices that are available typically change. As such, there is a never ending cycle of having to update the VR/AR application.” As disclosed in Roche’s background of invention. In other words, having to manually reanimate a system as opposed to a dynamic, time based system that can reanimate at will is time consuming and not efficient.


Regarding Claim 28:
The proposed combination of Mather, Roche, Gao and Glatt further discloses the computer-implemented method of claim 1, in response to allocating the duration of the longest one of the text-to-speech outputs, the computer-implemented method further comprises obtaining a voice recording in the several different languages of the longest one of the text-to-speech outputs (Mathur: Fig. 2 discloses obtaining a voice recording performed by the human speaker/actor).

Regarding Claim 29:
Claim 29 has been analyzed with respect to claim 1 (see rejection above) and is rejected for the same reasons of obviousness used above.
	It is noted Mathur discloses a system comprising one or more computers and one or more storage devices storing instructions to cause the operations of claim 29 at least at Fig. 4.

Regarding Claim 30:
Claim 30 has been analyzed with respect to claim 21 (see rejection above) and is rejected for the same reasons of obviousness used above.

Regarding Claim 31:
Claim 31 has been analyzed with respect to claim 22 (see rejection above) and is rejected for the same reasons of obviousness used above.

Regarding Claim 32:
Claim 32 has been analyzed with respect to claim 23 (see rejection above) and is rejected for the same reasons of obviousness used above.
Regarding Claim 33:
Claim 33 has been analyzed with respect to claim 24 (see rejection above) and is rejected for the same reasons of obviousness used above.

Regarding Claim 34:
Claim 34 has been analyzed with respect to claim 25 (see rejection above) and is rejected for the same reasons of obviousness used above.

Regarding Claim 37:
Claim 37 has been analyzed with respect to claim 28 (see rejection above) and is rejected for the same reasons of obviousness used above.

Regarding Claim 38:
Claim 38 has been analyzed with respect to claim 1 (see rejection above) and is rejected for the same reasons of obviousness used above.
	It is noted that Mathur discloses one or more non-transitory computer storage media encoded with computer program instructions that when executed by one or more computers cause the one or more computers to perform the operations of claim 38 at least at ¶[0063].

Regarding Claim 39:
Claim 39 has been analyzed with respect to claim 21 (see rejection above) and is rejected for the same reasons of obviousness used above.

5.	Claims 26-27 and 35-36 are rejected under 35 U.S.C. 103 as being unpatentable over Mathur, in view of Roche, further view of Gao, further in view of Glatt and further in view of Raitt (US 2012/0021828).

Regarding Claim 26:
The proposed combination of Mather, Roche, Gao and Glatt further discloses the computer-implemented method of claim 25, except wherein generating the animated scene of an individual speaking the longest one of the text-to-speech outputs comprises generating the animated scene of an individual speaking the longest one of the text-to-speech outputs in a video game. However, Raitt discloses:
wherein generating the animated scene of an individual speaking the longest one of the text-to-speech outputs comprises generating the animated scene of an individual speaking the longest one of the text-to-speech outputs in a video game (Raitt: ¶[0027]-[0031] and ¶[0026]-[0040] discloses  the system records multi-dimensional video game world data for a video game sequence and feeds it back to generate images of that video game scene).
Mathur in view of Roche, Gao and Glatt and further in view of Raitt are combinable because they are all directed to computer implemented processing of audio and timing information for presentation in multimedia scenes; e.g., Mathur discloses determining and adjusting durations for language specific audio/closed captioning in scenes, Roche discloses animating a scene for a virtual reality user, Gao discloses cross language text-to-speech generating with timing alignment, Glatt discloses constructing audio of a selected overall length from segments and Raitt discloses controlling and animating a virtual avatar/game-style animated scene. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include into the duration driven presentation framework of Mathur, Roche, Gao and Glatt, that it is within a video game. The suggestion/motivation for doing so is Leyton’s recognition that “Moreover, where the computer model has different proportions to that of the actor, the captured data might result in unacceptable artifacts due to recording intersections of data, or the like.” as disclosed in ¶[0005] of Raitt. In other words, the user experience is improved by having a dynamic animation service that more accurately depicts the scene within a video game setting.

Regarding Claim 27:
The proposed combination of Mather, Roche, Gao and Glatt further discloses the computer-implemented method of claim 1, except wherein determining an amount of time to allocate for a script that is to be spoken in an animated scene comprises determining the amount of time to allocate for the script that is to be spoken in the animated scene at run time in a video game (Raitt: ¶[0036]-[0040], ¶[0055]-[0058] and ¶[0109]-[0120] discloses generating an editing animation within a video game environment during execution of the game, including recording, modifying and replaying game-world animation data as part of the running game sequence, therefore teaching runtime determination/use of animation timing in a video game context).
Mathur, Roche, Gao and Glatt in view of Raitt are combinable because the former disclose determining durations of language specific spoken outputs and allocating time based on those durations, while Raitt teaches performing animation and scene generation within an executing video game environment. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to perform the already taught duration allocation in Raitt’s runtime video game environment so that the amount of time allocated for the script is determined during execution of the animated video game scene. The suggestion/motivation for doing so is Raitt’s teaching of an integrated video game and editing system that allows animation data to be recorded, modified and fed back into the video game for display of the game sequence in ¶[0040].
Regarding Claim 35:
Claim 35 has been analyzed with respect to claim 26 (see rejection above) and is rejected for the same reasons of obviousness used above.

Regarding Claim 36:
Claim 36 has been analyzed with respect to claim 27 (see rejection above) and is rejected for the same reasons of obviousness used above.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN SCOTT MCLEAN whose telephone number is (703)756-4599. The examiner can normally be reached "Monday - Friday 8:00-5:00 EST, off Every 2nd Friday".
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/IAN SCOTT MCLEAN/Examiner, Art Unit 2654                    

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654
Read full office action
Prosecution Timeline

Apr 22, 2024
Application Filed
Jun 05, 2025
Response after Non-Final Action
Nov 27, 2025
Non-Final Rejection — §103
Jan 09, 2026
Response Filed
Mar 11, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/245,802
Patent 12602553
SPEECH TRANSLATION METHOD, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Apr 14, 2026
17/952,401
Patent 12494199
VOICE INTERACTION METHOD AND ELECTRONIC DEVICE
2y 5m to grant Granted Dec 09, 2025
18/063,167
Patent 12443805
Systems and Methods for Multilingual Data Processing and Arrangement on a Multilingual User Interface
2y 5m to grant Granted Oct 14, 2025
17/559,283
Patent 12437144
Content Recommendation Method and User Terminal
2y 5m to grant Granted Oct 07, 2025
17/357,751
Patent 12400644
DYNAMIC LANGUAGE MODEL UPDATES WITH BOOSTING
2y 5m to grant Granted Aug 26, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
43%
Grant Probability
74%
With Interview (+31.0%)
3y 2m
Median Time to Grant
Moderate
PTA Risk
Based on 44 resolved cases by this examiner. Grant probability derived from career allow rate.