Last updated: April 19, 2026
Application No. 18/839,503
SIMULTANEOUS INTERPRETATION DEVICE, SIMULTANEOUS INTERPRETATION SYSTEM, SIMULTANEOUS INTERPRETATION PROCESSING METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM

Non-Final OA §103
Filed
Aug 19, 2024
Examiner
GAY, SONIA L
Art Unit
2657
Tech Center
2600 — Communications
Assignee
National Institute Of Information And Communications Technology
OA Round
1 (Non-Final)
Interview Optional

— +11.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 855 resolved cases, 2023–2026
Examiner Intelligence

GAY, SONIA L View full profile →
Grants 82% — above average
Career Allow Rate
701 granted / 855 resolved
+20.0% vs TC avg
Moderate +11% lift
Without
With
+11.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
33 currently pending
Career history
888
Total Applications
across all art units
Statute-Specific Performance

§101
10.2%
-29.8% vs TC avg
§103
50.6%
+10.6% vs TC avg
§102
11.9%
-28.1% vs TC avg
§112
13.9%
-26.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 855 resolved cases
Office Action

§103
DETAILED ACTION
This action is in response to the initial filing of application no. 18/839503 on 08/19/2024.
Claims 1 – 6 are still pending in this application, with claims 1, 5 and 6 being independent.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) in claims 1- 4  is/are: a speech processing unit that performs speech recognition processing; a segment processing unit that obtains sentence data; a speaker prediction processing unit that predicts a speaker; and a machine translation processing unit that performs machine translation. Furthermore, such claim limitation(s) in claims 2 and 3 is/are: a video clip processing unit that obtains a clip video stream; a speaker detection processing unit that extracts a face image; an audio encoder that performs audio encoding processing on the audio signal included in the clip video stream; a face encoder that performs face encoding; and a speaker identification processing unit that identifies a speaker. Moreover, such claim limitations in claim 3 is/are: a data storage unit that stores a speaker identifier. Additionally, such claim limitations in claim 4 is/are: a display processing device that input speaker identification data.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof:  a single chip with semiconductor device including an integrated circuit, or a dedicated circuit (pg. 46 lines 15 – pg. 47 line 5 of the originally filed specification); a hardware programmed with software to perform the functions recited in claims 1 - 4 ( Figures 4 and 7 which show the speaker prediction processing algorithm performed in claims 1 – 4 and pg.46 lines 6 – pg. 48 line 8 of the originally filed specification).
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 5 and 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Federico et al. (US 11,545,134) in view of Fu et al. (US 11,423,911) (“Fu”) and further in view of Shehzad et al. (US 2023/0215440) (“Shehzad”).
For claim 1, Federico discloses a simultaneous interpretation device (Abstract), comprising: a speech recognition processing unit (speech recognizer, Fig.3, 301, 303, 307) that performs speech recognition processing on a video stream (A/V stream, Fig.1, 105) including time information (The A/V streams comprise time-based indexed video chunks., column 3 lines 14 – 36), an audio signal (The A/V stream comprises audio signals., column 3 lines 14 – 36) and a video signal  (The A/V stream comprises audio signals., column 3 lines 14 – 36) to obtain word sequence data corresponding to the audio signal (column 6 lines 40 – 46); a segment processing unit (speech diarizator, Fig.3, 305) that obtains sentence data (utterances that are correlated with a speaker, Fig.4, 403 and 405; column 8 lines 40 - 51) by performing segment processing (column 6 lines 36 – 38), and obtains time range data that specifies a time range of the sentence data (Fig.4, 401 and 403; column 8 lines 40 – 51); a speaker prediction processing unit (speaker modeler, Fig.3, 311) that predicts a speaker who speaks in a period specified by time range data (Utterances of the same speaker are used by the speaker modeler  to select a target speaker. The utterances are associated with time range data., Fig.4, 401,405 and Fig.5, 507; column 7 lines 15 – 26; column 8 lines 40 – 58); and a machine translation processing unit (machine translator, Fig.3, 309) that performs machine translation processing on sentence data to obtain machine translation processing result data corresponding to the sentence data (column 7 lines 43 – 60).
Yet, Federico fails to teach the following: the word sequence data includes time information on when each word in the word sequence was uttered; the segment processing unit obtains the sentence data, which is segmented word sequence data, by performing segment processing on the word sequence data, and obtains the time range data that specifies a time range in which the word sequence included in the sentence data was uttered; and the speaker prediction processing unit predicts the speaker based on the video stream and the time range data.
However, Fu discloses a system and method for generating a context-aware transcription (Abstract), comprising the following: an ASR unit  (Fig.1, 108 and Fig.6, 614) comprising transcription generating functionality (column 5 lines 50 – column 6 line 6; column 21 lines 55 - 60) further generates word sequence data (transcript comprising English words) which includes time information on when each word in the word sequence was uttered (The English words are timestamped and/or indexed with time to synchronize the audio and the text., column 9 lines 46 – column 10 line 14; column 21 lines 65 – column 22 lines 7; column 23 lines 14 – 23) and the ASR unit comprises segmentation functionality (column 5 lines 50 – column 6 line 6; column 23 lines 14 - 17) to obtain sentence data (segments of synchronized text associated with a speaker), which is segmented word sequence data (The synchronized text is segmented into different segments, wherein the synchronized text comprises a sequence of English words., column 10 lines 5 – 36), and time range data (start time and end time) that specifies a time range in which the word sequence included in the sentence data was uttered (column 10 lines 36 – 44; column 23 lines 14 – 34).
Additionally, Shehzad further discloses a system and method for verifying a speaker in an audio-visual segment (Abstract), comprising the following: one or more unlabeled speakers are identified in an audio-visual segment using ASR techniques (Fig.4A, 310 and 320; [0026] [0027] [0037] [0038]); time data (moments in time) associated with each of the unlabeled speakers in the audio-visual segment is identified (Fig.3, 330; [0027] [0038]); audio data representative of speech signal and visual data representative of facial images respectively from the audio-visual segment is extracted based on the time data (Fig.3, 340; [0028] [0038]); the extracted audio and visual data is embedded (The extracted audio data is transformed into a speaker speech space using a first pre-trained neural network. The extracted video data is transformed into a speaker face space using a second pre-trained neural network, Fig.3, 350 and 360;  [0028] [0029] [0039] [0040]); and a speaker is predicted based on the embeddings (Fig.4B, 370, 380 and 390; [0030 – 0032] [0041 – 0043]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Frederico’s invention in the same way that Fu’s invention has been improved to achieve the following predictable results for the purpose of providing an automated dubbing service that generates translated versions of multimedia files at reduced costs, where the original spoken audio is replaced by naturally sounding speech that is both synchronized and acoustically similar to the original (Federico, column 2 lines 54 – column 3 line 6): the speech recognition unit (Frederico, transcription functionality of an ASR system, column 6 lines 44 - 47) further generates word sequence data which includes time information on when each word in the word sequence was uttered; and the segment processing unit (Frederico, segmentation functionality of an ASR system,  column 6 lines 44 - 47) further obtains the sentence data, which is segmented word sequence data, by performing segment processing on the word sequence data, and obtains the time range data that specifies a time range in which the word sequence included in the sentence data was uttered.
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Frederico and Fu in the same way that Shehzad’s invention has been improved to achieve the following, predictable results for the purpose of providing an automated dubbing service that generates translated versions of multimedia files at reduced costs, where the original spoken audio is replaced by naturally sounding speech that is both synchronized and acoustically similar to the original (Federico, column 2 lines 54 – column 3 line 6) : the speaker prediction processing unit further predicts the speaker based on video, e.g. the video stream, and time, e.g. time range data.

For claims 5 and 6, Federico discloses a non-transitory computer readable storage medium storing a program for causing a computer to execute the simultaneous interpretation processing method (Abstract; column 26 lines 20 – 56), comprising: a speech recognition processing step that performs speech recognition processing on a video stream (A/V stream, Fig.1, 105) including time information (The A/V streams comprise time-based indexed video chunks., column 3 lines 14 – 36), an audio signal (The A/V stream comprises audio signals., column 3 lines 14 – 36) and a video signal  (The A/V stream comprises audio signals., column 3 lines 14 – 36) to obtain word sequence data corresponding to the audio signal (column 6 lines 40 – 46); a segment processing step that obtains sentence data (utterances that are correlated with a speaker, Fig.4, 403 and 405; column 8 lines 40 - 51) by performing segment processing (column 6 lines 36 – 38), and obtains time range data that specifies a time range of the sentence data (Fig.4, 401 and 403; column 8 lines 40 – 51); a speaker prediction processing step that predicts a speaker who speaks in a period specified by time range data (Utterances of the same speaker are used by the speaker modeler  to select a target speaker. The utterances are associated with time range data., Fig.4, 401,405 and Fig.5, 507; column 7 lines 15 – 26; column 8 lines 40 – 58); and a machine translation processing step) that performs machine translation processing on sentence data to obtain machine translation processing result data corresponding to the sentence data (column 7 lines 43 – 60).
Yet, Federico fails to teach the following: the word sequence data includes time information on when each word in the word sequence was uttered; the segment processing step obtains the sentence data, which is segmented word sequence data, by performing segment processing on the word sequence data, and obtains the time range data that specifies a time range in which the word sequence included in the sentence data was uttered; and the speaker prediction processing step predicts the speaker based on the video stream and the time range data.
However, Fu discloses a system and method for generating a context-aware transcription (Abstract), comprising the following: an ASR unit  (Fig.1, 108 and Fig.6, 614) comprising transcription generating functionality (column 5 lines 50 – column 6 line 6; column 21 lines 55 - 60) further generates word sequence data (transcript comprising English words) which includes time information on when each word in the word sequence was uttered (The English words are timestamped and/or indexed with time to synchronize the audio and the text., column 9 lines 46 – column 10 line 14; column 21 lines 65 – column 22 lines 7; column 23 lines 14 – 23) and the ASR unit comprises segmentation functionality (column 5 lines 50 – column 6 line 6; column 23 lines 14 - 17) to obtain sentence data (segments of synchronized text associated with a speaker), which is segmented word sequence data ( The synchronized text is segmented into different segments, wherein the synchronized text comprises a sequence of English words., column 10 lines 5 – 36), and time range data (start time and end time) that specifies a time range in which the word sequence included in the sentence data was uttered (column 10 lines 36 – 44; column 23 lines 14 – 34).
Additionally, Shehzad further discloses a system and method for verifying a speaker in an audio-visual segment (Abstract), comprising the following: one or more unlabeled speakers are identified in an audio-visual segment using ASR techniques (Fig.4A, 310 and 320; [0026] [0027] [0037] [0038]); time data (moments in time) associated with each of the unlabeled speakers in the audio-visual segment is identified (Fig.3, 330; [0027] [0038]); audio data representative of speech signal and visual data representative of facial images respectively from the audio-visual segment are extracted based on the time data (Fig.3, 340; [0028] [0038]); the extracted audio data and visual data is embedded (The extracted audio data is transformed into a speaker speech space using a first pre-trained neural network. The extracted video data is transformed into a speaker face space using a second pre-trained neural network, Fig.3, 350 and 360;  [0028] [0029] [0039] [0040]); and a speaker is predicted based on the embeddings (Fig.4B, 370, 380 and 390; [0030 – 0032] [0041 – 0043]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Frederico’s invention in the same way that Fu’s invention has been improved to achieve the following predictable results for the purpose of providing an automated dubbing service that generates translated versions of multimedia files at reduced costs, where the original spoken audio is replaced by naturally sounding speech that is both synchronized and acoustically similar to the original (Federico, column 2 lines 54 – column 3 line 6): the speech recognition processing step (Frederico, transcription functionality of an ASR system, column 6 lines 44 - 47) further generates word sequence data which includes time information on when each word in the word sequence was uttered; and the segment processing unit (Frederico, segmentation functionality of an ASR system,  column 6 lines 44 - 47) further obtains the sentence data, which is segmented word sequence data, by performing segment processing on the word sequence data, and obtains the time range data that specifies a time range in which the word sequence included in the sentence data was uttered.
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Frederico and Fu in the same way that Shehzad’s invention has been improved to achieve the following, predictable results for the purpose of providing an automated dubbing service that generates translated versions of multimedia files at reduced costs, where the original spoken audio is replaced by naturally sounding speech that is both synchronized and acoustically similar to the original (Federico, column 2 lines 54 – column 3 line 6): the speaker prediction processing unit further predicts the speaker based on video, e.g. the video stream, and time, e.g. time range data.

Claim(s) 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over  Federico et al. (US 11,545,134) in view of Fu et al. (US 11,423,911) (“Fu”), and further in view of Shehzad et al. (US 2023/0215440) (“Shehzad”) and further in view of Ahn et al. (US 2019/0132549) (“Ahn”).
For claim 2, the combination of Frederico, Fu and Shehzad further discloses, wherein the speaker prediction processing unit  (Federico, column 7 lines 15 - 27) (Shehzad, processing subsystem, Fig.3, 105; [0033] [0034]) includes: a video clip processing unit (Shehzad, input processing model and information extraction module, Fig.3, 120 and 130) that obtains a clip video stream, which is data for a period specified by the time range data, from the video stream (Shehzad, The information extraction module extracts visual data representative of facial images respectively from the audio-visual segment based on one or more moments in time associated with each of the one or more unlabeled speakers., [0035] [0038]); an audio encoder (Shehzad, first pre-trained neural network of the input transformation module, Fig.3, 140) that performs audio encoding processing on the audio signal included in the clip video stream to obtain audio embedding representation data that is embedding representation data corresponding to the audio signal ([0035] [0039]); a face encoder (Shehzad, second pre-trained neural network of the input transformation module, Fig.3, 140) that performs face encoding processing on the image data forming the face image region of the speaker to obtain face embedding representation data that is embedding representation data corresponding to the face image region of the speaker ([0040]); and a speaker identification processing unit (Shehzad,  third neural network model of the input transformation module and speaker identification module, Fig.3, 140 and 150) that identifies a speaker who uttered the speech reproduced by the audio signal included in the clip video stream, based on the audio embedding representation data and the face embedding representation data. ([0035] [0041 – 0043]).
Yet, the combination of Frederico, Fu and Shehzad fails to teach the following: a speaker detection processing unit that extracts a face image region of a speaker from a frame image formed by the clip video stream.
However, Ahn discloses a system and method for performing facial recognition (Abstract), comprising the following: an extractor (Fig.3, 130 and 131 and Fig. 5; [0060] [0062] [0067]) comprising a frame selection module (Fig.5, 131a; [0067]) and face detection module (Fig.5, 131b; [0067]); the frame selection module selects a frame from a plurality of frames include in a video image ([0008 – 0010] [0068]); and the face detection module detects and extracts a facial region of a person from the frame image ([0069 – 0073]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Federico, Fu and Shehzad in the same way that Ahns’ invention has been improved to achieve the following, predictable results for the purpose of improving an automated dubbing service that generates translated versions of multimedia files at reduced costs by using facial recognition to select a naturally sounding speech to replace originally spoken audio, where the naturally sounding speech is both synchronized and acoustically similar to the original (Federico, column 2 lines 54 – column 3 line 6 and column 7 lines 15 - 27): the device further comprises a separate face, e.g. speaker, detection processing unit, that extracts a face image region of a person, e.g. speaker, from a frame image formed by a video image, e.g. clip video stream.

Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over  Federico et al. (US 11,545,134) in view of Fu et al. (US 11,423,911) (“Fu”), and further in view of Shehzad et al. (US 2023/0215440) (“Shehzad”), and further in view of Ahn et al. (US 2019/0132549) (“Ahn”) and further in view of Aley-Raz et al. (US 2010/0131273) (“Aley”).
For claim 3, the combination of Federico, Fu, Shehzad and Ahn further discloses: a data storage unit (Shehzad, audio embedding storage repository and visual embedding storage repository, Fig.2, 145 and 146) that stores a speaker identifier (names/labels) that identifies the speaker ([0030] [0031]), and stores the audio embedding representation data and the face embedding representation data that are linked to the speaker identifier ([0031]), wherein the speaker identification processing unit performs best matching processing using (1) the audio embedding representation data obtained by the audio encoder and the face embedding representation data obtained by the face encoder ([0030]), and (2) the audio embedding representation data and the face embedding representation data that have been stored in the data storage unit ([0030] [0031]). 
Yet, the combination of Federico, Fu, Shehzad and Ahn fails to teach the following: when a similarity score indicating a degree of similarity between the above two data sets in the best matching processing is greater than a predetermined value, the speaker identification processing unit identifies a speaker identified by the speaker identifier corresponding to the audio embedding representation data and the face embedding representation data stored in the data storage unit, which have been used for the matching processing in the best matching processing, as the speaker who uttered the speech reproduced by the audio signal included in the clip video stream.
However, Aley discloses a system and method for performing speaker recognition (Abstract), comprising the following: a confidence level  or degree of match is referred to as a similarity score ([0043]); and a speaker is identified based on a comparison of the similarity score with a threshold value (broadly interpreted as similarity score equaling or exceeding a threshold value) ([0043]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Federico, Fu, Shehzad and Ahn in the same way that Aley’s invention has been improved to achieve the following, predictable results for the purpose of providing an automated dubbing service that generates translated versions of multimedia files at reduced costs, where the original spoken audio is replaced by naturally sounding speech that is both synchronized and acoustically similar to the original (Federico, column 2 lines 54 – column 3 line 6) the confidence level estimated for the matching result generated by a generated by the speaker identification processing unit (Shehzad, [0030] [0032]) is further a similarity score indicating a degree of similarity between the above two data sets in the best matching processing; and when the similarity score is compared to a predetermined (threshold) value, e.g. greater than the threshold value, the speaker identification processing unit identifies a speaker, e.g. the speaker identified by the speaker identifier corresponding to the audio embedding representation data and the face embedding representation data stored in the data storage unit, which have been used for the matching processing in the best matching processing, as the speaker who uttered the speech reproduced by the audio signal included in the clip video stream.

Claim(s) 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Federico et al. (US 11,545,134) in view of Fu et al. (US 11,423,911) (“Fu”) and further in view of Shehzad et al. (US 2023/0215440) (“Shehzad”) and further in view of Dubinsky et al. (US 2020/0211565) (“Dubinsky”).
For claim 4, Federico discloses a simultaneous interpretation device (Abstract), comprising: a speech recognition processing unit (speech recognizer, Fig.3, 301, 303, 307) that performs speech recognition processing on a video stream (A/V stream, Fig.1, 105) including time information (The A/V streams comprise time-based indexed video chunks., column 3 lines 14 – 36), an audio signal (The A/V stream comprises audio signals., column 3 lines 14 – 36) and a video signal  (The A/V stream comprises audio signals., column 3 lines 14 – 36) to obtain word sequence data corresponding to the audio signal (column 6 lines 40 – 46); a segment processing unit (speech diarizator, Fig.3, 305) that obtains sentence data (utterances that are correlated with a speaker, Fig.4, 403 and 405; column 8 lines 40 - 51) by performing segment processing (column 6 lines 36 – 38), and obtains time range data that specifies a time range of the sentence data (Fig.4, 401 and 403; column 8 lines 40 – 51); a speaker prediction processing unit (speaker modeler, Fig.3, 311) that predicts a speaker who speaks in a period specified by time range data (Utterances of the same speaker are used by the speaker modeler  to select a target speaker. The utterances are associated with time range data., Fig.4, 401,405 and Fig.5, 507; column 7 lines 15 – 26; column 8 lines 40 – 58); a machine translation processing unit (machine translator, Fig.3, 309) that performs machine translation processing on sentence data to obtain machine translation processing result data corresponding to the sentence data (column 7 lines 43 – 60); and a display processing device  (Fig.1, 131) that displays a dubbed multimedia file (column 5 lines 7 – 23).
Yet, Federico fails to teach the following: the word sequence data includes time information on when each word in the word sequence was uttered; the segment processing unit obtains the sentence data, which is segmented word sequence data, by performing segment processing on the word sequence data, and obtains the time range data that specifies a time range in which the word sequence included in the sentence data was uttered; the speaker prediction processing unit predicts the speaker based on the video stream and the time range data; and a display processing device that inputs speaker identification data, which is data for identifying a speaker who uttered speech reproduced by the audio signal included in the video stream, obtained by the simultaneous interpretation device, and the machine translation processing result data corresponding to the sentence data obtained by the machine translation processing unit of the simultaneous interpretation device, and generates display data for displaying the speaker identification data and the machine translation processing result data in one or more predetermined image areas of a screen of a display device.
However, Fu discloses a system and method for generating a context-aware transcription (Abstract), comprising the following: an ASR unit  (Fig.1, 108 and Fig.6, 614) comprising transcription generating functionality (column 5 lines 50 – column 6 line 6; column 21 lines 55 - 60) further generates word sequence data (transcript comprising English words) which includes time information on when each word in the word sequence was uttered (The English words are timestamped and/or indexed with time to synchronize the audio and the text., column 9 lines 46 – column 10 line 14; column 21 lines 65 – column 22 lines 7; column 23 lines 14 – 23) and the ASR unit comprises segmentation functionality (column 5 lines 50 – column 6 line 6; column 23 lines 14 - 17) to obtain sentence data (segments of synchronized text associated with a speaker), which is segmented word sequence data ( The synchronized text is segmented into different segments, wherein the synchronized text comprises a sequence of English words., column 10 lines 5 – 36), and time range data (start time and end time) that specifies a time range in which the word sequence included in the sentence data was uttered (column 10 lines 36 – 44; column 23 lines 14 – 34).
Additionally, Shehzad further discloses a system and method for verifying a speaker in an audio-visual segment (Abstract), comprising the following: one or more unlabeled speakers are identified in an audio-visual segment using ASR techniques (Fig.4A, 310 and 320; [0026] [0027] [0037] [0038]); time data (moments in time) associated with each of the unlabeled speakers in the audio-visual segment is identified (Fig.3, 330; [0027] [0038]); audio data representative of speech signal and visual data representative of facial images respectively from the audio-visual segment are extracted based on the time data (Fig.3, 340; [0028] [0038]); the extracted audio data and visual data is embedded (The extracted audio data is transformed into a speaker speech space using a first pre-trained neural network. The extracted video data is transformed into a speaker face space using a second pre-trained neural network, Fig.3, 350 and 360;  [0028] [0029] [0039] [0040]); and a speaker is predicted based on the embeddings (Fig.4B, 370, 380 and 390; [0030 – 0032] [0041 – 0043]).
	Moreover, Dubinsky discloses a system and method for generating a dubbed video source (Abstract), comprising the following: a video program containing both audio and video is transmitted to a transcription service ([0009] [0010]); the audio is transcribed and translated into a target language ([0011 – 0014]); multiple language dubbings are simultaneously produced for all translated scripts ([0015 – 0017]); the on- screen placement of subtitles comprising translated text is determined ([0018] [0019]), wherein the translated text comprises speaker identifier ([0014]); and the video program comprising the translated text and dubbed audio is transmitted back to a source ([0009] [0020]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Frederico’s invention in the same way that Fu’s invention has been improved to achieve the following predictable results for the purpose of providing an automated dubbing service that generates translated versions of multimedia files at reduced costs, where the original spoken audio is replaced by naturally sounding speech that is both synchronized and acoustically similar to the original (Federico, column 2 lines 54 – column 3 line 6): the speech recognition unit (Frederico, transcription functionality of an ASR system, column 6 lines 44 - 47) further generates word sequence data which includes time information on when each word in the word sequence was uttered; and the segment processing unit (Frederico, segmentation functionality of an ASR system,  column 6 lines 44 - 47) further obtains the sentence data, which is segmented word sequence data, by performing segment processing on the word sequence data, and obtains the time range data that specifies a time range in which the word sequence included in the sentence data was uttered.
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Frederico and Fu in the same way that Shehzad’s invention has been improved to achieve the following, predictable results for the purpose of providing an automated dubbing service that generates translated versions of multimedia files at reduced costs, where the original spoken audio is replaced by naturally sounding speech that is both synchronized and acoustically similar to the original (Federico, column 2 lines 54 – column 3 line 6) : the speaker prediction processing unit further predicts the speaker based on video, e.g. the video stream, and time, e.g. time range data.
Furthermore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Frederico, Fu and Shehzad in the same way that Dubinsky’s invention has been improved to achieve the following, predictable results for the purpose of providing an automated dubbing service that generates translated versions of multimedia files at reduced costs for a variety of users (e.g. hearing impaired when the original file is missing closed captioning information)(Federico, column 5 lines 47 – 55), where the original spoken audio is replaced by naturally sounding speech that is both synchronized and acoustically similar to the original(Federico, column 2 lines 54 – column 3 line 6): the dubbed multimedia file further comprises machine translation processing results data (translated text) and speaker identification data (identifiers) as closed captioning information; and the display processing device inputs the speaker identification data, which is data for identifying a speaker who uttered speech reproduced by the audio signal included in the video stream, obtained by the simultaneous interpretation device, and the machine translation processing result data corresponding to the sentence data obtained by the machine translation processing unit of the simultaneous interpretation device, and generates display data for displaying the speaker identification data and the machine translation processing result data as closed captioning information in one or more predetermined image areas of a screen of a display device.

Conclusion
 	Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951. The examiner can normally be reached Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SONIA L GAY/Primary Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Aug 19, 2024
Application Filed
Feb 21, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/065,406
Patent 12602617
DATA MANUFACTURING FRAMEWORKS FOR SYNTHESIZING SYNTHETIC TRAINING DATA TO FACILITATE TRAINING A NATURAL LANGUAGE TO LOGICAL FORM MODEL
2y 5m to grant Granted Apr 14, 2026
18/136,634
Patent 12602408
STREAMING OF NATURAL LANGUAGE (NL) BASED OUTPUT GENERATED USING A LARGE LANGUAGE MODEL (LLM) TO REDUCE LATENCY IN RENDERING THEREOF
2y 5m to grant Granted Apr 14, 2026
18/390,675
Patent 12602539
PROACTIVE ASSISTANCE VIA A CASCADE OF LLMS
2y 5m to grant Granted Apr 14, 2026
18/467,276
Patent 12596708
SYSTEMS AND METHODS FOR AUTOMATED CODE GENERATION FOR CALCULATION BASED ON ASSOCIATED FORMAL SPECIFICATIONS
2y 5m to grant Granted Apr 07, 2026
18/209,100
Patent 12591604
INTELLIGENT ASSISTANT
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
82%
Grant Probability
93%
With Interview (+11.4%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 855 resolved cases by this examiner. Grant probability derived from career allow rate.
SIMULTANEOUS INTERPRETATION DEVICE, SIMULTANEOUS INTERPRETATION SYSTEM, SIMULTANEOUS INTERPRETATION PROCESSING METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email