Last updated: April 19, 2026
Application No. 18/029,060
SPEECH PROCESSING DEVICE AND OPERATION METHOD THEREOF

Final Rejection §103
Filed
Mar 28, 2023
Examiner
LE, THUYKHANH
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Amosense Co. Ltd.
OA Round
4 (Final)
Interview Optional

— +37.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 393 resolved cases, 2023–2026
Examiner Intelligence

LE, THUYKHANH View full profile →
Grants 78% — above average
Career Allow Rate
307 granted / 393 resolved
+16.1% vs TC avg
Strong +37% interview lift
Without
With
+37.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
19 currently pending
Career history
412
Total Applications
across all art units
Statute-Specific Performance

§101
18.6%
-21.4% vs TC avg
§103
41.8%
+1.8% vs TC avg
§102
20.1%
-19.9% vs TC avg
§112
10.1%
-29.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 393 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments/Amendments
2.	With respect to 112(a), the amended claims 1 and 9 overcome 112(a), thus 112(a) rejection has been withdrawn. 

	With respect to Allowable Subject Matter indicated in the previous office action, the amendment in claims 1 and 9 changes scope of the limitations as indicated as Allowable Subject Matter. Thus, Allowable Subject Matter is withdrawn. 

Claim Rejections - 35 USC § 103
3.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

4.	Claims 1-3, 6-10, 12-14 are rejected under 35 U.S.C.103 as being unpatentable over 
Nakadai et al. (US 2015/0154957 A1) in view of Kamatani et al. (US 2016/0085747 A1.)
	
	With respect to Claim 1, Nakadai et al. disclose 
A voice processing device comprising: 
 	a voice receiving circuit configured to receive voice signals related to voices pronounced by speakers (Nakadai et al. [0060] The sound collecting unit 11 records sound signals of N (where N is an integer greater than 1, for example, 8) channels and transmits the recorded sound signals of N channels to the sound signal acquiring unit 12. The sound collecting unit 11 includes N microphones 101-1 to 101-N receiving. See paragraph [0227]); 
 a voice processing circuit configured to: generate separated voice signals related to voices by performing voice source separation of the voice signals based on voice source positions of the voices (Nakadai et al. Fig. 16 element 21 Sound source localizing unit, element 22 sound source separating unit, [0062] In case of sound signals from the plurality of speakers, the speech recognizing unit 13 distinguishes the speakers and recognizes the speech details for each distinguished speaker, [0134] The sound source localizing unit 21 estimates an azimuth of a sound source on the basis of an input signal input from the sound signal acquiring unit 12 and outputs azimuth information indicating the estimated azimuth and sound signals of N channels to the sound source separating unit 22. The azimuth estimated by the sound source localizing unit 21 is, for example, a direction in the horizontal plane with respect to the direction of a predetermined microphone out of the N microphones from the point of the center of gravity of the positions of the N microphones of the sound collecting unit 11. For example, the sound source localizing unit 21 estimates the azimuth using a generalized singular-value decomposition-multiple signal classification (GSVD-MUSIC) method, [0137] the sound source separating unit 22 may calculate a sound feature quantity for each sound signal of N channels and may separate the sound signals into the sound signals by speakers on the basis of the calculated sound feature quantity and the azimuth information input from the sound source localizing unit 21), and generate translation results for the voices by using the separated voice signals (Nakadai et al. [0158] The language displayed in an image presented to each speaker may be based on a language selected in advance from a menu. For example, when the speaker Sp1 selects Japanese as the language from the menu, the translation unit 24 may translate the speech uttered in French by another speaker and may display the translation result in the first character presentation image 322C. Accordingly, even when another speaker utters speech in French, English, or Chinese, the conversation support apparatus 1A may display the speech pieces of other speakers in Japanese in the fourth character presentation image 352C in FIG. 18);  
 	a memory (Nakadai et al. [0227] The “computer-readable recording medium” may include a medium that temporarily holds a program for a predetermined time, like a volatile memory (RAM) a computer system serving as a server or a client in a case where the program is transmitted via a network such as the Internet or a communication circuit such as a telephone circuit); and 
  	an output circuit configured to output the translation results for the voices (Nakadai et al. [0112] The images 524A to 524C of the characters obtained by recognizing the speech of the second speaker Sp2 are displayed in the first character presentation image 522. As shown in FIG. 14, the images 524A to 524C are sequentially displayed from the deep side to the near side of the image display unit 15 in the first speaker Sp1. The images 534A to 534D of the characters obtained by recognizing the speech of the first speaker Sp1 are displayed in the second character presentation image 532. As shown in FIG. 14, the images 534A to 534D are sequentially displayed from the deep side to the near side of the image display unit 15 in the second speaker Sp2. In FIG. 14, the uttering order is, for example as follows, image 534A, image 524A, image 534B, image 524B, image 534C, image 524C, and image 534D. See paragraph [0111, 0140] and Fig. 14.) 
	Nakadai et al. fail to explicitly teach 
wherein an output order of the translation results is determined based on pronouncing time points order of the voices such that the translation results are output sequentially in the pronouncing time points order in response to the speakers pronouncing the voices.  
However, Kamatani et al. teach 
wherein an output order of the translation results is determined based on pronouncing time points order of the voices such that the translation results are output sequentially in the pronouncing time points order in response to the speakers pronouncing the voices (Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301.)
 Nakadai et al. and Kamatani et al. are analogous art because they are from a similar field of endeavor in the Speech Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of separating the sound sources based on the sound source positions as taught by Nakadai et al., using teaching of determining the start/the end of the utterances from the plurality of speakers in the conversation as taught by Kamatani et al. for the benefit of displaying the translated texts of the utterances chronologically (Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301.)

  	With respect to Claim 2, Nakadai et al. in view of Kamatani et al. teach
 	wherein the translation results include the voice signals related to voices obtained by translating the voices (Nakadai et al. [0140] In this case, the translation unit 24 translates the speech details so that the images 534A to 534D displayed in the second character presentation image 532 are translated from Japanese in which the first speaker Sp1 utters speech to English which is the language of the second speaker Sp2 and are then displayed. The translation unit 24 translates the speech details so that the images 524A to 524C displayed in the first character presentation image 522 are translated from English in which the second speaker Sp2 utters speech to Japanese which is the language of the first speaker Sp1 and are then displayed) OR text data related to texts obtained by translating the texts corresponding to the voices.  

 	With respect to Claim 3, Nakadai et al. in view of Kamatani et al. teach
 	comprising a plurality of microphones disposed to form an array (Nakadai et al. [0147] When a microphone array is constituted by the microphones 101-1 to 101-N of the sound collecting unit 11, a speaker may not input or select information indicating that the corresponding speaker utters speech to the conversation support apparatus 1A at the time of uttering speech. In this case, the conversation support apparatus 1A can separate the speech into speech pieces by speakers using the microphone array), 
 	wherein the plurality of microphones are configured to generate the voice signals in response to the voices (Nakadai et al. Fig. 16 elements 11 Sound Collecting Unit with microphone array 101-1 to 101-N, element 12 Sound Signal Acquiring Unit.)

With respect to Claim 6, Nakadai et al. in view of Kamatani et al. teach
 	wherein the voice processing circuit is configured to: 
 	determine the source languages for translating the voices related to the separated voice signals and the target languages with reference to the source language information corresponding to the voice source positions of the separated voice signals stored in the memory and the target language information (Nakadai et al. [0137] the sound source separating unit 22 may calculate a sound feature quantity for each sound signal of N channels and may separate the sound signals into the sound signals by speakers on the basis of the calculated sound feature quantity and the azimuth information input from the sound source localizing unit 21, [0138] The language information detecting unit 23 detects a language of each speaker using a known method for each sound signal by speakers input from the sound source separating unit 22. The language information detecting unit 23 outputs information indicating the detected language for each speaker and the sound signals by speakers and the azimuth information input from the sound source separating unit 22 to the speech recognizing unit 13A. The language information detecting unit 23 detects the language of each speaker with reference to, for example, a language database. The language database may be included in the conversation support apparatus 1A or may be connected thereto via a wired or wireless network), and 
 	generate the translation results by translating languages of the voices from the source languages to the target languages (Nakadai et al. [0140] The translation unit 24 translates the speech details if necessary on the basis of the speech details, the information indicating the speakers, and the information indicating a language for each speaker which are input from the speech recognizing unit 13A, adds or replaces information indicating the translated speech details to or for the information input from the speech recognizing unit 13A, and outputs the resultant to the image processing unit 14. Specifically, an example where two speakers of the first speaker Sp1 and the second speaker Sp2 are present as the speakers, the language of the first speaker Sp1 is Japanese, the language of the second speaker Sp2 is English will be described below with reference to FIG. 14. In this case, the translation unit 24 translates the speech details so that the images 534A to 534D displayed in the second character presentation image 532 are translated from Japanese in which the first speaker Sp1 utters speech to English which is the language of the second speaker Sp2 and are then displayed. The translation unit 24 translates the speech details so that the images 524A to 524C displayed in the first character presentation image 522 are translated from English in which the second speaker Sp2 utters speech to Japanese which is the language of the first speaker Sp1 and are then displayed.)
	With respect to Claim 7, Nakadai et al. in view of Kamatani et al. teach
 	wherein the voice processing circuit is configured to: judge pronouncing time points of the voices pronounced by the speakers based on the voice signals, and determine an output order of the translation results so that the output order of the translation results and a pronouncing order of the voices are the same, and wherein the output circuit is configured to output the translation results in accordance with the determined output order (Nakadai et al. [0111] FIG. 14 is a diagram showing an image which is displayed on the image display unit 15 after the first speaker Sp1 utters speech four times and the second speaker Sp2 utters speech three times, [0112] The images 524A to 524C of the characters obtained by recognizing the speech of the second speaker Sp2 are displayed in the first character presentation image 522. As shown in FIG. 14, the images 524A to 524C are sequentially displayed from the deep side to the near side of the image display unit 15 in the first speaker Sp1. The images 534A to 534D of the characters obtained by recognizing the speech of the first speaker Sp1 are displayed in the second character presentation image 532. As shown in FIG. 14, the images 534A to 534D are sequentially displayed from the deep side to the near side of the image display unit 15 in the second speaker Sp2. In FIG. 14, the uttering order is, for example as follows, image 534A, image 524A, image 534B, image 524B, image 534C, image 524C, and image 534D, Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301.)
 	With respect to Claim 8, Nakadai et al. in view of Kamatani et al. teach
 	wherein the voice processing circuit is configured to generate a first translation result for a first voice pronounced at a first time point and a second translation result for a second voice pronounced at a second time point after the first time point (Nakadai et al. [0111] FIG. 14 is a diagram showing an image which is displayed on the image display unit 15 after the first speaker Sp1 utters speech four times and the second speaker Sp2 utters speech three times, [0112] The images 524A to 524C of the characters obtained by recognizing the speech of the second speaker Sp2 are displayed in the first character presentation image 522. As shown in FIG. 14, the images 524A to 524C are sequentially displayed from the deep side to the near side of the image display unit 15 in the first speaker Sp1. The images 534A to 534D of the characters obtained by recognizing the speech of the first speaker Sp1 are displayed in the second character presentation image 532. As shown in FIG. 14, the images 534A to 534D are sequentially displayed from the deep side to the near side of the image display unit 15 in the second speaker Sp2. In FIG. 14, the uttering order is, for example as follows, image 534A, image 524A, image 534B, image 524B, image 534C, image 524C, and image 534D), and 
 wherein the first translation result is output prior to the second translation result (Nakadai et al. paragraphs [0111 and 0112] and Fig. 14, (Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301.)  

With respect to Claim 9, Nakadai et al. disclose 
  	An operating method of a voice processing device, the operating method comprising:
 	receiving voice signals related to voices pronounced by speakers (Nakadai et al. [0060] The sound collecting unit 11 records sound signals of N (where N is an integer greater than 1, for example, 8) channels and transmits the recorded sound signals of N channels to the sound signal acquiring unit 12. The sound collecting unit 11 includes N microphones 101-1 to 101-N receiving); 
 	generating separated voice signals related to voices by performing voice source separation of the voice signals based on voice source positions of the voices (Nakadai et al. Fig. 16 element 21 Sound source localizing unit, element 22 sound source separating unit, [0062] In case of sound signals from the plurality of speakers, the speech recognizing unit 13 distinguishes the speakers and recognizes the speech details for each distinguished speaker, [0134] The sound source localizing unit 21 estimates an azimuth of a sound source on the basis of an input signal input from the sound signal acquiring unit 12 and outputs azimuth information indicating the estimated azimuth and sound signals of N channels to the sound source separating unit 22. The azimuth estimated by the sound source localizing unit 21 is, for example, a direction in the horizontal plane with respect to the direction of a predetermined microphone out of the N microphones from the point of the center of gravity of the positions of the N microphones of the sound collecting unit 11. For example, the sound source localizing unit 21 estimates the azimuth using a generalized singular-value decomposition-multiple signal classification (GSVD-MUSIC) method, [0137] the sound source separating unit 22 may calculate a sound feature quantity for each sound signal of N channels and may separate the sound signals into the sound signals by speakers on the basis of the calculated sound feature quantity and the azimuth information input from the sound source localizing unit 21);
 generating translation results for the voices by using the separated voice signals (Nakadai et al. [0158] The language displayed in an image presented to each speaker may be based on a language selected in advance from a menu. For example, when the speaker Sp1 selects Japanese as the language from the menu, the translation unit 24 may translate the speech uttered in French by another speaker and may display the translation result in the first character presentation image 322C. Accordingly, even when another speaker utters speech in French, English, or Chinese, the conversation support apparatus 1A may display the speech pieces of other speakers in Japanese in the fourth character presentation image 352C in FIG. 18); and
 	outputting the translation results for the voices (Nakadai et al. [0140] The translation unit 24 translates the speech details if necessary on the basis of the speech details, the information indicating the speakers, and the information indicating a language for each speaker which are input from the speech recognizing unit 13A, adds or replaces information indicating the translated speech details to or for the information input from the speech recognizing unit 13A, and outputs the resultant to the image processing unit 14. Specifically, an example where two speakers of the first speaker Sp1 and the second speaker Sp2 are present as the speakers, the language of the first speaker Sp1 is Japanese, the language of the second speaker Sp2 is English will be described below with reference to FIG. 14. In this case, the translation unit 24 translates the speech details so that the images 534A to 534D displayed in the second character presentation image 532 are translated from Japanese in which the first speaker Sp1 utters speech to English which is the language of the second speaker Sp2 and are then displayed. The translation unit 24 translates the speech details so that the images 524A to 524C displayed in the first character presentation image 522 are translated from English in which the second speaker Sp2 utters speech to Japanese which is the language of the first speaker Sp1 and are then displayed), 
	Nakadai et al. fail to explicitly teach 
wherein the outputting of the translation results includes: 
determining an output order of the translation results such that the translation results are output sequentially based on pronouncing time points order of the voices in response to the speakers pronouncing the voices; and 
outputting the translation results in accordance with the determined output order.  
However, Kamatani et al. teach 
wherein the outputting of the translation results includes: 
determining an output order of the translation results such that the translation results are output sequentially based on pronouncing time points order of the voices in response to the speakers pronouncing the voices (Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301); and 
outputting the translation results in accordance with the determined output order (Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301.)  
Nakadai et al. and Kamatani et al. are analogous art because they are from a similar field of endeavor in the Speech Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of separating the sound sources based on the sound source positions as taught by Nakadai et al., using teaching of determining the start/the end of the utterances from the plurality of speakers in the conversation as taught by Kamatani et al. for the benefit of displaying the translated texts of the utterances chronologically (Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301.)

 	With respect to Claim 10, Nakadai et al. in view of Kamatani et al. teach
 	wherein the translation results include the voice signals related to voices obtained by translating the voices (Nakadai et al. [0140] In this case, the translation unit 24 translates the speech details so that the images 534A to 534D displayed in the second character presentation image 532 are translated from Japanese in which the first speaker Sp1 utters speech to English which is the language of the second speaker Sp2 and are then displayed. The translation unit 24 translates the speech details so that the images 524A to 524C displayed in the first character presentation image 522 are translated from English in which the second speaker Sp2 utters speech to Japanese which is the language of the first speaker Sp1 and are then displayed) OR text data related to texts obtained by translating the texts corresponding to the voices.  

 	With respect to Claim 12, Nakadai et al. in view of Kamatani et al. teach
 	wherein the generating of the translation results comprises: 
 	determining the source languages for translating the voices related to the separated voice signals and the target languages with reference to the source language information corresponding to the voice source positions of the stored separated voice signals and the target language information (Nakadai et al. [0137] the sound source separating unit 22 may calculate a sound feature quantity for each sound signal of N channels and may separate the sound signals into the sound signals by speakers on the basis of the calculated sound feature quantity and the azimuth information input from the sound source localizing unit 21, [0138] The language information detecting unit 23 detects a language of each speaker using a known method for each sound signal by speakers input from the sound source separating unit 22. The language information detecting unit 23 outputs information indicating the detected language for each speaker and the sound signals by speakers and the azimuth information input from the sound source separating unit 22 to the speech recognizing unit 13A. The language information detecting unit 23 detects the language of each speaker with reference to, for example, a language database. The language database may be included in the conversation support apparatus 1A or may be connected thereto via a wired or wireless network); and 
 	generating the translation results by translating languages of the voices from the source languages to the target languages (Nakadai et al. [0140] The translation unit 24 translates the speech details if necessary on the basis of the speech details, the information indicating the speakers, and the information indicating a language for each speaker which are input from the speech recognizing unit 13A, adds or replaces information indicating the translated speech details to or for the information input from the speech recognizing unit 13A, and outputs the resultant to the image processing unit 14. Specifically, an example where two speakers of the first speaker Sp1 and the second speaker Sp2 are present as the speakers, the language of the first speaker Sp1 is Japanese, the language of the second speaker Sp2 is English will be described below with reference to FIG. 14. In this case, the translation unit 24 translates the speech details so that the images 534A to 534D displayed in the second character presentation image 532 are translated from Japanese in which the first speaker Sp1 utters speech to English which is the language of the second speaker Sp2 and are then displayed. The translation unit 24 translates the speech details so that the images 524A to 524C displayed in the first character presentation image 522 are translated from English in which the second speaker Sp2 utters speech to Japanese which is the language of the first speaker Sp1 and are then displayed.)
 	With respect to Claim 13, Nakadai et al. in view of Kamatani et al. teach
 	wherein the determining of the output order comprises: 
 	judging pronouncing time points of the voices pronounced by the speakers based on the voice signals (Nakadai et al. [0111] FIG. 14 is a diagram showing an image which is displayed on the image display unit 15 after the first speaker Sp1 utters speech four times and the second speaker Sp2 utters speech three times, [0112] The images 524A to 524C of the characters obtained by recognizing the speech of the second speaker Sp2 are displayed in the first character presentation image 522. As shown in FIG. 14, the images 524A to 524C are sequentially displayed from the deep side to the near side of the image display unit 15 in the first speaker Sp1. The images 534A to 534D of the characters obtained by recognizing the speech of the first speaker Sp1 are displayed in the second character presentation image 532. As shown in FIG. 14, the images 534A to 534D are sequentially displayed from the deep side to the near side of the image display unit 15 in the second speaker Sp2. In FIG. 14, the uttering order is, for example as follows, image 534A, image 524A, image 534B, image 524B, image 534C, image 524C, and image 534D); and
 	determining an output order of the translation results so that the output order of the translation results and a pronouncing order of the voices are the same (Nakadai et al. [0111] FIG. 14 is a diagram showing an image which is displayed on the image display unit 15 after the first speaker Sp1 utters speech four times and the second speaker Sp2 utters speech three times, [0112] The images 524A to 524C of the characters obtained by recognizing the speech of the second speaker Sp2 are displayed in the first character presentation image 522. As shown in FIG. 14, the images 524A to 524C are sequentially displayed from the deep side to the near side of the image display unit 15 in the first speaker Sp1. The images 534A to 534D of the characters obtained by recognizing the speech of the first speaker Sp1 are displayed in the second character presentation image 532. As shown in FIG. 14, the images 534A to 534D are sequentially displayed from the deep side to the near side of the image display unit 15 in the second speaker Sp2. In FIG. 14, the uttering order is, for example as follows, image 534A, image 524A, image 534B, image 524B, image 534C, image 524C, and image 534D, Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301.)
 	With respect to Claim 14, Nakadai et al. in view of Kamatani et al. teach
 	wherein the generating of the translation results comprises: 
 	generating a first translation result for a first voice pronounced at a first time point (Nakadai et al. [0111] FIG. 14 is a diagram showing an image which is displayed on the image display unit 15 after the first speaker Sp1 utters speech four times and the second speaker Sp2 utters speech three times, [0112] The images 524A to 524C of the characters obtained by recognizing the speech of the second speaker Sp2 are displayed in the first character presentation image 522. As shown in FIG. 14, the images 524A to 524C are sequentially displayed from the deep side to the near side of the image display unit 15 in the first speaker Sp1. The images 534A to 534D of the characters obtained by recognizing the speech of the first speaker Sp1 are displayed in the second character presentation image 532. As shown in FIG. 14, the images 534A to 534D are sequentially displayed from the deep side to the near side of the image display unit 15 in the second speaker Sp2. In FIG. 14, the uttering order is, for example as follows, image 534A, image 524A, image 534B, image 524B, image 534C, image 524C, and image 534D); 
 	generating a second translation result for a second voice pronounced at a second time point after the first time point (Nakadai et al. paragraphs [0111 and 0112]), and 
 wherein the outputting of the translation results includes outputting the first translation result prior to the second translation result (Nakadai et al. paragraphs [0111 and 0112]), Kamatani et al. [0070] Since the end of the utterance 301 is earlier than the start of the utterance 303, and the speaker of the utterance 301 is different from that of the utterance 303, the translated text of the utterance 303 is displayed immediately after the translated text of the utterance 301.)  

5.	Claims 4-5, 11 are rejected under 35 U.S.C.103 as being unpatentable over 
Nakadai et al. (US 2015/0154957 A1) in view of Kamatani et al. (US 2016/0085747 A1)
and Adsumilli (US 9,749,738 B1.)

 	With respect to Claim 4, Nakadai et al. in view of Kamatani et al. all the limitations of Claim 3 upon which Claim 4 depends. Nakadai et al. in view of Kamatani et al. fail to explicitly teach 
 	wherein the voice processing circuit is configured to: 
 	judge the voice source positions of the respective voices based on a time delay among a plurality of voice signals generated from the plurality of microphones, and 
 	generate the separated voice signals based on the judged voice source positions.  
 	However, Adsumilli teaches
 	wherein the voice processing circuit is configured to: 
 	judge the voice source positions of the respective voices based on a time delay among a plurality of voice signals generated from the plurality of microphones (Adsumilli  col. 11, lines 11-54, gain and delays are used to estimate sound source position), and 
 generate the separated voice signals based on the judged voice source positions (Adsumilli col. 10, lines 1-28, “the audio source separation module 232 may receive source information about the number of expected source signals, the audio characteristics of the source signals, or the position of the audio sources” to “separate signals into estimated source signals”).  
 	Nakadai et al., Kamatani et al. and Adsumilli et al. are analogous art because they are from a similar field of endeavor in the Speech Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of separating the sound sources based on the sound source positions as taught by Nakadai et al., using teaching of determining the start/the end of the utterances from the plurality of speakers in the conversation as taught by Kamatani et al. for the benefit of displaying the translated texts of the utterances chronologically, using teaching of delay as taught by Adsumilli for the benefit of estimating sound source position (Adsumilli  col. 11, lines 11-54, gain and delays are used to estimate sound source position.)

 	With respect to Claim 5, Nakadai et al. in view of Kamatani et al. all the limitations of Claim 3 upon which Claim 5 depends. Nakadai et al. in view of Kamatani et al. fail to explicitly teach 
 	wherein the voice processing circuit is configured to: generate voice source position information representing the voice source positions of the voices based on a time delay among a plurality of voice signals generated from the plurality of microphones, and match and store, in the memory, the voice source position information for the voices with the separated voice signals for the voices.  
 	However, Adsumilli teaches 
wherein the voice processing circuit is configured to: generate voice source position information representing the voice source positions of the voices based on a time delay among a plurality of voice signals generated from the plurality of microphones (Adsumilli  col. 11, lines 11-54, gain and delays are used to estimate sound source position), and match and store, in the memory, the voice source position information for the voices with the separated voice signals for the voices (Adsumilli col. 15, lines 51-57, “The set of audio source signals and their associated time-varying positions may compose a spatial audio scene” which may be provided “to other modules or devices to allow them to synthesize audio from the spatial audio scene”; sending the results to other modules or subsystems for further processing is considered “storing” since other modules or subsystems would have to hold/store the data for further processing.) 
 	Nakadai et al., Kamatani et al. and Adsumilli et al. are analogous art because they are from a similar field of endeavor in the Speech Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of separating the sound sources based on the sound source positions as taught by Nakadai et al., using teaching of determining the start/the end of the utterances from the plurality of speakers in the conversation as taught by Kamatani et al. for the benefit of displaying the translated texts of the utterances chronologically, using teaching of delay as taught by Adsumilli for the benefit of estimating sound source position and separating the sound sources based on the sound source positions (Adsumilli  col. 11, lines 11-54, gain and delays are used to estimate sound source position, col. 15, lines 51-57, “The set of audio source signals and their associated time-varying positions may compose a spatial audio scene” which may be provided “to other modules or devices to allow them to synthesize audio from the spatial audio scene”.)

 	With respect to Claim 11, Nakadai et al. in view of Kamatani et al. all the limitations of Claim 9 upon which Claim 9 depends. Nakadai et al. in view of Kamatani et al. fail to explicitly teach 
 	wherein the generating of the separated voice signals comprises: 
 	judging the voice source positions of the respective voices based on a time delay among a plurality of voice signals generated from the plurality of microphones; and 
 	generating the separated voice signals based on the judged voice source positions.  
However, Adsumilli teaches
 	wherein the generating of the separated voice signals comprises: 
 	judging the voice source positions of the respective voices based on a time delay among a plurality of voice signals generated from the plurality of microphones (Adsumilli  col. 11, lines 11-54, gain and delays are used to estimate sound source position); and 
 generating the separated voice signals based on the judged voice source positions (Adsumilli col. 10, lines 1-28, “the audio source separation module 232 may receive source information about the number of expected source signals, the audio characteristics of the source signals, or the position of the audio sources” to “separate signals into estimated source signals”).  
 	Nakadai et al., Kamatani et al. and Adsumilli et al. are analogous art because they are from a similar field of endeavor in the Speech Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of separating the sound sources based on the sound source positions as taught by Nakadai et al., using teaching of determining the start/the end of the utterances from the plurality of speakers in the conversation as taught by Kamatani et al. for the benefit of displaying the translated texts of the utterances chronologically, using teaching of delay as taught by Adsumilli for the benefit of estimating sound source position (Adsumilli  col. 11, lines 11-54, gain and delays are used to estimate sound source position.)

Conclusion
6. 	The prior art made of record and not relied upon is considered pertinent to application’s disclosure. See PTO-892
a.	 Aue et al. (US 2015/0347399 A1.) In this reference, Aue et al. disclose a method and a system for generating, separately form the translation of the source user's speech, a further translation of the target user's speech in the source language to be transmitted to the source user.
b. 	Murthy et al. (US 2016/0350286 A1.) In this reference, Murthy et al. disclose a method and a system for translating different languages in the vehicle. 
c.	Ochiai et al. (US 2023/0067132 A1.) In this reference, Ochiai et al. disclose a method and a system for extracting a separated signal from a mixed speech signal by the beamformer.

7.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

8. 	Any inquiry concerning this communication or earlier communications from the examiner should be directed to THUYKHANH LE whose telephone number is (571)272-6429. The examiner can normally be reached Mon-Fri: 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C. Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/THUYKHANH LE/Primary Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

Mar 28, 2023
Application Filed
Mar 08, 2025
Non-Final Rejection — §103
Jun 13, 2025
Response Filed
Jul 11, 2025
Final Rejection — §103
Oct 15, 2025
Request for Continued Examination
Oct 16, 2025
Response after Non-Final Action
Oct 31, 2025
Non-Final Rejection — §103
Jan 15, 2026
Examiner Interview Summary
Jan 15, 2026
Applicant Interview (Telephonic)
Jan 22, 2026
Response Filed
Feb 27, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/105,011
Patent 12597413
ELECTRONIC DEVICE AND CONTROL METHOD THEREOF
2y 5m to grant Granted Apr 07, 2026
18/178,563
Patent 12592218
COMMUNICATION DEVICE, COMMUNICATION METHOD, AND NON-TRANSITORY STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/646,310
Patent 12592239
ACTIVE VOICE LIVENESS DETECTION SYSTEM
2y 5m to grant Granted Mar 31, 2026
18/242,053
Patent 12586577
AUTOMATIC SPEECH RECOGNITION USING MULTIPLE LANGUAGE MODELS
2y 5m to grant Granted Mar 24, 2026
18/567,634
Patent 12579365
INFORMATION ACQUISITION METHOD AND APPARATUS, DEVICE, AND MEDIUM
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
78%
Grant Probability
99%
With Interview (+37.1%)
2y 9m
Median Time to Grant
High
PTA Risk
Based on 393 resolved cases by this examiner. Grant probability derived from career allow rate.