Last updated: April 19, 2026
Application No. 17/889,116
Textless Speech-to-Speech Translation on Real Data

Final Rejection §101§103
Filed
Aug 16, 2022
Examiner
MASTERS, KRISTEN MICHELLE
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Meta Platforms Inc.
OA Round
4 (Final)
This examiner grants 62% of cases after interview

— +24.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 40 resolved cases, 2023–2026
Examiner Intelligence

MASTERS, KRISTEN MICHELLE View full profile →
Grants 62% of resolved cases
Career Allow Rate
25 granted / 40 resolved
+0.5% vs TC avg
Strong +25% interview lift
Without
With
+24.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
36 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
35.2%
-4.8% vs TC avg
§103
46.9%
+6.9% vs TC avg
§102
8.0%
-32.0% vs TC avg
§112
7.1%
-32.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases
Office Action

§101 §103
Detailed Action
This communication is in response to the Arguments and Amendments filed on 10/01/2025. Claims 1-20 are pending and have been examined.
No Amendments to the Independent claims have been made.
Dependent Claims 4 and 13 have been amended.
Claims 1-20 are rejected.
Claims 1-20 are pending.  Claims 1, 10, and 19 are independent.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The Applicants have not amended the independent claims. 
The Applicants have amended the dependent claims to include “decoder; and wherein the updating the speech normalizer comprises updating the speech normalizer without an intermediate step of generating text transcriptions associated with the first discrete speech units and the second discrete speech units.”
Regarding the 35 USC § 101 rejection Applicant notes If the claim recites a judicial exception, the claim requires further analysis in Prong Two. In Prong Two, Examiners evaluate whether the claim recites additional elements that integrates the exception into a practical application of that exception. If the recited exception is integrated into a practical application of the exception, then the claim is eligible at Prong Two of Step 2A, and the claim passes muster under Section 101. In this regard, under the updated guidance, the "directed to" inquiry under Step 2A turns on whether the alleged abstract idea is "integrated into a practical application." 
Moreover, the revised Step 2A specifically excludes consideration of whether the additional elements represent well-known, routine, conventional activity and Examiners should give weight to all additional elements, whether or not they are well-known or conventional, when evaluating whether a judicial exception has been integrated into a practical application. (See page 15 of the October 2019 Update Subject Matter Eligibility). 
Examiner notes the claims do not contain additional elements that integrates the exception into a practical application of that exception.

Applicant notes Under the October 2019 Update to Subject Matter Eligibility Guidance, the alleged abstract idea allegedly pertaining to the claims is indeed integrated into a practical application. A showing that the claimed invention provides "an improvement in the functioning of a computer or an improvement to another technology or technical field"-10- demonstrates that the claims are integrated into a practical application. (See pages 11-12 of the October 2019 Update Subject Matter Eligibility Guidance). 
Examiner notes the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a processor and memory and storage is noted as a general computer as noted. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Further, the additional limitation in the claims noted above are directed towards insignificant solution activity. The claims are not patent eligible. 
Applicant notes Here, Applicant's originally-filed specification describes that non-limiting exemplary aspects relate to "speech processing." (See paragraph [0002] of the specification). 
MPEP § 2106, which guides the Examiner in such matters, indicates the following. "Limitations the courts have found indicative that an additional element (or combination of elements) may have integrated the exception into a practical application include: 
- An improvement in the functioning of a computer, or an improvement to other 
technology or technical field, as discussed in MPEP §§ 2106.04(d)(1) and 2106.05(a)". 
(emphasis added). The Applicant's application indeed discloses exemplary non-limiting embodiments that provide an improvement in the functioning of a computer, an improvement to technology or an improvement to a technical field. Applicant points out that the specification need only "describe the invention such that the improvement would be apparent to one of ordinary skill in the art," which is the case here as pointed out below.
Examiner notes the claims do not contain elements sufficient to apply a judicial exception. A generic computer component cannot provide an inventive concept.

Applicant notes Independent claim 1 recites, nter alha, "generating, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein the plurality of first discrete speech units are associated with a speech cluster" 
Examiner notes the claim limitations do not specify how the speech learning model is trained or how it differs from generic speech learning models.  
Applicant notes and "accessing one or more second utterances of the content by one or more second speakers different from the first speaker" and "training a speech normalizer by: processing the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units" and "updating the speech normalizer by using the plurality of first discrete speech units as an optimization target associated with the plurality of second discrete speech units associated with the one or more second utterances." Applicant notes At least paragraph [0014] of Applicant's originally-filed specification describes "[t]o use real-world speech data to train the textless speech-to-speech translation model, one challenge may be that the utterances of the same content by different people could sound different (e.g., their sound wave may look different due to accent). To address this issue, a normalizer may be trained and then used as a pre-processing step to clean the training speech data so that the speech signals would look roughly the same when-11- different people utter the same content. The normalizer may use self-supervised discrete representations from a reference speaker's speech and finetune a pre-trained speech encoder with paired audio from multiple speakers and the reference speaker to remove the variations, while maintaining the content." 
Examiner notes the claims do not specify how the trained speech normalizer is used in real-world application or how it differs from generic trained speech normalizers. 
Applicant notes Additionally, at least paragraph [0016] of Applicant's originally-filed specification describes exemplary non-limiting aspects in which "[t]he embodiments disclosed herein present a textless speech-to-speech translation (S2ST) system that may translate speech from one language into another language and may be built without the need of any text data. Different from existing work in the literature, the embodiments disclosed herein tackle the challenge in modeling multi-speaker target speech and train the systems with real world speech-to-speech translation (S2ST) data. The key to our approach may comprise a self-supervised unit-based speech normalization technique, which may finetune a pre- trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while pre-serving the lexical content. 
Examiner notes the claims do not provide limitations as to how the model is trained or the normalizer is used to provide the speech translation.

Applicant notes With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on a first experimental multilingual S2ST dataset, compared to a baseline trained on un-normalized speech target. The embodiments disclosed herein also incorporate automatically mined speech-to-speech translation (S2ST) data and show an additional 2.0 BLEU gain. To our knowledge, the embodiments disclosed herein may be the first to establish a textless speech-to-speech translation (S2ST) technique that may be trained with real world data and may work for multiple language pairs." (emphasis added). 
Examiner notes these attributes are not described in the claim limitations

Applicant notes In addition, paragraph [0020] of Applicant's originally-filed specification describes "[t]o tackle the challenge of modeling real target speech where there are multiple speakers with various accents, speaking styles and recording conditions, etc., the embodiments disclosed herein propose a speech normalization technique that finetunes a self- supervised pre-trained model for speech with a limited amount of parallel multiple-to-single speaker speech. Experiments on four language pairs show that when trained with the normalized target speech obtained from a speech normalizer trained with 10-min parallel data, the performance of a textless S2ST model can be improved by 3.2 BLEU points on average compared with a baseline with un-normalized target speech."-12- 
Clearly then, the claimed invention provides several technical benefits and technical improvements regarding speech technology as well as speech processing improvements. Providing accurate and finetuned speech content that reduces undesirable/unwanted variations in speech indeed provides technical solutions to technical problems. 

Examiner notes the claims do not provide these limitations.
Regarding the 35 USC § 101 rejection, the applicants’ arguments and amendments do not overcome the 35 USC § 101 rejection.

Regarding the 35 U.S.C. § 103 rejections Applicant notes Biadsy and Chun, taken individually or in combination, fail to teach or suggest a method comprising, inter a/ia, "generating, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein the plurality of first discrete speech units are associated with a speech cluster" and "accessing one or more second utterances of the content by one or more second speakers different from the first speaker" and "updating the speech normalizer by using the plurality of first discrete speech units as an optimization target associated with the plurality of second discrete speech units associated with the one or more second utterances," as recited in independent claim 1. 
Examiner notes Biadsy and Chun do teach these limitations
Applicant notes Further, the Office indicates that Chun discloses in Col. "5:14-27""updating the speech normalizer by using the plurality of first discrete speech units as an optimization target associated with the plurality of second discrete speech units associated with the one or more second utterances." (See page 10 of the Office Action). However, Chun does not. 
Examiner notes Chun does teach this limitation. In this passage of Chun, the estimating a speaker transform, training process including normalizing and speaker characteristics of the particular speaker implies that the normalizer is not static and is updated.


 
Applicant notes However, speech data 120 in a first language being converted to speech data 150 in a second language generally based on a speech model 135 for the second language being obtained by a training process 130 using speech data 125 in the second language from a plurality of speakers generally speaking the second language, as at best described by Chun alone or in combination, does not teach or suggest "updating the speech normalizer by using the plurality of first discrete speech units""from the first utterance""of a content by a first speaker""as an optimization target associated with the plurality of second discrete speech units associated with the one or more second utterances""of the content by one or more speakers different from the first speaker," as recited in independent claim 
Examiner notes Chun uses the first speaker as the optimization target and uses the speaker embedding and generates in the second language using the voiceprint. The system compares the voiceprint back to the original.
Regarding the 35 U.S.C. § 103 rejections the applicants’ arguments and amendments do not overcome the prior art rejections. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Regarding independent Claim 1, the claim recites 
“1. A method comprising: accessing a first utterance of a content by a first speaker; This relates to a human using natural language understanding and the auditory system and/or the human mind to access an utterance.
generating, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein the plurality of first discrete speech units are associated with a speech cluster; This relates to a human using natural language understanding to generate speech units associated with a speech cluster using the human mind and/or pen and paper.
accessing one or more second utterances of the content by one or more second speakers different from the first speaker; This relates to a human using natural language understanding and the auditory system and/or the human mind to access an utterance.
and training a speech normalizer by: processing the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; This relates to a human using human speech to generate a plurality of discrete speech units.
and updating the speech normalizer by using the plurality of first discrete speech units as an optimization target associated with the plurality of second discrete speech units associated with the one or more second utterances.” This relates to a human updating speech using the normalization techniques such as accent, volume or tuning adjustments using the first speech as a target.  
Regarding independent Claim 10, claim 10 is a CRM claim which recites limitations similar to that of claim 1 and is rejected under the same rationale.
Regarding independent Claim 19, claim 19 is a System claim which recites limitations similar to that of claim 1 and is rejected under the same rationale.
This judicial exception is not integrated into a practical application. In particular, claims 10 and 19 recite the additional elements of “processor” “memory” and “computer-readable medium”, as per the independent claims. For example, in [0063] of the as filed specification, there is description of using In particular embodiments, computer system 600 includes a processor 602, memory 604,storage 606, an input/output (I/O) interface 608, a communication interface 610, and a bus 612. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a processor and memory and storage is noted as a general computer as noted. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Further, the additional limitation in the claims noted above are directed towards insignificant solution activity. The claims are not patent eligible. 

With respect to claim 2, the claims relate to 2. The method of Claim 1, wherein the generating the plurality of first discrete speech units comprises: generating a plurality of intermediate representations by processing the first utterance with the speech-learning model; This relates to a human using natural language understanding and human speech to generate speech and applying one or more clustering algorithms to the plurality of intermediate representations. This relates to an applied mathematical concept.  No additional limitations present.

With respect to claim 3, the claims cite 3. The method of Claim 1, further comprising: reducing one or more repeating first content units from the plurality of first content units. This relates to a human reducing a unit using speech or pen and paper. No additional limitations present.

With respect to claims 4 the claim cites 4. The method of Claim 1, wherein the trained speech normalizer comprises one or more of a finetuned speech-learning model or a decoder; and wherein the updating the speech normalizer comprises updating the speech normalizer without an intermediate step of generating text transcriptions associated with the first discrete speech units and the second discrete speech units. This relates to a human using natural language understanding to learn and finetune speech and voice to mimic a speaker. No additional limitations present.

With respect to claim 5 the claim 5. The method of Claim 1, further comprising: accessing a third utterance by a third speaker; This relates to a human accessing an utterance using memory or auditory systems. and processing the third utterance using the trained speech normalizer to generate a plurality of normalized speech units. This relates to a human speaking. No additional limitations present.

With respect to claim 6 the claim cites 6. The method of Claim 5, further comprising: anonymizing the third speaker based on removing one or more normalized speech units associated with speech characteristics specific to the third speaker from the plurality of normalized speech units. This relates to a human removing an accent, tone or pitch to anonymize a speech. No additional limitations present.

With respect to claim 7 the claim cites 7. The method of Claim 5, further comprising: denoising the third utterance based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. This relates to a human performing natural denoising in the human mind to filter out unwanted sounds. No additional limitations present.

With respect to claim 8 the claim cites 8. The method of Claim 5, further comprising: removing one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. This relates to a human pausing the human voice. No additional limitations present. 

With respect to claim 9 the claim cites 9. The method of Claim 1, further comprising: processing a plurality of first training data associated with a target language by the trained speech normalizer to generate a plurality of normalized target speech units; This relates to a human generating a plurality of normalized target speech units. and training a textless speech-to-speech translation model based on the plurality of normalized target speech units and a plurality of second training data associated with a source language. This relates to a human learning natural speech processing and translation in the human mind. No additional limitations present.

Claim 11 is a system claim with limitations similar to the limitations of Claim 2 and is rejected under similar rationale.
Claim 12 is a system claim with limitations similar to the limitations of Claim 3 and is rejected under similar rationale.
Claim 13 is a system claim with limitations similar to the limitations of Claim 4 and is rejected under similar rationale.

With respect to claim 16 the claim cites 16. The computer-readable medium of Claim 5, wherein the instructions, when executed, further cause: denoising the third utterance based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. This relates to a human performing natural denoising in the human mind to filter out unwanted sounds. No additional limitations present.

With respect to claim 17, the claim cites 17. The computer-readable medium of Claim 5, wherein the instructions, when executed, further cause: removing one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. This relates to a human pausing speech. No additional limitation is present. 

Claim 18 is a system claim with limitations similar to the limitations of Claim 9 and is rejected under similar rationale.
Claim 20 is a system claim with limitations similar to the limitations of Claim 9 and is rejected under similar rationale.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 9-13, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Biadsy  (US Patent Number US 20220122579 A1), in view of Chun (US Patent Number US 9922641 B1).
Regarding independent Claim 1, Biadsy teaches 
1. A method comprising, accessing a first utterance of a content by a first speaker; (See Biadsy, Figure 1, “British User 104” providing the “first utterance” shown as “can I make an appointment for tomorrow 108.”] [Biadsy, “[0029] The speech to speech conversion server 112 provides the audio data of synthesized utterances 138 and the audio data of utterances 134 to the model trainer 140. The model trainer 140 trains the model 124 using machine learning techniques….”] training a speech normalizer by: [see Biadsy, Figure 1, “Model Trainer 140” training the “Model 124.”] generating, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein the plurality of first discrete speech units is associated with a speech cluster; (see Biadsy [0006] “The source speech can be from any speaker or accent, and may contain complex prosodic patterns, (examiner interprets speech cluster as “complex prosodic patterns”)”) (see Biadsy, “[0029] The speech to speech conversion server 112 provides the audio data of synthesized utterances 138 and the audio data of utterances 134 to the model trainer 140. The model trainer 140 trains the model 124 using machine learning techniques….”)  (see Biadsy [0024] “The spectrogram decoder 130 generates five frames of audio data 106 of the synthesized utterance 114 that includes the same words or parts of words (examiner interprets discrete speech units as “parts of words”) as the five frames of audio data, but with a different voice than the user 104.”)
 Biadsy does not specifically teach accessing one or more second utterances of the content by one or more second speakers different from the first speaker; However, Chun does teach this limitation (See Chun, (4:56-5:4) “(14) In some implementations, the universal speech model can be used in cross-lingual speaker adaptation for multi-lingual speech synthesis. This is schematically shown in FIG. 1 using an example where speech data 120 in a first language is converted to speech data 150 in a second language such that the speech data 150 in the second language includes speaker characteristics of a first speaker from whom the speech data 120 originates. The speech data 150 in the second language is synthesized using a speaker independent speech model 135 for the second language. The speech model 135 can be, for example, a HMM based speech model. The speech model 135 for the second language can be obtained via a training process 130 using speech data 125 in the second language from a plurality of speakers speaking the second language.”) and processing the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; and (see Chun (2:4-24) “(7) The universal speech model can include a Gaussian mixture model that represents a plurality of speakers speaking one or more languages. The universal speech model can include a plurality of speech parameters estimated based on speech from the plurality of speakers. The speaker-independent speech synthesis model can include a plurality of hidden Markov models (HMMs). A training engine can be configured to train the plurality of HMMs. The plurality of HMMs can be trained by normalizing speech data from a second speaker speaking the second language, and by using a second speaker transform that represents speaker characteristics of the second speaker. The second speaker transform can be estimated from the speech data of the second speaker, using the universal speech model. Transcription data can be generated from the input speech data by a speech recognition engine. The transcription data can be translated by a translation engine from the first language to the second language. The speech in the second language can be generated based on the translated data. Text data can be accessed in the second language and the speech can be generated based on the accessed text data.”) updating the speech normalizer by using the plurality of first discrete speech units as an optimization target associated with the plurality of second discrete speech units associated with the one or more second utterances. (see Chun (5:14-27) “(16) In order to obtain a speaker-independent speech model 135, the speech data 125 from the plurality of speakers can be normalized with respect to suitable speaker transforms. For example, the speaker characteristics of a particular speaker speaking the second language can be analyzed to estimate a speaker transform 132 such that the estimated speaker transform 132, when applied to speech parameters from the universal speech model 105, produces the speaker characteristics of the particular speaker. The training process 130 can include normalizing the corresponding speech data 125 with the estimated speaker transform 132. This way, the training process 130 can be performed using speaker-independent speech data to obtain the speaker-independent speech model 135 for the second language.”) (see Chun (3:65-4:21) “(11) FIG. 1 shows a schematic diagram 100 representing an example of cross-lingual speaker adaptation for multi-lingual speech synthesis. The scheme represented in FIG. 1 can include a universal speech model 105 along with various speaker transforms. The speaker transforms represent speaker characteristics associated with various speakers of different languages. The universal speech model 105 can be obtained, for example, through a training process 110 based on speech data 115 from multiple speakers in multiple languages. In some implementations, training speech data from a large number of speakers (for example, hundreds or thousands of speakers) can be used in the training process 110 to obtain the universal speech model. The universal speech model (which may also be referred to as a universal background model) 105 can be represented, for example, using a Gaussian mixture model (GMM) or a hidden Markov model (HMM). In some implementations, information on phonemes (examiner interprets units as “phonemes”) from different languages and transcription can be avoided by representing the universal speech model as a GMM. The universal speech model can also be represented as other models, including, for example, hidden semi-Markov models (HSMM), higher order Markov models, segment models, or other acoustic models.”)
Biadsy and Chun are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The method of claim 1 of Biadsy to incorporate the teachings of Chun to include accessing one or more second utterances of the content by one or more second speakers different from the first speaker; and processing the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; and updating the speech normalizer by using the plurality of first discrete speech units as an optimization target associated with the plurality of second discrete speech units associated with the one or more second utterances Doing so allows for the speech synthesizer to speak with different voice characteristics as recognized by Chun in (2:25-40).

As to Claim 2, Biadsy in view of Chun teaches: 2. The method of Claim 1, 
Furthermore, Biadsy teaches wherein the generating the plurality of first discrete speech units comprises: (See Biadsy [0091] …The processor 602 can process instructions for execution within the computing device 600,”) generating a plurality of intermediate representations by processing the first utterance with the speech-learning model; and See Biadsy  [0083] “The system uses the decoder network comprised of an autoregressive RNN to predict the output spectrogram from the encoded input sequence one frame at a time. The prediction from the previous decoder time step is first passed through a small pre-net containing two fully connected layers of 256 ReLU units, which may help to learn attention. The pre-net output and attention context vector may be concatenated and passed through a stack of two unidirectional LSTM layers with 1024 units. The concatenation of the LSTM output and the attention context vector is then projected through a linear transform to produce a prediction of the target spectrogram frame. Finally, these predictions are passed through 5-layer convolutional post-net which predicts a residual to add to the initial prediction. Each post-net layer has 512 filters shaped 5×1 followed by batch normalization and tanh activation.”) (See Biadsy  [0084] “To synthesize an audio signal from the predicted magnitude spectrogram, the system uses the Griffin-Lim algorithm to estimate a phase consistent with the predicted magnitude, followed by an inverse STFT. In some implementations, neural vocoders such as WaveNet may produce improved synthesis quality. In some implementations, WaveNet could replace Griffin-Lim.”) applying one or more clustering algorithms to the plurality of intermediate representations. (see Biadsy [0071-0072] “... The network is composed of an encoder, a spectrogram decoder, and a phoneme decoder, followed by a vocoder to synthesize a time-domain waveform. (examiner interprets clustering algorithms as “the network” used to model “prosodic paterns”) This model can be trained to normalize speech from any speaker even for speech that includes accents, emotions, complex prosodic patterns, imperfections, and background noise, into the voice of a clean single predefined target speaker with a fixed accent and consistent articulation and prosody. This document describes the impact of this approach on speech recognition performance. Moreover, this document demonstrates that the same architecture can be trained on a speech separation task. In some implementations, the end-to-end speech-to-speech model can translate Spanish speech into synthesized English speech. [0072] Encoder-decoder models with attention may be used in modeling a variety of complex sequence-to-sequence problems. These models may be used for speech and natural language processing, such as machine translation, speech recognition, and combined speech translation. The models may also be used in end-to-end Text-To-Speech (TTS) synthesis and Automatic Speech Recognition (ASR), using a single neural network that directly generates the target sequences, given virtually raw inputs.”)

As to Claim 3 Biadsy in view of Chun teaches:” 3. The method of claim 1, 
Furthermore, Biadsy teaches further comprising: reducing one or more repeating first content units from the plurality of first content units. (see Biadsy [0024] “The speech to speech conversion server 112 receives the audio data 102 of the utterance 108 from the computing device 110 and provides the audio data 102 of the utterance 108 to the model 124. The speech to speech conversion server 112 trains the model 124 to convert the audio data 102 of the utterance 108 spoken in a British accent 122 to audio data 106 of the synthesized utterance 114 in an American accent 120. The speech to speech conversion server 112 does not use a speech recognizer 126 to perform this conversion. The speech recognizer 126 may remain inactive during the conversion process. Instead, the model 124 provides the audio data 102 of the utterance 108 to an encoder 128. The encoder 128 may be configured to convert the audio data 102 of the utterance 108 to an internal representation, such as a series of vectors. For example, as the encoder 128 receives the audio data 102 of the utterance 108, the encoder 128 may process five frames of audio and convert those five frames of audio to ten vectors. The vectors are not a transcription of the frames of audio data 102, but rather a mathematical representation of the frames of the audio data 102. The model 124 provides the series of vectors to the spectrogram decoder 130. The spectrogram decoder 130 may be configured to generate audio data of a synthesized utterance based on the vectors received from the encoder 128. For example, the spectrogram decoder 130 may receive the ten vectors from the encoder 128 that represent the five frames of audio. The spectrogram decoder 130 generates five frames of audio data 106 of the synthesized utterance 114 that includes the same words or parts of words as the five frames of audio data, but with a different voice than the user 104..”)

As to Claim 4, Biadsy in view of Chun teaches The method of Claim 1, 
Furthermore, Biadsy teaches The method of Claim 1, wherein the trained speech normalizer comprises one or more of a finetuned speech-learning model or a decoder; (see Biadsy [0069] In some implementations, the system may train the model using a collection of utterances received by the system and by other systems. The system obtains a transcription of each utterance in the collection of utterances. The system may generate the transcriptions using automated speech recognition or by manual transcription. The system provides each transcription to a speech synthesizer, or text to speech model, that generates the synthesized utterances in a synthesized voice. The system trains the model using machine learning and the collection of utterances and the corresponding synthesized utterances. The trained model is configured generate a synthesized utterance in the same synthesized voice based on receiving an utterance spoken by a user. The trained model does not use speech recognition to generate the synthesized utterance..”) and wherein the updating the speech normalizer comprises updating the speech normalizer without an intermediate step of generating text transcriptions associated with the first discrete speech units and the second discrete speech units. (see Biadsy [0024] The speech to speech conversion server 112 receives the audio data 102 of the utterance 108 from the computing device 110 and provides the audio data 102 of the utterance 108 to the model 124. The speech to speech conversion server 112 trains the model 124 to convert the audio data 102 of the utterance 108 spoken in a British accent 122 to audio data 106 of the synthesized utterance 114 in an American accent 120. The speech to speech conversion server 112 does not use a speech recognizer 126 to perform this conversion. The speech recognizer 126 may remain inactive during the conversion process. Instead, the model 124 provides the audio data 102 of the utterance 108 to an encoder 128. The encoder 128 may be configured to convert the audio data 102 of the utterance 108 to an internal representation, such as a series of vectors. For example, as the encoder 128 receives the audio data 102 of the utterance 108, the encoder 128 may process five frames of audio and convert those five frames of audio to ten vectors. The vectors are not a transcription of the frames of audio data 102, but rather a mathematical representation of the frames of the audio data 102. The model 124 provides the series of vectors to the spectrogram decoder 130. The spectrogram decoder 130 may be configured to generate audio data of a synthesized utterance based on the vectors received from the encoder 128. For example, the spectrogram decoder 130 may receive the ten vectors from the encoder 128 that represent the five frames of audio. The spectrogram decoder 130 generates five frames of audio data 106 of the synthesized utterance 114 that includes the same words or parts of words as the five frames of audio data, but with a different voice than the user 104.”)

As to Claim 9 Biadsy in view of Chun teaches The method of Claim 1, 
Furthermore, Biadsy teaches wherein the software is further operable when executed to: (see Biadsy [0105] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.”) Furthermore, Biadsy teaches further comprising: processing a plurality of first training data associated with a target language by the trained speech normalizer to generate a plurality of normalized target speech units; and training a textless speech-to-speech translation model based on the plurality of normalized target speech units and a plurality of second training data associated with a source language. (see Biadsy [0073] “This document describes combining state of the art speech recognition and synthesis models to build a direct end-to-end speech-to-speech sequence transducer which generates a speech spectrogram as a function of a different input spectrogram, without depending on an intermediate discrete representation. The model may first be applied to voice normalization and speech separation tasks. This model can be used to directly translate one language to another, for example, from Spanish speech into English speech.”)

Regarding Independent Claim 10, is a CRM claim with limitations similar to the limitations of Claim 1 which are rejected under similar rationale. Additionally, Biadsy teaches One or more computer-readable non-transitory storage media embodying software that is operable when executed to: (see Biadsy , [0105] “These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.”)

Claim 11 is a system claim with limitations similar to the limitations of Claim 2 and is rejected under similar rationale.
Claim 12 is a system claim with limitations similar to the limitations of Claim 3 and is rejected under similar rationale.
Claim 13 is a system claim with limitations similar to the limitations of Claim 4 and is rejected under similar rationale.
Claim 18 is a system claim with limitations similar to the limitations of Claim 9 and is rejected under similar rationale.

Regarding Independent Claim 19, is a computing system claim with limitations similar to the limitations of Claim 1 which are rejected under similar rationale. Additionally, Biadsy teaches A system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, we examine the efficiency of the models the processors operable when executing the instructions to: (see Biadsy [0105] “These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.”) (see Biadsy [0091] “The computing device 600 includes a processor 602, a memory 604, a storage device 606, a high-speed interface 608 connecting to the memory 604 and multiple high-speed expansion ports 610, and a low-speed interface 612 connecting to a low-speed expansion port 614 and the storage device 606. Each of the processor 602, the memory 604, the storage device 606, the high-speed interface 608, the high-speed expansion ports 610, and the low-speed interface 612, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 602 can process instructions for execution within the computing device 600, including instructions stored in the memory 604 or on the storage device 606 to display graphical information for a GUI on an external input/output device, such as a display 616 coupled to the high-speed interface 608. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).”) (see Biadsy [0092] “The memory 604 stores information within the computing device 600. In some implementations, the memory 604 is a volatile memory unit or units. In some implementations, the memory 604 is a non-volatile memory unit or units. The memory 604 may also be another form of computer-readable medium, such as a magnetic or optical disk.”)

Claim 20 is a system claim with limitations similar to the limitations of Claim 9 and is rejected under similar rationale.


Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Biadsy (US Patent Number US 20220122579 A1), in view of Chun (US Patent Number US 9922641 B1), and further in view of Sato (US Patent Number US-20220335965-A1).
As to Claim 5, Biadsy in view of Chun teaches The method of Claim 1, (see Claim 1) 
Biadsy does not specifically teach further comprising: accessing a third utterance by a third speaker; and processing the third utterance using the trained speech normalizer to generate a plurality of normalized speech units. However, Sato does teach this limitation (see Sato [0098] “The normalization unit 2242 normalizes the norms of the first auxiliary feature (the feature-extracted audio information of the target speaker)(examiner interprets third speaker as “target speaker”), the second auxiliary feature (the feature-extracted video information of the target speaker), and the third auxiliary feature the feature-extracted other clue information for the target speaker). The normalization unit 2242 normalizes a sample at each time and applies a generally used method such as dividing each component of the vector by the magnitude of the vector as an operation.”) (see Sato [0034] “As illustrated in FIG. 1, the audio signal processing apparatus 10 includes an audio signal processing unit 11, a first auxiliary feature conversion unit 12, a second auxiliary feature conversion unit 13, and an auxiliary information generation unit 14 (a generation unit). A mixed audio signal including audio from a plurality of sound sources is input to the audio signal processing apparatus 10. Further, an audio signal of a target speaker and video information of speakers at the time of recording the input mixed audio signal are input to the audio signal processing apparatus 10. Here, the audio signal of the target speaker is a signal obtained by recording what the target speaker utters (examiner notes that in this scenario of a “target speaker”, a first speaker would utter a first utterance, a second speaker would utter a second utterance and third speaker would utter a third utterance)”) (see Sato [0012] “A training apparatus according to the present invention includes a selection unit configured to select a mixed audio signal for training and a plurality of signals relating to processing of an audio signal of a target speaker for training from training data, an auxiliary feature conversion unit configured to convert the plurality of signals relating to processing of the audio signal of the target speaker for training into a plurality of auxiliary features for the plurality of signals using a plurality of auxiliary neural networks, an audio signal processing unit configured to estimate information regarding processing of an audio signal of the target speaker included in the mixed audio signal for training using a main neural network based on a feature of the mixed audio signal for training and the plurality of auxiliary features, and an update unit configured to update parameters of neural networks and cause the selection unit, the auxiliary feature conversion unit, and the audio signal processing unit to repeatedly execute processing until a predetermined criterion is satisfied to set the parameters of the neural networks satisfying the predetermined criterion.”) (see Sato [0096] “The auxiliary information generation unit 224 generates a weighted sum of the first auxiliary feature, the second auxiliary feature, and the third auxiliary feature, multiplied by corresponding attentions, using a neural network while referring to the first intermediate feature, and outputs the weighted sum to the integration unit 2212 as an auxiliary feature. FIG. 7 is a diagram illustrating an example of a configuration of the auxiliary information generation unit 224 illustrated in FIG. 5. As illustrated in FIG. 7, the auxiliary information generation unit 224 includes an attention calculation unit 2241, a normalization unit 2242, (examiner notes that the normalization unit is included in the auxiliary information generation unit which uses a neural network) an aggregation unit 2243, and a scaling unit 2244.”)
Biadsy in view of Chun and Sato are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The method of claim 1 of Biadsy and Chun to incorporate the teachings of Sato to include accessing a third utterance by a third speaker, processing the third utterance using the trained speech normalizer to generate a plurality of normalized speech units. Doing so allows the audio signal of the target speaker included in the mixed audio signal can be estimated with stable accuracy. as recognized by Sato in [0013].

As to Claim 14, claim 14 is a computer-readable medium claim with limitations similar to that of claim 5 and is rejected under the same rationale.
Claims 6, 7, 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Biadsy (US Patent Number US 20220122579 A1), in view of Chun (US Patent Number US 9922641 B1), and further in view of Sato (US Patent Number US-20220335965-A1) and further in view of Wittenstein (US Patent Number US-8335689-B2).
As to Claim 6, Biadsy in view of Chun and Sato teach The method of Claim 5, (see Claim 5) and the third speaker (see Claim 5) specific to the third speaker (see Claim 5). 
Biadsy in view of Chun and Sato do not specifically teach further comprising: anonymizing based on removing one or more normalized speech units associated with speech characteristics from the plurality of normalized speech units. However, Wittenstein does teach this limitation. (see Wittenstein (11:49-62) “these components may be incorporated in an automatic speech recognizer generalized to separate and recognize different sound sources, including noises, dialects, and voices. To the extent possible with existing technology and in accordance with the noise-transcription, dialect-transcription, and voice-transcription requirements given in transcription job specification 706, it evens out the nonuniformities, by removing those deviations that hinder transcription in general, and by normalizing characteristics where inconsistencies hinder transcription. This normalization also serves to anonymize the source for privacy. Moreover, by normalizing to different norms for different speakers or utterances, the preprocessor can pseudonymize the source for enhanced privacy.”)
Biadsy in view of Chun in view of Sato and Wittenstein are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The method of claim 5 of Biadsy and Chun and Sato to incorporate the teachings of Wittenstein to include anonymizing based on removing one or more normalized speech units associated with speech characteristics from the plurality of normalized speech units. Doing so allows the pseudonymization of the source for enhanced privacy, as recognized by Wittenstein in (11:61-62)

As to Claim 7, Biadsy in view of Chun and Sato teach 7. The method of Claim 5, (see Claim 5) and the third utterance (see Claim 5).
Biadsy in view of Chun and Sato do not teach further comprising: denoising based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. However, Wittenstein does teach this limitation. (see Wittenstein (11:49-62) “these components may be incorporated in an automatic speech recognizer generalized to separate and recognize different sound sources, including noises, dialects, and voices. To the extent possible with existing technology and in accordance with the noise-transcription, dialect-transcription, and voice-transcription requirements given in transcription job specification 706, it evens out the nonuniformities, by removing those deviations that hinder transcription in general, and by normalizing characteristics where inconsistencies hinder transcription. This normalization also serves to anonymize the source for privacy. Moreover, by normalizing to different norms for different speakers or utterances, the preprocessor can pseudonymize the source for enhanced privacy.”)
Biadsy in view of Chun in view of Sato and Wittenstein are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The method of claim 5 of Biadsy and Chun and Sato to incorporate the teachings of Wittenstein to include denoising based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. Doing so allows evening of nonuniformities, as recognized by Wittenstein in (11:55-56)

As to Claim 15, Biadsy in view of Chun and Sato teach The computer-readable medium of Claim 14, (see Claim 14)  and the third speaker (see Claim 14) specific to the third speaker (see Claim 14) and wherein the software is further operable when executed to (see claim 14).
Biadsy in view of Chun and Sato do not specifically teach anonymize the third speaker based on removing one or more normalized speech units associated with speech characteristics specific to the third speaker from the plurality of normalized speech units. However, Wittenstein does teach this limitation. (see Wittenstein (11:49-62) “these components may be incorporated in an automatic speech recognizer generalized to separate and recognize different sound sources, including noises, dialects, and voices. To the extent possible with existing technology and in accordance with the noise-transcription, dialect-transcription, and voice-transcription requirements given in transcription job specification 706, it evens out the nonuniformities, by removing those deviations that hinder transcription in general, and by normalizing characteristics where inconsistencies hinder transcription. This normalization also serves to anonymize the source for privacy. Moreover, by normalizing to different norms for different speakers or utterances, the preprocessor can pseudonymize the source for enhanced privacy.”)
Biadsy in view of Chun in view of Sato and Wittenstein are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The computer-readable medium of Claim 14, of Biadsy and Chun and Sato to incorporate the teachings of Wittenstein to include anonymizing based on removing one or more normalized speech units associated with speech characteristics from the plurality of normalized speech units. Doing so allows the pseudonymization of the source for enhanced privacy, as recognized by Wittenstein in (11:61-62)

As to Claim 16, Biadsy in view of Chun and Sato teach The computer-readable medium of Claim 15, and the third utterance (see Claim 5).
Biadsy in view of Chun and Sato do not specifically teach wherein the software is further operable when executed to: denoise based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. However, Wittenstein does teach this limitation. (see Wittenstein (11:49-62) “these components may be incorporated in an automatic speech recognizer generalized to separate and recognize different sound sources, including noises, dialects, and voices. To the extent possible with existing technology and in accordance with the noise-transcription, dialect-transcription, and voice-transcription requirements given in transcription job specification 706, it evens out the nonuniformities, by removing those deviations that hinder transcription in general, and by normalizing characteristics where inconsistencies hinder transcription. This normalization also serves to anonymize the source for privacy. Moreover, by normalizing to different norms for different speakers or utterances, the preprocessor can pseudonymize the source for enhanced privacy.”)
Biadsy in view of Chun in view of Sato and Wittenstein are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The computer-readable medium of Claim 15 of Biadsy and Chun and Sato to incorporate the teachings of Wittenstein to include denoising based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. Doing so allows evening of nonuniformities, as recognized by Wittenstein in (11:55-56)

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Biadsy  (US Patent Number US 20220122579 A1), in view of Chun (US Patent Number US 9922641 B1), and further in view of Sato (US Patent Number US-20220335965-A1) and further in view of Sargin (US Patent Number US-8913103-B1).
As to Claim 8, Biadsy in view of Chun and Sato teach The method of Claim 5, (see Claim 5).
Biadsy in view of Chun and Sato do not specifically teach further comprising: removing one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. However, Sargin does teach this limitation. (see Sargin (10:40-11:20) “(46) .function..SIGMA..times..times..function. ##EQU00001## where x is the audio stream signal and E is the energy of the window. This function sums the energy in the audio signal over "L" time samples after multiplying the input samples by a window function w(i). The window function w can be a Hamming window for example. In step 606, the short time energy E(t) associated with the audio segment is thresholded. If the short time energy E(t) is below a threshold .tau., the audio stream at time t is considered as "possibly silent". Because the average amplitude of the audio stream is varying from scenario to scenario and even within the same audio stream, a good threshold should be adapted to the short-time energy detected. One possible adaptation would be to combine the current measure of audio energy E(t) with a weighted previous value such as: .tau.(t)=.alpha..tau.(t-1)+(1-.alpha.)E(t) (3) where the threshold function .tau.(t) at a given time t is equal to a previous value .tau.(t-1), weighted with value .alpha. which is a fraction selected from between 0 and 1. When a continuous "possibly silent" region is longer than a pre-determined length, this region is converted from being labeled possibly silent" to being labeled "silence segment" and is removed from the input audio in step 608. Audio segments which exceed the threshold (&gt;=T) are labeled "non-silent" and output in step 610 for further processing. Output from the silence detection process 600 is segmented audio with silent regions removed. Speaker Diarisation (47) Speaker diarisation process 700 takes the output from the silence detection process 600 and identifies discrete speakers. Speaker diarisation process 700 groups segments of an input audio stream into according to the speaker identity. The speaker diarisation process extracts audio features, maintains a universal background model (UBM) and performs online speaker clustering. (48) FIG. 7 is a flowchart of an embodiment of the speaker diarisation process 700. Beginning at step 702, the speaker diarisation process 700 first calculates audio features for the non-silent audio segments from silence detection process 600. These features are taken from the set of standard audio features typically used for tasks such as speech recognition and include Mel-Frequency Cepstral Coefficients (MFCC), zero-crossing rate, and pitch, among others. These values are normalized and concatenated into a vector. The segments can overlap in time by as much at 80%. These audio feature vectors are then used to calculate a universal background model for audio feature vectors in step 704.”)
Biadsy in view of Chun in view of Sato and Sargin are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The method of Claim 5 of Biadsy and Chun and Sato to incorporate the teachings of Sargin to include removing one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. Doing so allows to identify participants speaking states based on the audio speaking state. as recognized by Sargin in (5:62-63).

As to Claim 17, Biadsy in view of Chun and further in view of Sato teach 17. The computer-readable medium of Claim 15, (see Claim 15). wherein the software is further operable when executed to (see Claim 15).
Biadsy in view of Chun and further in view of Sato do not teach remove one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. However, Sargin does teach this limitation. (see Sargin (10:40-11:20) “(46) .function..SIGMA..times..times..function. ##EQU00001## where x is the audio stream signal and E is the energy of the window. This function sums the energy in the audio signal over "L" time samples after multiplying the input samples by a window function w(i). The window function w can be a Hamming window for example. In step 606, the short time energy E(t) associated with the audio segment is thresholded. If the short time energy E(t) is below a threshold .tau., the audio stream at time t is considered as "possibly silent". Because the average amplitude of the audio stream is varying from scenario to scenario and even within the same audio stream, a good threshold should be adapted to the short-time energy detected. One possible adaptation would be to combine the current measure of audio energy E(t) with a weighted previous value such as: .tau.(t)=.alpha..tau.(t-1)+(1-.alpha.)E(t) (3) where the threshold function .tau.(t) at a given time t is equal to a previous value .tau.(t-1), weighted with value .alpha. which is a fraction selected from between 0 and 1. When a continuous "possibly silent" region is longer than a pre-determined length, this region is converted from being labeled possibly silent" to being labeled "silence segment" and is removed from the input audio in step 608. Audio segments which exceed the threshold (&gt;=T) are labeled "non-silent" and output in step 610 for further processing. Output from the silence detection process 600 is segmented audio with silent regions removed. Speaker Diarisation (47) Speaker diarisation process 700 takes the output from the silence detection process 600 and identifies discrete speakers. Speaker diarisation process 700 groups segments of an input audio stream into according to the speaker identity. The speaker diarisation process extracts audio features, maintains a universal background model (UBM) and performs online speaker clustering. (48) FIG. 7 is a flowchart of an embodiment of the speaker diarisation process 700. Beginning at step 702, the speaker diarisation process 700 first calculates audio features for the non-silent audio segments from silence detection process 600. These features are taken from the set of standard audio features typically used for tasks such as speech recognition and include Mel-Frequency Cepstral Coefficients (MFCC), zero-crossing rate, and pitch, among others. These values are normalized and concatenated into a vector. The segments can overlap in time by as much at 80%. These audio feature vectors are then used to calculate a universal background model for audio feature vectors in step 704.”)
Biadsy in view of Chun in view of Sato and Sargin are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The computer-readable medium of Claim 15 of Biadsy and Chun and Sato to incorporate the teachings of Sargin to include removing one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. Doing so allows to identify participants speaking states based on the audio speaking state. as recognized by Sargin in (5:62-63).


Conclusion
THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KRISTEN MICHELLE MASTERS whose telephone number is (703)756-1274. The examiner can normally be reached M-F 8:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KRISTEN MICHELLE MASTERS/Examiner, Art Unit 2659                                                                                                                                                                                                        
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Aug 16, 2022
Application Filed
Jun 15, 2024
Non-Final Rejection — §101, §103
Sep 24, 2024
Response Filed
Jan 11, 2025
Non-Final Rejection — §101, §103
Apr 04, 2025
Response after Non-Final Action
Apr 04, 2025
Response Filed
Jun 28, 2025
Non-Final Rejection — §101, §103
Oct 01, 2025
Response Filed
Jan 24, 2026
Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/513,614
Patent 12592219
Hearing Device User Communicating With a Wireless Communication Device
2y 5m to grant Granted Mar 31, 2026
17/415,675
Patent 12548569
METHOD AND SYSTEM OF DETECTING AND IMPROVING REAL-TIME MISPRONUNCIATION OF WORDS
2y 5m to grant Granted Feb 10, 2026
17/790,795
Patent 12548564
SYSTEM AND METHOD FOR CONTROLLING A PLURALITY OF DEVICES
2y 5m to grant Granted Feb 10, 2026
17/940,549
Patent 12547894
ENTROPY-BASED ANTI-MODELING FOR MACHINE LEARNING APPLICATIONS
2y 5m to grant Granted Feb 10, 2026
18/311,150
Patent 12547840
MULTI-STAGE PROCESSING FOR LARGE LANGUAGE MODEL TO ANSWER MATH QUESTIONS MORE ACCURATELY
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
62%
Grant Probability
87%
With Interview (+24.7%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allow rate.
Textless Speech-to-Speech Translation on Real Data

This examiner grants 62% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email