Last updated: May 29, 2026
Application No. 18/439,630
USING TEXT-INJECTION TO RECOGNIZE SPEECH WITHOUT TRANSCRIPTION

Final Rejection §101§103
Filed
Feb 12, 2024
Priority
Mar 01, 2023 — provisional 63/487,821
Examiner
OGUNBIYI, OLUWADAMILOL M
Art Unit
2653
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
2 (Final)
Interview Optional

— +17.6% interview lift. Examiner has a relatively high allowance rate (77%); +17.6% interview lift. A written response may suffice.
Based on 309 resolved cases, 2023–2026
Examiner Intelligence

OGUNBIYI, OLUWADAMILOL M View full profile →
Grants 77% — above average
Career Allowance Rate
239 granted / 309 resolved
+15.3% vs TC avg
Strong +18% interview lift
Without
With
+17.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
17 currently pending
Career history
340
Total Applications
across all art units
Statute-Specific Performance

§101
9.6%
-30.4% vs TC avg
§103
77.8%
+37.8% vs TC avg
§102
2.2%
-37.8% vs TC avg
§112
8.0%
-32.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 309 resolved cases
Office Action

§101 §103
DETAILED ACTION
Claims 1 – 20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
With regard to the Non-Final Office Action from 23 September 2025, the Applicant has filed a response on 22 December 2025.
Response to Arguments
The Applicant argues (Remarks: Page 8 par 3) against the Examiner’s 35 U.S.C. 101 rejection regarding the claims being directed to a judicial exception without significantly more, stating that ‘[w]hen analyzed as a whole, the claim contains specific limitations regarding the generation of alignment outputs from unpaired text that the human mind is simply not equipped to perform.’ The Applicant indicates that by the current amendment, the unspoken textual utterances are not paired with any corresponding audio representation, and that a human mind cannot align text to audio that does not exit, since if a human is given a text transcript never been heard, the human cannot mentally derive the acoustic features necessary to recognise that speech. To this, the Examiner refers to [0030] of the Specification which shows that the unspoken textual utterance includes text-only data (words, word-pieces, phonemes and/or graphemes) that isn’t paired with any corresponding spoken audio representation. Taking the unspoken textual utterances as words, the Examiner indicates here that these words (which are not paired with any corresponding audio representation), are to be inserted into the redacted portions of the transcriptions of the modified speech. A human is able to take textual words as the unspoken textual utterances, even though these words are yet to be uttered at the time the unspoken textual utterances were obtained, and then insert these unspoken textual utterances into the redacted portions of the transcripts of the modified speech utterances. By this, a human is able to mentally perform the indicated alignment by matching the words of these unspoken textual utterances, with redacted portions of the transcripts of the modified speech utterances. Contrary to the Applicant’s indication (Remarks: page 9 par 2) that this alignment requires ‘predicting millisecond-level text chunk durations … and upsampling representations to match acoustic frame rates’ and that a human with a pen and paper ‘cannot look at a sentence of text with fake random data and, without any audio reference, mentally calculate and generate “alignment outputs” comprising accurate speech-frame durations suitable for training a neural network.’ To this, the Examiner refers back to the claim (independent claim 1) which simply provides the use of an alignment model to generate a corresponding alignment output for each unspoken textual utterance. The limitation of the independent claim (as currently presented) does not require any of the millisecond-level text chunk duration and upsampling representation that the Applicant has indicated here. By its plain presentation, a human only needs to align the unspoken textual utterances with the redacted portions of the transcript of the modified speech utterances. This is a task that a human is able to perform mentally.
The Applicant indicates (Remarks: page 9 par 3) that the limitation of ‘training … a model’ does not recite a judicial exception except unless it explicitly claims the mathematical formulas, and since the claimed invention provides training without reciting the mathematical equation, the training involving upwards of 10,000 hours of transcribed speech, this would constitute a machine-learning process, not a mental one. The Examiner holds that the claims as presented do not require 10,000 hours of training for understanding the claimed invention, and that the training presented here, while not involving mathematical equations, do train based on information that is readily available for a human to be able to mentally process, these being the transcribed speech utterances, the modified speech utterance, the unspoken textual utterances and the alignment output information. A human can mentally make use of these information to learn the overall process, making this a mental process.
The Examiner hereby maintains the 35 U.S.C. 101 rejection.
Regarding the 35 U.S.C. 101 rejection given to the independent claims, the Applicant indicates that the applied prior art does not teach of the limitations as amended, that the unspoken textual utterances ‘are not paired with any corresponding audio representations.’ The Examiner will address this limitation as presented in its section. The Applicant further states (Remarks: page 11 par 2) that the Ganong, III et al. reference fails to teach of ‘generating, using an alignment model …,’ the Applicant indicating that the reference teaches a text-to-text alignment, different from claimed invention’s text-to-time (frames) alignment to mimic acoustic features and generate a duration output for the text that inherently lacks duration. The intent of the claimed invention is, according to the Applicant’s argument, a text-to-time alignment, but this is not reflected in the claim limitation. The claim limitation presents the generation of an alignment output for each unspoken textual utterance of the received training data, this being interpretable as textual words, that get transcripts of the obscured signals (as provided by [0055] of this reference which provides replacing the sensitive information) aligned with the transcript of the input speech signal. This alignment includes the replacement words being contained as part of the transcript of the obscured speech signal, being aligned with the transcript of the input speech signal, thereby being suitable to teach the indicated limitation just as presented.
The Applicant further states (Remarks: page 12 par 1) that Ganong, III et al. does not include a transcription of the original input in its training. The Examiner refers to [0071] of this reference which provides training of the speech processing model involving the use of the transcription of the input speech signal, thereby teaching contrary to the Applicant’s assertion.
Applicant’s arguments with respect to the independent claims have been considered but are moot because of the new grounds of rejection necessitated by the amendment to the claims. The claims will be addressed by their current presentation.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1 – 5, 10 – 15 and 20 are rejected under 35 U.S.C. 101 because this claimed invention is directed to a judicial exception without significantly more.
Independent claims 1 and 11 provide teaching for receiving training data that comprises transcribed speech utterances spoken in a general domain, modified speech utterances in a target domain modified to obfuscate classes of sensitive information and paired with a corresponding transcription that redacts the sensitive information, unspoken textual utterances that are transcriptions of modified speech utterances in the target domain not paired with any audio representation but contain fake random data inserted in the redacted portions that contained sensitive information. The claims go on to generate an alignment model with an alignment output for each unspoken textual utterance, and then train a speech recognition model using the transcribed speech utterances, modified speech utterances and the alignment outputs, so that the speech recognition model can learn to recognise speech in the target domain and phrases representing one or more classes of sensitive information.
Nothing in the claims preclude the claimed technique from being performed in the human mind. The entire process involves data gathering through collecting transcribed speech utterances, modified speech utterances and unspoken textual utterances, data generation for generating an alignment output, and data presentation of transcribed speech utterances, modified speech utterances and unspoken textual utterances to a training module, and recognising speech in a target domain as well as sensitive information in the speech, these being presented after recognition. In an attempt to have a human learn to recognise speech as well as classes of sensitive information contained in the speech, a human may receive transcribed speech utterances, speech utterances that have been modified to obfuscate sensitive information and paired with corresponding transcripts that have the sensitive information redacted, receive textual words that haven’t been spoken and aren’t paired with any audio and have fake random data that get inserted into the redacted portions, the human may then align the textual words that haven’t been spoken, and apply all these information to learn to recognise sensitive information contained in a received utterance that contains phrases that fall into classes of sensitive information. The claims hereby recite a mental process.
This judicial exception is not integrated into a practical application as the claims simply teach of data gathering, data generation and data presentation. While the claims make mention of data processing hardware and memory hardware, these are recited in generic terms.
The invention is not tied to any particular defining structure and simply provides instructions to apply the judicial exception. The technique can be performed by a generic computer which would be presented as a tool to implement the abstract idea (classifiable as automation of the mental process steps). The Specification in [0023] shows a computer as a mobile computing device, suitable to read upon the limitations of this claim. The data processing hardware and the memory hardware are recited at a high level of generality that they amount to no more than mere instructions to apply the exception using a generic computer. The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the invention is not tied to a practical application.
The claims provide techniques that amount to no more than mere instructions that apply the judicial exception which can be performed by a generic device. Mere instructions to apply an exception using a generic device cannot provide an inventive concept. Claims 1 and 11 are not eligible.
Claims 2 and 12 provide that the classes of one or more sensitive information comprise at least one of personably identifiable information, protected health information or dates. These are mere instructions to define the classes of sensitive information. This does not integrate any practical application nor does it provide any additional element sufficient to amount to more than the mentioned judicial exception.
Claims 3 and 13 provide that the redacted portions of the transcriptions of the modified speech utterances are tagged with a class identifier that indicate the class of the sensitive information. A human may textually tag the redacted portions of the transcriptions with class identifiers. This does not integrate any practical application nor does it provide any additional element sufficient to amount to more than the mentioned judicial exception.
Claims 4 and 14 provide that the fake random data inserted into the redacted portions are associated with class of sensitive information identified by the class identifier at the redacted portion. A human my manually insert fake random information into a redacted portion of a transcript that match the class of sensitive information that was redacted. This does not integrate any practical application nor does it provide any additional element sufficient to amount to more than the mentioned judicial exception.
Claims 5 and 15 provide teaching for the transcribed speech utterances in the general domain comprise a greater number of hours of speech than the modified speech utterances. This simply provides mere instructions to apply the judicial exception by having that the transcribed speech utterances in the general domain contain more data with more data hours than the modified speech. This does not integrate any practical application nor does it provide any additional element sufficient to amount to more than the mentioned judicial exception.
Claims 10 and 20 provide teaching for extracting a textual representation from an unspoken textual utterance, predicting a duration for a text chunk for each text chunk in the unspoken textual utterance, and upsampling the initial textual representation using the predicted duration for each chunk. A human may manually observe a text, make a timing/duration prediction for text chunks in the text, and perform an upsampling as an expansion of the information in the textual representation. This does not integrate any practical application nor does it provide any additional element sufficient to amount to more than the mentioned judicial exception.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 2, 5, 11, 12 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Ganong, III et al. (US 2023/0395063 A1: hereafter — Ganong) in view of Mozer et al. (US 2023/0229803 A1: hereafter — Mozer) and further in view of Bachtiger et al. (US 11,120,199 B1: hereafter — Bachtiger).
For claim 1, Ganong discloses a computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations (Ganong: [0020] — a processor (as a data processing hardware)) comprising:
receiving training data comprising:
transcribed speech utterances spoken in a general domain, each transcribed speech utterance paired with a corresponding transcription (Ganong: [0065] — having an input speech signal and a transcription of the input speech signal as its pair (this being an original domain whereby sensitive information is left in the speech and transcript));
modified speech utterances in a target domain, the modified speech utterances comprising utterances spoken in the target domain that have been modified to obfuscate one or more classes of sensitive information recited in the utterances, each modified speech utterance paired with a corresponding transcription that redacts the sensitive information obfuscated from the modified speech utterance (Ganong: [0070] — training involving the presence of obscured speech signal (the target domain being such a domain which has certain sensitive data removed); [0068] — an obscured input speech signal with the sensitive information removed); and
unspoken textual utterances corresponding to the transcriptions of the modified speech utterances in the target domain, [[wherein the unspoken textual utterances are not paired with any corresponding audio representation]] and comprise fake [[random]] data inserted into redacted portions of the transcriptions of the modified speech utterances where the sensitive information recited in the modified speech utterance has been redacted (Ganong: [0055]–[0056] — replacing the sensitive portion of the transcription with fake information different from what was previously there; [0071] — training the speech processing model based on manually generated transcription of obscured speech signal (this transcription of the obscured speech signal belonging to unspoken utterances since the obscured speech was never uttered));
generating, using an alignment model, a corresponding alignment output for each unspoken textual utterance of the received training data (Ganong: [0071] — ‘aligning 428 the manually generated transcription of the obscured speech signal with the transcription’); and
training a speech recognition model on the transcribed speech utterances, the modified speech utterances, and the alignment outputs generated for the unspoken textual utterances to teach the speech recognition model to learn to recognize speech in the target domain and phrases within the one or more classes of sensitive information (Ganong: [0070] — the training including the obscured speech signals (as the modified speech utterances); [0071] — training the speech processing model making use of the manually generated transcription, alignment with the transcription of the input speech (as the alignment outputs generated for the unspoken textual utterances, and also indicating that the transcription of the original speech is included for the purpose of training); [0056], FIGs. 6 & 7 — a transcription process that is able to identify the sensitive information in the speech).
The reference of Ganong provides teaching for inserting fake data into redacted portions of the transcriptions of the modified speech utterances, the training of a speech recognition including the transcribed speech utterances and the modified speech utterances, but differs from the claimed invention in that the claimed invention further includes training of the speech recognition using the alignment outputs generated from the unspoken textual utterances, which the reference of Ganong only alludes to.
The reference of Mozer is now introduced to teach of training a speech recognition model by making use of the alignment outputs generated from replacement text information as:
training a speech recognition model on the transcribed speech utterances, the modified speech utterances, and the alignment outputs generated for the unspoken textual utterances to teach the speech recognition model to learn to recognize speech in the target domain and phrases within the one or more classes of sensitive information (Mozer: [0025], [0027] — sanitising through obfuscating or transforming identified PII, by replacing the text features with a random pattern or replacing it with a generic template representative of the type of text that was meant to be conveyed; [0012] — machine learning training, including the training of a speech recognition model by incorporating the sanitised version (indicating training that includes the obfuscated version or in this case, the alignment output generated for the unspoken textual utterances)).
Hence, before the effective filing date of the claimed invention, one of ordinary skill in the art would have found it obvious to combine the known teaching of Mozer which trains a speech recognition model by making use of the sanitised text version obtained after obfuscation of private information, with the training of the speech processing model that makes use of transcribed speech utterances and the modified speech utterances, to thereby come up with the claimed invention. The combination of both prior art element would have provided the predictable result of providing a robust speech recognition system trained to be able to also recognise confidential information that require obscuring, and then properly the replacement of such confidential information before presenting the speech recognition results. See KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-97 (2007).
The combination of Ganong in view of Mozer provides teaching for inserting fake data into redacted portions of the transcriptions of the modified speech utterances, but differs from the claimed invention in that the claimed invention further provides teaching for inserting fake random information.
The reference of Bachtiger is now introduced to teach this as:
unspoken textual utterances corresponding to the transcriptions of the modified speech utterances in the target domain, wherein the unspoken textual utterances are not paired with any corresponding audio representation and comprise fake random data inserted into redacted portions of the transcriptions of the modified speech utterances where the sensitive information recited in the modified speech utterance has been redacted (Bachtiger: Col 4 lines 44–48 — a redaction module which is able to substitute redacted portions of an utterance with random series of symbols such as ***-**-*** or #@$-$#-@#$#@ (which are the unspoken textual utterances that are not paired with any corresponding audio representation)).
Hence, before the effective filing date of the claimed invention, one of ordinary skill in the art would have found it obvious to combine the known teaching of Bachtiger which replaces sensitive information with random data, with the teaching of inserting fake data into such redacted portions that contain sensitive information as taught by the combination of Ganong in view of Mozer, to thereby come up with the claimed invention. The combination of both prior art elements would have provided the predictable result of being able to quickly generate and present information for replacing sensitive information without following a particular order, resulting in simplicity and being unbiased. See KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-97 (2007).
For claim 2, claim 1 is incorporated and the combination of Ganong in view of Mozer further in view of Bachtiger discloses the method of claim 1, wherein the one or more classes of sensitive information comprises at least one of personably identifiable information, protected health information, or dates (Ganong: [0055] — the sensitive information includes personally identifiable information).
For claim 5, claim 1 is incorporated and the combination of Ganong in view of Mozer further in view of Bachtiger discloses the method, wherein the transcribed speech utterances in the general domain comprise a greater number of hours of speech than the modified speech utterances (Ganong: [0098] — ‘In some implementations, transcription generation process 10 may selectively remove sensitive content from the training of a speech processing system. For example, some or all of sensitive content may be identified as noted above but instead of providing the sensitive content for transcription, transcription generation process 10 may dispose of the one or more sensitive content signals (e.g., sensitive content signals 1112). In this manner, a speech processing system (e.g., speech processing system 514) may be trained using only non-sensitive content from the input speech signal. (this indicating that the amount of the transcribed speech utterances containing the sensitive material would inherently contain more data than the utterances without the sensitive material, and thereby leading to longer hours of data)).
As for claim 11, system claim 11 and method claim 1 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Ganong in [0099] provides that the disclosure may take an entirely hardware embodiment, with [0020] providing one or more processor s as well as memory architectures, all these being suitable to read upon the limitations of this claim. Accordingly, claim 11 is similarly rejected under the same rationale as applied above with respect to method claim 1.
As for claim 12, system claim 12 and method claim 2 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 12 is similarly rejected under the same rationale as applied above with respect to method claim 2.
As for claim 15, system claim 15 and method claim 5 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 15 is similarly rejected under the same rationale as applied above with respect to method claim 5.
Claims 3, 4, 13 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Ganong (US 2023/0395063 A1) in view of Mozer (US 2023/0229803 A1) and further in view of Bachtiger (US 11,120,199 B1) as applied to claims 1 and 11, in view of Gaeta et al. (US 10,002,639 B1: hereafter — Gaeta).
For claim 3, claim 1 is incorporated but the combination of Ganong in view of Mozer further in view of Bachtiger fails to teach the limitation of this claim, for which the reference of Gaeta is now introduced to teach as the method, wherein the redacted portions of the transcriptions of the modified speech utterances are tagged with a class identifier identifying the class of sensitive information that has been redacted (Gaeta: FIG. 1, Col 2 line 66 – Col 3 line 2 — a table showing text that was identified as confidential and an indication of the type of the confidential information identified).
The combination of Ganong in view of Mozer further in view of Bachtiger provides teaching for identifying classes of sensitive information in an utterance, but differs from the claimed invention in that the claimed invention further provides teaching for the transcriptions to be tagged with a class identifier identifying the class of sensitive information being redacted. The reference of Gaeta is however introduced to teach this, as presented above.
Hence, before the effective filing date of the claimed invention, one or ordinary skill in the art would have found it obvious to combine the known teaching of Gaeta which teaches the clear identification/tagging of the type of sensitive/confidential information, with the teaching of simply identifying sensitive information in speech as taught by the combination of Ganong in view of Mozer further in view of Bachtiger, to thereby come up with the claimed invention. The combination of both prior art elements would have provided the predictable result of clearly informing a user observing the redacted transcription of the class of sensitive information encountered in an utterance. See KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-97 (2007).
For claim 4, claim 3 is incorporated and the combination of Ganong in view of Mozer further in view of Bachtiger further in view of Gaeta teaches the method, wherein the fake random data inserted into each redacted portion of the transcriptions of the modified speech utterances is associated with the class of sensitive information identified by the class identifier at the redacted portion (Ganong: [0056] — replacing patients’ names, date of birth, medical history/prescription dosage information (the fake data inserted into each redacted portion being the same class of sensitive information that’s being redacted);
Bachtiger: Col 4 lines 44–48 — a redaction module which is able to substitute redacted portions of an utterance with random series of symbols such as ***-**-*** or #@$-$#-@#$#@ (which are the unspoken textual utterances that are not paired with any corresponding audio representation)).
As for claim 13, system claim 13 and method claim 3 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 13 is similarly rejected under the same rationale as applied above with respect to method claim 3.
As for claim 14, system claim 14 and method claim 4 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 14 is similarly rejected under the same rationale as applied above with respect to method claim 4.
Claims 6, 9, 16 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Ganong (US 2023/0395063 A1) in view of Mozer (US 2023/0229803 A1) and further in view of Bachtiger (US 11,120,199 B1) as applied to claims 1 and 11, further in view of Moritz et al. (US 2024/0153508 A1: hereafter — Moritz).
For claim 6, claim 1 is incorporated but the combination of Ganong in view of Mozer further in view of Bachtiger fails to disclose the limitation of this claim, for which the reference of Moritz is now introduced to teach as the method, wherein the speech recognition model comprises an audio encoder and a decoder (Moritz: [0017] — speech recognition usually performed using an encoder and a decoder), the audio encoder comprising a stack of self-attention layers each including a multi-headed self-attention mechanism (Moritz: [0195] — the encoder is composed of self-attention layers; [0194] — the presence of multi-head self-attention layers).
The combination of Ganong in view of Mozer further in view of Bachtiger provides teaching for the presence of speech recognition, but differs from the claimed invention in that the claimed invention further provides that the speech recognition model comprises an audio encoder and decoder, the audio encoder comprising a stack of self-attention layers each including a multi-headed self-attention mechanism. This isn’t new to the art as the reference of Moritz is seen to teach above.
Hence, before the effective filing date of the claimed invention, one or ordinary skill in the art would have found it obvious to combine the known teaching of Moritz which has a speech recognition model comprising an encoder and decoder, with the plain speech recognition model of the combination of Ganong in view of Mozer further in view of Bachtiger, to thereby come up with the claimed invention. The combination of both prior art elements would have provided the predictable result of the presence of a speech recognition system encompassing an encoder and a decoder as well as self-attention layers, for the purpose of greatly improving upon the speech recognition ability. See KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-97 (2007).
For claim 9, claim 6 is incorporated and the combination of Ganong in view of Mozer further in view of Bachtiger further in view of Moritz discloses the method, wherein the decoder comprises one of a Connection Temporal Classification (CTC) decoder, a Listen Attend Spell (LAS) decoder, or Recurrent Neural Network-Transducer (RNN-T) decoder (Moritz: [0070] — a connectionist temporal classification decoder).
As for claim 16, system claim 16 and method claim 6 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 16 is similarly rejected under the same rationale as applied above with respect to method claim 6.
As for claim 19, system claim 19 and method claim 9 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 19 is similarly rejected under the same rationale as applied above with respect to method claim 9.
Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Ganong (US 2023/0395063 A1) in view of Mozer (US 2023/0229803 A1), further in view of Bachtiger (US 11,120,199 B1) further in view of Moritz (US 2024/0153508 A1) as applied to claims 6 and 16, and further in view of CHUNG et al. (US 2023/0134942 A1: hereafter — Chung).
For claim 7, claim 6 is incorporated but the combination of Ganong in view of Mozer further in view of Bachtiger further in view of Moritz fails to explicitly teach the limitation of this claim, for which the reference of Chung is now introduced to teach as the method, wherein the training data further comprises un-transcribed speech utterances spoken in the general domain, each un-transcribed speech utterance not paired with any corresponding transcription (Chung: [0015] — training a speech recognition model using untranscribed speech data).
The combination of Ganong in view of Mozer further in view of Bachtiger and further in view of Moritz provides teaching for the presence of training data in a general domain (the general domain being that where the sensitive information is still present in the utterances), but differs from the claimed invention in that the claimed invention further teaches of the training data further comprising un-transcribed speech utterances which is not paired with any transcription. This isn’t new to the art as the reference of Chung is seen to teach above.
Hence, before the effective filing date of the claimed invention, one or ordinary skill in the art would have found it obvious to combine the known teaching of Chung which teaches of training using un-transcribed speech utterances not paired with any transcription, with the teaching of the presence of training data for speech recognition as taught by the combination of Ganong in view of Mozer further in view of Bachtiger further in view of Moritz, to thereby come up with the claimed invention. The combination of both prior art elements would have provided the predictable result of a self-supervised training process, so the model can be trained using reduced cost of not having to manually create labelled datasets. See KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-97 (2007).
As for claim 17, system claim 17 and method claim 7 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 17 is similarly rejected under the same rationale as applied above with respect to method claim 7.
Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Ganong (US 2023/0395063 A1) in view of Mozer (US 2023/0229803 A1), further in view of Bachtiger (US 11,120,199 B1) and further in view of Moritz (US 2024/0153508 A1), further in view of Chung (US 2023/0134942 A1) as applied to claims 6 and 16, further in view of Fan, Zhiyun, Shiyu Zhou, and Bo Xu. “Unsupervised pre-training for sequence to sequence speech recognition.” arXiv preprint arXiv:1910.12418 (2019) (hereafter — Fan), and further in view of BRONGERS et al. (US 2023/0308666 A1: hereafter — Brongers).
For claim 8, claim 7 is incorporated but the combination of Ganong in view of Mozer further in view of Bachtiger, further in view of Moritz and further in view of Chung fails to disclose the limitations of this claim, for which the reference of Fan is now introduced to teach as the method, wherein training the speech recognition model comprises:
for each un-transcribed speech utterance:
generating a corresponding encoded representation of the un-transcribed utterance (Fan: Fig. 1 — in the acoustic pre-training section, the result of the audio corpus (un-transcribed non-synthetic speech utterance) is used as input for the encoder); and
training the audio encoder on a [[contrastive]] loss applied on the corresponding encoded representation of the un-transcribed speech utterance (Fan: Fig. 1 — applying a mean squared error);
for each alignment output:
generating a corresponding encoded representation of the alignment output (Fan: Fig. 1 — the result of the synthesised speech is used as input for an encoder in the linguistic pre-training section, which combines a text corpus, a TTS system and synthesised audio (to obtain alignment output between the text and the audio)); and
training the audio encoder on a [[contrastive]] loss applied on the corresponding encoded representation of the alignment output (Fan: Fig. 1 — cross entropy error is applied (according to the linguistic pre-training section)); and
for each transcribed speech utterance:
generating a corresponding encoded representation of the transcribed speech utterance (Fan: Fig. 1 — in the post-training section, in-domain audio is processed by an encoder-decoder and in-domain text is used (taking this as the transcribed speech utterance)); and
training the audio encoder on a [[contrastive]] loss applied on the corresponding encoded representation of the transcribed speech utterance (Fan: Fig. 1 — applying a cross-entropy loss).
The combination of Ganong in view of Mozer further in view of Bachtiger, further in view of Moritz and further in view of Chung provides teaching for training a speech recognition model, but differs from the claimed invention in that the claimed invention further provides teaching for training an audio encoder on different losses corresponding to different encoded representations. This is however not new to the art as the reference of Fan is seen to teach above.
Hence, before the effective filing date of the claimed invention, one or ordinary skill in the art would have found it obvious to improve upon the teaching of training a speech recognition model as taught by the combination of Ganong in view of Mozer further in view of Bachtiger, further in view of Moritz, and further in view of Chung, by introducing the teaching of Fan which trains an audio encoder on different losses corresponding to different encoded representations, to thereby come up with the claimed invention. The combination of both prior art elements would have provided the predictable result of integrating useful representations of speech contained at the different stages, so that the decoder end can obtain rich linguistic information (Fan: Section 3.).
The combination of Ganong in view of Mozer further in view of Bachtiger, further in view of Moritz, further in view of Chung and further in view of Fan provides teaching for training an audio encoder on different losses corresponding to different encoded representations, but fails to teach of a contrastive loss.
The presence of a contrastive loss is however seen to be taught by the reference of Brongers (Brongers: [0067] — training an encoder based on contrastive loss).
Hence, before the effective filing date of the claimed invention, one or ordinary skill in the art would have found it obvious to improve upon the teaching of training an audio encoder on different losses corresponding to different encoded representations as taught by the combination of Ganong in view of Bachtiger further in view of Moritz, further in view of Chung, further in view of Fan, by introducing the teaching of Brongers which trains an encoder based on contrastive loss, to thereby come up with the claimed invention. The combination of both prior art elements would have provided the predictable result of increasing the agreement between objects of equal features, and maximising the distance to other object representations (Brongers: [0067]).
As for claim 18, system claim 18 and method claim 8 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 18 is similarly rejected under the same rationale as applied above with respect to method claim 8.
Claims 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ganong (US 2023/0395063 A1) in view of Mozer (US 2023/0229803 A1) and further in view of Bachtiger (US 11,120,199 B1) as applied to claim 1, further in view of DENG et al. (US 2024/0105159 A1: hereafter — Deng)1.
For claim 10, claim 1 is incorporated and the combination of Ganong in view of Mozer further in view of Bachtiger fails to disclose the limitations of this claim, for which the reference of Deng is now introduced to teach as:
the method, wherein generating the corresponding alignment output for each unspoken textual utterance of the received training (Deng: [0277] — performing an alignment technique to align original speech with original text) data comprises:
extracting an initial textual representation from the unspoken textual utterance (Deng: [0277] — obtaining target text by collecting the start and end positions of phonemes in the text (indicating an extraction of initial textual representation));
predicting a text chunk duration for each text chunk in the unspoken textual utterance (Deng: [0028] — predicting a first/second duration based on target text through the use of a prediction network whereby the duration is a phoneme duration that corresponds to text in the target text (predicting the duration for the text chunk)); and
upsampling the initial textual representation using the predicted text chunk duration for each text chunk in the unspoken textual utterance (Deng: [0211] — upsampling text vector based on duration of each phoneme).
The combination of Ganong in view of Mozer further in view of Bachtiger provides teaching for the presence of an alignment for each unspoken textual utterance, but differs from the claimed invention in that the claimed invention further provides teaching for generating the corresponding alignment output through duration prediction for text chunks in the unspoken textual utterance, and upsampling an extracted initial textual representation using the predicted text chunk duration for each text chunk in the unspoken textual utterance. This is however not new to the art as the reference of Deng is seen to teach above regarding an original text.
Hence, before the effective filing date of the claimed invention, one or ordinary skill in the art would have found it obvious to combine the known teaching of Deng which teaches of upsampling text vector based on predicted durations of text chunks as phonemes, with the teaching of the alignment as taught by the combination of Ganong in view of Mozer further in view of Bachtiger, to thereby come up with the claimed invention. The combination of both prior art elements would have provided the predictable result of being able to generate a target speech that aligns with an input unspoken text. See KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-97 (2007).
As for claim 20, system claim 20 and method claim 10 are related as apparatus and the method of using same, with each claimed element’s function corresponding to the claimed method step. Accordingly, claim 20 is similarly rejected under the same rationale as applied above with respect to method claim 10.
Conclusion
Applicant’s amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the Examiner should be directed to OLUWADAMILOLA M. OGUNBIYI whose telephone number is (571)272-4708. The Examiner can normally be reached Monday – Thursday (8:00 AM – 5:30 PM Eastern Standard Time).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s Supervisor, PARAS D. SHAH can be reached at (571) 270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/OLUWADAMILOLA M OGUNBIYI/Examiner, Art Unit 2653

/Paras D Shah/Supervisory Patent Examiner, Art Unit 2653                                                                                                                                                                                                        
03/25/2026


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 The Deng reference has a foreign priority date of 03 June 2021, this foreign priority also containing the same subject matter as provided by the applied Deng reference.
Read full office action
Prosecution Timeline

Feb 12, 2024
Application Filed
Sep 23, 2025
Non-Final Rejection mailed — §101, §103
Dec 22, 2025
Response Filed
Mar 27, 2026
Final Rejection mailed — §101, §103
May 08, 2026
Request for Continued Examination
May 09, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/069,884
Patent 12640154
Stylizing Text-to-Speech (TTS) Voice Response for Assistant Systems
3y 5m to grant Granted May 26, 2026
18/671,825
Patent 12608427
Drill Back To Original Audio Clip In Virtual Assistant Initiated Lists And Reminders
1y 11m to grant Granted Apr 21, 2026
18/615,766
Patent 12579979
NAMING DEVICES VIA VOICE COMMANDS
1y 11m to grant Granted Mar 17, 2026
19/024,112
Patent 12537007
METHOD FOR DETECTING AIRCRAFT AIR CONFLICT BASED ON SEMANTIC PARSING OF CONTROL SPEECH
1y 0m to grant Granted Jan 27, 2026
18/082,346
Patent 12508086
SYSTEM AND METHOD FOR VOICE-CONTROL OF OPERATING ROOM EQUIPMENT
3y 0m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
77%
Grant Probability
95%
With Interview (+17.6%)
2y 11m (~8m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 309 resolved cases by this examiner. Grant probability derived from career allowance rate.