Last updated: May 29, 2026
Application No. 18/832,325
SEMI-SUPERVISED TEXT-TO-SPEECH BY GENERATING SEMANTIC AND ACOUSTIC REPRESENTATIONS

Non-Final OA §101§103
Filed
Jul 23, 2024
Priority
Jan 26, 2023 — provisional 63/441,418 +1 more
Examiner
MASTERS, KRISTEN MICHELLE
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
This examiner grants 63% of cases after interview

— +22.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 46 resolved cases, 2023–2026
Examiner Intelligence

MASTERS, KRISTEN MICHELLE View full profile →
Grants 63% of resolved cases
Career Allowance Rate
29 granted / 46 resolved
+1.0% vs TC avg
Strong +22% interview lift
Without
With
+22.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
24 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
12.4%
-27.6% vs TC avg
§103
85.0%
+45.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 46 resolved cases
Office Action

§101 §103
Detailed Action
This communication is in response to the Application filed on 7/23/2024. 
Claims 1-20 are pending and have been examined. 
Claims 1-20 are rejected
Claims 1, 18, and 19 are independent are method, system and non-transitory computer storage media storing claims, respectively.
Apparent priority: 1/26/2023. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 5/22/2025 8/21/2025 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding Independent Claim 1, Claim 1 recites,
“1. A computer-implemented method for generating an audio signal from input text, the method comprising: 
receiving a request to convert input text into an audio signal, wherein the input text comprises a plurality of tokenized text inputs; [This relates to a human receiving a request to convert text to speech through the auditory or visual systems.]
generating, using a first generative neural network, a semantic representation of the tokenized text inputs comprising semantic tokens representing semantic content of the tokenized text inputs, each semantic token being selected from a vocabulary of semantic tokens; [This relates to a human generating a semantic representation of the tokenized text inputs using pen and paper.]
generating, using a second generative neural network and conditioned on at least the semantic representation, an acoustic representation of the semantic representation comprising one or more respective acoustic tokens representing acoustic properties of the audio signal; [This relates to a human generating an acoustic representation of the semantic representation using voice.]
and processing the acoustic representation using a decoder neural network to generate the audio signal. [This relates to a human processing the acoustic representation in the mind to generate the audio signal through voice.]

                The Dependent Claim does not include additional limitations that could incorporate the abstract idea into a practical application or cause the Claim as a whole to amount to significantly more than the underlying abstract idea.
	
Regarding Independent Claim 18, Claim 18 is a System claim with limitations similar to that of claim 1 and is rejected under the same rationale.
Regarding Independent Claim 19, Claim 19 is a non-transitory computer storage media claim with limitations similar to that of claim 1 and is rejected under the same rationale.

This judicial exception is not integrated into a practical application. In particular, claims 18, 19 and 10 recites additional elements of “computers” “storage” For example, in [0086-0087] of the as filed specification, there is description of using computer program components…storage… Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using computers and storage is noted as a general computer. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Further, the additional limitation in the claims noted above are directed towards insignificant solution activity. The claims are not patent eligible.

Dependent claim 2 recites,
“2. The method of claim 1, wherein the first generative neural network has an encoder-decoder Transformer architecture. A neural network is noted as additional limitations.

Dependent claim 3 recites,
“3. The method of claim 2, wherein the first generative neural network is trained on a parallel text-speech dataset that maps text to semantic representations of audio corresponding to the text. [This relates to a human mapping text to semantic representations of audio using pen and paper.] A neural network is noted as additional limitations.

Dependent claim 4 recites,
“4. The method of claim 3, wherein the training on the parallel text-speech dataset comprises: pre-training the first generative neural network on a first objective using semantic representations of a speech-only dataset; [This relates to a human pre-training on a first objective using semantic representations in the human mind.]
and fine-tuning the pre-trained first generative neural network on a second objective using the parallel text-speech dataset. [This relates to a human fine-tuning the pretraining using logic and reasoning in the human mind.] A neural network is noted as additional limitations.

Dependent claim 5 recites,
“5. The method of claim 4, wherein fine-tuning the pre-trained first generative neural network on a second objective using the parallel text-speech dataset further comprises: 
fine-tuning lower layers of the encoder of the pre-trained first generative neural network and [This relates to a human fine-tuning layers using pen and paper]
fixing the upper layers of the encoder of the pre-trained first generative neural network and the decoder of the pre-trained first generative neural network. [This relates to a human fixing upper layers  using pen and paper] A neural network is noted as additional limitations.

Dependent claim 6 recites,
“6. The method of claim 5, wherein the training comprises, after pre-training the first generative neural network: 
generating a backtranslation model that back translates from semantic representations to text by fine-tuning the pre-trained first generative neural network on a third objective using an initial parallel text-speech dataset; and [This relates to a human generating a backtranslation using pen and paper.]
generating the parallel-text speech dataset by processing the speech-only dataset using the backtranslation model. [This relates to a human generating a parallel text speech dataset using pen and paper.] No additional limitations present.] A neural network is noted as additional limitations.

Dependent claim 7 recites,
“7. The method of claim 6, wherein the training further comprises, after fine-tuning the pre-trained first generative neural network on a second objective using the parallel text- speech dataset: 
fine-tuning the pre-trained first generative neural network on the initial parallel-text speech dataset. [fine tuning a dataset using pen and paper.] A neural network is noted as additional limitations.

Dependent claim 8 recites,
“8. The method of claim 7, wherein fine-tuning the pre-trained first generative neural network on the initial parallel-text speech dataset further comprises: 
fine-tuning the decoder of the pre-trained first generative neural network and fixing the encoder of the pre-trained first generative neural network. [fine tuning a dataset using pen and paper.] A neural network is noted as additional limitations.

Dependent claim 9 recites,
“9. The method of claim 4, wherein the first objective comprises 
generating uncorrupted semantic representations of the speech-only dataset by denoising corrupted semantic representations of the speech-only dataset. [This relates to a human generating uncorrupted semantic representations of the speech-only dataset using pen and paper.] No additional limitations present.]

Dependent claim 10 recites,
10. The method of claim 4, wherein the second objective comprises 
generating semantic representations of text of the parallel text-speech dataset. [This relates to a human generating semantic representations of text of the parallel text-speech dataset using pen and paper.] No additional limitations present.]

Dependent claim 11 recites,
11. The method of claim 6, wherein the third objective comprises 
generating semantic representations of text of the initial parallel-text speech dataset. [This relates to a human generating semantic representations of text of the initial parallel-text speech dataset using pen and paper.] No additional limitations present.]

Dependent claim 12 recites,
12. The method of claim1, wherein the second generative neural network has a decoder-only Transformer architecture. A neural network is noted as additional limitations.

Dependent claim 13 recites,
13. The method of claim1, wherein the second generative neural network is trained on an audio-only dataset, wherein the audio-only dataset comprises, for each of a plurality of training audio inputs, a respective semantic representation and a respective acoustic representation. [This relates to a dataset a human can read on.] A neural network is noted as additional limitations.

Dependent claim 14 recites,
14. The method of claim 1, further comprising: 
obtaining a semantic representation of a target voice prompt comprising semantic tokens and a acoustic representation of the target voice prompt comprising acoustic tokens; and [This relates to a human obtaining a semantic representation of a target voice prompt using pen and paper.]
wherein the second generative neural network is conditioned on at least the semantic representation of the target voice prompt and the acoustic representation of the target voice prompt. [This relates to a human conditioning on prompt data in the human mind.] A neural network is noted as additional limitations.

Dependent claim 15 recites,
“15. The method of claim 14, wherein generating, using a second generative neural network and conditioned on at least the semantic representation, an acoustic representation of the semantic representation further comprises: 
prepending the semantic representation of the target voice prompt prior to the semantic representation of the tokenized inputs; [This relates to a human prepending the semantic representation of the target voice prompt in the human mind.]
and generating an appended semantic representation by appending the acoustic representation of the target voice prompt after the semantic representation of the tokenized text inputs, 
wherein the second generative neural network is conditioned on the appended semantic representation. [This relates to a human generating an appended semantic representation in the human mind.] A neural network is noted as additional limitations.

Dependent claim 16 recites,
16. The method of claim 15, wherein generating the appended semantic representation further comprises: 
inserting a first separator token between the semantic representation of the target voice prompt and the semantic representation of the tokenized inputs; [This relates to a human inserting a first separator token in the human mind.]
and inserting a second separator token between the semantic representation of the tokenized inputs and the acoustic representation of the target voice prompt. [This relates to a human inserting a second separator token in the human mind.] No additional limitations present.

Dependent claim 17 recites,
17. The method of claim1, wherein the decoder neural network generates the audio signal comprising audio characteristics of voice, tempo, and recording conditions. [This relates to a human generates the audio signal comprising audio characteristics of voice, tempo, and recording condition using the voice.] A neural network is noted as additional limitations.

As to dependent Claim 20, Claim 20 is a system claim with limitations similar to that of claim 2 and is rejected under the same rationale.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 18 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Lin (Foreign Application Number CN 114242035 B), in view of Chenpeng (Non-Patent Literature VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature).

Regarding Claim 1, Lin teaches 
a semantic representation of the tokenized text inputs comprising semantic tokens representing semantic content of the tokenized text inputs, each semantic token being selected from a vocabulary of semantic tokens; (see Lin page 12 paragraph 6 “taking non-Chinese as English as an example, under the condition that the English text in the text to be synthesized is followed by the Chinese text, the boundary tone corresponding to the English text in the initial prosodic information is determined as a reduced tone, under the condition that the English text in the text to be synthesized is followed by the English text, The initial prosodic information is not adjusted. by determining the boundary tone corresponding to the non-Chinese text in the initial prosodic information of the text to be synthesized as a reduced tone in the condition that the non-Chinese text is the Chinese text after the non-Chinese text, so that the joint of the prosodic of different language texts in the mixed language text is more natural, so as to further determine the boundary tone corresponding to the non-Chinese text in the text to be synthesized as the mixed language text, The rhythm of the voice synthesized by the text to be synthesized is more natural.”)
generating, using a second generative neural network (see Lin Page 13 paragraph 6 “In some embodiments, the vector of the first training text may be obtained by encoding the first training text through a text encoding model, and the text encoding model may include a BERT model or a simple model. In some embodiments, the prosodic prediction model may be a convolutional neural network or a long-term and short-term memory model.”) and conditioned on at least the semantic representation, an acoustic representation of the semantic representation comprising one or more respective acoustic tokens representing acoustic properties of the audio signal; and (see Lin Page 8 paragraph 6 “In some embodiments, the target prosodic information may be represented by a sequence, and the fused phoneme sequence and the target prosodic information may be a sequence of a spliced phoneme sequence and the target prosodic information to obtain the target phoneme sequence. In some embodiments, the coding model can be specifically determined according to the actual situation, for example, the coding model can be a BERT model or a simple model, and the disclosure does not limit the specific type of the coding model.”) (see Lin (page 5 paragraph 4) “In some embodiments, the phoneme sequence of the text to be synthesized can be obtained by manually marking the phoneme of the text to be synthesized according to the statistical knowledge. In some embodiments, the phoneme sequence of the text to be synthesized can be obtained by inquiring the phoneme of each word or word in the text to be synthesized in a preset dictionary, and the preset dictionary is pre-stored with the phoneme of multiple words or words.”) 
Lin does not specifically teach 1. A computer-implemented method for generating an audio signal from input text, the method comprising: receiving a request to convert input text into an audio signal, wherein the input text comprises a plurality of tokenized text inputs;  However, Chenpeng  does teach this limitation (See Chenpeng Figuire 1(a) txt2vec receives request “phoneme sequence”) generating, using a first generative neural network (See Chenpeng Section 3.1 VQ acoustic feature is generated) processing the acoustic representation using a decoder neural network to generate the audio signal. (See Chenpeng Figure 1(b) vec2wave section 3.2)
Lin and Chenpeng are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified a semantic representation of the tokenized text inputs comprising semantic tokens representing semantic content of the tokenized text inputs, each semantic token being selected from a vocabulary of semantic tokens of Lin to incorporate the computer-implemented method for generating an audio signal from input text, the method comprising: receiving a request to convert input text into an audio signal, wherein the input text comprises a plurality of tokenized text inputs; generating, using a first generative neural network processing the acoustic representation using a decoder neural network to generate the audio signal of Chenpeng . This allows state-of-the-art performance and more natural TTS as recognized by Chenpeng Page 1, Introduction, Paragraph 4.

Regarding Independent Claim 18, Claim 18 is a system claim contains limitations similar to that of Claim 1 and is rejected under similar rationale. Furthermore, Lin teaches 18. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: (see Lin figure 8 computer)

Regarding Independent Claim 19, Claim 19 is a system claim contains limitations similar to that of Claim 1 and is rejected under similar rationale. Furthermore Lin teaches 19. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: (see Lin figure 8 computer) (see Lin page 16 paragraph 9 and 17 paragraph 1) “In particular, in accordance with embodiments of the present disclosure, the process described above with reference to the flowchart may be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product including a computer program carried on a non-transient computer readable medium, the computer program including program code for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network through the communication device 809, or installed from the storage device 808, or installed from the ROM 802. When the computer program is executed by the processing device 801, the above functions defined in the method of the embodiment of the present disclosure are executed.”) 

Claims 2, 12-16 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lin (Foreign Application Number CN 114242035 B), in view of Chenpeng (Non-Patent Literature VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature), and further in view of Kumar (US PATENT US 11580145 B1).

As to Claim 2, Lin in view of Chenpeng teaches 2. The method of claim 1,  
Lin in view of Chenpeng does not specifically teach wherein the first generative neural network has an encoder-decoder Transformer architecture. However, Kumar does teach this limitation (see Kumar, “(25:32-41) “(109) In FIG. 8, the paraphrase generator 170 includes a NN rephrasing system 190, which is for example the same as the NN system 114 described with reference to FIGS. 2 to 6. The NN system 114 may be used to implement the methods described herein, using an encoder NN and a decoder NN, to generate a rephrased version of a query that includes words selected from a set of words including a first subset of words including words of the query and a second subset of words including words absent from the query.”)
Lin in view of Chenpeng and Kumar are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng to incorporate the first generative neural network has an encoder-decoder Transformer architecture of Kumar. This allows the system to more reliably return a satisfactory response as recognized by Kumar (2:21-24).

Regarding Claim 12, Lin in view of Chenpeng teaches 12. The method of claim1, 
Lin in view of Chenpeng does not specifically teach wherein the second generative neural network has a decoder-only Transformer architecture. However, Kumar does teach this limitation (see Kumar, (12:5-49) “(56) In the example of FIG. 4, the decoder NN 124 also receives a context vector, c.sub.1, in addition to receiving the start of sequence vector u.sub.SOS and the representation of the query (in this case, the fourth hidden state h.sub.5 of the encoder NN 118). A context vector generally aims to capture context in a sequence of words, and may be used to implement an attention mechanism. An attention mechanism for example allows more relevant, interesting or important words of a sentence or phrase to be identified, and focused on to a greater extent than other, less interesting words (such as common words). By inputting the context vector, c.sub.1, to the decoder NN 124, in conjunction with the start of sequence vector u.sub.SOS and the representation of the query, the accuracy of the decoder NN 124 in correctly predicting the next work in the rephrased version of the query may be increased. For example, the first hidden state of the decoder NN 124 depends on the representation of the query (which in this case is the fourth hidden state h.sub.5 of the encoder NN 118). The first word of the rephrased version of the query (after the start of sequence token) may be more likely to be similar to or the same as the first word of the query (rather than the last word of the query). However, the fourth hidden state h.sub.5 of the encoder NN 118 is four steps removed from the second hidden state h.sub.2 of the encoder NN 118, which is obtained by processing the first word of the query. Hence, as the length of the input query increases, the likelihood that the first word of the rephrased version of the query is correctly predicted by the decoder NN 124 may decrease, as the correlation between the representation of the query input to the decoder NN 124 and the hidden state of the encoder NN 118 corresponding to the first word of the query may also decrease. The attention mechanism attempts to compensate for this, by allowing the decoder NN 124 to attend to or focus on different parts of the query as each word of the rephrased version of the query is predicted. This for example allows the decoder NN 124 to more accurately identify an appropriate word of the query for a rephrased version of the query, regardless of the position of the word in the query. A context vector that depends on a plurality of different hidden states of the encoder NN 118, rather than merely the final hidden state of the encoder NN 118, may be used to implement such an attention mechanism. A context vector may be generated as described with reference to FIG. 6.”)
Lin in view of Chenpeng and Kumar are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng to incorporate wherein the second generative neural network has a decoder-only Transformer architecture of Kumar. This allows the system to more reliably return a satisfactory response as recognized by Kumar (2:21-24).

Regarding Claim 13, Lin in view of Chenpeng teaches 13. The method of claim 1, 
Lin in view of Chenpeng does not specifically teach wherein the second generative neural network is trained on an audio-only dataset, wherein the audio-only dataset comprises, for each of a plurality of training audio inputs, a respective semantic representation and a respective acoustic representation. However, Kumar does teach this limitation (see Kumar, (7:63-14) “(37) Second data 122 representative of a rephrased version of the query is generated using a decoder NN 124 and the representation 120 of the query. The encoder NN 118 and the decoder NN 124 for example form a sequence-to-sequence NN system. In such cases, the encoder NN 118 is trained to encode an input sequence (in this case, the query). The decoder NN 124 is trained to decode the representation 120 of the input sequence to generate a target sequence (in this case, the rephrased version of the query). The second data 122 may for example be in the same data format as the first data 116 (such as text data), or in a different format. In such examples, the encoder NN 118 and the decoder NN 124 may be considered to form an encoder and decoder pair, which may be referred to as an autoencoder. Such an autoencoder may be trained in an unsupervised manner, for example using unlabeled training data, which may simplify the training process.”)(examiner notes training query can be text or speech Kumar (1:8-9).) 
Lin in view of Chenpeng and Kumar are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng to incorporate the second generative neural network is trained on an audio-only dataset, wherein the audio-only dataset comprises, for each of a plurality of training audio inputs, a respective semantic representation and a respective acoustic representation of Kumar This allows the system to more reliably return a satisfactory response as recognized by Kumar (2:21-24).

Regarding Claim 14, Lin in view of Chenpeng teaches 14. The method of claim 1, 
Lin in view of Chenpeng does not specifically teach further comprising: obtaining a semantic representation of a target voice prompt comprising semantic tokens and a acoustic representation of the target voice prompt comprising acoustic tokens; and wherein the second generative neural network is conditioned on at least the semantic representation of the target voice prompt and the acoustic representation of the target voice prompt. However, Kumar does teach this limitation (see Kumar, (11:49-12:5) “(55) At a first time, no words of the rephrased version of the query have been predicted yet. Hence, at this time, a start of sequence token may be received as an input to the decoder NN 124. A token is for example a character, string or other data type that may be used to represent a given concept or characteristic relating to a sequence of words, such as the start of a sequence or the end of a sequence. The start of sequence token may be any suitable token that indicates the start of the rephrased version of the query. A token may be considered to be suitable where it differs from other words from a vocabulary from which the query and the rephrased version of the query may be formed. This allows the token to be distinguished from words of the query or the rephrased version of the query. For example, the start of sequence token may be a control character, such as a null character with a value of zero. In some cases, the start of sequence token may be a learned token, character or value, which is learnt during training of the NN system 114. The start of sequence token may be in the form of a start of sequence vector, u.sub.SOS, which may be generated from a null input. For example, the null input may be associated 126 with the start of sequence vector, u.sub.SOS, to be input to the decoder NN 124 using the same association 126 as described with reference to the encoder NN 118.”)
Lin in view of Chenpeng and Kumar are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng to incorporate obtaining a semantic representation of a target voice prompt comprising semantic tokens and a acoustic representation of the target voice prompt comprising acoustic tokens; and wherein the second generative neural network is conditioned on at least the semantic representation of the target voice prompt and the acoustic representation of the target voice prompt of Kumar This allows the system to more reliably return a satisfactory response as recognized by Kumar (2:21-24).

Regarding Claim 15, Lin in view of Chenpeng and Further in view of Kumar teaches 15. The method of claim 14, 
Furthermore, Kumar teaches wherein generating, using a second generative neural network and conditioned on at least the semantic representation, an acoustic representation of the semantic representation further comprises: prepending the semantic representation of the target voice prompt prior to the semantic representation of the tokenized inputs; and generating an appended semantic representation by appending the acoustic representation of the target voice prompt after the semantic representation of the tokenized text inputs, wherein the second generative neural network is conditioned on the appended semantic representation. (see Kumar, (11:49-12:5) “(55) At a first time, no words of the rephrased version of the query have been predicted yet. Hence, at this time, a start of sequence token may be received as an input to the decoder NN 124. A token is for example a character, string or other data type that may be used to represent a given concept or characteristic relating to a sequence of words, such as the start of a sequence or the end of a sequence. The start of sequence token may be any suitable token that indicates the start of the rephrased version of the query. A token may be considered to be suitable where it differs from other words from a vocabulary from which the query and the rephrased version of the query may be formed. This allows the token to be distinguished from words of the query or the rephrased version of the query. For example, the start of sequence token may be a control character, such as a null character with a value of zero. In some cases, the start of sequence token may be a learned token, character or value, which is learnt during training of the NN system 114. The start of sequence token may be in the form of a start of sequence vector, u.sub.SOS, which may be generated from a null input. For example, the null input may be associated 126 with the start of sequence vector, u.sub.SOS, to be input to the decoder NN 124 using the same association 126 as described with reference to the encoder NN 118.”)
Lin in view of Chenpeng and Kumar are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng to incorporate generating, using a second generative neural network and conditioned on at least the semantic representation, an acoustic representation of the semantic representation further comprises: prepending the semantic representation of the target voice prompt prior to the semantic representation of the tokenized inputs; and generating an appended semantic representation by appending the acoustic representation of the target voice prompt after the semantic representation of the tokenized text inputs, wherein the second generative neural network is conditioned on the appended semantic representation of Kumar This allows the system to more reliably return a satisfactory response as recognized by Kumar (2:21-24).

Regarding Claim 16, Lin in view of Chenpeng and Further in view of Kumar teaches 16. The method of claim 15, 
Furthermore, Kumar teaches wherein generating the appended semantic representation further comprises: inserting a first separator token between the semantic representation of the target voice prompt and the semantic representation of the tokenized inputs; and inserting a second separator token between the semantic representation of the tokenized inputs and the acoustic representation of the target voice prompt. (see Kumar, (11:49-12:5) “(55) At a first time, no words of the rephrased version of the query have been predicted yet. Hence, at this time, a start of sequence token may be received as an input to the decoder NN 124. A token is for example a character, string or other data type that may be used to represent a given concept or characteristic relating to a sequence of words, such as the start of a sequence or the end of a sequence. The start of sequence token may be any suitable token that indicates the start of the rephrased version of the query. A token may be considered to be suitable where it differs from other words from a vocabulary from which the query and the rephrased version of the query may be formed. This allows the token to be distinguished from words of the query or the rephrased version of the query. For example, the start of sequence token may be a control character, such as a null character with a value of zero. In some cases, the start of sequence token may be a learned token, character or value, which is learnt during training of the NN system 114. The start of sequence token may be in the form of a start of sequence vector, u.sub.SOS, which may be generated from a null input. For example, the null input may be associated 126 with the start of sequence vector, u.sub.SOS, to be input to the decoder NN 124 using the same association 126 as described with reference to the encoder NN 118.”)
Lin in view of Chenpeng and further in view of Kumar and Kumar are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Kumar to incorporate generating the appended semantic representation further comprises: inserting a first separator token between the semantic representation of the target voice prompt and the semantic representation of the tokenized inputs; and inserting a second separator token between the semantic representation of the tokenized inputs and the acoustic representation of the target voice prompt of Kumar This allows the system to more reliably return a satisfactory response as recognized by Kumar (2:21-24).

As to dependent Claim 20, Claim 20 is a system claim with limitations similar to that of claim 2 and is rejected under the same rationale.



Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Lin (Foreign Application Number CN 114242035 B), in view of Chenpeng (Non-Patent Literature VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature), and further in view of Zheng (US PATENT US 12050882 B2).

As to Claim 3, Lin in view of Chenpeng teaches 3. The method of claim 2,
Lin in view of Chenpeng does not specifically teach wherein the first generative neural network is trained on a parallel text-speech dataset that maps text to semantic representations of audio corresponding to the text. However, Zheng does teach this limitation (see Zheng, (4:23-36) “(31) In one or more embodiments, a fused acoustic and text (FAT) encoder may be further extended to a sequence-to-sequence framework. Embodiments of an end-to-end fused acoustic and text speech translation model (FAT-ST) are further presented. FAT-ST may be trained from both speech and text machine translation data into a single encoder-decoder model. Meanwhile, the model may also learn from speech recognition data using an extra FAT-MLM loss. This resolves the limitation of existing single encoder and decoder speech translation models, which can only learn from scarce parallel speech translation data, but neglects much larger scale speech recognition and text machine translation data.”)
Lin in view of Chenpeng and Zheng are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng to incorporate the first generative neural network is trained on a parallel text-speech dataset that maps text to semantic representations of audio corresponding to the text of Zheng. This allows speech translation model embodiments to substantially improve translation quality as recognized by Zheng (Abstract).

Claims 4, 5 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Lin (Foreign Application Number CN 114242035 B), in view of Chenpeng (Non-Patent Literature VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature), and further in view of Zheng (US PATENT US 12050882 B2), and further in view of Kumar (US PATENT US 11580145 B1).

Regarding Claim 4, Lin in view of Chenpeng and further in view of Zheng teaches 4. The method of claim 3, 
Furthermore, Zheng teaches and fine-tuning the pre-trained first generative neural network on a second objective using the parallel text-speech dataset. (see Zheng (4:14-36) “(30) The present patent disclosure presents embodiments of a fused acoustic and text masked language model (FAT-MLM) to unify the representations of different languages for bilingual cross-lingual language model pre-training and speech training. The FAT-MLM may jointly learn a unified representation for both acoustic and text input. In this way, the masked language model's input may be extended from only acoustic or text data to multimodal corpora containing both acoustic and text data, such that speech recognition and speech translation may be implemented in one model. (31) In one or more embodiments, a fused acoustic and text (FAT) encoder may be further extended to a sequence-to-sequence framework. Embodiments of an end-to-end fused acoustic and text speech translation model (FAT-ST) are further presented. FAT-ST may be trained from both speech and text machine translation data into a single encoder-decoder model. Meanwhile, the model may also learn from speech recognition data using an extra FAT-MLM loss. This resolves the limitation of existing single encoder and decoder speech translation models, which can only learn from scarce parallel speech translation data, but neglects much larger scale speech recognition and text machine translation data.”)
Lin in view of Chenpeng and Zheng are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Zheng to incorporate fine-tuning the pre-trained first generative neural network on a second objective using the parallel text-speech dataset of Zheng. This allows speech translation model embodiments to substantially improve translation quality as recognized by Zheng (Abstract).
Lin in view of Chenpeng and further in view of Zheng does not specifically teach wherein the training on the parallel text-speech dataset comprises: pre-training the first generative neural network on a first objective using semantic representations of a speech-only dataset; However, Kumar does teach this limitation (see Kumar (24:35-67) “(107) The paraphrase generator 170 includes an exemplar mapping system 186, which for example performs shallow-parse-exemplar-mapping. Shallow-parse-exemplar-mapping for example involves recognizing an entity and a relation in the query, e.g. using a semantic parser to perform a shallow parse of the query. Shallow parsing for example involves identification of constituent parts of a phrase such as nouns, verbs and adjectives, and then identifying more complex components that for example reflect semantic relations between the constituent parts. In this way, entities and relations present in the query may be identified. Entities are for example a concept or object, and may include named entities, which are for example real-world concepts or objects (which may be abstract or exist physically) that can be denoted with a proper noun. Relations for example represent relationships or facts relating to entities. A semantic parser (or other machine learning model for identifying entities and relations in a query) may be implemented as a combination of an entity tagger, to identify entities in the query, and an intent classifier, to identify the interactions or relationships between the identified entities. An entity tagger may for example use a linear chain conditional random field (CRF) or a recurrent neural network. An intent classifier may be a feedforward neural network. Upon identification of entities and relations in the query, a generic query is generated. The generic query may be considered to be an exemplar query, which for example represents a generalized version of a query. For example, an exemplar query may include entity classes and relations rather than specific entities and relations. In this way, the exemplar mapping system 186 for example rephrases a query as a generalized version of the query.”)
Lin in view of Chenpeng and further in view of Zheng and Kumar are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Zheng to incorporate the training on the parallel text-speech dataset comprises: pre-training the first generative neural network on a first objective using semantic representations of a speech-only dataset of Kumar. This allows the system to more reliably return a satisfactory response as recognized by Kumar (2:21-24).

Regarding Claim 5, Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar teaches 5. The method of claim 4, 
Furthermore Zheng teaches wherein fine-tuning the pre-trained first generative neural network on a second objective using the parallel text-speech dataset further comprises: fine-tuning lower layers of the encoder of the pre-trained first generative neural network and fixing the upper layers of the encoder of the pre-trained first generative neural network and the decoder of the pre-trained first generative neural network. (See Zheng (9:35-47) “(63) To demonstrate FAT-MLM's ability to unify the representation of different modalities and languages, the self-attention layers of a translation FAT-MLM are graphically shown in FIG. 7, FIG. 8A, and FIG. 8B. FIG. 7 graphically shows the output of one speech self-attention head at the first transformer layer in the acoustic embedding module and its corresponding spectrogram. The model in FIG. 7 is a translation FAT-MLM model trained with speech translation En.fwdarw.De dataset. The clear monotonic attention in FIG. 7 shows that a FAT-MLM method may learn good representation for speech.”) (see Zheng (13:21-48) “(86) In one or more experiments, raw audio files are used to extract multi-dimensional log-Mel filter banks stacked with 3-dimensional pitch features using a window size of 25 ms and step size of 10 ms. Text tokenizer/de-tokenizer models with a joint vocabulary size of 8K for text are trained in each dataset. Training samples that have more than 3,000 frames have been ignored for GPU efficiency. A basic transformer-based end-to-end FAT-ST framework has settings of first down-sampling the speech input with 2 layers of 2D convolution of size 3 with stride size of 2, followed by a standard 12-layer transformer with feed-forward layers of 2,048 hidden size to bridge the source and target side. Four attention heads are used on each side of the transformer and each of them has a dimensionality of 256. This section also shows the results of a FAT-ST big model with 4,096 hidden size for feed-forward layers of all transformer layers. For the speech reconstruction module, the outputs of the transformer encoder are simply linearly projected to another latent space, then the latent representations are upsampled with 2-layer deconvolution to match the size of the original input signal. The random masking ratio A is chosen as 30% across all the experiments including pre-training. During inference, there is no masking over the speech input. The last 5 checkpoints are averaged for testing. For decoding, a beam search is used with beam-size 5 and length penalty 0.6 for German, 0.0 for Spanish, and 0.3 for Dutch.”)
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and Zheng are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Zheng and Kumar to incorporate fine-tuning the pre-trained first generative neural network on a second objective using the parallel text-speech dataset further comprises: fine-tuning lower layers of the encoder of the pre-trained first generative neural network and fixing the upper layers of the encoder of the pre-trained first generative neural network and the decoder of the pre-trained first generative neural network of Zheng. This allows speech translation model embodiments to substantially improve translation quality as recognized by Zheng (Abstract).
Regarding Claim 10, Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar teaches 10. The method of claim 4,
Furthermore Kumar teaches wherein the second objective comprises generating semantic representations of text of the parallel text-speech dataset. (see Kumar, (31:49-56) “(136) “The further processing of the text data performed by the NLU system 202 therefore attempts to make a semantic understanding of the text data, for example to identify an intent of the text data. In this way, the NLU system 202 may be used to identify that the text data (which may for example be first text data as described above) represents a query. In this way, the NLU system 202 may therefore identify understandings of the query.”)
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and Kumar  are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and and Kumar and Zheng to incorporate the second objective comprises generating semantic representations of text of the parallel text-speech dataset of Kumar. This allows the system to more reliably return a satisfactory response as recognized by Kumar (2:21-24).

Claims 6-9 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Lin (Foreign Application Number CN 114242035 B), in view of Chenpeng (Non-Patent Literature VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature), and further in view of Zheng (US PATENT US 12050882 B2), and further in view of Kumar (US PATENT US 11580145 B1), and further in view of Jia (US PATENT US 20210217404 A1).

Regarding Claim 6, Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar teaches 6. The method of claim 5, 
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar does not specifically teach wherein the training comprises, after pre-training the first generative neural network: generating a backtranslation model that back translates from semantic representations to text by fine-tuning the pre-trained first generative neural network on a third objective using an initial parallel text-speech dataset; and generating the parallel-text speech dataset by processing the speech-only dataset using the backtranslation model. However, Jia does teach this limitation (see Jia, [0033] “In some implementations, the training data for the spectrogram generation engine 120 may be generated using the speaker encoder engine 110 after the speaker encoder engine 110 is trained. For example, a set of paired training data may originally include only pairs of input text and mel spectrograms of speech of that text. The mel spectrogram in each pair of the paired training data may be provided to the trained speaker encoder engine 110 which may output a respective speaker vector for each mel spectrogram. The system 100 may then add each speaker vector to the respective pair in the paired training data to generate the training data with triplets of text, an audio representation of speech of the text by a particular speaker, and a speaker vector for the particular speaker.”)
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and Jia are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Zheng and Kumar to incorporate the training comprises, after pre-training the first generative neural network: generating a backtranslation model that back translates from semantic representations to text by fine-tuning the pre-trained first generative neural network on a third objective using an initial parallel text-speech dataset; and generating the parallel-text speech dataset by processing the speech-only dataset using the backtranslation model of Jia. This allows improved adaptation quality, and enable synthesis of completely novel speakers as recognized by Jia [0007].

Regarding Claim 7, Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and further in view of Jia teaches 7. The method of claim 6,
Furthermore, Zheng teaches wherein the training further comprises, after fine-tuning the pre-trained first generative neural network on a second objective using the parallel text- speech dataset: fine-tuning the pre-trained first generative neural network on the initial parallel-text speech dataset. (see Zheng, (12:25-36) “(79) In one or more embodiments, a FAT-ST model may be further improved by fine-tuning from FAT-MLM. Since the FAT-ST transformer decoder predicts text only, it may be initialized from the acoustic and text shared multimodal transformer encoder. For example, parameters of the FAT-ST transformer decoder may be initialized from parameters of the transformer encoder and then be optimized during a training process. Although the transformer decoder is unidirectional which is different from bidirectional FAT-MLM, it may still benefit from FAT-MLM in experiments.”)
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and further in view of Jia and Zheng are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Zheng and Kumar and Jia to incorporate the training further comprises, after fine-tuning the pre-trained first generative neural network on a second objective using the parallel text- speech dataset: fine-tuning the pre-trained first generative neural network on the initial parallel-text speech dataset of Zheng. This allows speech translation model embodiments to substantially improve translation quality as recognized by Zheng (Abstract).

Regarding Claim 8, Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and further in view of Jia teaches 8. The method of claim 7, 
Furthermore, Zheng teaches wherein fine-tuning the pre-trained first generative neural network on the initial parallel-text speech dataset further comprises: fine-tuning the decoder of the pre-trained first generative neural network and fixing the encoder of the pre-trained first generative neural network. (see Zheng, (12:25-36) “(79) In one or more embodiments, a FAT-ST model may be further improved by fine-tuning from FAT-MLM. Since the FAT-ST transformer decoder predicts text only, it may be initialized from the acoustic and text shared multimodal transformer encoder. For example, parameters of the FAT-ST transformer decoder may be initialized from parameters of the transformer encoder and then be optimized during a training process. Although the transformer decoder is unidirectional which is different from bidirectional FAT-MLM, it may still benefit from FAT-MLM in experiments.”)
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and further in view of Jia and Zheng are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Zheng and Kumar and Jia to incorporate fine-tuning the pre-trained first generative neural network on the initial parallel-text speech dataset further comprises: fine-tuning the decoder of the pre-trained first generative neural network and fixing the encoder of the pre-trained first generative neural network of Zheng. This allows speech translation model embodiments to substantially improve translation quality as recognized by Zheng (Abstract).

Regarding Claim 9, Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar teaches 9. The method of claim 4, 
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar does not specifically teach wherein the first objective comprises generating uncorrupted semantic representations of the speech-only dataset by denoising corrupted semantic representations of the speech-only dataset. However, Jia does teach this limitation (see Jia, [0042] “Additionally or alternatively, a decoder of the network may include both L2 loss on spectrogram feature reconstruction with an additional L1 loss. A combined loss may be more robust on noise training data. Additionally or alternatively, noise reduction by spectral subtraction, e.g., at 10-percentile, may be performed on the targets for the mel spectrogram prediction network to further make the synthesized audio clean.”)
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and Jia are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Zheng and Kumar to incorporate the first objective comprises generating uncorrupted semantic representations of the speech-only dataset by denoising corrupted semantic representations of the speech-only dataset of Jia. This allows improved adaptation quality, and enable synthesis of completely novel speakers as recognized by Jia [0007].

Regarding Claim 11, Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and Further in view of Jia teaches 11. The method of claim 6, 
Furthermore, Zheng teaches wherein the third objective comprises generating semantic representations of text of the initial parallel-text speech dataset. (see Zheng, (18:19-44) “(79) As can be seen, in examples such as this, the query and the rephrased version of the query are in the same language as each other (English, in this example). Hence, methods such as this may be used to rephrase a query without translating the query into a different language. A language is for example a natural language, which has arisen naturally through human use. Alternatively, a language may be a constructed or artificial language, such as Esperanto, which has been devised for communication. For example, in some cases a language of the query may be determined, for example based on metadata associated with data representative of the query, or based on NLU applied to the query. Then, the words for the rephrased version of the query may be selected from the set of words which are in the language of the query. In such cases, the set of words may include words of a single language only. For example, there may be different sets of words (such as a different second subset of words) in each of a plurality of different languages. In such cases, the set of words (or the second subset of words) in a language which is the same as the language of the query may be selected for use with the methods herein. In other cases, though, the set of words (such as the second subset of words) may include words in various different languages. In such cases, the words for the rephrased version of the query may be selected from those words of the set of words that are in the same language as the language of the query.”)
Lin in view of Chenpeng and further in view of Zheng and further in view of Kumar and further in view of Jia and Zheng are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng and Zheng and Kumar and Jia to incorporate the third objective comprises generating semantic representations of text of the initial parallel-text speech dataset of Zheng. This allows speech translation model embodiments to substantially improve translation quality as recognized by Zheng (Abstract).


Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Lin (Foreign Application Number CN 114242035 B), in view of Chenpeng (Non-Patent Literature VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature), and further in view of Jia (US PATENT US 20210217404 A1).

Regarding Claim 17, Lin in view of Chenpeng teaches 17. The method of claim 1, 
Lin in view of Chenpeng does not specifically teach wherein the decoder neural network generates the audio signal comprising audio characteristics of voice, tempo, and recording conditions. However, Jia does teach this limitation (see Jia, [0038] “The LSTM speaker encoder is used to condition the synthesis network on a reference speech signal from the desired target speaker. Good generalization can be achieved using a reference speech signal which captures the characteristics of different speakers. Good generalization can lead to the identification of these characteristics using only a short adaptation signal, independent of its phonetic content and background noise. These objectives are satisfied using a speaker-discriminative model trained on a text-independent speaker verification task. The LSTM speaker encoder may be a speaker-discriminative audio embedding network, which is not limited to a closed set of speakers. (examiner notes voice as different speakers) (examiner notes recording conditions as “background noise”)”)(see Jia [0028] “As shown in FIG. 1, the system 100 includes a speaker encoder engine 110 and a spectrogram generation engine 120. The speaker encoder engine 110 receives an audio representation of a target speaker speaking and outputs a speaker vector, also called a speaker embedding vector or embedding vector, for the target speaker. For example, the speaker encoder engine 110 receives an audio recording of John Doe saying “Hello my name is John Doe” and, in response, outputs a vector with values that identify John Doe. The speaker vector may also capture the characteristic speaking rate of the speaker. (examiner notes tempo as “speaking rate of the speaker”)”)
Lin in view of Chenpeng and further in view of Jia are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Lin and Chenpeng to incorporate the decoder neural network generates the audio signal comprising audio characteristics of voice, tempo, and recording conditions of Jia. This allows improved adaptation quality, and enable synthesis of completely novel speakers as recognized by Jia [0007].

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KRISTEN MICHELLE MASTERS whose telephone number is (703)756-1274. The examiner can normally be reached M-F 8:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KRISTEN MICHELLE MASTERS/Examiner, Art Unit 2659   

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Jul 23, 2024
Application Filed
May 06, 2026
Non-Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/513,614
Patent 12592219
Hearing Device User Communicating With a Wireless Communication Device
4y 5m to grant Granted Mar 31, 2026
17/415,675
Patent 12548569
METHOD AND SYSTEM OF DETECTING AND IMPROVING REAL-TIME MISPRONUNCIATION OF WORDS
3y 2m to grant Granted Feb 10, 2026
17/790,795
Patent 12548564
SYSTEM AND METHOD FOR CONTROLLING A PLURALITY OF DEVICES
3y 7m to grant Granted Feb 10, 2026
17/940,549
Patent 12547894
ENTROPY-BASED ANTI-MODELING FOR MACHINE LEARNING APPLICATIONS
3y 5m to grant Granted Feb 10, 2026
18/311,150
Patent 12547840
MULTI-STAGE PROCESSING FOR LARGE LANGUAGE MODEL TO ANSWER MATH QUESTIONS MORE ACCURATELY
2y 9m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
63%
Grant Probability
85%
With Interview (+22.3%)
3y 0m (~1y 2m remaining)
Median Time to Grant
Low
PTA Risk
Based on 46 resolved cases by this examiner. Grant probability derived from career allowance rate.