Detailed Action
This communication is in response to the Arguments and Amendments filed on 11/17/2025. Claims 1-6, 8-13 and 15-21 are pending and have been examined. Claims 7 and 14 have been cancelled.
Any previous objection/rejection not mentioned in this Office Action has been withdrawn by the Examiner.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Response to Amendment
The Applicant has not amended the claims.
Regarding the 35 U.S. C. 101 rejection, The Applicant notes
The August 2025 Memo, Example 39, Example 47, Example 48.
The Applicant notes the claims involve mathematical concepts but do not recite the exceptions and cannot practically be performed in the human mind. Further, the applicant notes the claims improve the functioning of neural TTS computer systems pointing to Example 47 (claim 3), Example 48 (claim 2) of the 2025 memo.
Examiner notes Claim 1 (and dependent claims that do not add qualifying technical detail) is directed to judicial exceptions under Step 2A (both mathematical concepts and certain mental process characterizations of data manipulation) and, when considered as a whole, does not sufficiently integrate those exceptions into a practical application nor recite an “inventive concept” under Step 2B.
The applicant notes The Claims Recite Limitations That Cannot Be Practically Performed in the Human Mind The claims explicitly require:
1. "generating, by an encoder, a phone feature"
Examiner notes an encoder is a generic element in speech processing. A human is capable of generating a phone feature of a text input using voice.
2."using a word embedding model comprising a sequence-to-sequence encoder-decoder
framework"
Examiner notes
3. "generating, by a vocoder, a speech waveform"
Examiner notes The August 2025 Memo’s reminder that the mental process grouping has limits is acknowledged, but it does not preclude application of the mental process grouping here because several claim limitations recite high level cognitive/data manipulation concepts (e.g., generating embeddings, averaging vectors, aligning sequences, generating context features) that can be characterized as mental/data processing concepts. The fact that a neural network performs them in practice does not automatically remove them from the judicial exception analysis without claim language or specification evidence showing a concrete technological improvement to computer functionality.
Applicant notes Example 39 and compares it to applicant’s claim 1. The present claims similarly merely involve-but do not recite-mathematics.
Examiner notes Step 2A — Prong One: The claim is directed to judicial exceptions (mathematical concepts and mental process grouping)
• Mathematical concepts: Claim 1 recites a sequence of data transformations and computations: generating a phone feature, generating word embedding vector sequences, computing an average embedding vector sequence, aligning embedding sequences with phone sequences, generating context features, and using these numeric features to synthesize a waveform. These steps are paradigmatic mathematical/data processing operations (vector generation, averaging, alignment, numeric feature based signal synthesis) and therefore fall within the “mathematical concepts” exception recognized by the USPTO and Federal Circuit (see, e.g., Digitech, SAP America, Electric Power Group).
Applicant notes Mental process grouping (August 2025 Memo). The present claims do not "require specific mathematical calculations by referring to the mathematical calculations by name." They recite neural network components and data transformations without specifying the underlying mathematical algorithms.
Examiner notes Mental process grouping (August 2025 Memo): The August 2025 Memo instructs examiners not to overextend the mental process grouping to limitations that “cannot practically be performed in the human mind.” The Memo also explains that limitations which cannot practically be performed by a person are not within the mental process grouping, while limitations that are essentially cognitive/data manipulation concepts may be. Applicant argues the recited neural components (encoders, seq to seq models, vocoders) cannot be performed in the human mind and therefore cannot be mental processes. That point is acknowledged: the literal operation of a deep neural network with millions/billions of parameters cannot be executed mentally.
However, the statutory exception analysis asks whether the claim is “directed to” an abstract idea — and courts and the USPTO have recognized that claims which at a high level recite the performance of cognitive or mental tasks (e.g., categorizing, comparing, organizing information, calculating statistical measures, aligning sequences) may be placed in the mental process grouping even if implemented by machines (see examples in the August 2025 Memo and cases like SAP, Electric Power Group).
o Examiner notes The claim language here is largely high level and functional: “generating a phone feature,” “generating a word embedding vector sequence,” “generating an average embedding vector sequence,” and “aligning” — these can be reasonably characterized as information processing/mental like steps (conceptual transformations of linguistic and acoustic information). Absent claim detail tying the operations to specific technical mechanisms that go beyond mere data processing, these limitations are susceptible to classification as mental/data manipulation concepts under Step 2A Prong One.
o Examiner notes One way to analyze would be to ask whether the limitation uses a neural network in practice does not alone place it outside the mental process grouping; what matters for Step 2A prong one is whether the claim language, read as a whole, is directed to a judicial exception (here, data processing/mathematical/mental concepts). The August 2025 Memo narrows the mental process group but does not establish a per se rule that any claim that names a neural network component is removed from the judicial exception analysis.
o Applicant notes the claim merely “involves” mathematics as in Example 39 (training a network) and thus does not recite a judicial exception because no formulas or algorithm names appear. That argument is not persuasive here because: abstractness even where claims do not recite formulas by name if the claim effectively recites mathematical/data transformations (e.g., generating embeddings, averaging vectors, aligning sequences) (see SAP America; Digitech).
o Examiner notes Example 39 (training an ANN) may be patent eligible in certain contexts where the claim limitation is not a standalone data transformation and the claim otherwise integrates the limitation into a practical application that improves technology. But Example 39’s principle is fact sensitive: a training limitation that is merely high level may still be considered as only “involving” mathematics; by contrast, a chain of numeric feature transformations that are the heart of the claimed method can be treated as reciting mathematical concepts. Here, the claim centralizes numeric transformations as the means of achieving the result (speech waveform), so the mathematical concept/ Mental process grouping applies.
Step 2A — Prong Two: Does the claim integrate the exception into a practical application?
• Applicant notes the claim integrates the exception into a practical application by solving the TTS “one to many mapping” problem and by reciting a specific combination of neural components and multi level context integration. The specification provides useful problem/solution narrative, but the eligibility inquiry is primarily directed to the claim language as a whole and whether the claim recites particular technical means that show an improvement in computer functionality or other technology (cf. USPTO Memo, examples).
Examiner notes On the present claim wording, the limitations are largely functional and outcome oriented (generate features, average, align, synthesize) without concrete computational detail or a recitation of how the arrangements materially improve the functioning of the computer system itself (e.g., speed/latency reductions, memory or computational efficiency, novel data representations that reduce error by a measurable metric, or specific unconventional network architectures constrained in a way that produces the improvement).
o Examiner notes Example 47 (network security) was eligible because the claim included steps that changed system behavior in a concrete, real time way (dropping/blocking packets) and that improvement was reflected in claim steps beyond mere data analysis. Here, claim 1 culminates in creating a speech waveform; while useful, outputting a waveform is a data generation task and not per se equivalent to controlling or altering physical system behavior or computer resources in the way Example 47’s real time packet mitigation did.
o Examiner notes Example 48 (speech separation) included more explicit problem solution language and concrete mathematical steps and demonstrated how the claim’s pipeline provided an improvement over prior art speech separation by reciting clustering, masking, and resynthesis steps tied to the problem explanation. Example 48’s claim recited particular data transformations and a pipeline that the is treated as integrating the exception into a practical application; the present claim is similar in domain but, as written, is less explicit about the particular technical steps (e.g., precise alignment algorithm, constrained architecture, or demonstrable technical effect) that would make the improvement evident from claim language alone.
• Examiner notes The August 2025 Memo instructs patent examiners to consult the specification to determine whether the claimed invention reflects an improvement to technology. The specification here does state the improvement (multi level context to alleviate one to many mapping). But per MPEP/Case law, showing an improvement may require either (a) claim language that itself conveys the particular technical means producing the improvement, or (b) evidence that the recited combination is unconventional and produces a technical effect (a factual inquiry). On the present record the claim does not persuasively present either.
Step 2B — Inventive concept: the claim does not add significantly more
• Examiner notes Assuming arguendo that some claim limitations recite an abstract idea, the additional elements must supply an “inventive concept.” The claim recites known functional components (encoder, seq to seq embedding model, vocoder) and high level data transformations. Without claim specificity tying those components to particular unconventional architectures, constrained parameterizations, training/regimen steps, or demonstrable improvements, the recited elements appear to be routine, conventional uses of neural networks and generic software components, and therefore fail to supply an inventive concept (see Alice; Berkheimer — factual showing may rebut this with evidence).
• Applicant points to the non conventional arrangement doctrine (BASCOM) and cases recognizing technical improvements from particular configurations (Ancora, Visual Memory). Those cases are fact specific; they turned on claim limitations that imposed specific technical constraints or caused concrete improvements. To prevail here Applicant should (and may) either:
o Amend claims to recite the technical features and constraints that make the arrangement unconventional; or
o Submit evidence (declarations, benchmarks, contemporaneous technical documentation) showing that the claimed combination, as recited, was not conventional, and that the claimed steps produce a particular technical improvement to TTS systems (Berkheimer factual showing). Absent such evidence or claim specificity, Step 2B does not find an inventive concept.
• Examiner notes “Encoder, seq to seq architecture, vocoder cannot be mental processes”: Agreed that the literal performance of millions of neural computations is not a human mental act. But the question under Step 2A Prong One is whether the claim, read as a whole, is directed to an abstract idea — and high level cognitive/data manipulations (e.g., generate/align/average embeddings) are within the scope of judicial exceptions unless the claim ties them to particular technical implementations or practical improvements. Naming neural components without structure or constraints does not automatically transform an abstract data processing claim into a patent eligible technological improvement.
• Applicant notes “No recited algorithms by name — thus not an abstract idea”: Not persuasive. A claim reciting operations that are inherently mathematical/data transformative may be characterized as reciting a mathematical concept even without naming the algorithm (see Digitech, SAP). The decisive inquiry is the claim’s focus: if the focus is on data transformation or mental process like manipulation, it is vulnerable to categorization as an abstract idea.
• Applicant notes “Claim as a whole integrates exception into practical application (one to many mapping solved)”: The specification describes the technical problem and solution; to carry this over to claimable subject matter without further detail, Applicant must show the claimed elements themselves reflect that improvement. As written, the claim’s high level functional language does not make the improvement evident on the face of the claim; therefore the rejection stands absent amendment or evidentiary support.
• Examiner notes The § 101 analysis is highly fact sensitive. If Applicant can show (via claim amendment and/or evidence) that the recited components and their combination are unconventional and that the claimed steps produce a technical improvement in TTS systems (and that such improvement is reflected in the claim language), the analysis may change. As acknowledged above, the August 2025 Memo narrows the mental process group and asks examiners to avoid overbroad application; that instruction is applied here but does not, on the present claim language, lead to allowance.
The applicants arguments and amendments do not overcome the 35 U.S. C. 101 rejection.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-6, 8-13 and 15-21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Independent Claim 1 recites, “1. A method for generating speech through neural text-to-speech (TTS) synthesis, comprising:
obtaining a text input;
generating, by an encoder, a phone feature of the text input;
generating a word embedding vector sequence based on a word sequence using a word embedding model comprising a sequence-to-sequence encoder-decoder framework;
generating an average embedding vector sequence corresponding to at least one sentence based on the word embedding vector sequence;
aligning the average embedding vector sequence with the phone sequence of the text input;
generating context features of the text input based on a set of sentences associated with the text input;
and generating, by a vocoder, a speech waveform corresponding to the text input based on the phone feature and the context features, and the aligned average embedding vector sequence.”
The limitations of “obtaining …”, “generating …”, “generating …”, “generating …”, “aligning …”, “generating …”, “generating …”, as drafted covers a mental activity or human process.
More specifically, a human is capable of obtaining a text input using the human visual system by observing the text that is written, inscribed or displayed visually in some way, a human can use this visual information in further downstream cognitive processes.
A human is capable of generating the context features based on the word sequence comprises: generating a word embedding vector sequence based on the word sequence; This relates to a human using cognitive processes to generate context features
A human is capable of generating, using the logic and reasoning powers of the human mind a word embedding vector based on the word sequence, and using the human physical processes, typing or writing out the word embedding vector based on the word sequence.
A human is capable of generating an average embedding vector sequence corresponding to the at least one sentence based on the word embedding vector sequence This relates to a human using cognitive processes to generate using the logic and reasoning powers of the human mind an average embedding vector sequence corresponding to the at least one sentence based on the word embedding vector sequence, and using the human physical processes, typing or writing out the word embedding vector based on the word sequence.
A human is capable of aligning the average embedding vector sequence with a phone sequence of the text input; This relates to a human using cognitive processes and logic and reasoning to aligned the average embedding vector sequence with a phone sequence of the text input. The claim relates to generating the context features based on the aligned average embedding vector sequence. This relates to a human using cognitive processes and logic and reasoning to generate the context features based on the aligned average embedding vector sequence.
A human is capable of generating context features of the text input based on a set of sentences associated with the text input using the cognitive processes of the human mind such as natural language understanding and linguistic knowledge to map the text to an element or elements of context that resides in memory and then interpret the context of the given sentence, thereby creating context features.
A human is capable of generating a speech waveform corresponding to the text input based on the phone feature and the context features using the natural process of speech production through the vocal tract to form and articulate words and sentences. A human is capable aligning average embedding vector sequence using pen and paper. The claim is directed to an abstract idea. No additional elements are present in the claim.
Regarding Independent Claim 15, Claim 15 is an apparatus claim with limitations similar to that of claim 1 and is rejected under the same rationale.
Regarding Independent Claim 19, Claim 19 is storage medium claim with limitations similar to that of claim 1 and is rejected under the same rationale.
With respect to Claims 2, 16 and 20 the claim relates to generating the context features comprises: obtaining acoustic features corresponding to at least one sentence of the set of sentences before the text input This relates to a human using auditory processes to listen to a given spoken sentence and picking out the acoustic features before and after a sentence. The claim relates to generating the context features based on the acoustic features. This relates to a human using auditory processes to listen to a sentence with acoustic features and picking out context features. No additional limitations are present. With respect to Claim 3, 17 and 21 the claim relates to aligning the context features with a phone sequence of the text input. This relates to a human using the context features in combination with a phone sequence as described above in claim 1. No additional limitations are present. With respect to Claim 4 and 18, the claim relates to identifying a word sequence from at least one sentence of the set of sentences; This relates to a human using cognitive processes to listen or observe a sentence and identify a word sequence in a given set of sentences. The claim relates to generating the context features based on the word sequence. This relates to a human using cognitive processes as described in claims 1 and 15 to create context features for a given word sequence. No additional limitations are present. With respect to Claim 5, the claim relates to at least one sentence comprises at least one of: a sentence corresponding to the text input, sentences before the text input, and sentences after the text input. This relates to a human using cognitive processes to create or identify a suitable sentence before and after a given sentence text input. No additional limitations are present. With respect to Claim 6, the claim relates to the at least one sentence represents content of the set of sentences. This relates to a human using cognitive processes to create or identify a sentence that represent content of a set of sentences. No additional limitations are present. With respect to Claim 8, the claim relates to determining a position of the text input in the set of sentences. This relates to a human using cognitive processes and logic and reasoning to determining a position of the text input in the set of sentences generate the context features based on the location. No additional limitations are present. With respect to Claim 9 the claim relates to generating the context features based on the location comprises: generating a position embedding vector sequence based on the location. This relates to a human using cognitive processes and logic and reasoning to generate a position embedding vector sequence based on the location. The claim relates to aligning the position embedding vector sequence with a phone sequence of the text input. This relates to a human using cognitive processes and logic and reasoning to align the position embedding vector sequence with a phone sequence of the text input. The claim relates to generating the context features based on the aligned position embedding vector sequence. This relates to a human using cognitive processes and logic and reasoning to generate the context features based on the aligned position embedding vector sequence. No additional limitations are present. With respect to Claim 10, the claim relates to combining the phone feature and the context features into mixed features. This relates to a human using cognitive processes and logic and reasoning to combine the phone feature and the context features into mixed features. The claim relates to applying an attention mechanism on the mixed features to obtain attended mixed features. This relates to a human using cognitive processes and logic and reasoning to apply an attention mechanism on the mixed features to obtain attended mixed features. The claim relates to generate the speech waveform based on the attended mixed features. This relates to a human using cognitive processes and logic and reasoning to generate the speech waveform based on the attended mixed features. No additional limitations are present. With respect to Claim 11, the claim relates to generating the speech waveform comprises: combining the phone feature and the context features into first mixed features. This relates to a human using cognitive processes and logic and reasoning to combine the phone feature and the context features into first mixed features. The claim relates to applying a first attention mechanism on the first mixed features to obtain first attended mixed features. This relates to a human using cognitive processes and logic and reasoning to apply a first attention mechanism on the first mixed features to obtain first attended mixed features. The claim relates to applying a second attention mechanism on at least one context feature of the context features to obtain at least one attended context feature. This relates to a human using cognitive processes and logic and reasoning to apply a second attention mechanism on at least one context feature of the context features to obtain at least one attended context feature. The claim relates to combining the first attended mixed features and the at least one attended context feature into second mixed features. This relates to a human using cognitive processes and logic and reasoning to combine the first attended mixed features and the at least one attended context feature into second mixed features. The claim relates to generating the speech waveform based on the second mixed features. This relates to a human using cognitive processes and logic and reasoning to generate the speech waveform based on the second mixed features. No additional limitations are present. With respect to Claim 12 the claim relates to combining the phone feature and the context features into first mixed features. This relates to a human using cognitive processes and logic and reasoning to combine the phone feature and the context features into first mixed features. The claim relates to applying an attention mechanism on the first mixed features to obtain first attended mixed features. This relates to a human using cognitive processes and logic and reasoning to apply an attention mechanism on the first mixed features to obtain first attended mixed features. The claim relates to performing averaging pooling on at least one context feature of the context features to obtain at least one average context feature. This relates to a human using cognitive processes and logic and reasoning to perform averaging pooling on at least one context feature of the context features to obtain at least one average context feature. The claim relates to combining the first attended mixed features and the at least one average context feature into second mixed features. This relates to a human using cognitive processes and logic and reasoning to combine the first attended mixed features and the at least one average context feature into second mixed features. The claim relates to generating the speech waveform based on the second mixed features. This relates to a human using cognitive processes and logic and reasoning to generate the speech waveform based on the second mixed features. No additional limitations are present. With respect to Claim 13 the claim relates to identifying a phone sequence from the text input. This relates to a human using cognitive processes and logic and reasoning to identify a phone sequence from the text input. The claim relates to updating the phone sequence by adding a begin token and/or an end token to the phone sequence. This relates to a human using cognitive processes and logic and reasoning to update the phone sequence by adding a begin token and/or an end token to the phone sequence. The claim relates to determining the length of the begin token and the length of the end token according to the context features. This relates to a human using cognitive processes and logic and reasoning to determine the length of the begin token and the length of the end token according to the context features. The claim relates to generating the phone feature based on the updated phone sequence. This relates to a human using cognitive processes and logic and reasoning to generate the phone feature based on the updated phone sequence. No additional limitations are present.
Allowable Subject Matter
Claims 1-6, 8-13 and 15-21 are rejected due to 35 USC § 101 indicated above, but would be allowable if rewritten or amended to overcome the rejections under 35 USC § 101.
The following is a statement of reasons for the indication of allowable subject matter:
None of the found prior arts, either alone or in combination, disclose the subject matter as claimed.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KRISTEN MICHELLE MASTERS whose telephone number is (703)756-1274. The examiner can normally be reached M-F 8:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KRISTEN MICHELLE MASTERS/Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659