Last updated: April 19, 2026
Application No. 18/789,219
METHODS AND SYSTEMS FOR TRAINING AN ARTIFICIAL INTELLIGENCE (AI) TOTAL DURATION-AWARE MODEL TO CONTROL THE TOTAL DURATION OF SPEECH UTTERANCES BY A TEXT-TO-SPEECH (TTS) COMPUTING SYTEM

Non-Final OA §101§102§103
Filed
Jul 30, 2024
Examiner
COLUCCI, MICHAEL C
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
Interview Optional

— +15.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 990 resolved cases, 2023–2026
Examiner Intelligence

COLUCCI, MICHAEL C View full profile →
Grants 76% — above average
Career Allow Rate
749 granted / 990 resolved
+13.7% vs TC avg
Strong +15% interview lift
Without
With
+15.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
41 currently pending
Career history
1031
Total Applications
across all art units
Statute-Specific Performance

§101
14.2%
-25.8% vs TC avg
§103
59.2%
+19.2% vs TC avg
§102
8.5%
-31.5% vs TC avg
§112
6.0%
-34.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 990 resolved cases
Office Action

§101 §102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows: 
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception such as a natural phenomenon, abstract idea, or law of nature, without significantly more and/or a practical application per se, specifically with one or more of: 
1) Not integrating a judicial exception into a practical application (see explanation below), and 
2) Not reciting elements that would amount to significantly more than the judicial exception (see explanation below). 
Accordingly, claims 1-20 are directed towards patent ineligible subject matter under 35 U.S.C. 101.

The independent claims:
When taking the current claim limitations of the present invention, we see that they are directed to training a model (human mind) to perform altering or masking of a signal and find the differences upon output thereof e.g. mixed languages, to better predict future signals in one or more languages, but more specifically, the claims demonstrate mathematical waveform or signal operations of altering a x-y axis signal e.g. through a Fourier transformation or other alteration or function f(x) modification thereof, that does not require hardware or software, and is not significantly more per se, wherein a practical application or improvement of technology cannot be realized from neither the claims or the specification of the present invention. 
Regarding the claim limitations of claim(s) 1 and 15 as recited:

A method for training an AI duration model to control the duration of speech utterances by a text-to-speech computing system when converting text into speech, the method comprising: providing training data to the AI duration model, the training data including a plurality of phonemes derived from a string of text, corresponding actual frame durations for each of the phonemes, and a target output speech time duration; the AI duration model masking actual frame durations for a subset of the plurality of phonemes; the AI duration model generating predicted frame durations for the masked actual frame durations of the subset of the plurality of phonemes; and calculating a loss with a loss function to quantify a difference of at least the predicted frame durations and the actual frame durations, and using the loss to train the AI duration model by adjusting parameters of the AI duration model that are used to generate the predicted frame durations.

15. A method for using an AI duration model to control the generation of speech utterances by a text-to-speech computing system when converting text into speech, the method comprising: obtaining an AI duration model trained to generate phonemes and frame durations for the phonemes based on inputs comprising text to be converted into speech and a target output speech time duration; identifying the text to be converted into speech; identifying the target output speech time duration; providing the text and the target output speech time duration to the AI duration model, wherein the AI duration model tokenizes the text into a plurality of phonemes and predicts a frame duration for each phoneme in the plurality of phonemes based on the target output speech time duration, such that the summation of the frame durations for the plurality of phonemes is approximately equal to the target output speech time duration; and generating output based on the phonemes and predicted frame duration for each phoneme

Step 1: IS THE CLAIM DIRECTED TO A PROCESS, MACHINE, MANUFACTURE OR COMPOSITION OF MATTER?
Yes

Step 2A.1: IS THE CLAIM DIRECTED TO A LAW OF NATURE, A NATURAL PHENOMENON (PRODUCT OF NATURE) OR AN ABSTRACT IDEA? 
Yes

Step 2A.2: DOES THE CLAIM RECITE ADDITIONAL ELEMENTS THAT INTEGRATE THE JUDICIAL EXCEPTION INTO A PRACTICAL APPLICATION?
No. Regarding the independent claims. No, analogous to Solutran, Inc. v. Elavon, Inc., 931 F.3d 1161, 2019 USPQ2d 281076 (Fed. Cir. 2019), the claims are directed to training a model (human mind) to perform altering or masking of a signal and find the differences upon output thereof e.g. mixed languages, to better predict future signals in one or more languages, but more specifically, the claims demonstrate mathematical waveform or signal operations of altering a x-y axis signal e.g. through a Fourier transformation or other alteration or function f(x) modification thereof, that does not require hardware or software, and is not significantly more per se, wherein a practical application or improvement of technology cannot be realized from neither the claims or the specification of the present invention, such as lacking a clear improvement of function/technology.

Further as demonstrated in Solutran, Inc. v. Elavon, Inc., 931 F.3d 1161, 2019 USPQ2d 281076 (Fed. Cir. 2019), the claims were to methods for electronically processing paper checks, all of which contained limitations setting forth receiving merchant transaction data from a merchant, crediting a merchant’s account, and receiving and scanning paper checks after the merchant’s account is credited. In part one of the Alice/Mayo test, the Federal Circuit determined that the claims were directed to the abstract idea of crediting the merchant’s account before the paper check is scanned. The court first determined that the recited limitations of “crediting a merchant’s account as early as possible while electronically processing a check” is a “long-standing commercial practice” like in Alice and Bilski. 931 F.3d at 1167, 2019 USPQ2d 281076, at *5 (Fed. Cir. 2019). The Federal Circuit then continued with its analysis under part one of the Alice/Mayo test finding that the claims are not directed to an improvement in the functioning of a computer or an improvement to another technology. In particular, the court determined that the claims “did not improve the technical capture of information from a check to create a digital file or the technical step of electronically crediting a bank account” nor did the claims “improve how a check is scanned.” Id. 

Regarding the December 5th 2025 Memo in light of September 26, 2025 Appeals Review Panel Decision in Ex parte Desjardins, Appeal 2024-000567 for Application 16/319,040, in deciding if a recited abstract idea does or does not direct the entire claim to an abstract idea, when a claim is considered as a whole.
The claim which demonstrated improvements to technology and/or function recites: "adjust the first values of the plurality of parameters to optimize performance of the machine learning model on the second machine learning task while protecting performance of the machine learning model on the first machine learning task.". 
The decision recites that “We are persuaded that constitutes an improvement to how the machine learning model itself operates, and not, for example, the identified mathematical calculation.” 
When considering the limitation decided upon, there are clear improvements to machine learning that are not rudimentary or a long-standing practice, for instance adjusting for optimization and protection of performance, as claimed, are improvements to a machine learning models operations, not simply a general mathematical or generic recitation, but rather an improvement to function.

Specifically, Ex Parte Desjardins explained the following: 
Enfish ranks among the Federal Circuit's leading cases on the eligibility of technological improvements. In particular, Enfish recognized that “[m]uch of the advancement made in computer technology consists of improvements to software that, by their very nature, may not be defined by particular physical features but rather by logical structures and processes.” 822 F.3d at 1339. Moreover, because “[s]oftware can make non-abstract improvements to computer technology, just as hardware improvements can,” the Federal Circuit held that the eligibility determinations should turn on whether “the claims are directed to an improvement to computer functionality versus being directed to an abstract idea.” Id. at 1336. (Desjardins, page 8).

Further, specifically:
“Paragraph 21 of the Specification, which the Appellant cites, identifies improvements in training the machine learning model itself. Of course, such an assertion in the Specification alone is insufficient to support a patent eligibility determination, absent a subsequent determination that the claim itself reflects the disclosed improvement. See MPEP § 2106.05(a) (citing Intellectual Ventures I LLC v. Symantec Corp., 838 F.3d 1307, 1316 (Fed. Cir. 2016)). Here, however, we are persuaded that the claims reflect such an improvement. For example, one improvement identified in the 8 Appeal2024-000567 Application 16/319,040 Specification is to "effectively learn new tasks in succession whilst protecting knowledge about previous tasks." Spec. ,r 21. The Specification also recites that the claimed improvement allows artificial intelligence (AI) systems to "us[e] less of their storage capacity" and enables "reduced system complexity." Id. When evaluating the claim as a whole, we discern at least the following limitation of independent claim 1 that reflects the improvement: "adjust the first values of the plurality of parameters to optimize performance of the machine learning model on the second machine learning task while protecting performance of the machine learning model on the first machine learning task." We are persuaded that constitutes an improvement to how the machine learning model itself operates, and not, for example, the identified mathematical calculation. Under a charitable view, the overbroad reasoning of the original panel below is perhaps understandable given the confusing nature of existing § 101 jurisprudence, but troubling, because this case highlights what is at stake. Categorically excluding AI innovations from patent protection in the United States jeopardizes America's leadership in this critical emerging technology. Yet, under the panel's reasoning, many AI innovations are potentially unpatentable-even if they are adequately described and nonobvious-because the panel essentially equated any machine learning with an unpatentable "algorithm" and the remaining additional elements as "generic computer components," without adequate explanation. Dec. 24. Examiners and panels should not evaluate claims at such a high level of generality.”

Further in Ex Parte Desjardins, Appeal No. 2024-000567 (PTAB September 26, 2025, Appeals Review Panel Decision) (precedential), the claimed invention was a method of training a machine learning model on a series of tasks. The Appeals Review Panel (ARP) overall credited benefits including reduced storage, reduced system complexity and streamlining, and preservation of performance attributes associated with earlier tasks during subsequent computational tasks as technological improvements that were disclosed in the patent application specification. Specifically, the ARP upheld the Step 2A Prong One finding that the claims recited an abstract idea (i.e., mathematical concept). In Step 2A Prong Two, the ARP then determined that the specification identified improvements as to how the machine learning model itself operates, including training a machine learning model to learn new tasks while protecting knowledge about previous tasks to overcome the problem of “catastrophic forgetting” encountered in continual learning systems. Importantly, the ARP evaluated the claims as a whole in discerning at least the limitation “adjust the first values of the plurality of parameters to optimize performance of the machine learning model on the second machine learning task while protecting performance of the machine learning model on the first machine learning task” reflected the improvement disclosed in the specification. Accordingly, the claims as a whole integrated what would otherwise be a judicial exception instead into a practical application at Step 2A Prong Two, and therefore the claims were

The claim itself does not need to explicitly recite the improvement described in the specification (e.g., “thereby increasing the bandwidth of the channel”). See, e.g., Ex Parte Desjardins, Appeal No. 2024-000567 (PTAB September 26, 2025, Appeals Review Panel Decision) (precedential), in which the specification identified the improvement to machine learning technology by explaining how the machine learning model is trained to learn new tasks while protecting knowledge about previous tasks to overcome the problem of “catastrophic forgetting,” and that the claims reflected the improvement identified in the specification. Indeed, enumerated improvements identified in the Desjardins specification included disclosures of the effective learning of new tasks in succession in connection with specifically protecting knowledge concerning previously accomplished tasks; allowing the system to reduce use of storage capacity; and the enablement of reduced complexity in the system. Such improvements were tantamount to how the machine learning model itself would function in operation and therefore not subsumed in the identified mathematical calculation.

The second paragraph of MPEP § 2106.05(a), subsection I, is revised to add new examples xiii and xiv to the list of examples that may show an improvement in computer functionality: 
xiii. An improved way of training a machine learning model that protected the model’s knowledge about previous tasks while allowing it to effectively learn new tasks; Ex Parte Desjardins, Appeal No. 2024-000567 (PTAB September 26, 2025, Appeals Review Panel Decision) (precedential); and 
xiv. Improvements to computer component or system performance based upon adjustments to parameters of a machine learning model associated with tasks or workstreams; Ex Parte Desjardins, Appeal No. 2024-000567 (PTAB September 26, 2025, Appeals Review Panel Decision) (precedential).

Step 2B: DOES THE CLAIM RECITE ADDITIONAL ELEMENTS THAT AMOUNT TO SIGNIFICANTLY MORE THAN THE JUDICIAL EXCEPTION?
No. The claims do not demonstrate significantly more by utilizing mathematical concepts wherein the recitation of an “AI model” is extra solution activity and analogous to the human mind for learning e.g. language translation. The claims amount to training a model (human mind) to perform altering or masking of a signal and find the differences upon output thereof e.g. mixed languages, to better predict future signals in one or more languages, but more specifically, the claims demonstrate mathematical waveform or signal operations of altering a x-y axis signal e.g. through a Fourier transformation or other alteration or function f(x) modification thereof, that does not require hardware or software, and is not significantly more per se, wherein a practical application or improvement of technology cannot be realized from neither the claims or the specification of the present invention.
• Collecting and comparing known information (Classen)
• Collecting information, analyzing it, and displaying certain results of the collection and analysis (Electric Power Group; West View†)
• Comparing new and stored information and using rules to identify options (Smartgene)†
• Data recognition and storage (Content Extraction)


Assistance for Applicant in amending to overcome 101:
Limitations that the courts have found to qualify as “significantly more” when recited in a claim with a judicial exception include:
i. Improvements to the functioning of a computer, e.g., a modification of conventional Internet hyperlink protocol to dynamically produce a dual-source hybrid webpage, as discussed in DDR Holdings, LLC v. Hotels.com, L.P., 773 F.3d 1245, 1258-59, 113 USPQ2d 1097, 1106-07 (Fed. Cir. 2014) (see MPEP § 2106.05(a));
ii. Improvements to any other technology or technical field, e.g., a modification of conventional rubber-molding processes to utilize a thermocouple inside the mold to constantly monitor the temperature and thus reduce under- and over-curing problems common in the art, as discussed in Diamond v. Diehr, 450 U.S. 175, 191-92, 209 USPQ 1, 10 (1981) (see MPEP § 2106.05(a));
iii. Applying the judicial exception with, or by use of, a particular machine, e.g., a Fourdrinier machine (which is understood in the art to have a specific structure comprising a headbox, a paper-making wire, and a series of rolls) that is arranged in a particular way to optimize the speed of the machine while maintaining quality of the formed paper web, as discussed in Eibel Process Co. v. Minn. & Ont. Paper Co., 261 U.S. 45, 64-65 (1923) (see MPEP § 2106.05(b));
iv. Effecting a transformation or reduction of a particular article to a different state or thing, e.g., a process that transforms raw, uncured synthetic rubber into precision-molded synthetic rubber products, as discussed in Diehr, 450 U.S. at 184, 209 USPQ at 21 (see MPEP § 2106.05(c));
v. Adding a specific limitation other than what is well-understood, routine, conventional activity in the field, or adding unconventional steps that confine the claim to a particular useful application, e.g., a non-conventional and non-generic arrangement of various computer components for filtering Internet content, as discussed in BASCOM Global Internet v. AT&T Mobility LLC, 827 F.3d 1341, 1350-51, 119 USPQ2d 1236, 1243 (Fed. Cir. 2016) (see MPEP § 2106.05(d)); or
vi. Other meaningful limitations beyond generally linking the use of the judicial exception to a particular technological environment, e.g., an immunization step that integrates an abstract idea of data comparison into a specific process of immunizing that lowers the risk that immunized patients will later develop chronic immune-mediated diseases, as discussed in Classen Immunotherapies Inc. v. Biogen IDEC, 659 F.3d 1057, 1066-68, 100 USPQ2d 1492, 1499-1502 (Fed. Cir. 2011) (see MPEP § 2106.05(e)).

To help in amending the claims and for analysis purposes, example claims 3 and 4 are listed below from the courts, however such example amendment potentials are not limited to the provided examples and alternative amendments are possible using i-vi from the courts. The example below show differences between eligible claims (court claim 4) and ineligible claims (court claim 3), which thus illustrates significantly more which is tied to hardware that is not generally recited in the art.  In this case general changing of font size in claim 3 versus a significant step of conditionally changing font size tied to hardware in claim 4.
See below examples based on MPEP and not on the current claim set, to help amend to overcome 101 rejections:

Regarding independent claim examples:
For instance in the example claims, for example claims 3 and 4 below:
Ineligible
3. A computer‐implemented method of resizing textual information within a window displayed in a graphical user interface, the method comprising: 
(not significant) generating first data for describing the area of a first graphical element; 
(not significant) generating second data for describing the area of a second graphical element containing textual information; 
(not significant) calculating, by the computer, a scaling factor for the textual information which is proportional to the difference between the first data and second data. 
The claim recites that the step of calculating a scaling factor is performed by “the computer” (referencing the computer recited in the preamble). Such a limitation gives “life, meaning and vitality” to the preamble and, therefore, the preamble is construed to further limit the claim. (See MPEP 2111.02.)
However, the mere recitation of “computer‐implemented” is akin to adding the words “apply it” in conjunction with the abstract idea. Such a limitation is not enough to qualify as significantly more. With regards to the graphical user interface limitation, the courts have found that simply limiting the use of the abstract idea to a particular technological environment is not significantly more. (See, e.g., Flook.)

Whereas in similar claim 4:
Eligible
4. A computer‐implemented method for dynamically relocating textual information within an underlying window displayed in a graphical user interface, the method comprising: 
displaying a first window containing textual information in a first format within a graphical user interface on a computer screen; 
displaying a second window within the graphical user interface; 
constantly monitoring the boundaries of the first window and the second window to detect an overlap condition where the second window overlaps the first window such that the textual information in the first window is obscured from a user’s view; 
determining the textual information would not be completely viewable if relocated to an unobstructed portion of the first window; 
calculating a first measure of the area of the first window and a second measure of the area of the unobstructed portion of the first window; 
calculating a scaling factor which is proportional to the difference between the first measure and the second measure; 
scaling the textual information based upon the scaling factor; 
(significant step) automatically relocating the scaled textual information, by a processor, to the unobscured portion of the first window in a second format during an overlap condition so that the entire scaled textual information is viewable on the computer screen by the user; 
(significant step) automatically returning the relocated scaled textual information, by the processor, to the first format within the first window when the overlap condition no longer exists.
These limitations are not merely attempting to limit the mathematical algorithm to a particular technological environment. Instead, these claim limitations recite a specific application of the mathematical algorithm that improves the functioning of the basic display function of the computer itself. As discussed above, the scaling and relocating the textual information in overlapping windows improves the ability of the computer to display information and interact with the user.


		
The dependent claims are rejected as follows, for the same reasoning as being directed towards patent ineligible subject matter under 35 U.S.C. 101, and not adding eligible subject matter to the respective parent claim.
Claims 2-14 and 16-20 are directed to additional signal transform and mathematical steps such as Means squared, randomizing, time sequencing, language translation based on durations, and text to speech operations which a human can perform by reading aloud. Overall the dependent claims do not demonstrate significantly more or a practical application, and are still directed to the parent claim operation of training a model (human mind) to perform altering or masking of a signal and find the differences upon output thereof e.g. mixed languages, to better predict future signals in one or more languages, but more specifically, the claims demonstrate mathematical waveform or signal operations of altering a x-y axis signal e.g. through a Fourier transformation or other alteration or function f(x) modification thereof, that does not require hardware or software, and is not significantly more per se, wherein a practical application or improvement of technology cannot be realized from neither the claims or the specification of the present invention.










Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 6, 10-14, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20240161730 A1 Elias; Isaac et al. (hereinafter Elias) in view of US 20250149023 A1 KIM; Tae Woo et al. (hereinafter KIM).
Re claim 1, Elias teaches
1. A method for training an AI duration model to control the duration of speech utterances by a text-to-speech computing system when converting text into speech, the method comprising: (neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)
providing training data to the AI duration model, the training data including a plurality of phonemes derived from a string of text, corresponding actual frame durations for each of the phonemes, and a target output speech time duration; (phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1) 
calculating a loss with a loss function to quantify a difference of at least the predicted frame durations and the actual frame durations, and using the loss to train the AI duration model by adjusting parameters of the AI duration model that are used to generate the predicted frame durations. (predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

However, while Elias teaches neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
the AI duration model masking actual frame durations for a subset of the plurality of phonemes; (KIM phoneme duration model specific to masked speech 0030 and 0043 with fig. 2)
the AI duration model generating predicted frame durations for the masked actual frame durations of the subset of the plurality of phonemes; and (KIM prediction output based on phoneme duration model specific to masked speech 0030 and 0043 with fig. 2)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias to incorporate the above claim limitations as taught by KIM to allow for simple substation of one known element for another to yield predictable results, such as the masked speech in place of Elias speech, to allow the model to learn to infer durations from contextual information rather than relying on direct, unmasked acoustic cues, leading to more natural and controllable speech synthesis, to provide this option for more exact and smooth synthesis outputs.

Re claim 6, Elias teaches
6. The method according to Claim 1, the method further comprising parsing the string of text into a plurality of phonemes. (using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 10, Elias teaches
10. The method according to Claim 1, where the loss is calculated using cross-entropy loss. (using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 11, Elias teaches
11. The method according to Claim 1, the method further comprising generating one or more audio representations based on the phonemes, frame time durations for the phonemes, and the target output speech time duration. (phoneme based, duration and frame based, and output speech dependent thereof…predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 12, Elias teaches
12. The method according to Claim 11, the audio representations being one or more Mel spectrograms. (utilizing Mel frequency spectrograms for predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 13, Elias teaches
13. The method according to Claim 11, the method further comprising converting the audio representations into an output waveform. (synthesized output by predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 14, Elias teaches
14. The method according to Claim 13, where the output waveform is a time-domain signal. (using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 20, Elias teaches 
20. The method of Claim 15, wherein the AI duration model was previously trained with training data including a plurality of phonemes derived from a string of text, corresponding actual frame durations for each of the phonemes, and a target output speech time duration, wherein the training of the AI duration model included: (predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)
calculating a loss with a loss function to quantify a difference of at least the predicted frame durations and the actual frame durations, and using the loss to train the AI duration model by adjusting parameters of the AI duration model that are used to generate the predicted frame durations. (based on predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1) 

However, while Elias teaches neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
the AI duration model masking actual frame durations for a subset of the plurality of phonemes; (KIM phoneme duration model specific to masked speech 0030 and 0043 with fig. 2)
the AI duration model generating predicted frame durations for the masked actual frame durations of the subset of the plurality of phonemes; and (KIM prediction output based on phoneme duration model specific to masked speech 0030 and 0043 with fig. 2)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias to incorporate the above claim limitations as taught by KIM to allow for simple substation of one known element for another to yield predictable results, such as the masked speech in place of Elias speech, to allow the model to learn to infer durations from contextual information rather than relying on direct, unmasked acoustic cues, leading to more natural and controllable speech synthesis, to provide this option for more exact and smooth synthesis outputs.


Claims 2-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20240161730 A1 Elias; Isaac et al. (hereinafter Elias) in view of US 20250149023 A1 KIM; Tae Woo et al. (hereinafter KIM) and further in view of US 11704507 B1 Fantinuoli; Claudio (hereinafter Fantinuoli)
Re claims 2, while Elias teaches synthesizing and synchronizing speech features to output speech, as well as neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
2. The method according to Claim 1, where the target output speech time duration is approximately equal to a time duration for an initial speech. (Fantinuoli approximating the input speech to the output speech by constraining speech features such as word and time ratios, speed, and latency thresholds between conversions e.g. translations col 10 lines 26 to col 11 line 16) 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias in view of KIM to incorporate the above claim limitations as taught by Fantinuoli to allow for combining prior art elements according to a known methods to yield predictable results, such as using latency reduction during transformations from one speech to another such as in TTS modeling, wherein latency is reduced from source or target by synchronizing durations and rate of speech, which improves awareness of when pauses or removal of stop-words/pauses are needed to synch output speech to a more natural sounding flow of conversation or general speaking style dependent on the user or scenario, in combination with compression for non-complex synchronization.

Re claims 3, while Elias teaches synthesizing and synchronizing speech features to output speech, as well as neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
3. The method according to Claim 2, where the string of text and the speech generated from the string of text are of a first language, and the initial speech is of a second language, such that the target output speech time duration for the speech of the first language is approximately equal to the time duration for the initial speech of the second language. (Fantinuoli approximating the input speech to the output speech by constraining speech features such as word and time ratios, speed, and latency thresholds between conversions e.g. translations col 10 lines 26 to col 11 line 16)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias in view of KIM to incorporate the above claim limitations as taught by Fantinuoli to allow for combining prior art elements according to a known methods to yield predictable results, such as using latency reduction during translations between languages, from one speech to another such as in TTS modeling, wherein latency is reduced from source or target by synchronizing durations and rate of speech, which improves awareness of when pauses or removal of stop-words/pauses are needed to synch output speech to a more natural sounding flow of conversation or general speaking style dependent on the user or scenario, in combination with compression for non-complex synchronization.

Re claims 4, while Elias teaches synthesizing and synchronizing speech features to output speech, as well as neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
4. The method according to Claim 1, where the target output speech time duration is greater than a time duration for an initial speech, where speech at the target output speech time duration is a speed-up version of the initial speech. (Fantinuoli approximating the input speech to the output speech by constraining speech features such as word and time ratios, speed, and latency thresholds between conversions e.g. translations col 10 lines 26 to col 11 line 16)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias in view of KIM to incorporate the above claim limitations as taught by Fantinuoli to allow for combining prior art elements according to a known methods to yield predictable results, such as using latency reduction and speed increase/slow-down via thresholds, during transformations from one speech to another such as in TTS modeling, wherein latency is reduced from source or target by synchronizing durations and rate of speech, which improves awareness of when pauses or removal of stop-words/pauses are needed to synch output speech to a more natural sounding flow of conversation or general speaking style dependent on the user or scenario, in combination with compression for non-complex synchronization.

Re claims 5, while Elias teaches synthesizing and synchronizing speech features to output speech, as well as neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
5. The method according to Claim 1, where the target output speech time duration is less than a time duration for an initial speech, where speech at the target output speech time duration is a slowed-down version of the initial speech. (Fantinuoli approximating the input speech to the output speech by constraining speech features such as word and time ratios, speed, and latency thresholds between conversions e.g. translations col 10 lines 26 to col 11 line 16)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias in view of KIM to incorporate the above claim limitations as taught by Fantinuoli to allow for combining prior art elements according to a known methods to yield predictable results, such as using latency reduction and speed increase/slow-down via thresholds, during transformations from one speech to another such as in TTS modeling, wherein latency is reduced from source or target by synchronizing durations and rate of speech, which improves awareness of when pauses or removal of stop-words/pauses are needed to synch output speech to a more natural sounding flow of conversation or general speaking style dependent on the user or scenario, in combination with compression for non-complex synchronization.

Claims 7-9 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20240161730 A1 Elias; Isaac et al. (hereinafter Elias) in view of US 20250149023 A1 KIM; Tae Woo et al. (hereinafter KIM) and further in view of US 20240038212 A1 Shih; Kevin et al. (hereinafter Shih)
Re claim 7, while Elias teaches synthesizing and synchronizing speech features to output speech, as well as neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
7. The method according to Claim 1, where the AI model masks the actual frame durations for the subset of the plurality of phonemes non-sequentially. (Shih mask phoneme duration processing 0047 with non-sequential or sequential task performance, random sampling, and K-means inclusive of mean-square 0098 and 0101)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias in view of KIM to incorporate the above claim limitations as taught by Shih to allow for combining prior art elements according to a known methods to yield predictable results, such as using well-known mathematical or processing techniques for significant improvements in training efficiency, phonetic alignment, and generation quality, particularly in Non-Autoregressive, regressive, or Text-to-Speech systems without requiring strict, pre-aligned, or frame-level-only operations, and instead rely on semantic or cluster-based information when prioritized over duration when alignment is not needed or already took place if further improvement is achievable for signal quality e.g. more natural sounding. 

Re claim 8, while Elias teaches synthesizing and synchronizing speech features to output speech, as well as neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
8. The method according to Claim 1, where the AI model masks the actual frame durations for the subset of the plurality of phonemes randomly. (Shih mask phoneme duration processing 0047 with non-sequential or sequential task performance, random sampling, and K-means inclusive of mean-square 0098 and 0101)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias in view of KIM to incorporate the above claim limitations as taught by Shih to allow for combining prior art elements according to a known methods to yield predictable results, such as using well-known mathematical or processing techniques for significant improvements in training efficiency, phonetic alignment, and generation quality, particularly in Non-Autoregressive, regressive, or Text-to-Speech systems without requiring strict, pre-aligned, or frame-level-only operations, and instead rely on semantic or cluster-based information when prioritized over duration when alignment is not needed or already took place if further improvement is achievable for signal quality e.g. more natural sounding. 

Re claim 9, while Elias teaches synthesizing and synchronizing speech features to output speech, as well as neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
9. The method according to Claim 1, where the loss is calculated using mean-squared error loss. (Shih mask phoneme duration processing 0047 with non-sequential or sequential task performance, random sampling, and K-means inclusive of mean-square 0098 and 0101)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Elias in view of KIM to incorporate the above claim limitations as taught by Shih to allow for combining prior art elements according to a known methods to yield predictable results, such as using well-known mathematical or processing techniques for significant improvements in training efficiency, phonetic alignment, and generation quality, particularly in Non-Autoregressive, regressive, or Text-to-Speech systems without requiring strict, pre-aligned, or frame-level-only operations, and instead rely on semantic or cluster-based information when prioritized over duration when alignment is not needed or already took place if further improvement is achievable for signal quality e.g. more natural sounding. 


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim 15 and 17-19 is rejected under 35 U.S.C. 102(a)(1) as being anticipated US 20240161730 A1 Elias; Isaac et al. (hereinafter Elias).
Re claim 15, Elias teaches 
15. A method for using an AI duration model to control the generation of speech utterances by a text-to-speech computing system when converting text into speech, the method comprising: (neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)
obtaining an AI duration model trained to generate phonemes and frame durations for the phonemes based on inputs comprising text to be converted into speech and a target output speech time duration; (using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)
identifying the text to be converted into speech; (TTS… using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)
identifying the target output speech time duration; (TTS output thereof as speech… using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)
providing the text and the target output speech time duration to the AI duration model, wherein the AI duration model tokenizes the text into a plurality of phonemes and predicts a frame duration for each phoneme in the plurality of phonemes based on the target output speech time duration, such that the summation of the frame durations for the plurality of phonemes is approximately equal to the target output speech time duration; and (predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)
generating output based on the phonemes and predicted frame duration for each phoneme. (synthesized output, predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 17, alias teaches
17. The method according to Claim 15, the method further comprising generating an audio representation of the output based on the plurality of phonemes, corresponding predicted frame durations, and the target output speech time duration. (phoneme based, duration and frame based, and output speech dependent thereof…predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 18, Elias teaches
18. The method according to Claim 17, the method further comprising converting the output into an output waveform. (synthesized output by predicting frames 0054, calculating the loss difference between prediction and reference/actual as in fig. 2 and using a method such as cross-entropy 0051 using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)

Re claim 19, alias teaches
19. The method according to Claim 19, where the output waveform is a time-domain signal. (using phonemes from text sequence 0004 and 0010 such as by parsing 0049 to supply training data, based on a neural network modeling with TTS to produce synthesized speech in the time-domain 0064 with fig. 1)


Claim 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20240161730 A1 Elias; Isaac et al. (hereinafter Elias) in view of US 11704507 B1 Fantinuoli; Claudio (hereinafter Fantinuoli).
Re claim 16, while Elias teaches synthesizing and synchronizing speech features to output speech, as well as neural network model driven duration training for predicted frames based on loss calculations for phonemes in frames, it fails to teach:
16. The method according to Claim 15, wherein the method includes using the AI duration model to translate a first speech segment in a first language having a first speech segment duration into a second speech segment in a second language having a second speech segment duration, such that the target speech time duration is approximately equal to the first speech segment duration. (Fantinuoli approximating the input speech to the output speech by constraining speech features such as word and time ratios, speed, and latency thresholds between conversions e.g. translations col 10 lines 26 to col 11 line 16)
Therefore, it would have been obvious to one of ordinary skill in the art before the effesctive filing date of the claimed invention to modify the system of Elias to incorporate the above claim limitations as taught by Fantinuoli to allow for combining prior art elements according to a known methods to yield predictable results, such as using latency reduction during translations between languages, from one speech to another such as in TTS modeling, wherein latency is reduced from source or target by synchronizing durations and rate of speech, which improves awareness of when pauses or removal of stop-words/pauses are needed to synch output speech to a more natural sounding flow of conversation or general speaking style dependent on the user or scenario, in combination with compression for non-complex synchronization.

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

US 10741169 B1	Trueba; Jaime Lorenzo et al.
Phoneme analysis

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL C COLUCCI whose telephone number is (571)270-1847.  The examiner can normally be reached on M-F 9 AM - 5 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/MICHAEL COLUCCI/Primary Examiner, Art Unit 2655                                                                                                                                                                                               (571)-270-1847
Examiner FAX:  (571)-270-2847
Michael.Colucci@uspto.gov
Read full office action
Prosecution Timeline

Jul 30, 2024
Application Filed
Feb 11, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/515,502
Patent 12592240
ENCODING AND DECODING OF ACOUSTIC ENVIRONMENT
2y 5m to grant Granted Mar 31, 2026
18/585,168
Patent 12586570
CHUNK-WISE ATTENTION FOR LONGFORM ASR
2y 5m to grant Granted Mar 24, 2026
18/131,021
Patent 12573405
WORD CORRECTION USING AUTOMATIC SPEECH RECOGNITION (ASR) INCREMENTAL RESPONSE
2y 5m to grant Granted Mar 10, 2026
18/656,274
Patent 12573380
MANAGING AMBIGUOUS DATE MENTIONS IN TRANSFORMING NATURAL LANGUAGE TO A LOGICAL FORM
2y 5m to grant Granted Mar 10, 2026
18/492,177
Patent 12567414
SYSTEM AND METHOD FOR DETECTING A WAKEUP COMMAND FOR A VOICE ASSISTANT
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
91%
With Interview (+15.3%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 990 resolved cases by this examiner. Grant probability derived from career allow rate.