DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed on 10/16/2025 have been fully considered but they are not persuasive.
Applicant argues that generating synthetic speech data based at least on the second rate, and generating the second portion of the simulated multi-speaker recording based at least on the synthetic speech data. Applicant submits that amended claim 1 does not recite any abstract idea. In addition, Applicant submits that amended claim 1 integrates any alleged abstract idea into a practical application and recites significantly more than any alleged abstract idea. Further, independent claims 11 and 19 have been amended to recite limitations similar to those discussed above in connection with amended claim
In response:
Response to applicant’s specific points:
“A person cannot mentally generate synthetic speech data”:
The analysis under § 101 does not require that every step be performable mentally; rather, the claim as a whole is directed to an abstract idea where the core advance lies in mathematical/data analysis and decision-making. The addition of generic computer implementation to carry out content generation does not, by itself, confer eligibility. See Alice, 573 U.S. at 223; Content Extraction v. Wells Fargo, 776 F.3d 1343, 1347 (Fed. Cir. 2014).
The claim fails to recite a specific technical solution for synthetic speech generation that improves the operation of computers or the synthesis technology. Without particularized algorithmic or hardware constraints, the claim reads on routine implementation. “Integration into practical application”:
As explained above (Step 2A, Prong Two), using the abstract calculations to guide speech synthesis in a simulated recording is an application of the abstract idea in a technological environment, but the claims do not impose meaningful limits or recite a specific improvement to computer technology. MPEP § 2106.04(d); ChargePoint, Inc. v. SemaConnect, Inc., 920 F.3d 759, 766–69 (Fed. Cir. 2019).
As such, claims 1-20 remain rejected under 35 U.S.C. § 101 as being directed to an abstract idea (mathematical concepts and mental processes) without reciting additional elements that integrate the exception into a practical application or add significantly more than the abstract idea.
Regarding the Art rejection, Applicant contends that Bodin fails to teach or suggest (i) computing a first discrepancy that represents a first difference between the first rate and a first target rate for the first speech-based attribute, and (ii) determining, based at least on the first discrepancy, a second rate for a second portion of a simulated multi-speaker recording.
Examiner response: Under the broadest reasonable interpretation (BRI), the claim language reads on Bodin’s dynamic prosody adjustment framework in which: A prosody “rate” is identified/selected for a section of synthesized data and used during rendering (Abstract; [0005]–[0008]; [0150]–[0159]; [0160]–[0166]; FIG. 13). A target prosody (including rate) is determined from parameters such as context, user instruction, and user prosody history (FIGS. 14A-14D; [0153]-[0177]). The system compares current/observed voice characteristics or previously applied settings with predetermined profiles/targets and selects the prosody to apply to subsequent sections (FIG. 14D; [0177]; FIG. 15; [0180]-[0187]). BRI of “computing a first discrepancy” reasonably encompasses Bodin’s comparison/selection operations between an observed/current characteristic and a stored/target prosody (e.g., voice-pattern profiles with associated prosody settings; [0177]) used to decide how to adjust rate. The computation of a “difference” need not be limited to a formal arithmetic subtraction when, as here, the art teaches comparing current characteristics to target ranges/profiles and making an adjustment decision based on that comparison (see FIG. 14D and accompanying text at [0177]; also FIG. 15 discussing selection logic executed prior to rendering the next section [0180]–[0187]). BRI of “determining, based at least on the first discrepancy, a second rate … for a second portion” reads on Bodin’s disclosure of applying the selected/adjusted prosody rate to the next section to be rendered after the comparison to target settings or user voice characteristics (FIG. 14C–14D; [0172]–[0177]) and rendering that subsequent section with the adjusted rate (FIG. 13; [0150]–[0159]). Bodin expressly describes conditional selection and application of prosody prior to rendering a next section (FIG. 15; [0180]–[0187]). Applicant’s argument construes “discrepancy” as requiring an explicit numeric delta stored as a standalone value. The claims, however, do not require a particular mathematical format, storage representation, or explicit arithmetic operation; under BRI, a difference/comparison between an observed attribute and a target attribute that is used to select/adjust the prosody rate for the next section meets the claimed “discrepancy” and “second rate determined based on the discrepancy.” With respect to “simulated multi-speaker recording,” the claims do not positively recite structural features that exclude Bodin’s sections of synthesized, voice-rendered data. Bodin’s system synthesizes content into a voice-rendered format and renders sections with selected prosody (FIGS. 9–10, 13, 15). Under BRI, a “recording” comprising sequential portions rendered with selected prosody reads on Bodin’s generated/voiced sections. To the extent applicant’s specification may ascribe a narrower meaning (e.g., explicit turn-taking or overlap simulation), the present claim language does not affirmatively impose such restrictions.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is
directed to a Judicial exception (i.e., abstract idea) without significantly more. The
claims do not include additional elements that are sufficient to amount to significantly
more than the judicial exception.
Alice Step 1 — Claim(s) directed to an abstract idea. Claims 1–10, 11–20 (device/process/system claims) are directed to the abstract idea of measuring/quantifying attributes (rates) of speech in portions of a recording, computing discrepancies between measured rates and target rates, using those computed differences to determine adjusted rates, and generating synthetic speech data/segments based on those adjusted rates.
The claimed steps amount to mental processes (collecting data, calculating rates and discrepancies, comparing values, and using those results to select/generate synthetic data). These concepts are analogous to information collection, analysis, comparison, and adjustment based on calculated numerical results — categories of abstract ideas found in Alice, Mayo, and subsequent Federal Circuit and USPTO guidance (for example, aggregating information and manipulating numbers is an abstract concept per Electric Power Group).
The claim language does not tie the abstract idea to a specific improvement in the functioning of the computer itself or otherwise recite a particular way to implement the calculations or generation that would be a technical solution to a technical problem. Instead, the claims recite high-level steps (e.g., “determining,” “computing,” “generating synthetic speech data,” “generating the … recording”) that can be performed mentally or with generic computer components.
Alice Step 2 — No inventive concept that transforms the abstract idea into a patent-eligible application. After determining the claims are directed to an abstract idea, the next question is whether the claims contain an “inventive concept” that amounts to significantly more than the abstract idea itself (Alice step 2).
The independent claims (claims 1, 11, 19) recite generic computer implementation elements: “generating synthetic speech data,” “generating the … recording,” “one or more processors,” “one or more circuits,” and broadly recited systems (cloud, edge device, LLMs, etc.). These elements are recited at a high level and perform conventional tasks — data generation, sampling from distributions, computing differences, selecting rates — without specifying a particular, non-routine way of implementing those tasks or any unconventional hardware architecture, or demonstrating an improvement in the functioning of the computer or other technical field.
The Dependent claims further specify variations (e.g., sampling from distributions, selecting speakers based on turn probability, the speech-based attributes being overlap or silence, computing based on running length) but these are additional abstract data manipulation or mathematical/statistical operations and do not supply a specific technological means or other unconventional feature that transforms the claims into patent-eligible subject matter.
The specification does not recite an improvement to the underlying computer architecture, a novel machine, or a specific non-generic technique for generating synthetic speech data that improves synthetic generation in a technical manner. The claims instead use generic computing terms to effectuate the abstract idea.
Claims 1-20 are rejected under 35 U.S.C. § 101 as being directed to an abstract idea and lacking an inventive concept that transforms the abstract idea into patent-eligible subject matter.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1, 5, 8, 10-11, 15, 18-20 are rejected under 35 U.S.C. 102(a1) as being anticipated by Bodin et al., US 20070100628 (Hereinafter Bodin).
Regarding claims 1, 11 and 19, Bodin discloses a system, ore or more processors (see fig. 1) and method comprising determining a first rate at which a first speech-based attribute occurs within a first portion of a simulated multi-speaker recording (i.e., discloses determining a prosody “rate” attribute for synthesized speech content (paras. [0153]-[0166], FIG. 13), where the “first portion” is a section of synthesized data selected for rendering (paras. [0155]-[0159], FIG. 15). Under BRI, “speech-based attribute” encompasses prosody rate, pause/duration, or similar TTS attributes); computing a first discrepancy that represents a first difference between the first rate and a first target rate for the first speech-based attribute (i.e., comparing current prosody settings with desired or stored prosody settings based on context or user preferences (paras. [0172]–[0177]) and adjusting accordingly, which under BRI constitutes computing a difference between a measured rate and a target rate); determining, based at least on the first discrepancy, a second rate at which the first speech-based attribute is to occur within a second portion of the simulated multi-speaker recording (i.e., Bodin discloses adjusting the prosody rate for subsequent sections based on the difference between current and desired settings (paras. [0172]–[0177], FIG. 14C–14D); generating synthetic speech data based at least on the second rate (i.e., Bodin discloses generating voice output with the adjusted prosody rate (paras. [0150]–[0159], [0160]–[0166], FIG. 13)); generating the second portion of the simulated multi-speaker recording based at least on the synthetic speech data (i.e., Bodin discloses rendering the next section of synthesized data using the generated synthetic speech data (paras. [0155]–[0159], FIG. 15).
Regarding claims 5 and 15, Bodin discloses a method (and a processor) (see claims 1 and 11 above) wherein the determining the second rate comprises at least one of computing the second rate based at least on a sampled value associated with the first speech-based attribute, (i.e., typical values of attributes) (see paragraph 96), an amount of the first speech-based attribute within the first portion of the simulated multi-speaker recording (see paragraph 181), and/or a running length associated with the first portion of the simulated multi-speaker recording (i.e., section length) (see paragraph 157).
Regarding claim 8, Bodin discloses a method (see claim 1 above) further comprising determining the first target rate based at least on a set of parameters associated with generating the simulated multi-speaker recording (i.e., prosody settings) (see paragraph 159).
Regarding claim 10, Bodin discloses a method (see claim 1 above) wherein the first speech-based attribute comprises at least one of an overlap in speech or a silence (i.e., pauses) (see paragraphs 153 and 159).
Regarding claims 18 and 20, Bodin discloses wherein the system (see claim 19 rejection) is comprised a system for performing simulation operations (rendering synthesized data) (see abstract. Also refer to paragraphs 86 and 161).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 2, 12, 17 are rejected under 35 U.S.C. 103 as being unpatentable over Bodin in view Ghosh et al., US 20240161728 (Hereinafter Ghosh).
Regarding claims 2 and 12, Bodin discloses a processor and method as
described above (see claims 1 and 11).
Although Bodin discloses a method (and processor) as disclosed above, Bodin
does not specifically disclose a method further comprising: determining a second discrepancy that represents a second difference between a third rate at which a second speech- based attribute occurs within the first portion of the simulated multi-speaker recording and a second target rate for the second speech-based attribute; and determining that the first discrepancy exceeds the second discrepancy prior to generating the second portion of the simulated multi-speaker recording.
However, Ghosh discloses a method comprising determining a second discrepancy that represents a second difference between a third rate at which a second speech- based attribute occurs within the first portion of the simulated multi-speaker recording and a second target rate for the second speech-based attribute (i.e., During training of a machine learning model (e.g., SM 120),
MSTE 114 may select a training input and apply the speech model to the selected
training input to generate a training output. MSTE 114 may then compare the training
output with the target output (ground truth) and evaluate the observed mismatch using a
loss function. The mismatch may be back-propagated through the model (e.g., using
gradient descent techniques), and the weights and biases of the model may be adjusted
to make the training outputs evolve in the direction of the target outputs. Such
adjustments may be repeated—over any number of iterations, epochs, etc) (see
paragraph 31. Also, refer to paragraph 30 where it is disclosed generating synthetic
speech based on speech-based attributes of multiple speakers); and determining that the first discrepancy exceeds the second discrepancy prior to generating the second portion of the simulated multi-speaker recording (i.e., until the output mismatch for a given training input satisfies a predetermined condition (e.g., falls below a predetermined value, converges to an acceptable level of accuracy, etc.). Subsequently, a different training input may be selected, a new training output generated, and a new series of adjustments implemented based on a mismatch with the target output, until the model is trained to a target degree of accuracy) (see paragraph 31).
Therefore, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to have modified the teaching of Boding with
the teaching of Ghosh to arrive at the claimed invention. A motivation for doing so would
have been to generate speech synthesis of high quality.
Regarding claim 17, Bodin discloses a processor as described above (see
claim).
Bodin does not specifically disclose a processor further comprising sampling the
first target rate from a distribution associated with a set of parameters for generating the
simulated multi-speaker recording.
However, Ghosh discloses a processor comprising sampling the first target rate
from a distribution associated with a set of parameters for generating the simulated
multi-speaker recording (i.e., Generative models allow sampling from the determined
probability distributions during generation of new speech and impart some natural
diversity to the generated speech (see paragraph 16). Ghosh additionally discloses a
computing system 100 may be configured to process text 151 to generate synthetic
audio data 170 that may include a suitable audio representation of text 151, e.g., a
spoken version of text 151 synthesized based on prior speech samples stored in data
repository 101. In some embodiments, synthetic audio data 170 may correspond to an
artificial speaker whereas prior speech samples may be produced by real speakers.
Prior speech samples may include suitable audio data, e.g., training spectrogram(s) 103, characterizing speech of a person pronouncing a respective
training text 102 (see paragraph 23).
Therefore, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to have modified the teaching of Boding with
the teaching of Ghosh to arrive at the claimed invention. A motivation for doing so would
have been to generate speech synthesis of high quality.
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Bodin in
view of Di Gangi et al., US 20250118336, (hereinafter Di Gangi).
Regarding claim 6, Bodin discloses a method (see claim 1 rejection above).
Although Bodin discloses a method as described above (see claim 1 rejection),
Bodin does not specifically disclose a method further comprising determining a speaker
associated with the second portion of the simulated multi-speaker recording based at
least on a turn probability associated with the simulated multi-speaker recording.
However, Di Gangi discloses a method further comprising determining a speaker
associated with the second portion of the simulated multi-speaker recording based at
least on a turn probability associated with the simulated multi-speaker recording (i.e., a
system 1200 selects and or determines a voice vector for each speaker. The voice
vector encodes speaker characteristics like pitch, volume, timbre, or tone, which
uniquely identify a person's voice. Its input is a sequence of time segments 1220, with
start and stop time stamps, their speaker labels 1230, and an audio signal 1210 for
each segment. This process uses a speaker encoder whose input is the audio signal
from a segment and its output is a single vector for the segment. In some embodiments, the speaker encoder for this step is the same as the one used for speaker diarization
(see paragraph 72). In addition, Di Gangi discloses A speaker diarization system is
used to assign audio windows, and their respective word sequences, to speaker labels
(see paragraph 56).
Therefore, it would have been obvious to one of ordinary skill in the art before the
effective filing date of the claimed invention to have modified the teaching of Boding with
the teaching of Ghosh to arrive at the claimed invention. A motivation for doing so would
have been to improve the quality of speaker identification in a multi-channel
environment.
Allowable Subject Matter
Although claims 3-4, 7, 9, 13-14, 16 are rejected under 101 as being directed to
an abstract idea, assuming the 101 rejection is overcome dure to a potential
amendment, those claims would be objected to as being dependent upon a rejected
base claim (see associated 102 and 103 rejection above), but would be allowable if
rewritten in independent form including all of the limitations of the base claim and any
intervening claims.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PIERRE LOUIS DESIR whose telephone number is (571)272-7799. The examiner can normally be reached Monday-Friday 9AM-5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659