Prosecution Insights
Last updated: April 19, 2026
Application No. 18/526,600

PROBABILISTIC GENERATION OF SPEAKER DIARIZATION DATA

Final Rejection §101§102§103
Filed
Dec 01, 2023
Examiner
DESIR, PIERRE LOUIS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
2 (Final)
61%
Grant Probability
Moderate
3-4
OA Rounds
4y 4m
To Grant
92%
With Interview

Examiner Intelligence

Grants 61% of resolved cases
61%
Career Allow Rate
173 granted / 285 resolved
-1.3% vs TC avg
Strong +32% interview lift
Without
With
+31.5%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
10 currently pending
Career history
295
Total Applications
across all art units

Statute-Specific Performance

§101
14.4%
-25.6% vs TC avg
§103
48.4%
+8.4% vs TC avg
§102
18.8%
-21.2% vs TC avg
§112
11.8%
-28.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 285 resolved cases

Office Action

§101 §102 §103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant's arguments filed on 10/16/2025 have been fully considered but they are not persuasive. Applicant argues that generating synthetic speech data based at least on the second rate, and generating the second portion of the simulated multi-speaker recording based at least on the synthetic speech data. Applicant submits that amended claim 1 does not recite any abstract idea. In addition, Applicant submits that amended claim 1 integrates any alleged abstract idea into a practical application and recites significantly more than any alleged abstract idea. Further, independent claims 11 and 19 have been amended to recite limitations similar to those discussed above in connection with amended claim In response: Response to applicant’s specific points: “A person cannot mentally generate synthetic speech data”: The analysis under § 101 does not require that every step be performable mentally; rather, the claim as a whole is directed to an abstract idea where the core advance lies in mathematical/data analysis and decision-making. The addition of generic computer implementation to carry out content generation does not, by itself, confer eligibility. See Alice, 573 U.S. at 223; Content Extraction v. Wells Fargo, 776 F.3d 1343, 1347 (Fed. Cir. 2014). The claim fails to recite a specific technical solution for synthetic speech generation that improves the operation of computers or the synthesis technology. Without particularized algorithmic or hardware constraints, the claim reads on routine implementation. “Integration into practical application”: As explained above (Step 2A, Prong Two), using the abstract calculations to guide speech synthesis in a simulated recording is an application of the abstract idea in a technological environment, but the claims do not impose meaningful limits or recite a specific improvement to computer technology. MPEP § 2106.04(d); ChargePoint, Inc. v. SemaConnect, Inc., 920 F.3d 759, 766–69 (Fed. Cir. 2019). As such, claims 1-20 remain rejected under 35 U.S.C. § 101 as being directed to an abstract idea (mathematical concepts and mental processes) without reciting additional elements that integrate the exception into a practical application or add significantly more than the abstract idea. Regarding the Art rejection, Applicant contends that Bodin fails to teach or suggest (i) computing a first discrepancy that represents a first difference between the first rate and a first target rate for the first speech-based attribute, and (ii) determining, based at least on the first discrepancy, a second rate for a second portion of a simulated multi-speaker recording. Examiner response: Under the broadest reasonable interpretation (BRI), the claim language reads on Bodin’s dynamic prosody adjustment framework in which: A prosody “rate” is identified/selected for a section of synthesized data and used during rendering (Abstract; [0005]–[0008]; [0150]–[0159]; [0160]–[0166]; FIG. 13). A target prosody (including rate) is determined from parameters such as context, user instruction, and user prosody history (FIGS. 14A-14D; [0153]-[0177]). The system compares current/observed voice characteristics or previously applied settings with predetermined profiles/targets and selects the prosody to apply to subsequent sections (FIG. 14D; [0177]; FIG. 15; [0180]-[0187]). BRI of “computing a first discrepancy” reasonably encompasses Bodin’s comparison/selection operations between an observed/current characteristic and a stored/target prosody (e.g., voice-pattern profiles with associated prosody settings; [0177]) used to decide how to adjust rate. The computation of a “difference” need not be limited to a formal arithmetic subtraction when, as here, the art teaches comparing current characteristics to target ranges/profiles and making an adjustment decision based on that comparison (see FIG. 14D and accompanying text at [0177]; also FIG. 15 discussing selection logic executed prior to rendering the next section [0180]–[0187]). BRI of “determining, based at least on the first discrepancy, a second rate … for a second portion” reads on Bodin’s disclosure of applying the selected/adjusted prosody rate to the next section to be rendered after the comparison to target settings or user voice characteristics (FIG. 14C–14D; [0172]–[0177]) and rendering that subsequent section with the adjusted rate (FIG. 13; [0150]–[0159]). Bodin expressly describes conditional selection and application of prosody prior to rendering a next section (FIG. 15; [0180]–[0187]). Applicant’s argument construes “discrepancy” as requiring an explicit numeric delta stored as a standalone value. The claims, however, do not require a particular mathematical format, storage representation, or explicit arithmetic operation; under BRI, a difference/comparison between an observed attribute and a target attribute that is used to select/adjust the prosody rate for the next section meets the claimed “discrepancy” and “second rate determined based on the discrepancy.” With respect to “simulated multi-speaker recording,” the claims do not positively recite structural features that exclude Bodin’s sections of synthesized, voice-rendered data. Bodin’s system synthesizes content into a voice-rendered format and renders sections with selected prosody (FIGS. 9–10, 13, 15). Under BRI, a “recording” comprising sequential portions rendered with selected prosody reads on Bodin’s generated/voiced sections. To the extent applicant’s specification may ascribe a narrower meaning (e.g., explicit turn-taking or overlap simulation), the present claim language does not affirmatively impose such restrictions. Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a Judicial exception (i.e., abstract idea) without significantly more. The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. Alice Step 1 — Claim(s) directed to an abstract idea. Claims 1–10, 11–20 (device/process/system claims) are directed to the abstract idea of measuring/quantifying attributes (rates) of speech in portions of a recording, computing discrepancies between measured rates and target rates, using those computed differences to determine adjusted rates, and generating synthetic speech data/segments based on those adjusted rates. The claimed steps amount to mental processes (collecting data, calculating rates and discrepancies, comparing values, and using those results to select/generate synthetic data). These concepts are analogous to information collection, analysis, comparison, and adjustment based on calculated numerical results — categories of abstract ideas found in Alice, Mayo, and subsequent Federal Circuit and USPTO guidance (for example, aggregating information and manipulating numbers is an abstract concept per Electric Power Group). The claim language does not tie the abstract idea to a specific improvement in the functioning of the computer itself or otherwise recite a particular way to implement the calculations or generation that would be a technical solution to a technical problem. Instead, the claims recite high-level steps (e.g., “determining,” “computing,” “generating synthetic speech data,” “generating the … recording”) that can be performed mentally or with generic computer components. Alice Step 2 — No inventive concept that transforms the abstract idea into a patent-eligible application. After determining the claims are directed to an abstract idea, the next question is whether the claims contain an “inventive concept” that amounts to significantly more than the abstract idea itself (Alice step 2). The independent claims (claims 1, 11, 19) recite generic computer implementation elements: “generating synthetic speech data,” “generating the … recording,” “one or more processors,” “one or more circuits,” and broadly recited systems (cloud, edge device, LLMs, etc.). These elements are recited at a high level and perform conventional tasks — data generation, sampling from distributions, computing differences, selecting rates — without specifying a particular, non-routine way of implementing those tasks or any unconventional hardware architecture, or demonstrating an improvement in the functioning of the computer or other technical field. The Dependent claims further specify variations (e.g., sampling from distributions, selecting speakers based on turn probability, the speech-based attributes being overlap or silence, computing based on running length) but these are additional abstract data manipulation or mathematical/statistical operations and do not supply a specific technological means or other unconventional feature that transforms the claims into patent-eligible subject matter. The specification does not recite an improvement to the underlying computer architecture, a novel machine, or a specific non-generic technique for generating synthetic speech data that improves synthetic generation in a technical manner. The claims instead use generic computing terms to effectuate the abstract idea. Claims 1-20 are rejected under 35 U.S.C. § 101 as being directed to an abstract idea and lacking an inventive concept that transforms the abstract idea into patent-eligible subject matter. Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. (a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention. Claims 1, 5, 8, 10-11, 15, 18-20 are rejected under 35 U.S.C. 102(a1) as being anticipated by Bodin et al., US 20070100628 (Hereinafter Bodin). Regarding claims 1, 11 and 19, Bodin discloses a system, ore or more processors (see fig. 1) and method comprising determining a first rate at which a first speech-based attribute occurs within a first portion of a simulated multi-speaker recording (i.e., discloses determining a prosody “rate” attribute for synthesized speech content (paras. [0153]-[0166], FIG. 13), where the “first portion” is a section of synthesized data selected for rendering (paras. [0155]-[0159], FIG. 15). Under BRI, “speech-based attribute” encompasses prosody rate, pause/duration, or similar TTS attributes); computing a first discrepancy that represents a first difference between the first rate and a first target rate for the first speech-based attribute (i.e., comparing current prosody settings with desired or stored prosody settings based on context or user preferences (paras. [0172]–[0177]) and adjusting accordingly, which under BRI constitutes computing a difference between a measured rate and a target rate); determining, based at least on the first discrepancy, a second rate at which the first speech-based attribute is to occur within a second portion of the simulated multi-speaker recording (i.e., Bodin discloses adjusting the prosody rate for subsequent sections based on the difference between current and desired settings (paras. [0172]–[0177], FIG. 14C–14D); generating synthetic speech data based at least on the second rate (i.e., Bodin discloses generating voice output with the adjusted prosody rate (paras. [0150]–[0159], [0160]–[0166], FIG. 13)); generating the second portion of the simulated multi-speaker recording based at least on the synthetic speech data (i.e., Bodin discloses rendering the next section of synthesized data using the generated synthetic speech data (paras. [0155]–[0159], FIG. 15). Regarding claims 5 and 15, Bodin discloses a method (and a processor) (see claims 1 and 11 above) wherein the determining the second rate comprises at least one of computing the second rate based at least on a sampled value associated with the first speech-based attribute, (i.e., typical values of attributes) (see paragraph 96), an amount of the first speech-based attribute within the first portion of the simulated multi-speaker recording (see paragraph 181), and/or a running length associated with the first portion of the simulated multi-speaker recording (i.e., section length) (see paragraph 157). Regarding claim 8, Bodin discloses a method (see claim 1 above) further comprising determining the first target rate based at least on a set of parameters associated with generating the simulated multi-speaker recording (i.e., prosody settings) (see paragraph 159). Regarding claim 10, Bodin discloses a method (see claim 1 above) wherein the first speech-based attribute comprises at least one of an overlap in speech or a silence (i.e., pauses) (see paragraphs 153 and 159). Regarding claims 18 and 20, Bodin discloses wherein the system (see claim 19 rejection) is comprised a system for performing simulation operations (rendering synthesized data) (see abstract. Also refer to paragraphs 86 and 161). Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 2, 12, 17 are rejected under 35 U.S.C. 103 as being unpatentable over Bodin in view Ghosh et al., US 20240161728 (Hereinafter Ghosh). Regarding claims 2 and 12, Bodin discloses a processor and method as described above (see claims 1 and 11). Although Bodin discloses a method (and processor) as disclosed above, Bodin does not specifically disclose a method further comprising: determining a second discrepancy that represents a second difference between a third rate at which a second speech- based attribute occurs within the first portion of the simulated multi-speaker recording and a second target rate for the second speech-based attribute; and determining that the first discrepancy exceeds the second discrepancy prior to generating the second portion of the simulated multi-speaker recording. However, Ghosh discloses a method comprising determining a second discrepancy that represents a second difference between a third rate at which a second speech- based attribute occurs within the first portion of the simulated multi-speaker recording and a second target rate for the second speech-based attribute (i.e., During training of a machine learning model (e.g., SM 120), MSTE 114 may select a training input and apply the speech model to the selected training input to generate a training output. MSTE 114 may then compare the training output with the target output (ground truth) and evaluate the observed mismatch using a loss function. The mismatch may be back-propagated through the model (e.g., using gradient descent techniques), and the weights and biases of the model may be adjusted to make the training outputs evolve in the direction of the target outputs. Such adjustments may be repeated—over any number of iterations, epochs, etc) (see paragraph 31. Also, refer to paragraph 30 where it is disclosed generating synthetic speech based on speech-based attributes of multiple speakers); and determining that the first discrepancy exceeds the second discrepancy prior to generating the second portion of the simulated multi-speaker recording (i.e., until the output mismatch for a given training input satisfies a predetermined condition (e.g., falls below a predetermined value, converges to an acceptable level of accuracy, etc.). Subsequently, a different training input may be selected, a new training output generated, and a new series of adjustments implemented based on a mismatch with the target output, until the model is trained to a target degree of accuracy) (see paragraph 31). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teaching of Boding with the teaching of Ghosh to arrive at the claimed invention. A motivation for doing so would have been to generate speech synthesis of high quality. Regarding claim 17, Bodin discloses a processor as described above (see claim). Bodin does not specifically disclose a processor further comprising sampling the first target rate from a distribution associated with a set of parameters for generating the simulated multi-speaker recording. However, Ghosh discloses a processor comprising sampling the first target rate from a distribution associated with a set of parameters for generating the simulated multi-speaker recording (i.e., Generative models allow sampling from the determined probability distributions during generation of new speech and impart some natural diversity to the generated speech (see paragraph 16). Ghosh additionally discloses a computing system 100 may be configured to process text 151 to generate synthetic audio data 170 that may include a suitable audio representation of text 151, e.g., a spoken version of text 151 synthesized based on prior speech samples stored in data repository 101. In some embodiments, synthetic audio data 170 may correspond to an artificial speaker whereas prior speech samples may be produced by real speakers. Prior speech samples may include suitable audio data, e.g., training spectrogram(s) 103, characterizing speech of a person pronouncing a respective training text 102 (see paragraph 23). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teaching of Boding with the teaching of Ghosh to arrive at the claimed invention. A motivation for doing so would have been to generate speech synthesis of high quality. Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Bodin in view of Di Gangi et al., US 20250118336, (hereinafter Di Gangi). Regarding claim 6, Bodin discloses a method (see claim 1 rejection above). Although Bodin discloses a method as described above (see claim 1 rejection), Bodin does not specifically disclose a method further comprising determining a speaker associated with the second portion of the simulated multi-speaker recording based at least on a turn probability associated with the simulated multi-speaker recording. However, Di Gangi discloses a method further comprising determining a speaker associated with the second portion of the simulated multi-speaker recording based at least on a turn probability associated with the simulated multi-speaker recording (i.e., a system 1200 selects and or determines a voice vector for each speaker. The voice vector encodes speaker characteristics like pitch, volume, timbre, or tone, which uniquely identify a person's voice. Its input is a sequence of time segments 1220, with start and stop time stamps, their speaker labels 1230, and an audio signal 1210 for each segment. This process uses a speaker encoder whose input is the audio signal from a segment and its output is a single vector for the segment. In some embodiments, the speaker encoder for this step is the same as the one used for speaker diarization (see paragraph 72). In addition, Di Gangi discloses A speaker diarization system is used to assign audio windows, and their respective word sequences, to speaker labels (see paragraph 56). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teaching of Boding with the teaching of Ghosh to arrive at the claimed invention. A motivation for doing so would have been to improve the quality of speaker identification in a multi-channel environment. Allowable Subject Matter Although claims 3-4, 7, 9, 13-14, 16 are rejected under 101 as being directed to an abstract idea, assuming the 101 rejection is overcome dure to a potential amendment, those claims would be objected to as being dependent upon a rejected base claim (see associated 102 and 103 rejection above), but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Conclusion THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to PIERRE LOUIS DESIR whose telephone number is (571)272-7799. The examiner can normally be reached Monday-Friday 9AM-5:30PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action

Prosecution Timeline

Dec 01, 2023
Application Filed
Jul 13, 2025
Non-Final Rejection — §101, §102, §103
Oct 16, 2025
Response Filed
Feb 12, 2026
Final Rejection — §101, §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12585679
EXECUTING UNSUPERVISED PRE-TRAINING TASKS WITH A MACHINE LEARNING MODEL TO PREDICT DOCUMENT GRAPH ATTRIBUTES
2y 5m to grant Granted Mar 24, 2026
Patent 12562154
Scalable Model Specialization Framework for Speech Model Personalization
2y 5m to grant Granted Feb 24, 2026
Patent 12555594
SYSTEM AND METHOD FOR TRACKING EMOTIONAL STATE OF A CALLER USING ARTIFICIAL INTELLIGENCE
2y 5m to grant Granted Feb 17, 2026
Patent 12542137
MULTI-PERSON LLM ASSISTANT INTERACTIONS
2y 5m to grant Granted Feb 03, 2026
Patent 12541672
ADDRESSING CATASTROPHIC FORGETTING AND OVER-GENERALIZATION WHILE TRAINING A NATURAL LANGUAGE TO A MEANING REPRESENTATION LANGUAGE SYSTEM
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
61%
Grant Probability
92%
With Interview (+31.5%)
4y 4m
Median Time to Grant
Moderate
PTA Risk
Based on 285 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month