Last updated: April 19, 2026
Application No. 18/598,996
SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS

Final Rejection §103
Filed
Mar 07, 2024
Examiner
SIRJANI, FARIBA
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Datum Point Labs Inc.
OA Round
2 (Final)
Interview Optional

— +31.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 547 resolved cases, 2023–2026
Examiner Intelligence

SIRJANI, FARIBA View full profile →
Grants 76% — above average
Career Allow Rate
414 granted / 547 resolved
+13.7% vs TC avg
Strong +31% interview lift
Without
With
+31.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
31 currently pending
Career history
578
Total Applications
across all art units
Statute-Specific Performance

§101
14.1%
-25.9% vs TC avg
§103
49.1%
+9.1% vs TC avg
§102
14.7%
-25.3% vs TC avg
§112
10.7%
-29.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 547 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 8, and 15 are independent and are amended.
This Application was published as U.S. 20240339104.
Apparent priority: 6 April 2023.

Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection that, if presented, were necessitated by the amendments to the Claims.
This action is Final.

Response to Amendments and Arguments
Applicant’s arguments are directed to the amended language and are moot in view of the modified grounds of rejection.
Claim 1 is amended as follows and the other independent Claims are amended similarly: “receiving, via a data interface, an input text, a reference spectrogram, and an emotion identifier (emotion ID);” to remove the option of speaker identifier which was claimed in the alternative in the original claim.
New reference is added to address the amendment.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-6, 8-10, 12-13, 15-17, and 19 are rejected under 35 U.S.C. 103 as unpatentable over Li (U.S. 20250349282) in view of Mukherjee (U.S. 20230099732).

Regarding Claim 1, Li teaches: 
1. A method of text to speech synthesis, the method comprising: 
receiving, via a data interface, an input text, a reference spectrogram, and an emotion identifier (emotion ID); [Li, Figure 2, “zero-shot personalized TTS model 200.”   The “TTS module 206” is receiving “input text 215,” the “reference Mel-Spectrogram 210” and the “Speaker Embedding 214” / “Speaker ID.”  “[0063] …  The speaker encoder module then takes the acoustic features as input, and output speaker embedding, which represents speaker identity of target speaker….”]
generating, via a first encoder, a vector representation of the input text; [Li, Figure 5 shows an expanded version of the “TTS module 206” as “TTS module 500” which includes a “phone embedding 506” and “conformer encoder 508” / “first encoder” which receives the “input text 502” and generates an embedding/vector of the “input text 502.”  “[0073] … As illustrated in FIG. 5, the TTS module comprises components based on a conformer TTS model, wherein the TTS module 500 takes input text 502 and converts it to phoneme identifiers (e.g., Phoneme IDs 504). These are then converted to a phone embedding 506.”]
generating, via a second encoder, a vector representation of the reference spectrogram; [Li, Figure 4, the “Reference Mel-Spectrogram 402” in input to the “Speaker Encoder 400” / “second encoder.”  Figure 5 the same “Mel-spectrogram 520” is input to the “speaker encoder 522” / “second encoder.”  In this reference, the element numbers are changed to conform to the figure number.  “[0071] … The speaker encoder 400 takes a reference Mel-spectrogram (e.g., Mel-spectrogram 314 from FIG. 3) as input and generates a 256-dimension speaker embedding for each target speaker (e.g., speaker embedding 408) based on each Mel-spectrogram received. As shown in FIG. 4, the speaker encoder 400 comprises one or more LSTM layers (e.g., LSTM layer 404) and linear transform layer 406.”]
generating, via a variance adaptor, a modified vector representation based on a combined representation including a combination of the vector representation of the input text, the vector representation of the reference spectrogram, and an embedding of the emotion ID; and [Li, Figure 5, “Variance Adaptor 516” / “variance adaptor” receives 3 inputs coming from the (1)  “Input Text 502” after passing through the stages that converts it to an embedding and after the “conformer encoder 508”, (2 and 3) an input coming from the “speaker embedding 528” which includes both the “reference spectrogram” and the “speaker ID” of the Claim.  See Figure 4 where the “speaker embedding 408” is generated from the “reference mel-spectrogram 402” and [0063] provided above that teaches that the speaker ID is reflected in the speaker embedding.  The output of the “Variance Adaptor 516” is the “modified vector representation.”  (There are also another 3 inputs comping from the Global prosodic, style, and language (514, 512, 510) which are the global values that are modified by the adaptor 516.)]
generating, via a decoder, an audio waveform based on the modified vector representation. [Li, Figure 2, the output of the “TTS Module 206” is “synthesized speech 216” /waveform of the Claim.  Figure 5 shows that the “TTS Module 500” that sits in place of TTS Module 206 of Figure 2, ends with a “Conformer Decoder 518” / “Decoder” of the Claim.  “[0080] … The conformer decoder 518 is configured to generate acoustic features, such as predicted Mel-spectrogram (e.g., Mel-spectrogram 520) for the target speaker voice.”  “[0081] Finally, a vocoder (such as a well-trained Universal MeIGAN vocoder, for example), can be used to convert the predicted Mel-spectrogram to waveforms.”]
Li does not teach the use of an emotion identifier for the TTS process.

Mukherjee teaches:
receiving, via a data interface, an input text, a reference spectrogram, and an emotion identifier (emotion ID); [[Mukherjee, Figure 1, the input is “text 112” and the output is “audio data 116.” Figure 2, the inputs are “text 112,” “spectrograms 208,” and the output of an “emotional classifier 124” are “emotional labels” that teach the “emotion ID” of the Claim and is input to another encoder for generating an embedding.  “[0030] … The emotional classifier model 124 is trained based upon emotional classification, that is, the emotional classifier model 124 is trained with words or phrases that have emotional labels assigned thereto, where the emotional labels identify respective emotions assigned to the words or phrases (e.g., happy, sad, angry, etc.). As such, the emotional classifier model 124 is configured to assign an emotional label (e.g., happy, sad, angry, etc.) to text based upon words of the text….”]
generating, via a first encoder, a vector representation of the input text; [Mukherjee has different encoders for different tasks, Figure 2: “[0038] As depicted in FIG. 2, the emotional classifier model 124 takes the text 112 as input and outputs a textual embedding 224 based upon the input. In general, the textual embedding 224 captures semantic information about the text 112….”  “[0037] The TTS model 122 includes N encoder(s) 212, where N is a positive integer and where each of the encoder(s) 212 are connected to one another…”]
generating, via a second encoder, a vector representation of the reference spectrogram; [Mukherjee, Figure 2, the “spectrograms 208” are input to a “decoder pre-net 206” which works like an encoder but there is also the “scaled positional encoder 210” down the line which can be mapped to the “second encoder” of the Claim.]
generating, via a variance adaptor, a modified vector representation based on a combined representation including a combination of the vector representation of the input text, the vector representation of the reference spectrogram, and an embedding of the emotion ID; and [[Mukherjee, Figure 2, the output of an “emotional classifier 124” which is a “textual embedding 224” which teaches the “embedding of the emotion ID” of the Claim.  Two input branches coming from the bottom of Figure 2 are provided to a “concatenation 226” and then input to the Decoders and the output of the “scaled positional encoder 210” is coming to the decoders as well.  “[0038] As depicted in FIG. 2, the emotional classifier model 124 takes the text 112 as input and outputs a textual embedding 224 based upon the input. In general, the textual embedding 224 captures semantic information about the text 112. The TTS model 122 concatenates the textual embedding 224 and the phoneme encoding 222 to generate a concatenation 226 of the textual embedding 224 and the phoneme encoding 222. According to embodiments, a variance adapter (not shown in FIG. 2) adds variance information to the phoneme encoding 222 (e.g., duration, pitch, and energy).”  The “variance adaptor” in Mukherjee is not getting the spectrogram embedding but adding the variance information as duration, pitch, and energy is adding the information that is normally in a spectrogram.]
generating, via a decoder, an audio waveform based on the modified vector representation. [Mukherjee, Figure 1, the input is “text 112” and the output is “audio data 116” generated by a decoder of the TTS.   “[0035] The TTS application 110 may further include an audio converter 132. According to embodiments, the audio converter 132 is included in a separate application that executes on a separate computing device. The audio converter 132 is generally configured to convert the spectrogram 130 into audio data 116 (e.g., a waveform) such that the speech 118 can be output by the speaker 120 based upon the audio data 116.”  “[0025] The computing system 102 includes a processor 104 and memory 106. The computing system 102 may also include a data store 108. The memory 106 has a TTS application 110 loaded therein. As will be described in greater detail below, the TTS application 110, when executed by the processor 104, is generally configured to (1) obtain computer-readable text 112 from an (electronic) text source 114; (2) facilitate generation of audio data 116 (e.g., a waveform) based upon the computer-readable text 112; and (3) cause speech 118 to be played over a speaker 120 based upon the audio data 116, where the speech 118 reflects an emotion that is associated with words of the text 112.”  “…The computing system, generates, by way of an encoder of a text to speech (TTS) model, a phoneme encoding based upon the phoneme sequence. The computing system provides the textual embedding and the phoneme encoding as input to a decoder of the TTS model. The computing system causes speech that includes the words to be played over a speaker based upon output of the decoder of the TTS model, where the speech reflects an emotion underlying the text due to the textual embedding provided to the encoder.”  Abstract.] [Figure 4, 418.]
Li and Mukherjee pertain to TTS and it would have been obvious to add the emotionally labeled/tag data of Mukherjee to the system of Li to generate emotional speech instead of speaker specific speech or in addition to it.  This combination falls under simple substitution of one element for a similar one with predictable results, combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.


    PNG
    media_image1.png
    748
    516
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    790
    478
    media_image2.png
    Greyscale


Regarding Claim 2, Li teaches: 
2. The method of claim 1, wherein the generating the vector representation of the input text includes generating a sequence of phoneme embeddings. [Li, Figure 5, “Phone embedding 506” generated from the “input text 502.”]

Regarding Claim 3, Li teaches: 
3. The method of claim 1, wherein the modified vector representation is a Mel-spectrogram. [Li, the “modified vector” of the Claim is the output from the “variance adaptor 516” and after decoding generates the “Mel-spectrogram 520.”  “[0079] The combination of embeddings, as described above, is finally received by the variance adaptor 516. The variance adaptor 516 is used to predict phoneme duration, which means it predicts the total time taken by each phoneme. It also predicts the phone-level fundamental frequency, which is the relative highness or lowness of a tone as perceived by human.”  “[0080] After the phone-level duration and fundamental frequency prediction, the encoder output expands according to phoneme duration, which is then input into the conformer decoder 518. The conformer decoder 518 is configured to generate acoustic features, such as predicted Mel-spectrogram (e.g., Mel-spectrogram 520) for the target speaker voice.”]

Regarding Claim 5, Li teaches: 
5. The method of claim 4, further comprising: 
generating at least one of a vector representation of a pitch prediction or a vector representation of an energy prediction based on the combined representation, [Li, the “combined representation” is the combination that is input to the “variance adaptor” of the Claim which is mapped to “variance adaptor 506” of Figure 5. The output is: ‘[0079] The combination of embeddings, as described above, is finally received by the variance adaptor 516. The variance adaptor 516 is used to predict phoneme duration, which means it predicts the total time taken by each phoneme. It also predicts the phone-level fundamental frequency, which is the relative highness or lowness of a tone as perceived by human.”  Fundamental frequency is pitch.  
wherein the generating the modified vector representation is based on a combination of the combined representation and at least one of the vector representation of the energy prediction or the vector representation of the pitch prediction. [Li, Figure 2 shows the generation of  the “speaker embedding 528” of Figure 5 which is one of the inputs to the “variance adaptor 516” which generates the “modified  representation” of the Claim.  Figure 2 shows that the “feature extractor 202” extracts “prosodic features 212” that include both “fundamental frequency 212A” and “energy 212B.”  The output of “prosodic features 212” goes into the “TTS Module 206” which is shown as “TTS Module 500” in Figure 5 which includes the internals of the TTS module.]

Regarding Claim 6, Li teaches: 
6. The method of claim 1, wherein at least one of the first encoder or the decoder includes at least one convolutional neural network (CNN) layer and at least one self-attention layer. [Li, Figure 4 shows the “speaker encoder 400” as including a “LSTM layer 404” which is normally not a CNN unless combined with one.  But Figure 1 shows that its “ML Engines 150” include a “training engine 152” which is CNN: “[0057] The training engine 152 is configured to train the parallel convolutional recurrent neural networks and/or the individual convolutional neural networks, recurrent neural networks, learnable scalars, or other models included in the parallel convolutional recurrent neural networks. The training engine 152 is configured to train the zero-shot model 144 and/or the individual model components (e.g., feature extractor 145, speaker encoder 146, and/or TTS module 147, etc.).”  It also teaches: “[0077] The global style tokens 512 are generated by a global style token module which consists of a reference encoder and a style attention layer. The global style token module is configured to help capture residual prosodic features, including the speaking rate of the target speaker, in addition to other prosodic features extracted using the feature extractor.”  Li teaches the use of an attention layer in encoding the “global style tokens 512” that are input to the “variance adaptor 516” in Figure 5 which  teaches the use of transformer/attention models inside one of the first encoder / TTS module 206/500.  “[0077] The global style tokens 512 are generated by a global style token module which consists of a reference encoder and a style attention layer. The global style token module is configured to help capture residual prosodic features, including the speaking rate of the target speaker, in addition to other prosodic features extracted using the feature extractor.”]

Claim 8 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale. Additionally:
8. A system for text to speech synthesis, the system comprising: 
a memory that stores a plurality of processor executable instructions; [Li, Figure 1, “hardware storage Devices 124, 140.”]
a data interface that receives an input text, a reference spectrogram, and at least one of an emotion ID or speaker ID; and [Li, Figure 1, “I/O Devices 116.”]
one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: [Li, Figure 1, “processors 112, 112.”]
…

Claim 9 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.
Claim 10 is a system claim with limitations corresponding to the limitations of Claim 3 and is rejected under similar rationale.
Claim 12 is a system claim with limitations corresponding to the limitations of Claim 5 and is rejected under similar rationale.
Claim 13 is a system claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.
 
Claim 15 is a computer program product system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale.
Claim 16 is a computer program product system claim with limitations corresponding to the limitations of method Claim 2 and is rejected under similar rationale.
Claim 17 is a computer program product system claim with limitations corresponding to the limitations of method Claim 3 and is rejected under similar rationale.
Claim 19 is a computer program product system claim with limitations corresponding to the limitations of method Claim 5 and is rejected under similar rationale.


Claims 4, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Li and Mukherjee in view of Joly (U.S. 11978431).
Regarding Claim 4, Li teaches: 
4. The method of claim 1, wherein the generating the modified vector representation includes: [Li, the “modified vector” of the Claim is the output from the “variance adaptor 516”.]
generating a duration prediction based on the combined representation; and [Li, this is what is generated by the “Variance Adaptor 516” which called a “variance adaptor” by the Claim also: “[0079] The combination of embeddings, as described above, is finally received by the variance adaptor 516. The variance adaptor 516 is used to predict phoneme duration, which means it predicts the total time taken by each phoneme. It also predicts the phone-level fundamental frequency, which is the relative highness or lowness of a tone as perceived by human.”]
duplicating the modified vector representation based on the duration prediction.  
The duplication in the instant Application is for adjusting the length for Mel-Spectrum calculation purposes and in the context of calculation of the phoneme duration. (See instant Application: “[0023] … In some embodiments, the output vector 212 of length regulator 210 is a Mel-spectrogram, and to adjust for duration the Mel-spectrogram frames (Mel-frames) are duplicated….”)  This is not taught by Li or Mukherjee.
Joly teaches:
generating a duration prediction based on the combined representation; and [Joly, “The phoneme duration predictor 550 may include one or more BiLSTM layer(s) that may process the phoneme embedding data 506, and one or more CNN layer(s) that may process the output of the BiLSTM layer(s). One or more LSTM layer(s) may process the output(s) of the CNN layer(s) to determine the duration data 554. In some embodiments, the phoneme duration predictor 550 includes one BiLSTM layer, three CNN layers, and one LSTM layer.” 10:16-23.]
duplicating the modified vector representation based on the duration prediction.  [Joly teaches that duplication of data is conducted in the context of upsampling and in the course of determining the phoneme duration:  “Referring to FIG. 5, a speech-synthesis component 270g may include an upsampling encoder 552 that upsamples the feature embedding data 526 for application to the phoneme embedding data 506 in accordance with duration data 554 determined by a phoneme duration predictor 550. The phoneme duration predictor 550 may thus determine, for a given item of feature embedding data 526, how many items of phoneme data (and/or parts of phonemes) should be modified using the item of feature embedding data 526. For example, if a given item of duration data corresponds to duration of “5,” the upsampling encoder 552 may upsample (e.g., duplicate) a corresponding item of feature embedding data 526 by a factor of 5 and apply the feature embedding data to five phonemes in the phoneme embedding data 506.” 10: 1-15.]   
Li/Mukherjee and Joly pertain to the training and use of TTS models implemented in NN which include a phoneme duration prediction step and it would have been obvious to combine the upsampling/duplication of Joly which is conducted in the course of phoneme duration predication with the system of Li /Mukherjee as a known method of phoneme duration detection which Li does not care to elaborate upon.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 11 is a system claim with limitations corresponding to the limitations of Claim 4 and is rejected under similar rationale.
Claim 18 is a computer program product system claim with limitations corresponding to the limitations of method Claim 4 and is rejected under similar rationale.

Claims 7, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li and Mukherjee in view of Kim (U.S. 20250349282).
Regarding Claim 7, Li teaches: 
7. The method of claim 1, further comprising: 
updating parameters of at least one of the first encoder, the second encoder, the variance adaptor, or the decoder via backpropagation based on a comparison of the modified vector representation to a ground truth vector representation. [Li teaches a “training engine 152” in Figure 1 and shows training/ “updating of the parameters” of its “TTS model 608” in Figure 6 and Figure 5 also teaches a loss calculation at 525 which is for training of the “TTS module 500” including its encoders 508 and decoders 518 and the variance adaptor 516.  Figure 6, “Training Corpus 602” includes the “ground truth.”  “[0043] Herein, “training data” refers to labeled data and/or ground truth data configured to be used to pre-train the TTS model used as the source model that is configurable as the zero-shot model 144….”  “[0083] Using the flow-based architecture described above, the model is able to learns proper [0084] target speaker alignment without a requirement of ground truth duration from an external tool.”  The Comparison of the Claim is taught by:  “[0088] Attention will now be directed to FIG. 6, which illustrates an example embodiment of a process flow diagram for training a source text-to-speech model … A speaker cycle consistency training loss (e.g., cycle loss 614) is added to minimize cosine similarity between speaker embedding generated from ground truth and synthesized audio, which encourages the TTS model 608 to synthesize higher speaker similarity speech….”]
Claim 7 expresses a standard training process and training of NNs like the LSTM of Li normally uses back propagation in the loss calculation but this term is not express in Li or Mukherjee.
Kim teaches:
updating parameters of at least one of the first encoder, the second encoder, the variance adaptor, or the decoder via backpropagation based on a comparison of the modified vector representation to a ground truth vector representation. [Kim, “[0081] The artificial neural network-based text-to-speech synthesis apparatus may be learned using a large database where texts and speech signals are present in pairs. A loss function may be defined by comparing an output obtained by entering a text as an input to an answer speech signal. The text-to-speech synthesis apparatus may learn the loss function through an error back propagation algorithm and thus finally may obtain a single artificial neural network text-to-speech synthesis model that outputs a desired speech when any text is input.”  Figure 11, “Model Learning 1114.”  “[0134] Also, the model learning unit 1114 may train the data learning model using, for example, a learning algorithm including error back propagation or gradient descent.”]
Li /Mukherjee and Kim pertain to the training and use of TTS models implemented in NN and it would have been obvious to use the back propagation algorithm of Kim in the loss function calculation of Li/Mukherjee as one method of optimizing the loss and training the model.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 14 is a system claim with limitations corresponding to the limitations of Claim 7 and is rejected under similar rationale.
Claim 20 is a computer program product system claim with limitations corresponding to the limitations of method Claim 7 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
This Application and 18/604,278, published as US 20240339103, by the same inventors and assigned to the same assignee include similar material but originate from different provisional applications and ODP was not observed.

Stewart (US 4015087): “It is a further object of the invention to provide improved apparatus for the analysis and display of the spectrum of complex signals, such as speech, in real time to provide a spectrogram wherein the distribution of energy of the signal in frequency is displayed continuously with respect to time so that the fine detail present in the spectrum, such as pitch and other periodic variations, will be shown in the spectrogram.” 2:22-30.

Kim (U.S. 20240105160):
    PNG
    media_image3.png
    144
    562
    media_image3.png
    Greyscale

Squibbs (U.S. 7,103,548):

    PNG
    media_image4.png
    696
    906
    media_image4.png
    Greyscale

Kirkeby (U.S. 20100145705):

    PNG
    media_image5.png
    346
    342
    media_image5.png
    Greyscale
  
    PNG
    media_image6.png
    338
    292
    media_image6.png
    Greyscale

[0003] The most obvious solution to this problem is to store and/or transmit the added multimedia content with the original text content. However, this increases the amount of data by at least an order of magnitude since the text format is much more compact than graphics and sound. U.S. Pat. No. 7,103,548 disclosed a system for converting a text message into audio form, the text message having embedded emotion indicators and feature-type indications, the latter of which serve to determine which of multiple audio-form presentation feature types are to be used to express, in the audio form of the text message, the emotions indicated by said emotion indicators. And currently MSN Messenger allows for the sender to write tags in a text that is then translated into a picture at the receiving end. However, preparing the content in advance eliminates the possibility of a context-dependent `surprise effect`. Furthermore, if a certain ambient soundscape, say, rain and wind, is added to the speech and played back through a single loudspeaker in a conventional mobile device it will sound like disturbing background noise and reduce intelligibility.

Jang (U.S. 20210090551):

    PNG
    media_image7.png
    680
    528
    media_image7.png
    Greyscale

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Fariba Sirjani/
Primary Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Mar 07, 2024
Application Filed
Jan 03, 2026
Non-Final Rejection — §103
Feb 04, 2026
Response Filed
Feb 27, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/454,031
Patent 12603099
SELF-ADJUSTING ASSISTANT LLMS ENABLING ROBUST INTERACTION WITH BUSINESS LLMS
2y 5m to grant Granted Apr 14, 2026
18/152,553
Patent 12579482
Schema-Guided Response Generation
2y 5m to grant Granted Mar 17, 2026
18/341,681
Patent 12572737
GENERATIVE THOUGHT STARTERS
2y 5m to grant Granted Mar 10, 2026
18/406,094
Patent 12537013
AUDIO-VISUAL SPEECH RECOGNITION CONTROL FOR WEARABLE DEVICES
2y 5m to grant Granted Jan 27, 2026
18/180,329
Patent 12492008
Cockpit Voice Recorder Decoder
2y 5m to grant Granted Dec 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+31.0%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 547 resolved cases by this examiner. Grant probability derived from career allow rate.