Last updated: April 19, 2026
Application No. 17/936,101
SYLLABLE-BASED TEXT CONVERSION FOR PRONUNCIATION HELP

Non-Final OA §101§103
Filed
Sep 28, 2022
Examiner
CHUNG, DANIEL WONSUK
Art Unit
2659
Tech Center
2600 — Communications
Assignee
International Business Machines Corporation
OA Round
1 (Non-Final)
Interview Optional

— +37.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 44 resolved cases, 2023–2026
Examiner Intelligence

CHUNG, DANIEL WONSUK View full profile →
Grants 54% of resolved cases
Career Allow Rate
24 granted / 44 resolved
-7.5% vs TC avg
Strong +38% interview lift
Without
With
+37.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
33 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
25.2%
-14.8% vs TC avg
§103
52.3%
+12.3% vs TC avg
§102
17.3%
-22.7% vs TC avg
§112
5.2%
-34.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 44 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending and have been examined.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 9/28/2022 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement was considered and attached by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1, 19 and 20 the limitations of “receiving, via a computer, an input text in a first language”, “receiving, via the computer, a selection of a target language that is different from the first language”, “obtaining, via the computer and from the target language, syllables with a pronunciation most closely matching a pronunciation of the input text in the first language, wherein the obtaining is based on a comparison of one or more spectrograms for the input text with one or more spectrograms for text of the target language”, and “presenting, via the computer, the obtained syllables in the target language”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  More specifically, the mental process of a human reading text and thinking of characters in another language that would pronounce the read text by using spectrogram comparison of language syllables in the mind.  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas.  Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application because the recitation of a computer system in claim 19, reads to generalized computer components, based upon the claim interpretation wherein the structure is interpreted using P0038-P0040 in the specification. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to read text and think of characters in another language that would pronounce the read text in the mind amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are not patent eligible.
With respect to claim 2, the claim recites “separating, via the computer, the input text into syllables in the first language, wherein the syllables in the first language are used to generate the one or more spectrograms for the input text”, which reads on a human separating text into syllables in the mind.  No additional limitations are present.
With respect to claim 3, the claim recites “calculating, via the computer, a time to pronounce each of the syllables in the first language” and “dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the calculated time”, which reads on a human thinking of the time to pronounce syllables in the mind and dividing spectrogram according to the time.  No additional limitations are present.
With respect to claim 4, the claim recites “identifying, via the computer, points of zero amplitude in an audio waveform generated via pronouncing the input text” and “dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the identified points of zero amplitude”, which reads on a human identifying points of no sound when pronouncing text in the mind and dividing spectrogram according to the identified points.  No additional limitations are present.
With respect to claim 5, the claim recites “determining a respective time value at the identified points of zero amplitude, wherein the dividing is based on the determined respective time value”, which reads on a human identifying points of no sound when pronouncing text in the mind and dividing spectrogram according to the identified points.  No additional limitations are present.
With respect to claim 6, the claim recites “calculating, via the computer, a time to pronounce each of the syllables in the first language”, “identifying, via the computer, points of zero amplitude in an audio waveform generated via pronouncing the input text”, and “dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the calculated time and on the identified points of zero amplitude”, which reads on a human identifying points of no sound when pronouncing text in the mind and time to pronounce syllables in the mind and dividing spectrogram according to the identified points and time to pronounce syllables.  No additional limitations are present.
With respect to claim 7, the claim recites “recording as an audio waveform the pronunciation of the input text in the first language” and “generating the one or more spectrograms for the input text based on the audio waveform”, which reads on a human thinking of pronunciation of input text and thinking of a spectrogram according to the pronunciation.  No additional limitations are present.
With respect to claim 8, the claim recites “generating embeddings from the spectrograms from the received input text, wherein the comparison of the one or more spectrograms for the input text with the one or more spectrograms for the text of the target language comprises comparing the generated embeddings for the input text with embeddings generated from the one or more spectrograms for the text of the target language”, which reads on a human dividing read text into graphemes and comparing the graphemes to graphemes of the target language.  No additional limitations are present.
With respect to claim 9, the claim recites “wherein the comparison of the generated embeddings for the input text with the embeddings generated from the one or more spectrograms for the text of the target language comprises performing cosine similarity calculations”, which reads on a human performing cosine similarity calculation on vector representation of grapheme and grapheme of target language in the mind.  No additional limitations are present.
With respect to claim 10, the claim recites “wherein the presenting of the obtained syllables in the target language comprises displaying the obtained syllables in the target language on a screen of the computer along with other text in the first language”, which reads on a human writing the obtained syllables on paper using a pen or pencil.  No additional limitations are present.
With respect to claim 11, the claim recites “wherein the presenting of the obtained syllables comprises playing an audio recording of the syllables in the target language”, which reads on a human thinking of the syllable pronunciation in the mind and producing a verbal output.  No additional limitations are present.
With respect to claim 12, the claim recites “receiving, via the computer, an indication of the first language via a selection of the first language”, which reads on a human obtaining a first language in the mind through an indication on text in the mind.  No additional limitations are present.
With respect to claim 13, the claim recites “determining, via the computer, the first language via machine learning analysis of text being displayed on the computer”, which reads on a human determining a first language by reading text in the mind.  No additional limitations are present.
With respect to claim 14, the claim recites “wherein the input text is received via a selection of a portion of text that is displayed on a screen of the computer”, which reads on a human reading text in the mind.  No additional limitations are present.
With respect to claim 15, the claim recites “wherein the selection of the portion of the text is made via click- and-drag of a text box over the input text on the screen”, which reads on a human reading text that is highlighted in the mind.  No additional limitations are present.
With respect to claim 16, the claim recites “wherein the obtaining is performed via a first machine learning model that is trained via a second machine learning model, wherein for the training the second machine learning model analyzes embeddings representing the one or more spectrograms for the input text and the one or more spectrograms for the text of the target language”, which reads on a human utilizing a set of instructions or rules that include information about text and spectrogram of first language and target language in the mind.  No additional limitations are present.
With respect to claim 17, the claim recites “wherein the obtaining is performed via a first machine learning model that is trained via an autoencoder, wherein for the training the autoencoder converts the one or more spectrograms for the input text and the one or more spectrograms for text of the target language into respective tokens”, which reads on a human utilizing a set of instructions or rules that include information about text and spectrogram of first language and target language in the mind.  No additional limitations are present.
With respect to claim 18, the claim recites “wherein the obtaining is performed via a first machine learning model that is trained via a second machine learning model, wherein for the training the second machine learning model analyzes a combination of tokens representing textual syllables from the input text and tokens representing the one or more spectrograms for the input text”, which reads on a human utilizing a set of instructions or rules that include information about text and spectrogram of first language and target language in the mind.  No additional limitations are present.
These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 8-10, 12, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Prasad et al. (U.S. PG Pub No. 20230116268), hereinafter Prasad, in view of Voss et al. (U.S. PG Pub No. 20220199071), hereinafter Voss.

Regarding claim 1, 19, and 20 Prasad teaches:
(Claim 1) A method for syllable-based pronunciation assistance, the method comprising: (P0012, The method includes receiving an input text in a first script from a user. Each character of the input text is phonetically mapped with a second script corresponding to the second language. The permutations of mapping of each, input character with, each character of the second script is validated and the input text in the first script is transliterated into an output text in the second script.)
(Claim 19) A computer system for syllable-based pronunciation assistance, the computer system comprising: one or more processors, one or more computer-readable memories, and program instructions stored on at least one of the one or more computer-readable memories for execution by at least one of the one or more processors to cause the computer system to: (P0087, The computing device comprises at least one processor and a non-transitory, computer-readable storage medium, for example, a memory unit, for storing computer program instructions defined by modules of the transliteration engine.)
(Claim 20) A computer program product for syllable-based pronunciation assistance, the computer program product comprising a computer-readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to: (P0090, The transliteration engine comprises modules defining computer program instructions, which when executed by the hardware processor, cause the processor to transliterate input text of first language into the output text of second language.)
receiving, via a computer, an input text in a first language; (P0012, Receiving an input text in a first script from a user.)
receiving, via the computer, a selection of a target language that is different from the first language; (P0012, Each character of the input text is phonetically mapped with a second script corresponding to the second language.; Fig. 4a, Language selection button.)
obtaining, via the computer and from the target language, syllables with a pronunciation most closely matching a pronunciation of the input text in the first language, wherein the obtaining is based on a comparison of one or more spectrograms for the input text with one or more spectrograms for text of the target language; and (P0048, Transliteration based on graphene to phoneme mapping and cross-lingual pronunciation mapping models. The embodiments herein transliterate text in any input language (or first language) to text comprising characters of a base language (or a second language) based on pronunciation. The system and method utilize a conventional word mapping algorithm along with the pretrained transliteration model.; P0113, The transliteration engine is configured to phonetically map each grapheme (or character) of the input text with a second script.; P0091, The encoder trains a pre-trained model with the data files and corresponding transliterated text using transfer learning. The acoustic model is pretrained on multiple datasets of the base language.)
presenting, via the computer, the obtained syllables in the target language. (P0012, Transliterated into an output text in the second script.; Fig. 4a, Transliteration output.)
Prasad does not specifically teach:
obtaining, via the computer and from the target language, syllables with a pronunciation most closely matching a pronunciation of the input text in the first language, wherein the obtaining is based on a comparison of one or more spectrograms for the input text with one or more spectrograms for text of the target language; and 
Voss, however, teaches:
obtaining, via the computer and from the target language, syllables with a pronunciation most closely matching a pronunciation of the input text in the first language, wherein the obtaining is based on a comparison of one or more spectrograms for the input text with one or more spectrograms for text of the target language; and (P0110, Embedding vectors representing encoded audio data in accordance with certain embodiments of the invention can include grapheme probability vectors (as described above) or other representations of the audio signal, including (but not limited to) the raw signal, derived features such as MFCCs, neural network representations (such as the hidden states or memory states of an LSTM), neural embeddings derived from the network, or any other numerical embeddings.; P0112, Template matching can utilize various common mathematical distance metrics to compute the match signal. Generally, when using embedding vectors, geometric distances (e.g., cosine similarity, Euclidean distance, inner product, etc.) can be appropriate.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to compare spectrograms of input text and target language text. It would have been obvious to combine the references because comparing audio or representation of audio is a known technique that yields a predictable result of matching audio where each audio represents word, character, phoneme, or grapheme. (Voss P0017, P0055).

Regarding claim 8 Prasad in view of Voss teach claim 1.
Prasad does not specifically teach:
generating embeddings from the spectrograms from the received input text, wherein the comparison of the one or more spectrograms for the input text with the one or more spectrograms for the text of the target language comprises comparing the generated embeddings for the input text with embeddings generated from the one or more spectrograms for the text of the target language. 
Voss, however, teaches:
generating embeddings from the spectrograms from the received input text, wherein the comparison of the one or more spectrograms for the input text with the one or more spectrograms for the text of the target language comprises comparing the generated embeddings for the input text with embeddings generated from the one or more spectrograms for the text of the target language. (P0110, Embedding vectors representing encoded audio data in accordance with certain embodiments of the invention can include grapheme probability vectors (as described above) or other representations of the audio signal, including (but not limited to) the raw signal, derived features such as MFCCs, neural network representations (such as the hidden states or memory states of an LSTM), neural embeddings derived from the network, or any other numerical embeddings.; P0112, Template matching can utilize various common mathematical distance metrics to compute the match signal. Generally, when using embedding vectors, geometric distances (e.g., cosine similarity, Euclidean distance, inner product, etc.) can be appropriate.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to compare spectrograms of input text and target language text. It would have been obvious to combine the references because comparing audio or representation of audio is a known technique that yields a predictable result of matching audio where each audio represents word, character, phoneme, or grapheme. (Voss P0017, P0055).

Regarding claim 9 Prasad in view of Voss teach claim 8.
Voss further teaches:
wherein the comparison of the generated embeddings for the input text with the embeddings generated from the one or more spectrograms for the text of the target language comprises performing cosine similarity calculations. (P0112, Template matching can utilize various common mathematical distance metrics to compute the match signal. Generally, when using embedding vectors, geometric distances (e.g., cosine similarity, Euclidean distance, inner product, etc.) can be appropriate.)

Regarding claim 10 Prasad in view of Voss teach claim 1.
Prasad further teaches:
wherein the presenting of the obtained syllables in the target language comprises displaying the obtained syllables in the target language on a screen of the computer along with other text in the first language. (P0084, FIG. 4A-4C exemplarily illustrates a graphical representation displayed on a display unit of an electronic device, showing a transliterated text suggestions on a suggestion bar interface 403.)

Regarding claim 12 Prasad in view of Voss teach claim 1.
Prasad further teaches:
receiving, via the computer, an indication of the first language via a selection of the first language. (P0012, Each character of the input text is phonetically mapped with a second script corresponding to the second language.; Fig. 4a, Language selection button.)

Claims 2, 7, and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Prasad, in view of Voss, and further view of Gunasekara et al. (U.S. PG Pub No. 20220382999), hereinafter Gunasekara.

Regarding claim 2 Prasad in view of Voss teach claim 1.
Prasad further teaches:
separating, via the computer, the input text into syllables in the first language, wherein the syllables in the first language are used to generate the one or more spectrograms for the input text. (P0113, The transliteration engine is configured to phonetically map each grapheme (or character) of the input text with a second script. According to an embodiment herein, for mapping each character of the input text with the phonemes based matching characters of the second script.)
Prasad does not specifically teach:
separating, via the computer, the input text into syllables in the first language, wherein the syllables in the first language are used to generate the one or more spectrograms for the input text.
Gunasekara, however, teaches:
separating, via the computer, the input text into syllables in the first language, wherein the syllables in the first language are used to generate the one or more spectrograms for the input text. (P0125, Tacotron2 comprises a recurrent sequence-to-sequence feature prediction network that accepts the character embedding of a given text as input and produces its corresponding mel-spectrogram.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to generate spectrogram for input text.  It would have been obvious to combine the references because utilizing segments of the text input is a known technique to yield a predictable result of generating spectrogram for the input text. (Gunasekara P0125).

Regarding claim 7 Prasad in view of Voss teach claim 1.
Prasad does not specifically teach.
recording as an audio waveform the pronunciation of the input text in the first language; and
generating the one or more spectrograms for the input text based on the audio waveform.
Gunasekara, however, teaches:
recording as an audio waveform the pronunciation of the input text in the first language; and (P0125, Tacotron2comprises a recurrent sequence-to-sequence feature prediction network that accepts the character embedding of a given text as input and produces its corresponding mel-spectrogram. … Tacotron2 may also employ a modified WaveNet as a vocoder to convert mel-spectrogram into waveform.)
generating the one or more spectrograms for the input text based on the audio waveform. (P0125, Tacotron2 comprises a recurrent sequence-to-sequence feature prediction network that accepts the character embedding of a given text as input and produces its corresponding mel-spectrogram.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to generate spectrogram for input text.  It would have been obvious to combine the references because utilizing audio waveform is a known technique to yield a predictable result of generating spectrogram for the input text. (Gunasekara P0125).

Regarding claim 11 Prasad in view of Voss teach claim 1.
Prasad does not specifically teach:
wherein the presenting of the obtained syllables comprises playing an audio recording of the syllables in the target language.
Gunasekara, however, teaches:
wherein the presenting of the obtained syllables comprises playing an audio recording of the syllables in the target language. (P0141, A transliteration of the proper noun may be generated using a transliteration engine (TLT engine) onboard the mobile device. In some examples, the TLT engine may receive as input the proper noun in a first language, and may generate as its output a transliteration of that proper noun in a second language.; P0174, Processing module may comprise one or more of the STT engine, the TTT engine, and the TTS engine. Moreover, in some examples, the TTT engine may comprise one or more of the TC engine, the NER engine, the TLT engine, and the SEL engine.; P0158, Processor module may also convert translated text data into output speech data using a TTS engine onboard mobile device.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to playing audio in the target language.  It would have been obvious to combine the references because playing audio allows users to understand the words. (Gunasekara P0064).

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Prasad, in view of Voss, in view of Gunasekara, and further view of Rosenberg et al. (U.S. PG Pub No. 20240029715), hereinafter Rosenberg.
Regarding claim 3 Prasad in view of Voss and further view of Gunasekara teach claim 2.
Prasad in view of Voss and further view of Gunasekara does not specifically teach:
calculating, via the computer, a time to pronounce each of the syllables in the first language; and
dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the calculated time.
Rosenberg, however, teaches:
calculating, via the computer, a time to pronounce each of the syllables in the first language; and (P0034, The duration predictor receives the initial textual representation from the embedding extractor and predicts a corresponding text chunk duration (i.e., word, word-piece, phoneme, and/or grapheme duration). The text chunk duration indicates a duration the corresponding text chunk would be spoken if a human (or text-to-speech system) spoke the unspoken textual utterance.)
dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the calculated time. (P0040, The speech encoder receives, as input, each transcribed non-synthetic speech utterance as a sequence of features/vectors (e.g., mel-frequency spectrograms such as the acoustic frames of FIG. 1) and generates, as output, for each of a plurality of output steps, an encoded audio representation (es) that corresponds to the transcribed non-synthetic speech utterance at the corresponding output step. In parallel, the alignment model receives the transcription corresponding to the same non-synthetic speech utterance and generates an alignment output according to Equation 1.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to calculate time for pronunciation and divide the spectrogram based on the time.  It would have been obvious to combine the references because calculating a duration of time is a known technique to yield a predictable result of mapping sequence of text chunks to speech frames directly. (Rosenberg P0039).

Claims 4 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Prasad, in view of Voss, in view of Gunasekara, and further view of Netzer (U.S. PG Pub No. 20220013120).

Regarding claim 4 Prasad, in view of Voss, and in further view of Gunasekara teach claim 2.
Prasad in view of Voss and further view of Gunasekara does not specifically teach:
identifying, via the computer, points of zero amplitude in an audio waveform generated via pronouncing the input text; and
dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the identified points of zero amplitude.
Netzer, however, teaches:
identifying, via the computer, points of zero amplitude in an audio waveform generated via pronouncing the input text; and (P0053, By analyzing the representation of analog speech signal, client of FIG. 1 may be configured to distinguish between segments of the speech signal, wherein speech segment (SS) (representing the words ELIAV and DANIELLE respectively) are examples of a speech segment. While segments are an example of silence segment. In some exemplary embodiments, SS303 is attributed to speaking pause, end of speech, silence, or the like due to lack of speech signal or a substantially low speech signal amplitude. …  the segment represents speech elements selected from a group comprising of: a syllable; a plurality of syllables; a word; a fraction of a word; a plurality of words; and a combination thereof.)
dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the identified points of zero amplitude. (P0063, Spectrogram of a segment may be produced.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to identify points of zero amplitude and divide spectrogram based on the identified points.  It would have been obvious to combine the references because dividing spectrogram by points of zero amplitude can solve the issue of speaking pause, end of speech, silence, or the like due to lack of speech signal or a substantially low speech signal amplitude. (Netzer P0053).

Regarding claim 5 Prasad, in view of Voss, in view of Gunasekara, and in further view of Netzer teach claim 4.
Netzer further teaches:
determining a respective time value at the identified points of zero amplitude, wherein the dividing is based on the determined respective time value. (P0061, An extracted segment has time duration of T=231.9 msec. Since it was sampled at a rate of f=16 KHz then the total number of samples (values) in this embodiment will be T*f=3711. In some exemplary embodiments, 1st time frame (TF) 410 may comprise 512 samples as well as 2nd TF420, 3rd TF (not shown) and so on until the last TF of the segment.)

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Prasad, in view of Voss, in view of Gunasekara, in view of Rosenberg, and further view of Netzer.
Regarding claim 6 Prasad, in view of Voss, and in further view of Gunasekara teach claim 2.
Prasad, in view of Voss, and in further view of Gunasekara does not specifically teach:
calculating, via the computer, a time to pronounce each of the syllables in the first language;
identifying, via the computer, points of zero amplitude in an audio waveform generated via pronouncing the input text; and
dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the calculated time and on the identified points of zero amplitude.
Rosenberg, however, teaches:
calculating, via the computer, a time to pronounce each of the syllables in the first language; (P0034, The duration predictor receives the initial textual representation from the embedding extractor and predicts a corresponding text chunk duration (i.e., word, word-piece, phoneme, and/or grapheme duration). The text chunk duration indicates a duration the corresponding text chunk would be spoken if a human (or text-to-speech system) spoke the unspoken textual utterance.)
dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the calculated time and on the identified points of zero amplitude. (P0040, The speech encoder receives, as input, each transcribed non-synthetic speech utterance as a sequence of features/vectors (e.g., mel-frequency spectrograms such as the acoustic frames of FIG. 1) and generates, as output, for each of a plurality of output steps, an encoded audio representation (es) that corresponds to the transcribed non-synthetic speech utterance at the corresponding output step. In parallel, the alignment model receives the transcription corresponding to the same non-synthetic speech utterance and generates an alignment output according to Equation 1.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to calculate time for pronunciation and divide the spectrogram based on the time.  It would have been obvious to combine the references because calculating a duration of time is a known technique to yield a predictable result of mapping sequence of text chunks to speech frames directly. (Rosenberg P0039).
Prasad, in view of Voss, in view of Gunasekara, and further view of Rosenberg does not specifically teach:
identifying, via the computer, points of zero amplitude in an audio waveform generated via pronouncing the input text; and
dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the calculated time and on the identified points of zero amplitude.
Netzer, however, teaches:
identifying, via the computer, points of zero amplitude in an audio waveform generated via pronouncing the input text; and (P0053, By analyzing the representation of analog speech signal, client of FIG. 1 may be configured to distinguish between segments of the speech signal 300, wherein speech segment (SS) (representing the words ELIAV and DANIELLE respectively) are examples of a speech segment. While segments are an example of silence segment. In some exemplary embodiments, SS303 is attributed to speaking pause, end of speech, silence, or the like due to lack of speech signal or a substantially low speech signal amplitude. …  the segment represents speech elements selected from a group comprising of: a syllable; a plurality of syllables; a word; a fraction of a word; a plurality of words; and a combination thereof.)
dividing, via the computer, an input text spectrogram for the input text into a spectrogram per syllable of the input text, wherein the dividing is based on the calculated time and on the identified points of zero amplitude. (P0063, Spectrogram of a segment may be produced.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to identify points of zero amplitude and divide spectrogram based on the identified points.  It would have been obvious to combine the references because dividing spectrogram by points of zero amplitude can solve the issue of speaking pause, end of speech, silence, or the like due to lack of speech signal or a substantially low speech signal amplitude. (Netzer P0053).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Prasad, in view of Voss, and further view of Sites et al. (U.S. PG Pub No. 20100312545), hereinafter Sites.
Regarding claim 13 Prasad in view of Voss teach claim 1.
Prasad in view of Voss does not specifically teach:
determining, via the computer, the first language via machine learning analysis of text being displayed on the computer.
Sites, however, teaches:
determining, via the computer, the first language via machine learning analysis of text being displayed on the computer. (P0034, Detection system can be used to detect writing systems and languages represented in text by performing operations.; P0031,The n-grams, associated probability estimates, and respective counts can be stored in a classification model for use by a classifier, e.g., a Bayesian classifier, that detects languages in input text.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to determine language of text.  It would have been obvious to combine the references because a particular writing system can be used to represent more than one language and identification of language is important before processing the text. (Sites P0002).

Claims 14 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Prasad, in view of Voss, and further view of Wyardet (U.S. Patent No. RE38883).

Regarding claim 14 Prasad in view of Voss teach claim 1.
Prasad in view of Voss does not specifically teach:
wherein the input text is received via a selection of a portion of text that is displayed on a screen of the computer.
Wyardet, however, teaches:
wherein the input text is received via a selection of a portion of text that is displayed on a screen of the computer. (Col. 1, Lines 51-55, A user may select text with a mouse by positioning the point at the beginning of the selection, depressing a predefined mouse button, dragging the insertion point to the end of the selection while holding down the mouse button and then releasing the mouse button.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to receive text via selection of text.  It would have been obvious to combine the references because selecting text displayed on a screen is a known technique that yields a predictable result of receiving text. (Sites P0002).

Regarding claim 15 Prasad in view of Voss and further view of Wyardet teach claim 14.
Wyardet further teaches:
wherein the selection of the portion of the text is made via click- and-drag of a text box over the input text on the screen. (Col. 1, Lines 51-55, A user may select text with a mouse by positioning the point at the beginning of the selection, depressing a predefined mouse button, dragging the insertion point to the end of the selection while holding down the mouse button and then releasing the mouse button.)

Claims 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Prasad, in view of Voss, and further view of Chen et al. (U.S. PG Pub No. 20220122581), hereinafter Chen.

Regarding claim 16 Prasad in view of Voss teach claim 1.
Prasad further teaches:
wherein the obtaining is performed via a first machine learning model that is trained via a second machine learning model, wherein for the training the second machine learning model analyzes embeddings representing the one or more spectrograms for the input text and the one or more spectrograms for the text of the target language. (P0077, The encoder trains a pre-trained model with the data files and corresponding transliterated text using transfer learning. The acoustic model is pre-trained on multiple datasets of the base language. The decoder performs decoding, for example, an output text of the trained model to generate text comprising characters of the second language.)
Prasad in view of Voss does not specifically teach:
wherein the obtaining is performed via a first machine learning model that is trained via a second machine learning model, wherein for the training the second machine learning model analyzes embeddings representing the one or more spectrograms for the input text and the one or more spectrograms for the text of the target language. 
Chen, however, teaches:
wherein the obtaining is performed via a first machine learning model that is trained via a second machine learning model, wherein for the training the second machine learning model analyzes embeddings representing the one or more spectrograms for the input text and the one or more spectrograms for the text of the target language. (P0031, The synthesized speech representations may include mel-frequency spectrogram frames for training the ASR model.; P0038, The training process generates a spectrogram consistency loss.; P0007, Transliterating the input text sequence in the first language into a native script. … Generating, using a variational autoencoder, a native audio encoder embedding for the native synthesized speech representation; generating, using the variational autoencoder, a cross-lingual audio encoder embedding for the cross-lingual synthesized speech representation; determining an adversarial loss term conditioned on the first language based on the native and cross-lingual audio encoder embeddings.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to train a model with embeddings representing spectrograms for input text.  It would have been obvious to combine the references because the use of embeddings representing spectrograms in training a model is a known technique to yield a predictable result of training an acoustic model. (Chen P0027).

Regarding claim 17 Prasad in view of Voss teach claim 1.
Prasad further teaches:
wherein the obtaining is performed via a first machine learning model that is trained via an autoencoder, wherein for the training the autoencoder converts the one or more spectrograms for the input text and the one or more spectrograms for text of the target language into respective tokens.
Prasad in view of Voss does not specifically teach:
wherein the obtaining is performed via a first machine learning model that is trained via an autoencoder, wherein for the training the autoencoder converts the one or more spectrograms for the input text and the one or more spectrograms for text of the target language into respective tokens.
Chen, however, teaches:
wherein the obtaining is performed via a first machine learning model that is trained via an autoencoder, wherein for the training the autoencoder converts the one or more spectrograms for the input text and the one or more spectrograms for text of the target language into respective tokens. (P0031, The synthesized speech representations may include mel-frequency spectrogram frames for training the ASR model.; P0038, The training process generates a spectrogram consistency loss.; P0007, Transliterating the input text sequence in the first language into a native script. … Generating, using a variational autoencoder, a native audio encoder embedding for the native synthesized speech representation; generating, using the variational autoencoder, a cross-lingual audio encoder embedding for the cross-lingual synthesized speech representation; determining an adversarial loss term conditioned on the first language based on the native and cross-lingual audio encoder embeddings.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to train a model with embeddings representing spectrograms for input text.  It would have been obvious to combine the references because the use of embeddings representing spectrograms in training a model is a known technique to yield a predictable result of training an acoustic model. (Chen P0027).

Regarding claim 18 Prasad in view of Voss teach claim 1.
Prasad further teaches:
wherein the obtaining is performed via a first machine learning model that is trained via a second machine learning model, wherein for the training the second machine learning model analyzes a combination of tokens representing textual syllables from the input text and tokens representing the one or more spectrograms for the input text. (P0077, The encoder trains a pre-trained model with the data files and corresponding transliterated text using transfer learning. The acoustic model is pre-trained on multiple datasets of the base language. The decoder performs decoding, for example, an output text of the trained model to generate text comprising characters of the second language.)
Prasad in view of Voss does not specifically teach:
wherein the obtaining is performed via a first machine learning model that is trained via a second machine learning model, wherein for the training the second machine learning model analyzes a combination of tokens representing textual syllables from the input text and tokens representing the one or more spectrograms for the input text.
Chen, however, teaches:
wherein the obtaining is performed via a first machine learning model that is trained via a second machine learning model, wherein for the training the second machine learning model analyzes a combination of tokens representing textual syllables from the input text and tokens representing the one or more spectrograms for the input text. (P0031, The synthesized speech representations may include mel-frequency spectrogram frames for training the ASR model.; P0038, The training process generates a spectrogram consistency loss.; P0007, Transliterating the input text sequence in the first language into a native script. … Generating, using a variational autoencoder, a native audio encoder embedding for the native synthesized speech representation; generating, using the variational autoencoder, a cross-lingual audio encoder embedding for the cross-lingual synthesized speech representation; determining an adversarial loss term conditioned on the first language based on the native and cross-lingual audio encoder embeddings.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to train a model with embeddings representing spectrograms for input text.  It would have been obvious to combine the references because the use of embeddings representing spectrograms in training a model is a known technique to yield a predictable result of training an acoustic model. (Chen P0027).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL WONSUK CHUNG whose telephone number is (571)272-1345. The examiner can normally be reached Monday - Friday (7am-4pm)[PT].
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, PIERRE-LOUIS DESIR can be reached at (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DANIEL W CHUNG/Examiner, Art Unit 2659   

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Sep 28, 2022
Application Filed
Jan 06, 2026
Non-Final Rejection — §101, §103
Mar 31, 2026
Interview Requested
Apr 07, 2026
Applicant Interview (Telephonic)
Apr 08, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/984,768
Patent 12579471
DATA AUGMENTATION AND BATCH BALANCING METHODS TO ENHANCE NEGATION AND FAIRNESS
2y 5m to grant Granted Mar 17, 2026
17/812,782
Patent 12493892
METHOD AND SYSTEM FOR EXTRACTING CONTEXTUAL PRODUCT FEATURE MODEL FROM REQUIREMENTS SPECIFICATION DOCUMENTS
2y 5m to grant Granted Dec 09, 2025
17/706,303
Patent 12400078
INTERPRETABLE EMBEDDINGS
2y 5m to grant Granted Aug 26, 2025
18/441,766
Patent 12387000
PRIVACY-PRESERVING AVATAR VOICE TRANSMISSION
2y 5m to grant Granted Aug 12, 2025
17/842,986
Patent 12380875
SPEECH SYNTHESIS WITH FOREIGN FRAGMENTS
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
54%
Grant Probability
92%
With Interview (+37.5%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 44 resolved cases by this examiner. Grant probability derived from career allow rate.