Last updated: April 19, 2026
Application No. 18/689,696
OUTPUT VOICE TRACK GENERATION

Non-Final OA §103
Filed
Mar 06, 2024
Examiner
SIRJANI, FARIBA
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Gan Studio Inc.
OA Round
1 (Non-Final)
Interview Optional

— +31.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 547 resolved cases, 2023–2026
Examiner Intelligence

SIRJANI, FARIBA View full profile →
Grants 76% — above average
Career Allow Rate
414 granted / 547 resolved
+13.7% vs TC avg
Strong +31% interview lift
Without
With
+31.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
31 currently pending
Career history
578
Total Applications
across all art units
Statute-Specific Performance

§101
14.1%
-25.9% vs TC avg
§103
49.1%
+9.1% vs TC avg
§102
14.7%
-25.3% vs TC avg
§112
10.7%
-29.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 547 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 9, and 16 are independent.  Claims have different scopes in the sense that the same limitations are shuffled.
This Application was published as U.S. 20240347038.
Apparent priority: 7 September 2021.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: “voice generation engine” in Claim 1 and dependents. These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to processors or a combination of processor and memory and possibly transducers such as microphones and loudspeakers or to a combination of software and hardware.
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “processor” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kim (US 20200394998) in view of Huang (U.S. 2011/0184723).
Regarding Claim 1, Kim teaches:
1. A system comprising: 
a microphone to receive a reference voice sample from a user; [Kim, Figure 3, the “speech signal or sample” is received and it may be from a “speech database 425” of Figure 4 or if received directly from a speaker it would be at a microphone:“[0051] … According to another embodiment, a speech recognition unit (not shown) may receive a specific speech, convert it into a character corresponding to the input speech, and provide the converted character to the speech synthesizer 110 as an input text….”  However, the word “microphone” is not mentioned.  “[0068] … However, the text-to-speech synthesis system 400 may further include an input/output device (I/O device; not shown). Accordingly, the text-to-speech synthesis system 400 may directly receive an input from a user, and output at least one of a text, a speech, and a video to the user.”]
a processor; and [Kim, Figure 11: “[0146] … For example, at least one of the data acquiring unit 1111, the pre-processing unit 1112, the learning data selecting unit 1113, the model learning unit 1114, or the model evaluation unit 1115 may be manufactured in the form of an AI-dedicated hardware chip or a portion of a conventional general-purpose processor (e.g., a CPU or an application processor) or a graphics processor (e.g., a GPU) and thus may be mounted on various electronic apparatuses that have been described above.”]
a voice generation engine coupled to the processor, [Kim, Figure 4 teaches the “voice generation engine”:  [0032] FIG. 4 is a block diagram of a text-to-speech synthesis system according to an embodiment of the present disclosure.”  Figures 1-3 show the output of synthesized speech from the “speech synthesizer 110” which is part of an ensemble of 3 components which together form the “voice generation engine” of the Claim.]
wherein the voice generation engine is to: 
extract a voice characteristic information from the received reference voice sample; [Kim, Figure 3, two types of voice characteristics are being extracted from the “speech signal or sample 220/320” at the “sequential prosody feature extractor 230” and the “articulatory feature extractor 310.” “[0058] … In other embodiments, the two speech signals or samples 220 and 320 may be the same as each other. …”  “[0055] According to an embodiment, the sequential prosody feature extractor 230 may receive a speech signal or sample 220 and extract a sequential prosody feature 210 from the received speech signal or sample. In this example, the received speech signal or sample may include speech spectrum data representing information related to the sequential prosody feature 210, and may include melody, speech of a specific speaker, etc., for example.”  “[0059] The articulatory feature extractor 310 may be configured to extract an articulatory feature of the speaker from the speech data. The articulatory feature of the speaker may not only simulate the speaker's speech, but may also include at least one of various elements such as style, prosody, emotion, tone, and pitch that may form the speech….”]
process the voice characteristic information based on a voice generation model to assign a weight for each voice characteristics to generate a weighted voice characteristics information, [Kim, the “prosody information or attribute values” of Kim (Figure 3, 210)  teach or include the “characteristics information” of the Claim and are extracted  (Figure 3, 230) and presented as “orthogonal embeddings” (meaning independent characteristics) with “tags”/ weights that determine the importance of each attribute/characteristic information and the output is generated as a weighted sum of these attributes:   “[0121] The text-to-speech synthesis system 1000 may generate the sequential prosody information based on the received prosody information on at least a part of the input text, or change the sequential prosody information corresponding to the input text, that is, change the prosody information corresponding to the input text, and generate a synthesized speech corresponding to the input text in which the generated or changed sequential prosody information is reflected. According to an embodiment, the text-to-speech synthesis system 1000 may apply the prosody information (e.g., attribute values) corresponding to the respective parts of the input text input to a reference embedding vector corresponding to reference sequential prosody information. In this example, the reference embedding vector may include a plurality of embedding vectors representing predetermined sequential prosody feature information. For example, the reference embedding vector includes a prosody feature vector over time, and each prosody feature information may be represented by a weighted sum of a plurality of sub-embedding vectors orthogonal to each other (for example, height, size, length, pause duration, style vector, etc.) The text-to-speech synthesis system 1000 may separate the intrinsic elements of the reference embedding vector. For example, the text-to-speech synthesis system 1000 may acquire a plurality of unit embedding vectors orthogonal to each other based on the reference embedding vector. …  In addition, the text-to-speech synthesis system 1000 may perform regularization during learning of the text-to-speech synthesis system when learning the embedding vector for the sequential prosody feature so that elements inherent in the embedding vector may be separated. This normalization may be performed through the normalizer 420 of FIG. 4. When the text-to-speech synthesis system 1000 performs machine learning by performing normalization during learning, the reference embedding vector may be trained as a sparse vector. Accordingly, the text-to-speech synthesis system 900 may accurately separate the inherent elements using the principle component analysis (PCA) from the embedding vector trained as the sparse vector. According to this configuration, the text-to-speech synthesis system 1000 may modify the reference embedding vector based on the attribute values in a tag provided in the speech synthesis markup language. For example, the text-to-speech synthesis system 1000 may change weights for a plurality of unit embedding vectors based on the received attribute values in the tag.”] 
wherein the voice generation model is trained based on a training voice characteristic information and a training text data; and [Kim, Figure 4, speech synthesizer includes a “data learning unit 450” and Figure 11, “model learning unit 1114.”  “[0107] The artificial neural network-based text-to-speech synthesis system may be trained using a large database existing as a pair of a learning text and a speech signal. …”  “[0063] According to an embodiment, the communication unit 405 may be configured to receive a text from an external device. In this example, the text may include a learning text to be used for training the artificial neural network text-to-speech synthesis model. …” “[0064] … The sequential prosody feature and/or the articulatory feature of the speaker extracted as described above may be transmitted to the encoder 435 and/or the decoder 440 through the data learning unit 450 to be used to train the artificial neural network text-to-speech synthesis model….”]
based on weighted voice characteristic information of the reference voice sample, generate an output voice track corresponding to an input text data. [Kim, Figures 1-3 show the output of synthesized speech from the “speech synthesizer 110” based on the input “Text ‘Hello’ 120.”  Figure 10, input of text 101 and prosody 102 and output of speech 1030 with the input text including the prosody information / “voice characteristics” using markup language that shows the weights to be applied to speed and style of the generated speech.  “[0118] FIG. 10 is a schematic diagram of a text-to-speech synthesis system 1000 that outputs a synthesized speech by applying, to an input text, attribute values input to a tag provided in a markup language according to an embodiment of the present disclosure. In an embodiment, the text-to-speech synthesis system 1000 may correspond to the text-to-speech synthesis system 400 of FIG. 4 and/or a text-to-speech synthesis system 1100 of FIG. 11.”]

    PNG
    media_image1.png
    326
    926
    media_image1.png
    Greyscale

A microphone is required for receiving the speech signals of Kim but a microphone is not called by name.
Huang teaches:
a microphone to receive a reference voice sample from a user; [ Huang is directed to an early form of speech synthesis (LTS in Huang is Letter to Sound) and teaches that its interface includes a microphone: “[0063] The user interface module 232 may interact with a user via a user interface. The user interface may include a data output device (e.g., visual display, audio speakers), and one or more data input devices. The data input devices may include, but are not limited to, combinations of one or more of keypads, keyboards, mouse devices, touch screens, microphones, speech recognition packages, and any other suitable devices or other electronic/software selection methods. The user interface module 232 may facilitate the entry of one or more input letter strings 108 into the phonetic suggestion engine 102. Further, the user interface module 232 may enable a user to designate a language-specific localized LTS module 206, a language-specific phoneme confusion table 222, one or more language-specific dictionaries 214, and/or a language-specific language frequency model 230….”]
Kim and Huang pertain to speech synthesis and it would have been obvious to include the microphone of Huang in the user interface of Kim for receiving speech samples and because speech related applications of TTS and STT are the two sides of the same coin and both types of speech and text are used in the training of their models and therefore input devices for both modalities are required.

Regarding Claim 2, Kim teaches:
2. The system as claimed in claim 1, wherein the voice characteristic information comprises attribute values corresponding to a plurality of voice characteristics of the reference voice sample based on which the input text data is to be converted into output voice track. [Kim, Figures 1-3 show the output of synthesized speech from the “speech synthesizer 110” based on the input “Text ‘Hello’ 120” and “sequential prosody feature 130” which teaches the “attributes corresponding to a plurality of voice characteristics of the reference voice sample” of the Claim.  Figure 10, input of text 101 and prosody 102 and output of speech 1030 with the markup language showing the speed and style to be applied to the text. The “reference voice” of the Claim is taught by the “target speaker” of Kim because the TTS is synthesizing speech in the voice of the target speaker.  “[0118] FIG. 10 is a schematic diagram of a text-to-speech synthesis system 1000 that outputs a synthesized speech by applying, to an input text, attribute values input to a tag provided in a markup language according to an embodiment of the present disclosure. ….”  “[0066] In an embodiment, the communication unit 405 may receive, as a sequential prosody feature, the prosody information on an input text from an external device. In this example, the prosody information may include attribute values input with a tag provided in a speech synthesis markup language for respective parts (e.g., phoneme, character, syllable, word, etc.) of the input text.”  “[0121] …According to an embodiment, the text-to-speech synthesis system 1000 may apply the prosody information (e.g., attribute values) corresponding to the respective parts of the input text input to a reference embedding vector corresponding to reference sequential prosody information….”] (To distinguish attributes from characteristics, define them with particularity and include dependent claims directed to examples of each. Otherwise, in some of the claims the language can be interpreted such that they are interchangeable.  Finding examples even in the supporting Specification is not easy.  In one place it seems that language, e.g. English, is characteristic and attributes are the pitch and duration of phonemes in the English language.  ([0027] of the published Application).  Such that attributes are just numeric or alphanumeric values of characteristics: pitch is a characteristic and attribute is 160 Hz, for example, which is the numeric value of pitch.)

Regarding Claim 3, Kim teaches:
3. The system as claimed in claim 2, wherein the voice generation model comprises a categorized weight assigned to an attribute value of a categorized voice characteristic amongst the plurality of voice characteristics. [Kim, “[0121] … For example, the reference embedding vector includes a prosody feature vector over time, and each prosody feature information may be represented by a weighted sum of a plurality of sub-embedding vectors orthogonal to each other (for example, height, size, length, pause duration, style vector, etc….”  Figure 10 shows speed and style as two types of attributes with values as “speed = 1.5” And “style=emphasis.” Speed and Style are “categories” of voice characteristics.]
Kim does not teach that the weights are categorically assigned such that each attribute has a constant categorial weight according to its category.
Huang teaches that each language has its own group of phonemes:  
wherein the voice generation model comprises a categorized weight assigned to an attribute value of a categorized voice characteristic amongst the plurality of voice characteristics. [Huang, Figure 3 showing the choice of languages that impacts the phonetic suggestion:  “[0065] The example web page 302 may further include a desired language portion 310 that enables the user to designate the desired language for the word or phrase suggestions, thus enabling the phonetic suggestion engine 102 to implement the one or more corresponding language-specific dictionaries 214, and/or the corresponding language-specific language frequency model 230. Moreover, the example webpage 302 may also include a native language portion 312 that enables the user to select the user's native language. In turn, the phonetic suggestion engine 102 may implement the corresponding language-specific localized LTS module 206, and/or the corresponding language-specific phoneme confusion table 222….”  Huang also teaches that the phonemes are weighted according to their language/category:  “[0072] At block 408, the phonetic suggestion engine 102 may rank the surviving candidate phoneme sequences 122, or the scored phoneme sequences 124, via the rank component 116. I…The ranking component 116 may further use the linearly weighted scores of the scored phoneme sequences 124 to rank and/or prune the scored phoneme sequences 124 and generate the ranked phoneme sequences 126.”]
Kim and Huang pertain to speech synthesis and it would have been obvious to add the categorization of phonemes by language from Huang with the TTS of Kim which mentions the possibility of several languages in order to have a TTS that operates in different languages smoothly. This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 4, Kim teaches:
4. The system as claimed in claim 3, 
wherein the voice generation model is trained based on the training text data and a training voice sample, [Kim, Figure 11, “model learning unit 1114.”  “[0063] According to an embodiment, the communication unit 405 may be configured to receive a text from an external device. In this example, the text may include a learning text to be used for training the artificial neural network text-to-speech synthesis model. …” “[0064] … The sequential prosody feature and/or the articulatory feature of the speaker extracted as described above may be transmitted to the encoder 435 and/or the decoder 440 through the data learning unit 450 to be used to train the artificial neural network text-to-speech synthesis model….”]
wherein the training text data and training voice sample is obtained from a sample data repository. [Kim, Figure 4, speech synthesizer includes a “data learning unit 450.”  “[0107] The artificial neural network-based text-to-speech synthesis system may be trained using a large database existing as a pair of a learning text and a speech signal. …”]

Regarding Claim 5, Kim teaches:
5. The system as claimed in claim 3, wherein to process the voice characteristic information, the voice generation engine is to: 
derive the attribute value pertaining to one category of voice characteristic from the reference voice sample; [Kim, Figure 3 the attributes/characteristics are extracted from the “speech signal or sample 220/320” by the “sequential prosody feature extractor 230” and “articulatory feature extractor 310.”] 
compare the derived attribute value with a value linked with the categorized weight assigned to the attribute value of categorized voice characteristic; and 
 on determining the derived attribute value to match with the value linked with the categorized weight, assign the categorized weight as the weight for the voice characteristic.
Kim does not teach categorization of phonemes according to language because it only mentions but does not delve into multi-lingual TTS.
Huang teaches:
compare the derived attribute value with a value linked with the categorized weight assigned to the attribute value of categorized voice characteristic; and [Huang, Figures 1 and 2, “Fast Matching Component 112” including the “Phonetic Distance Module 220” is performing a comparison. Matching is by comparison.  See [0022] provided below.]
 on determining the derived attribute value to match with the value linked with the categorized weight, assign the categorized weight as the weight for the voice characteristic. [Huang, the process of pruning that leads to the final results uses the weights assigned to groups/categories of phonemes of a language.  “[0057] As a result, the rank component 116 may use these scores of each scored phoneme sequences 124 to rank the sequences. In some embodiments, the rank component 116 may use linear weighting to combine the scores for each scored phoneme sequence 124….”   “[0022] The phonetic suggestion engine 102 may use the fast matching component 112 to identify candidate phoneme sequences 122 from a pool of phoneme sequences that may match the query phoneme sequence 120. The pool of phoneme sequences may be from a standardized language reference resource, such as a dictionary. In some embodiments, the fast matching component 112 may identify the candidate phoneme sequences 122 by applying one or more pruning constraints. In other embodiments, the fast matching component 112 may identify the candidate phoneme sequences 122 by comparing the phonetic distance between the phonemes in the query phoneme sequence 120 and the phonemes in each of the candidate phoneme sequences 122. In further embodiments, the fast matching component 112 may use both the one or more pruning constraints and the phonetic distance comparison to identify the candidate phoneme sequences 122.”]
Rationale similar to that provided for claim 3.

Regarding Claim 6, Kim teaches:
6. The system as claimed in claim 1, wherein the training voice characteristic information is extracted from the training voice sample and comprises attribute values corresponding to a plurality of voice characteristics. [Kim, Figure 3 the attributes/characteristics are extracted from the “speech signal or sample 220/320” by the “sequential prosody feature extractor 230” and “articulatory feature extractor 310.”  “[0064] … The sequential prosody feature and/or the articulatory feature of the speaker extracted as described above may be transmitted to the encoder 435 and/or the decoder 440 through the data learning unit 450 to be used to train the artificial neural network text-to-speech synthesis model….”  “[0069] … For example, the sequential prosody feature extraction unit 410 may extract a sequential prosody feature from a received speech signal or sample using a speech processing method such as mel frequency cepstral (MFC). Alternatively, the sequential prosody feature may be extracted by inputting a trained prosody feature model (e.g., an artificial neural network) using a speech sample….”]

Regarding Claim 7, Kim teaches:
7. The system as claimed in claim 6, wherein the plurality of voice characteristics comprises a type of phonemes present in the voice sample, number of phonemes, duration of each phoneme, pitch of each phoneme, and energy of each phoneme. [Kim teaches that the prosody information may correspond to each “phoneme” which determines the type and number of phonemes and may include volume/energy, pitch/pitch, length/duration, and additionally pause duration and style which are not called for by the Claim.  “[0012] According to one embodiment of the present disclosure, in the text-to-speech synthesis method, the sequential prosody feature may include a prosody information corresponding to at least one unit of a frame, character, phoneme, syllable, or word in chronological order, and wherein the prosody information may include at least one of information on a volume of the sound, information on a pitch of the sound, information on a length of the sound, information on a pause duration of the sound, or information on a style of the sound.”]

Regarding Claim 8, Kim teaches:
8. The system as claimed in claim 1, wherein the reference voice sample is a voice sample by using which user wanted to manipulate output voice track based on its voice characteristics. [Kim calls the “reference voice sample” of the Claim “target voice” and irrespective of the “source voice” the output has to come out in the “target voice.” “[0107] The artificial neural network-based text-to-speech synthesis system may be trained using a large database existing as a pair of a learning text and a speech signal. According to an embodiment, a speech synthesis apparatus may receive a text and compare an output speech signal to a target speech signal to define a loss function. The speech synthesis apparatus may learn the loss function through an error back-propagation algorithm and thus may obtain an artificial neural network that outputs a finally desired speech when any text is input.”   “[0108] … A text-to-speech synthesis system may learn by comparing an output speech signal and a target speech signal, so as to generate, when receiving a text, an articulatory feature of the speaker and a sequential prosody feature, output speech data that corresponds to the speaker's speech reading the text in which the sequential prosody feature is reflected.” “[0005] In addition, when the difference between the pitch range of the source speaker and the pitch of the target speaker is large, it may be difficult to reflect the prosody feature of the source speaker to the target speaker. …Considering such circumstances, it may be required to pre-process the prosody feature before applying the prosody feature to the artificial neural network model in order to improve the quality of the synthesized speech in which the prosody feature is reflected.”]

Regarding Claim 9, Kim teaches:
9. A method comprising: 
obtaining a training voice sample and a training text data; [Kim, “[0107] The artificial neural network-based text-to-speech synthesis system may be trained using a large database existing as a pair of a learning text and a speech signal…..”] 
extracting a training voice characteristic information from the training voice sample; [Kim, Figures 1-3 all show the prosody extraction or extracted prosody which teaches the extracting of voice characteristics and the extracted characteristics may be used for training or generation.  “[0115] FIG. 9 is an exemplary view showing a network of a sequential prosody feature extraction unit 920 configured to extract a plurality of embedding vectors 930 representing the sequential prosody feature from a speech signal or sample 910 according to an embodiment of the present disclosure…..” “[0121] … Accordingly, the text-to-speech synthesis system 900 may accurately separate the inherent elements using the principle component analysis (PCA) from the embedding vector trained as the sparse vector. …”]
training a voice generation model based on the training voice characteristic information, [Kim, “[0108] … A text-to-speech synthesis system may learn by comparing an output speech signal and a target speech signal, so as to generate, when receiving a text, an articulatory feature of the speaker and a sequential prosody feature, output speech data that corresponds to the speaker's speech reading the text in which the sequential prosody feature is reflected.”]
wherein while training, the voice generation model is to classify voice characteristic as a categorized voice characteristic based on the type of an attribute value of the voice characteristic; and
 assigning a weight for the categorized voice characteristic based on the attribute value of the categorized voice characteristic. [Kim, each sub-embedding/ “attribute value of the … voice characteristic” is assigned a weight.  “[0121] … In this example, the reference embedding vector may include a plurality of embedding vectors representing predetermined sequential prosody feature information. For example, the reference embedding vector includes a prosody feature vector over time, and each prosody feature information may be represented by a weighted sum of a plurality of sub-embedding vectors orthogonal to each other (for example, height, size, length, pause duration, style vector, etc.) … For example, the text-to-speech synthesis system 1000 may change weights for a plurality of unit embedding vectors based on the received attribute values in the tag.”  ] (“[0027] … Example of voice characteristics include, but may not be limited to, number of phonemes, type of phonemes present in the voice sample, duration of each phoneme, pitch of each phoneme, and energy of each phoneme. The attribute values, such as training attribute values 222, corresponding to the voice characteristics of the training voice sample 218 may include numeric or alphanumeric values representing the level or quantity of each voice characteristic. For example, English language has total of 44 phonemes and each phoneme is represented by a certain value. Each of the phoneme has certain duration, pitch and energy which may also be represented numerically and alphanumerically….”  Published version of the instant Application.)
Kim does not teach the categorization of prosody such as male and female or based on language, for example, English vs. Spanish. [Huang
Huang teaches:
wherein while training, the voice generation model is to classify voice characteristic as a categorized voice characteristic based on the type of an attribute value of the voice characteristic; and [Huang, Figure 3 categorizes the phonemes/attribute values according to language and therefore determines the voice characteristic that is based on such attribute values/phonemes.  “[0044] In at least one embodiment, the phoneme confusion table 222 may be constructed based on language-specific training data. For example, in an instance where the phonetic suggestion engine 102 is intended for use by Chinese (Mandarin) speaker to obtain English word or phrase suggestions, the training data may be English phonemes as pronounced by one or more Chinese (Mandarin) speakers. In this way, phoneme confusion table 222 may enable the phonetic distance module 220 to account for speech, ethnic, and/or regional pronunciation differences. However, in other embodiments, the phoneme confusion table 222 may include phonetic distances for phonemes of multiple languages and pronounced by different language speakers.”]
assigning a weight for the categorized voice characteristic based on the attribute value of the categorized voice characteristic. [Huang, see mapping for Claim 3 and [0057], [0062] and [0075] of Huang all provided above.]
Rationale as provided for Claim 3.

Claim 10 is a method claim with limitations corresponding to the limitations of Claim 7 and is rejected under similar rationale. 

Regarding Claim 11, Kim teaches:
11. The method as claimed in claim 9, wherein the training voice sample and training text data pertaining to different languages is obtained from a sample data repository. [Kim, “[0076] …  The learning text may be written in at least one language, and may include at least one of a word, a phrase, and a sentence that can be understood by human. …”  “[0090] According to an embodiment, the encoder 610 may receive a character embedding for an input text, as shown in FIG. 6. According to another embodiment, the input text may include at least one of word, phrase, or sentence used in one or more languages….”] 

Regarding Claim 12, Kim teaches:
12. The method as claimed in claim 9, further comprising: 
obtaining a subsequent training voice sample; [Kim, Figure 3 shows the arriving of the “speech signal or sample 220/320” from the “speech database 425” of Figure 4 and the “sequential prosody feature extraction unit 230/410” and the fact that training is not done with a single sample indicate the sequential nature of receiving a number of voice samples.  Figure 10 shows the arriving of the prosody information added to text which has the same effect.  “[0107] The artificial neural network-based text-to-speech synthesis system may be trained using a large database existing as a pair of a learning text and a speech signal….”]
extracting a subsequent training voice characteristic information from the subsequent training voice sample, wherein the subsequent training voice characteristic information comprises attribute values corresponding to voice characteristics of the subsequent training voice sample; and [Kim, Figures 3 and 4: “sequential prosody feature extraction unit 230/410.” “[0121] … According to this configuration, the text-to-speech synthesis system 1000 may modify the reference embedding vector based on the attribute values in a tag provided in the speech synthesis markup language. For example, the text-to-speech synthesis system 1000 may change weights for a plurality of unit embedding vectors based on the received attribute values in the tag.”] 
training the voice generation model based on the extracted subsequent voice characteristic information. [Kim, Figure 4, “data learning unit 450.” “[0064] … he sequential prosody feature and/or the articulatory feature of the speaker extracted as described above may be transmitted to the encoder 435 and/or the decoder 440 through the data learning unit 450 to be used to train the artificial neural network text-to-speech synthesis model….”  Figures 7-9 showing the training: “[0110]… Through this process, the artificial neural network text-to-speech synthesis model included in the encoder 710 and the decoder 720 may be trained so that the sequential prosody feature can be more naturally reflected.”  [0114]. Figure 11: ‘[0139] Also, the model learning unit 1114 may learn the data learning model through, for example, supervised learning having the learning data as an input value.”]

Regarding Claim 13, Kim teaches:
13. The method as claimed in claim 12, wherein while training, on determining that the attribute values and the category of the subsequent voice characteristics does not correspond to any of the weight assigned, assigning a new weight and a new category for subsequent voice characteristic based on the attribute value of the subsequent voice characteristic. [Kim, Figure 10, input of text 101 and prosody 102 and output of speech 1030 with the markup language showing the speed and style to be applied to the text and obviously as the input prosody/attribute is changed the weights of the characteristics/sub-embeddings also change and must be updated.  “[0122] In an embodiment, the text-to-speech synthesis system 1000 may be configured to modify the reference embedding vector based on the received attribute values in the tag provided in the speech synthesis markup language. For example, the text-to-speech synthesis system 1000 may resynthesize the embedding vectors corresponding to the sequential prosody feature by summing the values obtained by multiplying the weights changed according to the received attribute values by a plurality of unit embedding vectors.  The text-to-speech synthesis system 1000 may output the embedding vector for the changed sequential prosody feature information. The text-to-speech synthesis system 1000 may input the modified embedding vector to the artificial neural network text-to-speech synthesis model, to convert the output speech data into speech data corresponding to the input text in which the information included in attribute values in a tag provided in the speech synthesis markup language is reflected.”]

Regarding Claim 14, Kim teaches:
14. The method as claimed in claim 9, wherein the training voice characteristic information comprises an attribute value corresponding to a plurality of voice characteristics. [Kim, the “prosody information or attribute values” of Kim (Figure 3, 210) can include several “characteristics information” and are extracted (Figure 3, 230) and presented as “orthogonal sub-embeddings” such as length and style.  Sub-embeddings of prosody/attribute are characteristics.  “[0121] …. According to an embodiment, the text-to-speech synthesis system 1000 may apply the prosody information (e.g., attribute values) corresponding to the respective parts of the input text input to a reference embedding vector corresponding to reference sequential prosody information. In this example, the reference embedding vector may include a plurality of embedding vectors representing predetermined sequential prosody feature information. For example, the reference embedding vector includes a prosody feature vector over time, and each prosody feature information may be represented by a weighted sum of a plurality of sub-embedding vectors orthogonal to each other (for example, height, size, length, pause duration, style vector, etc.) ….”] 

Regarding Claim 15, Kim teaches:
15. The method as claimed in claim 14, wherein based on the attribute values, assigning corresponding weights to each voice characteristic amongst the plurality of voice characteristics. [Kim, the “prosody information or attribute values” of Kim include several “characteristics information”/“orthogonal sub-embeddings” and the weighted sum of characteristics/sub-embeddings forms the prosody/attribute such that the weights are based on the prosody/attribute and vice versa.  “[0121] …. For example, the reference embedding vector includes a prosody feature vector over time, and each prosody feature information may be represented by a weighted sum of a plurality of sub-embedding vectors orthogonal to each other (for example, height, size, length, pause duration, style vector, etc.) ….”] [Kim: “0130] Meanwhile, at least one of the data learning unit 1110 or the data recognition unit 1120 may be implemented as a software module. When at least one of the data learning unit 1110 or the data recognition unit 1120 is implemented as a software module (or a program module including instructions), the software module may be stored in a memory or a non-transitory computer-readable medium….”]

Regarding Claim 16, Kim teaches:
16. A non-transitory computer-readable medium comprising computer-readable instructions, which when executed by a processor, causes a computing device to: [Kim: “[0130] Meanwhile, at least one of the data learning unit 1110 or the data recognition unit 1120 may be implemented as a software module. When at least one of the data learning unit 1110 or the data recognition unit 1120 is implemented as a software module (or a program module including instructions), the software module may be stored in a memory or a non-transitory computer-readable medium….”]
receive a request from a user to convert an input text data into an output voice track; [Kim, Figure 5, “receive input text S520” is an implied request to conduct TTS.]
generate a predicted output voice track in a specified language based on a predefine voice characteristic information using a voice generation model; [Kim, Figure 5, “generate output speech data for input text reflecting sequential prosody feature S540.”  The “voice generation model” is taught by device 400 of Figure 4 or the ensemble of components in Figure 3.  ]
if in case the predicted output voice track is inappropriate, [Kim, this portion pertains to the training which keeps comparing the actual output to the desired output by calculating a loss and iterating until the loss is minimized. “[0107] … According to an embodiment, a speech synthesis apparatus may receive a text and compare an output speech signal to a target speech signal to define a loss function. The speech synthesis apparatus may learn the loss function through an error back-propagation algorithm and thus may obtain an artificial neural network that outputs a finally desired speech when any text is input.”]  receive a reference voice sample from the user; [Kim, Figures 2-3 showing the receiving of the “speech signal or sample 220/320.”] 
extracting a voice characteristic information from the reference voice sample; [Kim, Figure 5, “receive sequential prosody feature S530.” See also Figures 1-3 and 10 for receiving of the prosody information and Figures 2-3 for extraction of the prosody features by the “sequential prosody feature extractor 230.” “[0121] … Accordingly, the text-to-speech synthesis system 900 may accurately separate the inherent elements using the principle component analysis (PCA) from the embedding vector trained as the sparse vector. According to this configuration, the text-to-speech synthesis system 1000 may modify the reference embedding vector based on the attribute values in a tag provided in the speech synthesis markup language. For example, the text-to-speech synthesis system 1000 may change weights for a plurality of unit embedding vectors based on the received attribute values in the tag.”]
process the voice characteristic information to assign a weight to each voice characteristics using the voice generation model to generate a weighted voice characteristic information; and [Kim, “[0121] … According to an embodiment, the text-to-speech synthesis system 1000 may apply the prosody information (e.g., attribute values) corresponding to the respective parts of the input text input to a reference embedding vector corresponding to reference sequential prosody information. In this example, the reference embedding vector may include a plurality of embedding vectors representing predetermined sequential prosody feature information. For example, the reference embedding vector includes a prosody feature vector over time, and each prosody feature information may be represented by a weighted sum of a plurality of sub-embedding vectors orthogonal to each other (for example, height, size, length, pause duration, style vector, etc.)…”]
based on the weighted voice characteristic information of the reference voice sample, generate an updated output voice track corresponding to the input text data. [Kim teaches modifying and resynthesizing the embedding vectors that convey the prosody by changing the weights of the sub-embeddings/attributes according to an input tagged text where the tags determine the attribute values as shown in Figure 10.  “[0122] In an embodiment, the text-to-speech synthesis system 1000 may be configured to modify the reference embedding vector based on the received attribute values in the tag provided in the speech synthesis markup language. For example, the text-to-speech synthesis system 1000 may resynthesize the embedding vectors corresponding to the sequential prosody feature by summing the values obtained by multiplying the weights changed according to the received attribute values by a plurality of unit embedding vectors. The text-to-speech synthesis system 1000 may output the embedding vector for the changed sequential prosody feature information. The text-to-speech synthesis system 1000 may input the modified embedding vector to the artificial neural network text-to-speech synthesis model, to convert the output speech data into speech data corresponding to the input text in which the information included in attribute values in a tag provided in the speech synthesis markup language is reflected.”]
In Kim, The language is not specified separately but is included in the text:  “[0090] … According to another embodiment, the input text may include at least one of word, phrase, or sentence used in one or more languages….”
Huang teaches:
generate a predicted output voice track in a specified language based on a predefine voice characteristic information using a voice generation model; [Huang, Figure 3, shows a selection of language such that the particular phonemes of that language can be used for the generation of speech.  “[0003] The phonetic suggestion engine may initially use one or more letters-to-sound (LTS) databases to convert an input letter string into phonemes, or segments of sound that form meaningful contrasts between utterances. Subsequently, the phonemes may be further pruned and scored to match candidate words or phrases from a particular language dictionary. ….”  “0018] The phonetic suggestion engine 102 may ultimately provide word or phrase suggestions 106 for the input letter string 108. In various embodiments, the phonetic suggestion engine 102 may include one or more updateable language-specific components (e.g., dictionaries, letter-to-sound converters, letter-to-sound correlation databases, and/or the like) that are specific to different languages. Thus, depending on its language configuration, the phonetic suggestion engine 102 may provide word or phrase suggestions in different languages for the same input letter string 108. …’]
Kim and Huang pertain to speech synthesis and it would have been obvious to add the selection of a particular  language from Huang with the TTS of Kim which mentions the possibility of several languages in order to permit the TTS to access the particular phonemes of the particular language directly. This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 17 is a computer program product system claim with limitations corresponding to the limitations of method Claim 4 and is rejected under similar rationale.
Claim 18 is a computer program product system claim with limitations corresponding to the limitations of method Claim 4 and is rejected under similar rationale.  (Claim 4 has two limitations and Claims 17 and 18 each correspond to a different limitation.)
Claim 19 is a computer program product system claim with limitations corresponding to the limitations of method Claim 2 and is rejected under similar rationale.
Claim 20 is a computer program product system claim with limitations corresponding to the limitations of method Claims 7 and 10 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Several interpretations of weights can be mapped.  See Kannan (U.S. 20200380215): wherein the voice generation model comprises a categorized weight assigned to an attribute value of a categorized voice characteristic amongst the plurality of voice characteristics. [Kannan categorizes the characteristics/phonemes according to language as a category:  “[0043] The second stage of training modifies the globally-trained model 150 to include an adapter layer after every LSTM layer 216 of the encoder network 200. Each adapter layer has a separate set of weights for each language. The training process resumes with data from a single language, such as French. Now, after each batch of examples, gradients are computed only for the adapter weights for French, and not for any other parameters in the model. Thus, the adapter weights for French only update for a series of steps. Then, the training process continues with data from another language, such as English. Again, after each batch of examples, gradients are computed only for the adapter weights for English, and not for any other parameters in the model. Thus, the adapter weights for English only update for a series of steps. This process repeats until the adapter weights for each language are learned.”

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Fariba Sirjani/
Primary Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Mar 06, 2024
Application Filed
Jan 07, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/454,031
Patent 12603099
SELF-ADJUSTING ASSISTANT LLMS ENABLING ROBUST INTERACTION WITH BUSINESS LLMS
2y 5m to grant Granted Apr 14, 2026
18/152,553
Patent 12579482
Schema-Guided Response Generation
2y 5m to grant Granted Mar 17, 2026
18/341,681
Patent 12572737
GENERATIVE THOUGHT STARTERS
2y 5m to grant Granted Mar 10, 2026
18/406,094
Patent 12537013
AUDIO-VISUAL SPEECH RECOGNITION CONTROL FOR WEARABLE DEVICES
2y 5m to grant Granted Jan 27, 2026
18/180,329
Patent 12492008
Cockpit Voice Recorder Decoder
2y 5m to grant Granted Dec 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+31.0%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 547 resolved cases by this examiner. Grant probability derived from career allow rate.