Last updated: April 19, 2026
Application No. 18/113,671
METHOD AND SYSTEM FOR APPLYING SYNTHETIC SPEECH TO SPEAKER IMAGE

Final Rejection §103
Filed
Feb 24, 2023
Examiner
MASTERS, KRISTEN MICHELLE
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Neosapience Inc.
OA Round
2 (Final)
This examiner grants 62% of cases after interview

— +24.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 40 resolved cases, 2023–2026
Examiner Intelligence

MASTERS, KRISTEN MICHELLE View full profile →
Grants 62% of resolved cases
Career Allow Rate
25 granted / 40 resolved
+0.5% vs TC avg
Strong +25% interview lift
Without
With
+24.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
36 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
35.2%
-4.8% vs TC avg
§103
46.9%
+6.9% vs TC avg
§102
8.0%
-32.0% vs TC avg
§112
7.1%
-32.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases
Office Action

§103
Detailed Action
This communication is in response to the Amendments and Arguments filed on 6/24/2025. 
Claims 1-10 are pending and have been examined. Hence, this Action has been made FINAL.
Apparent priority: 8/27/2020
Any previous objection/rejection not mentioned in this Office Action has been withdrawn by the Examiner. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 4/25/2025 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendment
5.	The Applicants have amended the independent claims to include “text, wherein the voice data comprises voice style characteristics or other voice information and outputting the voice data comprises outputting an embedding vector representing at least one voice style characteristic of the voice style characteristics acquired from the voice data;” “data, wherein the synthesis voice reflects one or more of the other voice information or at least one voice style characteristic of the voice style characteristics, and wherein generating the synthesis voice includes providing the embedding vector to a decoder of the artificial neural network text-to-speech synthesis model;” “the artificial neural network text- to-speech synthesis model further includes an attention module and an artificial neural network duration prediction model trained to predict a duration of each of the plurality of phonemes; and the generating the timing information for each of the plurality of phonemes includes inputting an embedding for each of the plurality of phonemes from the output voice data to the artificial neural network duration prediction model through the attention module to predict a number of frames for each of the plurality of phonemes as the timing information.” 
Regarding the 35 U.S.C. § 103 rejection the Applicants arguments and amendments overcome the 35 U.S.C. § 103 rejection. Hence, new grounds for rejection has been made in view of Wang (US Patent Number US 8224652 B2), in view of KOLLURU (US Patent Number US 20150052084 A1) and further in view of Cosatto (US Patent Number US 6112177 A).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Wang (US Patent Number US 8224652 B2), in view of KOLLURU (US Patent Number US 20150052084 A1)   and further in view of Cosatto (US Patent Number US 6112177 A).
Regarding Claim 1, Wang teaches 1. A method for applying synthesis voice to a speaker image, the method being performed by one or more processors and comprising: receiving an input text; (see Wang (2:35-53) “(14) More specifically, given the aforementioned types of inputs for training, the synchronized motion and speech of those inputs is probabilistically modeled at various speech levels, including, for example, sentences (or multiple sentences), phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. Note that the duration of particular sentences, phrases, words, phonemes, etc. can also be modeled in order to account for differences in speaking speeds (e.g., a fast speaker or a slow speaker). Each body part, e.g., mouth (i.e., lip sync and other mouth motions), nose, eyes, eyebrows, ears, face, head, fingers, hands, arms, legs, feet, torso, spine, skeletal elements of a body, etc., can be modeled using either the same or separate inputs to create one or more probabilistic models. Further, while these speech/motion models can be updated or changed as often as desired, once trained, these models can be stored to a computer-readable medium for later use in synthesizing animations based on new text or speech inputs.”) inputting the input text to an artificial neural network text-to-speech synthesis model and outputting voice data for the input text; (see Wang (6:3-26) “(22) In any case, as described in Section 2.4, once the probabilistic trainable models 115 have been learned from the training data, the Animation Synthesizer uses these models to generate new animation sequences for avatars or robots given an arbitrary text or speech input. Specifically, in one embodiment, a text/speech input module 120 receives an input of text 125, either typed or read from an electronic file or database. In additional embodiments, the text/speech input module 120 receives a speech input that is either recorded 130, or provided as a live input 135 via a microphone or other audio input device. Note that this speech input can also be provided as the audio portion of a typical video recording or live feed. (23) In the case where the text/speech input module 120 receives a text 125, the text/speech input module passes the input to a speech synthesis module 145 that uses conventional text-to-speech techniques to generate a speech signal from the text input. Such techniques are well known to those skilled in the art, and will not be described herein. In various embodiments, an emotion input module 140 allows the user to associate particular emotional characteristics or an emotional context with some or all of the text 125. For example, the user may with the speech synthesis module 145 to generate speech from the text 125 input that sounds happy, sad, angry, confused, etc.”) generating a synthesis voice corresponding to the output voice data; (see Wang (10:34-41) “(53) As noted above, in various embodiments, the user is provided with a user interface that allows the user to associate particular emotional characteristics or an emotional context with some or all of the text or speech input. For example, by selecting or assigning particular emotional characteristics with particular portions of a text input (or with a text input generated from a speech input) speech will be generated that sounds happy, sad, angry, confused, excited, etc.”)
Wang does not specifically teach wherein the voice data comprises voice style characteristics or other voice information and outputting the voice data comprises outputting an embedding vector representing at least one voice style characteristic of the voice style characteristics acquired from the voice data; However KOLLURU does teach this limitation (see KOLLURU [0024-0030] “In an embodiment, a system for emulating a subject is provided. The system allows a user to interact with a computer generated talking head with the subject's face and voice; [0025] said system comprising a processor, a user interface and a personality storage section, [0026] the user interface being configured to emulate the subject, by displaying a talking head which comprises the subject's face and output speech from the mouth of the face with the subject's voice, the user interface further comprising a receiver for receiving a query from the user, the emulated subject being configured to respond to the query received from the user, [0027] the processor comprising a dialogue section or system and a talking head generation section, [0028] wherein said dialogue section is configured to generate a response to a query inputted by a user from the user interface and generate a response to be outputted by the talking head, the response being generated by retrieving information from said personality storage section, said personality storage section comprising content created by or about the subject, [0029] and said talking head generation section is configured to: [0030] convert said response into a sequence of acoustic units, the talking head generation section further comprising a statistical model, said statistical model comprising a plurality of model parameters, said model parameters being derived from said personality storage section, the model parameters describing probability distributions which relate an acoustic unit to an image vector and speech vector, said image vector comprising a plurality of parameters which define the subject's face and said speech vector comprising a plurality of parameters which define the subject's voice, the talking head generation section being further configured to output a sequence of speech vectors and image vectors which are synchronized such that the head appears to talk.”) (see KOLLURU [0031] “In a further embodiment, the head outputs an expressive response such that said face and voice demonstrate expression, said processor further comprising an expression deriving section configured to determine the expression with which to output the generated response, and wherein the said model parameters describe probability distributions which relate an acoustic unit to an image vector and speech vector for an associated expression.”) (examiner notes the speech vectors are retrieved information from personality storage section which define the subjects voice(examiner interprets voice style as “personality…subjects voice”) generating a synthesis voice corresponding to the output voice data, (see KOLLURU [0078-0081] receiving a user inputted query; [0079] generating a response to a query inputted by a user from the user interface and generate a response to be outputted by the talking head, the response being generated by retrieving information from said personality storage section, said personality storage section comprising content created by or about the subject; and [0080] outputting said response by displaying a talking head which comprises the subject's face and output speech from the mouth of the face with the subject's voice, [0081] wherein said talking head outputs said response by:”) wherein the synthesis voice reflects one or more of the other voice information or at least one voice style characteristic of the voice style characteristics,  (see KOLLURU [0082-0083] converting said response into a sequence of acoustic units using a statistical model, said statistical model comprising a plurality of model parameters, the model parameters describing probability distributions which relate an acoustic unit to an image vector and speech vector, said image vector comprising a plurality of parameters which define the subject's face and said speech vector comprising a plurality of parameters which define the subject's voice, [0083] the talking head appearing to talk by outputting a sequence of speech vectors and image vectors which are synchronised.”) (see KOLLURU [0031] “In a further embodiment, the head outputs an expressive response such that said face and voice demonstrate expression, said processor further comprising an expression deriving section configured to determine the expression with which to output the generated response, and wherein the said model parameters describe probability distributions which relate an acoustic unit to an image vector and speech vector for an associated expression.”) (examiner notes the speech vectors are retrieved information from personality storage section which define the subjects voice(examiner interprets voice style as “personality…subjects voice”) and wherein generating the synthesis voice includes providing the embedding vector to a decoder of the artificial neural network text-to-speech synthesis model; (see KOLLURU [0308] Then the word level feature vector and the full context phone level feature vector are concatenated as linguistic features to form the linguistic feature vector in S5313. The feature vector is then mapped via the NN in S5315 to the expression parameters to be used in the system of FIG. 6.”) the artificial neural network text- to-speech synthesis model further includes an attention module and an artificial neural network duration prediction model trained to predict a duration of each of the plurality of phonemes; (see KOLLURU [0315] In an embodiment, the above described head generation system uses "expression weights" to introduce expression into both the expression on the face and the speech. The expression deriving section described above with reference to FIGS. 15 to 17 can output these expression dependent weights directly. [0316] This allows expressiveness dependent HMM parameters to be represented as the linear interpolation of cluster models and the interpolation weights for each cluster HMM model are used to represent the expressiveness information. [0317] Therefore, the training data can be classified into groups and the group dependent CAT weights can be estimated using all the training sentences in this group. If N training sentences are classified into M groups (M&lt;&lt;N), the training data can be expressed as M points in the CAT weight space. [0318] In an embodiment, the NN used as transformation to map the linguistic features into the synthesis features and the CAT model which is used to construct the expressive synthesis feature space, can be trained jointly. The joint training process can be described as follows [0319] 1. Initial CAT model training to generate initial canonical model M0 and the initial CAT weight set .LAMBDA..sub.0 which is composed of the CAT weights for all the training sentences, set iteration number i=0 [0320] 2. Given the expressive linguistic features of training sentences and the CAT weight set of training sentences .LAMBDA..sub.i, the NN for iteration i, i.e. NN.sub.i is trained using least square error criterion.”) and the generating the timing information for each of the plurality of phonemes includes inputting an embedding for each of the plurality of phonemes from the output voice data to the artificial neural network duration prediction model through the attention module to predict a number of frames for each of the plurality of phonemes as the timing information. (see KOLLURU [0279] In step S1303, the images for building the AAM are selected. In an embodiment, only about 100 frames are required to build the AAM. The images are selected which allow data to be collected over a range of frames where the subject exhibits a wide range of emotions. For example, frames may be selected where the subject demonstrates different expressions such as different mouth shapes, eyes open, closed, wide open etc. In one embodiment, frames are selected which correspond to a set of common phonemes in each of the emotions to be displayed by the head.”) (see KOLLURU [0289] In one embodiment, the AAM parameters and their first time derivates are used at the input for a CAT-HMM training algorithm as previously described. (see KOLLURU [0325-0332] Through the process mentioned above, the NN and the CAT model are updated jointly which can improve performance at the synthesis stage. [0326] This joint training process is not limited to NN and CAT models. In general a transformation from linguistic feature space to synthesis feature space other than NN and the methods to construct the synthesis feature space other than CAT can be updated using joint training in the same framework.[0327] The above has described the training for the system. The text to speech synthesis will now be described with reference to FIG. 18. [0328] The synthesis system shown in FIG. 18 comprises an expressive linguistic feature extraction block 401 which extracts an expressive feature vector from the response generated by the dialogue section in an expressive linguistic space 403 as described with reference to the training. The process for extracting this vector in the synthesis stage is identical to the process described in the training stage. [0329] The expressive feature vector is then mapped via transformation block 405 to an expressive synthesis vector in an expressive synthesis space 407. The transformation block 405 has been trained as described above. [0330] The determined expressive synthesis vector is then used directly as an input to the head generating section 409. As described above, in one embodiment the transformation block 405 maps the expressive linguistic feature vector directly to CAT weights in the expressive synthesis feature space 407.
[0331] In a method in accordance with an embodiment, there is no need to prepare special training data or require human interaction to assess training data. Further, the text to be synthesized is converted into the linguistic feature vector directly. This linguistic feature vector contains much more emotion information than a single emotion ID. The transformation block converts a linguistic feature vector into an expressive synthesis feature with same emotion. Further, this synthesis feature can be used to synthesize the speech with same emotion as in original text data.[0332] If in expressive synthesis feature space, each training sentence is related to a unique synthesis feature vector, the unique emotion information in each sentence is learned by the transformation, e.g. NN. It can provide the user with very rich emotion resources for synthesis.”)
Wang and KOLLURU are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Claim 1 of Wang to incorporate the teachings of KOLLURU to include the voice data comprises voice style characteristics or other voice information and outputting the voice data comprises outputting an embedding vector representing at least one voice style characteristic of the voice style characteristics acquired from the voice data; generating a synthesis voice corresponding to the output voice data, wherein the synthesis voice reflects one or more of the other voice information or at least one voice style characteristic of the voice style characteristics, and wherein generating the synthesis voice includes providing the embedding vector to a decoder of the artificial neural network text-to-speech synthesis model; the artificial neural network text- to-speech synthesis model further includes an attention module and an artificial neural network duration prediction model trained to predict a duration of each of the plurality of phonemes; and the generating the timing information for each of the plurality of phonemes includes inputting an embedding for each of the plurality of phonemes from the output voice data to the artificial neural network duration prediction model through the attention module to predict a number of frames for each of the plurality of phonemes as the timing information. Doing so improves performance at the synthesis stage as recognized by KOLLURU in [0325].

Wang in view of KOLLURU does not specifically teach and generating information on a plurality of phonemes included in the output voice data, However, Cosatto does teach this limitation. (see Cosatto, (6:13-30) “(7) The processor timestamps the sample (step 204). That is, the processor associates a time with each sound and image sample. Timestamping is important for the processor to know which image is associated with which sound so that later, the processor can synchronize the concatenated sounds with the correct images of the talking head. Next, in step 206 the processor decomposes the image sample into a hierarchy of segments, each segment representing a part of the sample (such as a facial part). Decomposition of the image sample is advantageous because it substantially reduces the memory requirements of the algorithm when the animation sequence (FIG. 3b) is implemented…”) wherein the information on the plurality of phonemes includes timing information for each of the plurality of phonemes included in the output voice data, (see Cosatto, (8:25-40) “(20) Referring again to prong 201 of FIG. 3a, the processor captures a sample of a multiphone (step 203), which is typically the image, movement, and associated sound of the subject speaking a designated phoneme sequence. As in the animation prong (200), this sampling process may be performed by a video or other means. After the multiphone sample is recorded, it is timestamped by the processor so that the processor will recognize which sounds are associated with which images when it later performs the TTS synthesis. A sound is "associated" with an image (or with data characterizing an image) where the same sound was uttered by the subject at the time that image was sampled. Thus, at this point, the processor has recorded image, movement, and associated acoustic information with respect to a particular phoneme sequence. The image information for a phoneme sequence constitutes a plurality of frames.”) 
Wang in view of KOLLURU and Cosatto are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Claim 1 of Wang and KOLLURU to incorporate the teachings of Cosatto to include and generating information on a plurality of phonemes included in the output voice data, the information on the plurality of phonemes includes timing information for each of the plurality of phonemes included in the output voice data. Doing so allows synchronization of the concatenated sounds with the correct images of the talking head as recognized by Cosatto in (6:13-30).

Claims 2 and 3 are rejected under 35 U.S.C. 103 as being unpatentable over Wang (US Patent Number US 8224652 B2), in view of KOLLURU (US Patent Number US 20150052084 A1) and further in view of Cosatto (US Patent Number US 6112177 A), and further in view of Rezvani (US Patent Number US 20130124206 A1).

As to Claim 2, Wang in view of KOLLURU and further in view of Cosatto teaches 2. The method according to claim 1, 
Furthermore, Cosatto teaches further comprising: generating one or more frames including a speaker's mouth shape corresponding to each of the plurality of phonemes based on the timing information for each of the plurality of phonemes; (see Cosatto, (8:25-40) (20) Referring again to prong 201 of FIG. 3a, the processor captures a sample of a multiphone (step 203), which is typically the image, movement, and associated sound of the subject speaking a designated phoneme sequence. As in the animation prong (200), this sampling process may be performed by a video or other means. After the multiphone sample is recorded, it is timestamped by the processor so that the processor will recognize which sounds are associated with which images when it later performs the TTS synthesis. A sound is "associated" with an image (or with data characterizing an image) where the same sound was uttered by the subject at the time that image was sampled. Thus, at this point, the processor has recorded image, movement, and associated acoustic information with respect to a particular phoneme sequence. The image information for a phoneme sequence constitutes a plurality of frames.”) (see Cosatto, (7:34-45) “(13) One solution to providing for a greater variation of mouth shapes while minimizing memory storage requirements is to use warping or morphing techniques. That is, the parameterization of the mouth parts can be kept quite low, and the mouth parts existing in the animation library can be warped or morphed to create new intermediate mouth shapes. For example, where the ultimate animated synthesis requires a high degree of resolution of changes to the mouth to appear realistic, an existing mouth shape in memory can be warped to generate the next, slightly different mouth shape for the sequence. For image warping, control points are defined using the existing mouth parameters for the sample image.”)
Wang in view of KOLLURU and Cosatto are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Claim 1 of Wang and KOLLURU to incorporate the teachings of Cosatto to include generating one or more frames including a speaker's mouth shape corresponding to each of the plurality of phonemes based on the timing information for each of the plurality of phonemes. Doing so allows synchronization of the concatenated sounds with the correct images of the talking head as recognized by Cosatto in (6:13-30).
Wang in view of KOLLURU and further in view of Cosatto do not specifically teach and dubbing the generated synthesis voice to the generated one or more frames to generate video content. However, Rezvani does teach this limitation. (see Rezvani, [0062] “The output of the TTS system is not just an audio sequence but also can include timing marker (also called timestamps) for words and phonemes. For example, consider that the word "rain" is part of the text to be converted to an audio sequence. The audio model will not only generate the word "rain", but will generate the beginning and end timestamp for this audio sequence for "rain" relative to the beginning time of the generated audio sequence. This timestamp information can be utilized for audio video synchronization which is disclosed in later paragraphs.”) (see Rezvani, [0077] “FIG. 10 illustrates an example process of synchronization between the visual and audio sequences. Text 1010 is decomposed into words, phonemes, or utterances (1020). Each word, phoneme, or utterance has a time duration 1030. Each word, phoneme, or utterance is matched with an entry in a dictionary (1040). The entry contains a time series of multipliers or VQ centers (also called VQ index). The procedure checks if the duration of word or phoneme in the audio generated by the TTS system matches the duration of the corresponding visual motion produced by the dictionary. If the durations do not match, the situation can be remedied by the beginning and ending timestamps provided by the TTS system. The ratio of these durations can be used to generate by interpolation a time series of multipliers which match the timestamps from the audio (1050). In this way synchronization between TTS generated audio sequence and dictionary generated visual sequence is achieved.”)
Wang in view of KOLLURU in view of Cosatto and Rezvani are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified The method of combination of Wang and KOLLURU and Cosatto to incorporate the teachings of Rezvani to include and dubbing the generated synthesis voice to the generated one or more frames to generate video content. Doing so allows produce a speech that is directly related to the audio data used to generate the model as recognized by Rezvani in [0013]. 

As to Claim 3, Wang in view of KOLLURU and further in view of Cosatto and further in view of Rezvani teaches 3. The method according to claim 2, 
Furthermore, Cosatto teaches wherein the generating the one or more frames including the speaker's mouth shape corresponding to each of the plurality of phonemes includes: generating a facial landmark feature based on the information on the plurality of phonemes, (see Cosatto, (7:59-8:40) “(16) Depending on the application, a two-dimensional parameterization may be too limited to cover all transitions of the mouth shape smoothly. As such, a three or four dimensional parameterization may be taken into account. This means that one or two additional parameters will be measured from the mouth shape samples and stored in the library. The use of additional parameters results in a more refined and detailed spectrum of available mouth shape variations to be used in the synthesis. The cost of using additional parameters is the requirement of greater memory space. Nevertheless, the use of additional parameters to describe the mouth features may be necessary in some applications to stitch these mouth parts seamlessly together into a synthesized face in the ultimate sequence. (17) One solution to providing for a greater variation of mouth shapes while minimizing memory storage requirements is to use warping or morphing techniques. That is, the parameterization of the mouth parts can be kept quite low, and the mouth parts existing in the animation library can be warped or morphed to create new intermediate mouth shapes. For example, where the ultimate animated synthesis requires a high degree of resolution of changes to the mouth to appear realistic, an existing mouth shape in memory can be warped to generate the next, slightly different mouth shape for the sequence. For image warping, control points are defined using the existing mouth parameters for the sample image. (18) Alternatively, the mouth spaces may be sampled by recording a set of sample images that maps the space of one mouth parameter only, and image warping or morphing may be used to create new sample images necessary to map the space of the remaining parameters. (19) Another sampling method is to first extract all sample images from a video sequence of a person talking naturally. Then, using automatic face/facial features location, these samples are registrated so that they are normalized. The normalized samples are labeled with their respective measured parameters. Then, to reduce the total number of samples, vector quantization may be used with respect to the parameters associated with each sample.  (20) It should be noted that where the sample images are derived from photographs, the resulting face is very realistic. However, caution should be exercised when synthesizing these photographs to align and scale each image precisely. If the scale of the mouth and its position is not the same in each frame, a jerky and unnatural motion will result in the animation.”) (see Cosatto, (9:3-15) “(23) Different types of rules, equations, or other parameters may be used to characterized the mouth shapes derived from the phoneme sequence samples. In some cases, extraction of simple equations to characterize the mouth movements provides for optimal efficiency. In one embodiment, specific mouth parameters (e.g., data points representing degree of lip protrusion, etc.) representing each multiphone sample image (step 211) are extracted. In this way, the specific mouth parameters can be linked up by the processor with the multiphones to which they correspond. The mouth parameters described in step 211 may also comprise one or more stored rules or equations which characterize the shape and/or movement of the mouth derived from the samples.”)wherein the generated facial landmark feature includes a landmark feature for the speaker's mouth shape; (see Cosatto, (9:3-15) “(23) Different types of rules, equations, or other parameters may be used to characterized the mouth shapes derived from the phoneme sequence samples. In some cases, extraction of simple equations to characterize the mouth movements provides for optimal efficiency. In one embodiment, specific mouth parameters (e.g., data points representing degree of lip protrusion, etc.) representing each multiphone sample image (step 211) are extracted. In this way, the specific mouth parameters can be linked up by the processor with the multiphones to which they correspond. The mouth parameters described in step 211 may also comprise one or more stored rules or equations which characterize the shape and/or movement of the mouth derived from the samples.”) (see Cosatto, (10:8-26) “(30) In sum, FIG. 3a describes a preferred embodiment of the sampling techniques which are used to create the animation and coarticulation libraries. These libraries can then be used in generating the actual animated talking-head sequence, which is the subject of FIG. 3b. FIG. 3b shows a flowchart which also portrays, for simplicity, two separate process sections 216 and 221. The animated sequence begins in the coarticulation process section 221. Some stimulus, such as text, is input into a memory accessible by the processor (step 223). This stimulus represents the particular data that the animated sequence will track. The stimulus may be voice, text, or other types of binary or encoded information that is amenable to interpretation by the processor as a trigger to initiate and conduct an animated sequence. As an illustration, where a computer interface uses a talking head to transmit E-mail messages to a remote party, the input stimulus is the E-mail message text created by the sender. The processor will generate a talking head which tracks, or generates speech associated with, the sender's message text.”)  and generating one or more frames including the speaker's mouth shape based on the generated facial landmark feature. (see Cosatto, (7:59-8:40) “(20) Referring again to prong 201 of FIG. 3a, the processor captures a sample of a multiphone (step 203), which is typically the image, movement, and associated sound of the subject speaking a designated phoneme sequence. As in the animation prong (200), this sampling process may be performed by a video or other means. After the multiphone sample is recorded, it is timestamped by the processor so that the processor will recognize which sounds are associated with which images when it later performs the TTS synthesis. A sound is "associated" with an image (or with data characterizing an image) where the same sound was uttered by the subject at the time that image was sampled. Thus, at this point, the processor has recorded image, movement, and associated acoustic information with respect to a particular phoneme sequence. The image information for a phoneme sequence constitutes a plurality of frames.”) (see Cosatto, (9:3-15) “(23) Different types of rules, equations, or other parameters may be used to characterized the mouth shapes derived from the phoneme sequence samples. In some cases, extraction of simple equations to characterize the mouth movements provides for optimal efficiency. In one embodiment, specific mouth parameters (e.g., data points representing degree of lip protrusion, etc.) representing each multiphone sample image (step 211) are extracted. In this way, the specific mouth parameters can be linked up by the processor with the multiphones to which they correspond. The mouth parameters described in step 211 may also comprise one or more stored rules or equations which characterize the shape and/or movement of the mouth derived from the samples.”)
Wang in view of KOLLURU and Cosatto are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Claim 1 of Wang and KOLLURU to incorporate the teachings of Cosatto to include the generating the one or more frames including the speaker's mouth shape corresponding to each of the plurality of phonemes includes: generating a facial landmark feature based on the information on the plurality of phonemes, wherein the generated facial landmark feature includes a landmark feature for the speaker's mouth shape; and generating one or more frames including the speaker's mouth shape based on the generated facial landmark feature. Doing so allows synchronization of the concatenated sounds with the correct images of the talking head as recognized by Cosatto in (6:13-30).

Claims 4-8 are rejected under 35 U.S.C. 103 as being unpatentable over Wang (US Patent Number US 8224652 B2), in view of KOLLURU (US Patent Number US 20150052084 A1)   and further in view of Cosatto (US Patent Number US 6112177 A), and further in view of GAO (US Patent Number US 20210020161 A1).

As to Claim 4, Wang in view of KOLLURU and further in view of Cosatto teaches 4. The method according to claim 1, 
Wang in view of Cosatto do not teach teach wherein the generating the synthesis voice corresponding to the output voice data includes inputting the output voice data to a vocoder to generate the synthesis voice, However, Gao does teach this limitation (see Gao [0056] “Generating the second speech signal segment from the generated second feature vectors may comprise using a third trained algorithm, for example a vocoder. In an embodiment, the audio data comprises spectral data, for example spectral envelope data. The first feature vectors may further comprise one or more of: information relating to the fundamental frequency of the frame of the first speech signal, information relating to the aperiodicity of the frame of the first speech signal, and information relating to whether the frame is voiced or unvoiced. The second feature vectors may comprise the same features as the first feature vectors.”) and the generating the information on the plurality of phonemes included in the output voice data includes inputting the output voice data to an artificial neural network phoneme recognition model and outputting timing information for each of the plurality of phonemes. (see Gao [0057-0061] “In an embodiment, the second trained algorithm comprises a neural network, wherein generating a second feature vector at a current time step comprises: [0058] generating a weighted combination of the representational vectors; [0059] generating a vector representing the weighted combination; [0060] combining the sequence of vectors generated for the current time step and one or more previous time steps with the speaker vector to generate a combined vector; and [0061] inputting the combined vector into the neural network, the neural network outputting the second feature vector for the current time step.”)
Wang in view of KOLLURU in view of Cosatto and Gao are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Wang and KOLLURU and Cosatto to incorporate the teachings of Gao to include generating the synthesis voice corresponding to the output voice data includes inputting the output voice data to a vocoder to generate the synthesis voice, and the generating the information on the plurality of phonemes included in the output voice data includes inputting the output voice data to an artificial neural network phoneme recognition model and outputting timing information for each of the plurality of phonemes. Doing so allows capturing different characteristics within a sentence and expression of a different characteristic for each word, as recognized by Gao in [0277-0278].

As to Claim 5, Wang in view of KOLLURU and further in view of Cosatto and further in view of Gao teaches 5. The method according to claim 4, 
Furthermore, Gao teaches wherein the inputting the output voice data to the artificial neural network phoneme recognition model and outputting the timing information for each of the plurality of phonemes includes: receiving information on a plurality of phoneme sequences of the input text; (see Gao [0044] “The text signals may be signals comprising actual text, or may alternatively comprise text related information, for example a sequence of phonemes. The text signals may further comprise additional information such as timing information for example.”) and inputting the information on the plurality of phoneme sequences and the output voice data to the artificial neural network phoneme recognition model, (see Gao [0047] “The first trained algorithm may comprise a trained neural network. [0048] In an embodiment, the second trained algorithm is a neural network based text-to-speech system.”) and outputting timing information for each of the plurality of phonemes. (see Gao [0045] The timing information may be used to align the second speech signal segment within the output second speech signal.”) (see Gao [0046] “The timing information may be used to detect a difference in duration between the first speech signal segment and the second speech signal segment. The timing information may additionally be used to resample the second speech signal segment to match the duration of the first speech signal segment.”)(see Gao [0049] “In an embodiment, the system is a spoken language translation system in which speaker characteristics from the input are preserved and transposed into the output speech, in a different language, on a neural network based text-to-speech system.”) (see Gao [0050] “The speaker characteristics may embody the expression and intonation of the input speech that are lost during compression to plain text. The speech synthesis system may use these characteristics to recreate the speaker, their expressions and intonations in the target language.”)
Wang in view of KOLLURU in view of Cosatto and Gao are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Wang and KOLLURU and Cosatto to incorporate the teachings of Gao to include the inputting the output voice data to the artificial neural network phoneme recognition model and outputting the timing information for each of the plurality of phonemes includes: receiving information on a plurality of phoneme sequences of the input text; and inputting the information on the plurality of phoneme sequences and the output voice data to the artificial neural network phoneme recognition model, and outputting timing information for each of the plurality of phonemes. Doing so allows capturing different characteristics within a sentence and expression of a different characteristic for each word, as recognized by Gao in [0277-0278].

As to Claim 6, Wang in view of KOLLURU and further in view of Cosatto teaches 6. The method according to claim 1, 
Wang in view of Cosatto do not teach wherein the artificial neural network text-to- speech synthesis model includes an attention module configured to determine a length of the synthesis voice based on a length of the input text, (see Gao [0142] Taking the matrix output of size T times L, a time-wise normalisation function 403 may then be applied across time (i.e. across the frames) to obtain a vector of size L. The time-wise normalisation function may be the max function, mean function or a recurrent neural network for example. For example, where the max function is used, for each column L in the array, the value of the row T with the maximum entry is extracted as the value for the output vector.”) and the generating the information on the plurality of phonemes included in the output voice data (see Gao [0150] In an embodiment, the encoder 304 comprises a look-up table, where each phonetic unit is assigned a unique numerical integer corresponding to a row in the look-up table. The look up table comprises a 2D matrix of size V times H, where each integer corresponds to a row in the 2D matrix, where V is the total number of possible phonetic units and H is a fixed length. In an embodiment, H=128. The values in the 2D matrix may be learnt automatically during a training stage, and stored for use during implementation. The representational vector corresponding to an input phonetic unit is a vector of the values in the corresponding row. There is a one to one correspondence between the phonetic unit and the representational vector, thus where five phonetic units are inputted, five representational vectors are outputted, as shown in the figure.”) includes generating timing information for each of the plurality of phonemes through the attention module. (see Gao [0153] In the described example, the attention mechanism 303 uses the attention vector itself (i.e. the vector output from the attention mechanism 303 in the previous step, which is cached for use in the next step), and the memory state (i.e. the current sequence of memory vectors stored in the memory module 305, described later). The attention mechanism may however use any combination of information from itself (such as the location of the attention, i.e. the previous attention vector), the encoder contents (the encoder output), the output itself (i.e. the WORLD vectors output by the decoder in the final step), the speaker vector, the decoder (i.e. the information passed from the decoder to the memory module) and memory module for example. The use of the speaker vector by the attention mechanism could influence how quickly or slowly the attention mechanism changes its weights, in order to accommodate different speakers speaking at different speeds for example. In particular, the attention mechanism 303 may not take the attention vector itself as input”)
Wang in view of KOLLURU in view of Cosatto and Gao are in the same field of endeavor of speech processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Wang and KOLLURU and Cosatto to incorporate the teachings of Gao to include the artificial neural network text-to- speech synthesis model includes an attention module configured to determine a length of the synthesis voice based on a length of the input text, and the generating the information on the plurality of phonemes included in the output voice data includes generating timing information for each of the plurality of phonemes through the attention module. Doing so allows capturing different characteristics within a sentence and expression of a different characteristic for each word, as recognized by Gao in [0277-0278].

As to Claim 7, Wang in view of KOLLURU and further in view of Cosatto and further in view of Gao teaches 7. The method according to claim 6, 
Furthermore, Gao teaches wherein the artificial neural network text-to- speech synthesis model includes an artificial neural network duration prediction model trained to predict a duration of each of the plurality of phonemes, (see Gao [0046] The timing information may be used to detect a difference in duration between the first speech signal segment and the second speech signal segment. The timing information may additionally be used to resample the second speech signal segment to match the duration of the first speech signal segment.”) (see Gao [0311] Any mismatch in timings may be compensated for by modifying the duration of the output speech segments directly (for example as described in FIG. 24 below) or by modifying the text (for example as described in FIG. 25). For example, an output segment 117 may correspond to an input segment having a start time point of 2 hours 13 minutes, and an end time point of 2 hours 13 minutes and 10 seconds. The output segment is combined with the other output segments at a start point of 2 hours 13 minutes. If the duration of the output segment is longer than 10 seconds, modifications may be made as described in relation to the figures below. Timing information relating to words or sentences within the speech segment may also be used to control align”) and the generating the timing information for each of the plurality of phonemes through the attention module includes inputting an embedding for each of the plurality of phonemes to the artificial neural network duration prediction model to predict a duration for each of the plurality of phonemes. (see Gao [0310] “The speech synthesis stage 103 then generates a plurality of speech segments 117 from the target language text segments 103. A speech concatenation 2301 process then uses the timing information 2311 to align each of the corresponding output speech segments 17 to produce a longer aligned speech signal 2312. For example, the start and/or end times of each input segment can be used to combine the output segments together, so that the timing is the same as the input signal. This may be helpful where the audio signal corresponds to an input video signal for example.”) (see Gao [0302] “Thus for a training example in a batch, the gradient of the above loss with respect to each of the parameters (i.e. weights and biases in the neural networks, the speaker vectors, language vectors etc.) is calculated. Gradients are not calculated for the parameters of the adversarial neural network 300 in this step. As has been described previously, a computational graph may record where elements were added, and be used to determine the gradient values. The output function of the computational graph at each time step is the loss function at the time step t. For parameters used across the time steps, an average of the gradient values for the time steps is taken.”) (see Gao [0051-0055] “In an embodiment, generating the second speech signal segment comprises: [0052] converting the second text signal into a sequence of phonetic units; [0053] converting the phonetic units into representat
Read full office action
Prosecution Timeline

Feb 24, 2023
Application Filed
Mar 17, 2025
Non-Final Rejection — §103
Jun 24, 2025
Response Filed
Sep 27, 2025
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/513,614
Patent 12592219
Hearing Device User Communicating With a Wireless Communication Device
2y 5m to grant Granted Mar 31, 2026
17/415,675
Patent 12548569
METHOD AND SYSTEM OF DETECTING AND IMPROVING REAL-TIME MISPRONUNCIATION OF WORDS
2y 5m to grant Granted Feb 10, 2026
17/790,795
Patent 12548564
SYSTEM AND METHOD FOR CONTROLLING A PLURALITY OF DEVICES
2y 5m to grant Granted Feb 10, 2026
17/940,549
Patent 12547894
ENTROPY-BASED ANTI-MODELING FOR MACHINE LEARNING APPLICATIONS
2y 5m to grant Granted Feb 10, 2026
18/311,150
Patent 12547840
MULTI-STAGE PROCESSING FOR LARGE LANGUAGE MODEL TO ANSWER MATH QUESTIONS MORE ACCURATELY
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
62%
Grant Probability
87%
With Interview (+24.7%)
3y 2m
Median Time to Grant
Moderate
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD AND SYSTEM FOR APPLYING SYNTHETIC SPEECH TO SPEAKER IMAGE

This examiner grants 62% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email