Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 5, and 13 are independent.
This Application was published as US 20250104693.
Apparent priority is 9/26/2023.
The instant Application is directed to a method of generating synthetic speech with prosody characteristics based on the natural language input.
Response to Arguments
35 USC 101
Applicant’s arguments with respect to independent claim 1 have been fully considered and are persuasive. Specifically, using different sets of layers configured for natural language generation and prosody prediction within a single language model represents an improvement in the technical field of synthetic speech generation. Therefore, the rejection of claim 1 is withdrawn.
Applicant's arguments with regards to claims 5 and 13 have been fully considered but they are not persuasive. MPEP 2106.05(a) states that “the claim must be evaluated to ensure the claim itself reflects the disclosed improvement in technology.” Claims 5 and 13 only recite the limitations of a single model for both natural language and prosody prediction, which is found in the prior art and therefore not an improvement. Therefore, the rejection of claims 5, 11-13, and 19-20 is maintained.
Claims 6 and 14 incorporate the improvement outlined in claim 1 above; therefore the rejection of claims 6 and 14 is withdrawn.
35 USC 103
Applicant’s arguments with respect to 35 USC 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 5, 11-13, and 19-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Step 1: The independent Claims are directed to statutory categories:
Claim 1 is a method claim and directed to the process category of patentable subject matter.
Claim 5 is a method claim and directed to the process category of patentable subject matter.
Claim 13 is a device claim and directed to the machine or manufacture category of patentable subject matter.
Step 2A, Prong One: Does the Claim recite a Judicially Recognized Exception? Abstract Idea? Are these Claims nevertheless considered Abstract as a Mathematical Concept (mathematical relationships, mathematical formulas or equations, mathematical calculations), Mental Process (concepts performed in the human mind (including an observation, evaluation, judgment, opinion), or Certain Methods of Organizing Human Activity (1-fundamental economic principles or practices (including hedging, insurance, mitigating risk), 2-commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations), 3- managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions) and fall under the judicial exception to patentable subject matter?)
The rejected Claims recite Mental Processes.
Step 2A, Prong Two: Additional Elements that Integrate the Judicial Exception into a Practical Application? Identifying whether there are any additional elements recited in the claim beyond the judicial exception(s), and evaluating those additional elements to determine whether they integrate the exception into a practical application of the exception. “Integration into a practical application” requires an additional element(s) or a combination of additional elements in the claim to apply, rely on, or use the judicial exception in a manner that imposes a meaningful limit on the judicial exception, such that the claim is more than a drafting effort designed to monopolize the exception. Uses the considerations laid out by the Supreme Court and the Federal Circuit to evaluate whether the judicial exception is integrated into a practical application.
The rejected Claims do not include additional limitations that point to integration of the abstract idea into a practical application and are therefore directed to a Mental Process.
Claim 5 is a generic automation of a mental process because a human agent receive an input from a customer and determine an appropriate response. Prong Two of step 2A in the 101 analysis asks whether the abstract idea is integrated with a practical application. The answer is no in this instance because there is no technological solution in the Claim that “integrates” the abstract idea. The Claim only suggests that the abstract idea be applied. It does not describe an application.
5. A computer-implemented method comprising: receiving first input data corresponding to a user input; [Agent receives a message from a customer that their phone is broken.]
determining first prompt data including the first input data, wherein the first prompt data represents a first request for a first language model to determine an output responsive to the user input; [Agent creates a case note that the customer has a broken iPhone 7.]
processing, using the first language model, the first prompt data to generate first natural language data responsive to the user input; [Agent determines the response: “I’m verry sorry, but your phone is out of warranty.”]
processing, using the first language model, the first prompt data to generate first prosody data representing at least a first voice characteristic; [Agent annotates the response: Apologetic]
using the first natural language data and the first prosody data to generate first output audio data representing first synthetic speech corresponding to the at least first voice characteristic; and causing presentation of the first output audio data. [Agent calls the customer and reads the response in an apologetic tone.]
Step 2B: Search for Inventive Concept: Additional Element Do not amount to Significantly More: The limitations of "computer-implemented” and “language model” are well-understood, routine, and conventional machine components that are being used for their well-understood, routine, and conventional and rather generic functions. Additionally, these limitations are expressed parenthetically and lack nexus to the Claim language and as such are a separable and divisible mention to a machine. Accordingly, they are not sufficient to cause the Claim to amount to significantly more than the underlying abstract idea.
The Dependent Claims do not add limitations that could help the Claim as a whole to amount to significantly more than the Abstract idea identified for the Independent Claim:
11. The computer-implemented method of claim 5, wherein the first prosody data represents a natural language description of the at least first voice characteristic. [Agent uses the natural language description: “Apologetic and a little bit sad”]
12. The computer-implemented method of claim 5, further comprising: receiving context data associated with the user input, wherein the first prompt data represents a further instruction for the first language model to generate prosody data associated with the first natural language data based on the first input data and the context data. [Agent sees in the customer’s profile that the customer is 85, and adds an additional note to speak slowly and clearly.]
The additional limitations introduced by the Dependent Claims are not sufficient as additional elements that integrate the judicial exception into a practical application or as additional elements that cause the Claim as a whole to amount to substantially more than the underlying abstract idea.
With respect to Independent Claim 13, which has limitations similar to the limitations of Claim 5, the limitations of “processor” and “memory” are expressed parenthetically and lack nexus to the Claim language and as such are a separable and divisible mention to a machine. Accordingly, they do not include additional limitations that cause the Claim as a whole to amount to more than the underlying abstract idea.
The Dependent Claims 19-20 are similar to claims 11-12 and do not add limitations that could integrate the judicial exception into a practical application or help the Claim as a whole to amount to significantly more than the Abstract idea identified for the Independent Claim.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 5, 11-13, and 19-20 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Bonar et al. (US 20240169974A1).
Regarding claim 5, Bonar discloses: 5. A computer-implemented method comprising: receiving first input data corresponding to a user input; ("Audio Input 110" Fig. 1)
determining first prompt data including the first input data, wherein the first prompt data represents a first request for a first language model to determine an output responsive to the user input; ("Large Language Model Input 218" Fig. 2)
processing, using the first language model, the first prompt data to generate first natural language data responsive to the user input; ("Text Response(s) 318" Fig. 3)
processing, using the first language model, the first prompt data to generate first prosody data representing at least a first voice characteristic; ("Style Cue(s) 322" Fig. 3)
using the first natural language data and the first prosody data to generate first output audio data representing first synthetic speech corresponding to the at least first voice characteristic; and causing presentation of the first output audio data. ("Audio Output 406" Fig. 4; see also “Translate the text response and the style cue to generate an audio output response to the user input 514” Fig. 5)
Regarding claim 11, Bonar discloses: 11. The computer-implemented method of claim 5, wherein the first prosody data represents a natural language description of the at least first voice characteristic. (“Style Cue(s) 322” (“Friendly”) Fig. 3- “friendly” is a natural language description of the characteristic.)
Regarding claim 12, Bonar discloses: 12. The computer-implemented method of claim 5, further comprising: receiving context data associated with the user input, wherein the first prompt data represents a further instruction for the first language model to generate prosody data associated with the first natural language data based on the first input data and the context data. (See Fig. 3 – the “Conversation Prompt 308” (“Be friendly and good at conversation”) reads on context data and is included in the prompt to the LLM.
See also: "[0031] Furthermore, the large language model 302 can be configured with a conversational profile 312 which can enable the large language model 302 to not only respond to individual inputs but rather carry on a conversation in which context can persist and change over time. Consequently, what constitutes an appropriate response can be nebulous and depend heavily on implications of previous statements, the current mood, and other indefinite factors. ... As such, the large language model 302 can appropriately respond to user inputs while accounting for conversational history, mood, and other context clues."
See also: “[0034] The word selection and phrasing of the text response 318 can be determined by the large language model 302 based on a context derived from the speech-to-text translation 306 in combination with the instructions of the conversation prompt 308 and/or the style prompt 310…”
Claim 13 is a system claim with limitations corresponding to the limitations of Claim 5 and is rejected under similar rationale. Additionally, at least one processor; and at least one memory including instructions of the Claim are taught by Bonar. ( Processing Unit(s) 602; Memory 604, Fig. 6)
Claim 19 is a system claim with limitations corresponding to the limitations of Claim 11 and is rejected under similar rationale.
Claim 20 is a system claim with limitations corresponding to the limitations of Claim 12 and is rejected under similar rationale.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-4, 6-9, and 14-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bonar in view of Liu et al. (“Recurrent Neural Network for Text Classification with Multi-Task Learning”).
Regarding claim 1, Bonar discloses: 1. A computer-implemented method comprising: receiving first input data representing a first natural language input; ("Audio Input 110" Fig. 1)
receiving first context data associated with the first natural language input; ("Conversation Prompt 308" Fig. 3)
generating first prompt data including the first input data and the first context data, wherein the first prompt data represents a first request for a first language model to determine a first output responsive to the first natural language input; ("Large Language Model Input 218" Fig. 2)
processing, using a first set of layers of the first language model, the first prompt data to generate first natural language data responsive to the first natural language input, ("Text Response(s) 318" Fig. 3)
wherein the first set of layers are configured for natural language generation; (not explicitly disclosed )
processing, using a second set of layers of the first language model, the first prompt data to generate first prosody data representing at least a first synthetic voice characteristic, ("Style Cue(s) 322" Fig. 3)
wherein the second set of layers are configured for prosody prediction; (not explicitly disclosed )
processing the first natural language data and the first prosody data to generate first output audio data representing first synthetic speech corresponding to the first synthetic voice characteristic and responsive to the first natural language input; and causing presentation of the first output audio data. ("Audio Output 406" Fig. 4)
Bonar does not explicitly disclose multiple task specific layers.
Liu discloses multiple task specific layers. (“Model-II: Coupled-Layer Architecture In Model-II, we assign a LSTM layer for each task, which can use the information for the LSTM layer of the other task.”)
PNG
media_image1.png
287
539
media_image1.png
Greyscale
Liu Fig. 2
Bonar and Liu are considered analogous art to the claimed invention because they disclose neural networks for NLP. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Bonar with a multi-layer shared architecture for Multi-Task learning as taught by Liu for natural language and prosody tasks. Doing so would have been beneficial because features learned from a task may be useful for other tasks. (Liu pg. 3 para 1)
Regarding claim 2, Bonar discloses: 2. The computer-implemented method of claim 1, wherein processing the first prompt data to generate the first prosody data comprises: receiving, from a first layer of the first set of layers, first embedding data representing the first prompt data; and processing, by the second set of layers, the first prompt data and the first embedding data to generate the first prosody data. (See claim 1.)
Bonar does not disclose: embedding data that is processed by shared layers.
Liu discloses: embedding data (“In all of our experiments, the word embeddings are trained using word2vec [Mikolov et al., 2013] on the Wikipedia corpus (1B words).” Pg. 4, Section 5.2)
Liu also discloses a shared architecture where information is shared between task layers. (See claim 1)
Bonar and Liu are considered analogous art to the claimed invention because they disclose deep neural networks for NLP. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Bonar with word embedding as taught by Liu. This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Regarding claim 3, Bonar does not disclose a shared layer architecture.
Liu discloses: 3. The computer-implemented method of claim 2, wherein the first embedding data is received at a second layer of the second set of layers, (“In Model-II, we assign a LSTM layer for each task, which can use the information for the LSTM layer of the other task.” Pg. 3, section Model-II))
and the method further comprises: processing, by the second layer of the second set of layers, the first prompt data and the first embedding data to generate second embedding data representing the first prompt data and the first embedding data; and (Fig. 2, (b) shows that each layer has an output)
processing the second embedding data using a third layer of the second set of layers to generate third embedding data representing the first prompt data, (Fig. 2(b) shows at least 4 layers for each task.)
wherein the first prosody data is generated based at least in part on the third embedding data. (Fig. 2(b) shows the task specific outputs are based on at least a third embedding data.)
Bonar and Liu are considered analogous art to the claimed invention because they disclose deep neural networks for NLP. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Bonar with a shared architecture for Multi-Task learning as taught by Liu for natural language and prosody tasks. Doing so would have been beneficial because features learned from a task may be useful for other tasks. (Liu pg. 3 para 1)
Note that the specification of the instant application discloses: “[0064] … data representing a latent representation (e.g., embedding data) representing synthesized speech…” Based on this, “embedding data” is interpreted to mean any latent representation of data that has been embedded.
Regarding claim 4, Bonar does not disclose a shared layer architecture.
Liu discloses: 4. The computer-implemented method of claim 1, wherein processing the first prompt data to generate the first natural language data comprises: receiving, from a first layer of the second set of layers and at a second layer of the first set of layers, first embedding data representing the first prompt data, and processing, by the first set of layers, the first prompt data and the first embedding data to generate the first natural language data. (Fig. 2, (b) discloses that embeddings from both tasks are shared with the layers for both tasks.)
See claim 3 for motivation statement.
Regarding claim 6, Bonar discloses: 6. The computer-implemented method of claim 5, wherein: processing the first prompt data to generate the first natural language data comprises processing, using a first set of layers of the first language model, the first prompt data, wherein the first set of layers are configured for natural language generation; and processing the first prompt data to generate the first prosody data comprises processing, using a second set of layers of the first language model, the first prompt data, wherein the second set of layers are configured for prosody prediction.
Bonar does not explicitly disclose multiple task specific layers.
Liu discloses multiple task specific layers. (“Model-II: Coupled-Layer Architecture In Model-II, we assign a LSTM layer for each task, which can use the information for the LSTM layer of the other task.”)
Bonar and Liu are considered analogous art to the claimed invention because they disclose neural networks for NLP. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Bonar with a multi-layer shared architecture for Multi-Task learning as taught by Liu for natural language and prosody tasks. Doing so would have been beneficial because features learned from a task may be useful for other tasks. (Liu pg. 3 para 1)
Regarding claim 7, Bonar does not disclose a shared layer architecture with embeddings.
Liu discloses: 7. The computer-implemented method of claim 6, wherein processing, using the second set of layers of the first language model, the first prompt data comprises: receiving, from a first layer of the first set of layers and at a second layer of the second set of layers, first embedding data representing the first prompt data, wherein the first prosody data is generated based at least in part by processing the first embedding data. (Fig. 2, (b) discloses that embeddings from both tasks are shared with the layers for both tasks.)
See claim 3 for motivation statement.
Regarding claim 8, Bonar does not disclose a shared layer architecture with embeddings.
Liu discloses: 8. The computer-implemented method of claim 6, wherein processing, using the first set of layers of the first language model, the first prompt data comprises: receiving, from a first layer of the second set of layers and at a second layer of the first set of layers, first embedding data representing the first prompt data, and processing, by the first set of layers, the first prompt data and the first embedding data to generate the first natural language data. (Fig. 2, (b) discloses that embeddings from both tasks are shared with the layers for both tasks.)
See claim 3 for motivation statement.
Regarding claim 9, Bonar does not disclose a shared layer architecture with embeddings.
Liu discloses: 9. The computer-implemented method of claim 7, further comprising: processing, by the second layer of the second set of layers, the first prompt data and the first embedding data to generate second embedding data representing the first prompt data and the first embedding data; and processing the second embedding data using a third layer of the second set of layers to generate third embedding data representing the first prompt data, wherein the first prosody data is generated based at least in part on the third embedding data. (Fig. 2(b) shows at least 4 layers for each task, and each layer receives the data from the previous data as well as shared data.)
See claim 3 for motivation statement.
Claim 14 is a system claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.
Claim 15 is a system claim with limitations corresponding to the limitations of Claim 7 and is rejected under similar rationale.
Claim 16 is a system claim with limitations corresponding to the limitations of Claim 8 and is rejected under similar rationale.
Claim 17 is a system claim with limitations corresponding to the limitations of Claim 9 and is rejected under similar rationale.
Claim(s) 10 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bonar in view of Calapodescu et al. (US 20230215421 A1).
Regarding claim 10, Bonar discloses: 10. The computer-implemented method of claim 5, wherein the first prosody data corresponds to a spectrogram, ([0040] discloses that style cues can include a sentiment)
and generating the first output audio data comprises: processing, using a vocoder, the spectrogram. ( Vocal Synthesizer 422, Fig. 4 )
Bonar does not disclose that the prosody data is a spectrogram.
Calapodescu discloses: wherein the first prosody data corresponds to a spectrogram, and generating the first output audio data comprises: processing, using a vocoder, the spectrogram. ("[0045] Speech representations may additionally or alternatively be embodied in a speech signal that can be processed downstream by a voice synthesizer, vocoder, etc. to generate speech. An example of such speech signals is a spectrogram, such as a Mel-spectrogram." )
Bonar and Calapodescu are considered analogous art to the claimed invention because they disclose methods of TTS with prosody control. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Bonar to use a spectrogram for the prosody data. Doing so would have been beneficial to achieve fast speed with comparable voice quality. (Calapodescu [0063].) This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Claim 18 is a system claim with limitations corresponding to the limitations of Claim 10 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Yi et al. (“Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings”). Yi discloses a multi-task architecture for determining prosody information. (See fig. 4)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JON C MEIS whose telephone number is (703)756-1566. The examiner can normally be reached Monday - Thursday, 8:30 am - 5:30 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JON CHRISTOPHER MEIS/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654