Last updated: May 29, 2026
Application No. 18/110,141
AUDIO SIGNAL GENERATION USING NEURAL NETWORKS

Final Rejection §101§103
Filed
Feb 15, 2023
Examiner
BECKER, TYLER JUSTIN
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
2 (Final)
Interview Optional

— +16.5% interview lift. Examiner has a relatively high allowance rate (75%); +16.5% interview lift. A written response may suffice.
Based on 20 resolved cases, 2023–2026
Examiner Intelligence

BECKER, TYLER JUSTIN View full profile →
Grants 75% — above average
Career Allowance Rate
15 granted / 20 resolved
+13.0% vs TC avg
Strong +16% interview lift
Without
With
+16.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
11 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
90.4%
+50.4% vs TC avg
§102
3.6%
-36.4% vs TC avg
§112
4.8%
-35.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 20 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed July 24th, 2025 has been entered. Claims 1, 5, 8, 12, 15, and 18 have been amended. Claims 1-20 are pending and have been examined. Applicant’s amendments to the claims have overcome all claim objections previously set forth.

Response to Arguments
Applicant's arguments with respect to the rejection of claims 1-5, 8-12, and 15-18 under 35 U.S.C. 101, filed July 24th, 2025, have been fully considered but they are not persuasive.
The applicant argues that the claimed invention is integrated into a practical application because it “provides significant technical advantages to processors that generate audio signals from input speech and reference speech using neural networks.” The examiner respectfully disagrees that the claimed invention is integrated into a practical application. While the applicant’s claims may be directed to a specific technological improvement, that specific improvement is not clearly represented in the language of the claims. Furthermore, the recited abstract idea is not integrated into a practical application because the improvement in technology must be provided by one or more additional elements.
According to MPEP 2106.05(a), “After the examiner has consulted the specification and determined that the disclosed invention improves technology, the claim must be evaluated to ensure the claim itself reflects the disclosed improvement in technology [emphasis added]. Intellectual Ventures I LLC v. Symantec Corp., 838 F.3d 1307, 1316, 120 USPQ2d 1353, 1359 (Fed. Cir. 2016) (patent owner argued that the claimed email filtering system improved technology by shrinking the protection gap and mooting the volume problem, but the court disagreed because the claims themselves did not have any limitations that addressed these issues). That is, the claim must include the components or steps of the invention that provide the improvement described in the specification. However, the claim itself does not need to explicitly recite the improvement described in the specification (e.g., "thereby increasing the bandwidth of the channel"). The full scope of the claim under the BRI should be considered to determine if the claim reflects an improvement in technology (e.g., the improvement described in the specification).”
MPEP 2106.05(a) also states, “It is important to note, the judicial exception alone cannot provide the improvement. The improvement can be provided by one or more additional elements. [emphasis added] See the discussion of Diamond v. Diehr, 450 U.S. 175, 187 and 191-92, 209 USPQ 1, 10 (1981)) in subsection II, below. In addition, the improvement can be provided by the additional element(s) in combination with the recited judicial exception. See MPEP § 2106.04(d) (discussing Finjan, Inc. v. Blue Coat Sys., Inc., 879 F.3d 1299, 1303-04, 125 USPQ2d 1282, 1285-87 (Fed. Cir. 2018)). Thus, it is important for examiners to analyze the claim as a whole when determining whether the claim provides an improvement to the functioning of computers or an improvement to other technology or technical field.”
As described in the 101 rejection below, the claimed invention is an abstract idea that can be performed in the human mind with pen and paper that has been implemented using generic computing components. While the applicant’s invention may perform computations or take steps beyond human capability, that is not reflected in the language of the claims, and thus the claimed invention does not amount to more than an abstract idea.
Furthermore, as described in the 101 rejection below, the only additional elements in claim 1 are “a processor”, “one or more circuits”, and “one or more neural networks”.
These additional elements are the only limitations of the claim that should be evaluated to determine if the claim provides the improvement, as the remaining limitations describe the mental process, which means they can’t provide the improvement.
When analyzing these additional elements, the “processor”, “one or more circuits”, and “one or more neural networks” are generic computing hardware and software that are recited at a high level of generality.  Thus, the limitation describes mere instructions to apply the exception using generic hardware (MPEP 2106.05(f)).
Independent claims 8 and 15 are similarly written, and provide no additional elements to the claimed invention. Therefor, these independent claims do not integrate the abstract idea into a practical application.
Claims 2-5, 9-12, and 16-18 each depend on claim 1, 8, or 15, and do not provide any additional elements that provide significantly more than the judicial exception.
As such, claims 1-5, 8-12, and 15-18 are directed to an abstract idea without significantly more.

Applicant’s arguments with respect to the rejection of claim(s) 1-20 under 35 U.S.C. 102 and 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Specification
Content of Specification
(a) TITLE OF THE INVENTION: See 37 CFR 1.72(a) and MPEP § 606. The title of the invention should be placed at the top of the first page of the specification unless the title is provided in an application data sheet. The title of the invention should be brief but technically accurate and descriptive, preferably from two to seven words. It may not contain more than 500 characters.
(b) CROSS-REFERENCES TO RELATED APPLICATIONS: See 37 CFR 1.78 and MPEP § 211 et seq.
(c) STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT: See MPEP § 310.
(d) THE NAMES OF THE PARTIES TO A JOINT RESEARCH AGREEMENT. See 37 CFR 1.71(g).
(e) INCORPORATION-BY-REFERENCE OF MATERIAL SUBMITTED ON A READ-ONLY OPTICAL DISC, AS A TEXT FILE OR AN XML FILE VIA THE PATENT ELECTRONIC SYSTEM: The specification is required to include an incorporation-by-reference of electronic documents that are to become part of the permanent United States Patent and Trademark Office records in the file of a patent application. See 37 CFR 1.77(b)(5) and MPEP § 608.05. See also the Legal Framework for Patent Electronic System posted on the USPTO website (https://www.uspto.gov/sites/default/files/documents/2019LegalFrameworkPES.pdf) and MPEP § 502.05
(f) STATEMENT REGARDING PRIOR DISCLOSURES BY THE INVENTOR OR A JOINT INVENTOR. See 35 U.S.C. 102(b) and 37 CFR 1.77.
(g) BACKGROUND OF THE INVENTION: See MPEP § 608.01(c). The specification should set forth the Background of the Invention in two parts:
(1) Field of the Invention: A statement of the field of art to which the invention pertains. This statement may include a paraphrasing of the applicable U.S. patent classification definitions of the subject matter of the claimed invention. This item may also be titled “Technical Field.”
(2) Description of the Related Art including information disclosed under 37 CFR 1.97 and 37 CFR 1.98: A description of the related art known to the applicant and including, if applicable, references to specific related art and problems involved in the prior art which are solved by the applicant’s invention. This item may also be titled “Background Art.”
(h) BRIEF SUMMARY OF THE INVENTION: See MPEP § 608.01(d). A brief summary or general statement of the invention as set forth in 37 CFR 1.73. The summary is separate and distinct from the abstract and is directed toward the invention rather than the disclosure as a whole. The summary may point out the advantages of the invention or how it solves problems previously existent in the prior art (and preferably indicated in the Background of the Invention). In chemical cases it should point out in general terms the utility of the invention. If possible, the nature and gist of the invention or the inventive concept should be set forth. Objects of the invention should be treated briefly and only to the extent that they contribute to an understanding of the invention.
(i) BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S): See MPEP § 608.01(f). A reference to and brief description of the drawing(s) as set forth in 37 CFR 1.74.
(j) DETAILED DESCRIPTION OF THE INVENTION: See MPEP § 608.01(g). A description of the preferred embodiment(s) of the invention as required in 37 CFR 1.71. The description should be as short and specific as is necessary to describe the invention adequately and accurately. Where elements or groups of elements, compounds, and processes, which are conventional and generally widely known in the field of the invention described, and their exact nature or type is not necessary for an understanding and use of the invention by a person skilled in the art, they should not be described in detail. However, where particularly complicated subject matter is involved or where the elements, compounds, or processes may not be commonly or widely known in the field, the specification should refer to another patent or readily available publication which adequately describes the subject matter.
(k) CLAIM OR CLAIMS: See 37 CFR 1.75 and MPEP § 608.01(m). The claim or claims must commence on a separate sheet or electronic page (37 CFR 1.52(b)(3)). Where a claim sets forth a plurality of elements or steps, each element or step of the claim should be separated by a line indentation. There may be plural indentations to further segregate subcombinations or related steps. See 37 CFR 1.75 and MPEP 608.01(i) - (p).
(l) ABSTRACT OF THE DISCLOSURE: See 37 CFR 1.72 (b) and MPEP § 608.01(b). The abstract is a brief narrative of the disclosure as a whole, as concise as the disclosure permits, in a single paragraph preferably not exceeding 150 words, commencing on a separate sheet following the claims. In an international application which has entered the national stage (37 CFR 1.491(b)), the applicant need not submit an abstract commencing on a separate sheet if an abstract was published with the international application under PCT Article 21. The abstract that appears on the cover page of the pamphlet published by the International Bureau (IB) of the World Intellectual Property Organization (WIPO) is the abstract that will be used by the USPTO. See MPEP § 1893.03(e).
(m) SEQUENCE LISTING: See 37 CFR 1.821 - 1.825 and MPEP §§ 2421 - 2431. The requirement for a sequence listing applies to all sequences disclosed in a given application, whether the sequences are claimed or not. See MPEP § 2422.01.

The disclosure is objected to because of the following informalities:
	The specification is missing a brief summary of the invention as outlined in (h) of the content of the specification above.
Appropriate correction is required.

The lengthy specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors of which applicant may become aware in the specification.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-5, 8-12, and 15-18 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1, the claim recites element (a) “one or more 
The judicial exception is not integrated into a practical application. The claim recites the additional elements (b) “a processor”, (c) “one or more circuits”, and (d) “one or more neural networks”. Here, elements (b)-(d) account for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES).
The claim does not include any other additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, element (a) amounts to no more than a mental process, and elements (b)-(d) amount to no more than generic computing components. Even when considered in combination, these additional elements represent mere instructions to implement an abstract idea or other exception on a computer, and do not provide an inventive concept (step 2B).

Claim 2 depends on claim 1, and thus recites the limitations of claim 1, with the additional elements (e) “wherein the one or more second features comprises a timbre of the second voice signal” and (f) “the one or more 
For the reasons discussed above for claim 1, the claim 1 limitations recite abstract ideas. The additional elements of claim 2 do not preclude the steps of claim 1 from practically being performed in the mind. Elements (e) and (f) further modify the abstract idea by disclosing that the second features comprise a timbre of the second voice signal, and that the first audio features and the timbre are combined to generate the output audio. Here, elements (e) and (f) fall under the mental process of collecting information, analyzing it, and displaying results of the collection and analysis (MPEP 2106.04(a)(2)(III)(A)). Furthermore, the claim also includes the additional element (g) “one or more neural networks”. Here, element (g) accounts for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES). Even when viewed in combination, the mental process and generic computing components in the claim do not amount to significantly more than a mental process (Step 2B: YES).

Claim 3 depends on claim 1, and thus recites the limitations of claim 1, with the additional element (h) “wherein the one or more first audio features comprise at least one of: pitch, amplitude, and linguistic content.”
For the reasons discussed above for claim 1, the claim 1 limitations recite abstract ideas. The additional element of claim 3 does not preclude the steps of claim 1 from practically being performed in the mind. Element (h) further modifies the abstract idea by disclosing that the first audio features include pitch, amplitude, linguistic content, or some combination of the three. Here, element (h) falls under the mental process of collecting information (MPEP 2106.04(a)(2)(III)(A)). Accordingly, the claim recites a judicial exception (Step 2A).
Claim 3 does recite any additional elements and therefore, the claim is not practically integrated into a practical application and does not amount to significantly more than a judicial exception (Step 2A Prong two and Step 2B).

Claim 4 depends on claim 3, and thus recites the limitations of claim 3, with the additional element (i) “wherein the linguistic content is represented by one or more phoneme posteriorgrams.”
For the reasons discussed above for claim 3, the claim 3 limitations recite abstract ideas. The additional element of claim 4 does not preclude the steps of claim 3 from practically being performed in the mind. Element (i) further modifies the abstract idea by disclosing that the linguistic content is represented by posteriorgrams. Here, element (i) falls under the mental process of collecting information, analyzing it, and displaying results of the collection and analysis (MPEP 2106.04(a)(2)(III)(A)). Accordingly, the claim recites a judicial exception (Step 2A). 
Claim 4 does recite any additional elements and therefore, the claim is not practically integrated into a practical application and does not amount to significantly more than a judicial exception (Step 2A Prong two and Step 2B).

Claim 5 depends on claim 1, and thus recites the limitations of claim 1, with the additional element (j) “wherein the one or more 
For the reasons discussed above for claim 1, the claim 1 limitations recite abstract ideas. The additional element of claim 3 does not preclude the steps of claim 1 from practically being performed in the mind. Element (j) further modifies the abstract idea by disclosing that a generator generates the audio signal based on encodings. Here, element (j) falls under the mental process of collecting information, analyzing it, and displaying results of the collection and analysis (MPEP 2106.04(a)(2)(III)(A)). Furthermore, the claim also includes the additional elements (k) “one or more neural networks” and (l) “a generator”. Here, elements (k) and (l) account for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES). Even when viewed in combination, the mental process and generic computing components in the claim do not amount to significantly more than a mental process (Step 2B: YES).

Regarding claim 8, the claim recites element (a) “using one or more 
The judicial exception is not integrated into a practical application. The claim recites the additional element (b) “one or more neural networks”. Here, element (b) accounts for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES).
The claim does not include any other additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, element (a) amounts to no more than a mental process, and element (b) amounts to no more than generic computing components. Even when considered in combination, these additional elements represent mere instructions to implement an abstract idea or other exception on a computer, and do not provide an inventive concept (step 2B).

Claim 9 depends on claim 8, and thus recites the limitations of claim 8, with the additional elements (c) “wherein the one or more second features comprises a timbre of the second voice signal” and (d) “the one or more 
For the reasons discussed above for claim 8, the claim 8 limitations recite abstract ideas. The additional elements of claim 9 do not preclude the steps of claim 8 from practically being performed in the mind. Elements (c) and (d) further modify the abstract idea by disclosing that the second features comprise a timbre of the second voice signal, and that the first audio features and the timbre are combined to generate the output audio. Here, elements (c) and (d) fall under the mental process of collecting information, analyzing it, and displaying results of the collection and analysis (MPEP 2106.04(a)(2)(III)(A)). Furthermore, the claim also includes the additional element (e) “one or more neural networks”. Here, element (e) accounts for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES). Even when viewed in combination, the mental process and generic computing components in the claim do not amount to significantly more than a mental process (Step 2B: YES).

Claim 10 depends on claim 8, and thus recites the limitations of claim 8, with the additional element (f) “wherein the one or more first audio features comprise at least one of: pitch, amplitude, and linguistic content.”
For the reasons discussed above for claim 8, the claim 8 limitations recite abstract ideas. The additional element of claim 10 does not preclude the steps of claim 8 from practically being performed in the mind. Element (f) further modifies the abstract idea by disclosing that the first audio features include pitch, amplitude, linguistic content, or some combination of the three. Here, element (f) falls under the mental process of collecting information (MPEP 2106.04(a)(2)(III)(A)). Accordingly, the claim recites a judicial exception (Step 2A).
Claim 10 does recite any additional elements and therefore, the claim is not practically integrated into a practical application and does not amount to significantly more than a judicial exception (Step 2A Prong two and Step 2B).

Claim 11 depends on claim 10, and thus recites the limitations of claim 10, with the additional element (g) “wherein the linguistic content is represented by one or more phoneme posteriorgrams.”
For the reasons discussed above for claim 10, the claim 10 limitations recite abstract ideas. The additional element of claim 11 does not preclude the steps of claim 10 from practically being performed in the mind. Element (g) further modifies the abstract idea by disclosing that the linguistic content is represented by posteriorgrams. Here, element (g) falls under the mental process of collecting information, analyzing it, and displaying results of the collection and analysis (MPEP 2106.04(a)(2)(III)(A)). Accordingly, the claim recites a judicial exception (Step 2A). 
Claim 11 does recite any additional elements and therefore, the claim is not practically integrated into a practical application and does not amount to significantly more than a judicial exception (Step 2A Prong two and Step 2B).

Claim 12 depends on claim 8, and thus recites the limitations of claim 8, with the additional element (h) “wherein the one or more 
For the reasons discussed above for claim 8, the claim 8 limitations recite abstract ideas. The additional element of claim 12 does not preclude the steps of claim 8 from practically being performed in the mind. Element (h) further modifies the abstract idea by disclosing that a generator generates the audio signal based on encodings. Here, element (h) falls under the mental process of collecting information, analyzing it, and displaying results of the collection and analysis (MPEP 2106.04(a)(2)(III)(A)). Furthermore, the claim also includes the additional elements (i) “one or more neural networks” and (j) “a generator”. Here, elements (i) and (j) account for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES). Even when viewed in combination, the mental process and generic computing components in the claim do not amount to significantly more than a mental process (Step 2B: YES).

Regarding claim 15, the claim recites element (a) “one or more 
The judicial exception is not integrated into a practical application. The claim recites the additional elements (b) “one or more processors”, (c) “one or more circuits”, and (d) “one or more neural networks”. Here, elements (b)-(d) account for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES).
The claim does not include any other additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, element (a) amounts to no more than a mental process, and elements (b)-(d) amount to no more than generic computing components. Even when considered in combination, these additional elements represent mere instructions to implement an abstract idea or other exception on a computer, and do not provide an inventive concept (step 2B).

Claim 16 depends on claim 15, and thus recites the limitations of claim 15, with the additional elements (e) “wherein the one or more second features comprise a timbre of the second voice signal” and (f) “the one or more 
For the reasons discussed above for claim 15, the claim 15 limitations recite abstract ideas. The additional elements of claim 16 do not preclude the steps of claim 15 from practically being performed in the mind. Elements (e) and (f) further modify the abstract idea by disclosing that the second features comprise a timbre of the second voice signal, and that the first audio features and the timbre are combined to generate the output audio. Here, elements (e) and (f) fall under the mental process of collecting information, analyzing it, and displaying results of the collection and analysis (MPEP 2106.04(a)(2)(III)(A)). Furthermore, the claim also includes the additional element (g) “one or more neural networks”. Here, element (g) accounts for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES). Even when viewed in combination, the mental process and generic computing components in the claim do not amount to significantly more than a mental process (Step 2B: YES).

Claim 17 depends on claim 15, and thus recites the limitations of claim 15, with the additional element (h) “wherein the one or more first audio features comprise at least one of: pitch, amplitude, and linguistic content.”
For the reasons discussed above for claim 15, the claim 15 limitations recite abstract ideas. The additional element of claim 17 does not preclude the steps of claim 15 from practically being performed in the mind. Element (h) further modifies the abstract idea by disclosing that the first audio features include pitch, amplitude, linguistic content, or some combination of the three. Here, element (h) falls under the mental process of collecting information (MPEP 2106.04(a)(2)(III)(A)). Accordingly, the claim recites a judicial exception (Step 2A).
Claim 17 does recite any additional elements and therefore, the claim is not practically integrated into a practical application and does not amount to significantly more than a judicial exception (Step 2A Prong two and Step 2B).

Claim 18 depends on claim 15, and thus recites the limitations of claim 15, with the additional element (i) “wherein the one or more 
For the reasons discussed above for claim 15, the claim 15 limitations recite abstract ideas. The additional element of claim 18 does not preclude the steps of claim 15 from practically being performed in the mind. Element (i) further modifies the abstract idea by disclosing that a generator generates the audio signal based on encodings. Here, element (i) falls under the mental process of collecting information, analyzing it, and displaying results of the collection and analysis (MPEP 2106.04(a)(2)(III)(A)). Furthermore, the claim also includes the additional elements (j) “one or more neural networks” and (k) “a generator”. Here, elements (j) and (k) account for generic computing components recited at a high level of generality (MPEP 2106.04(a)(2)(III)(C)). Even when viewed in combination, the claim elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to the judicial exception (Step 2A: YES). Even when viewed in combination, the mental process and generic computing components in the claim do not amount to significantly more than a mental process (Step 2B: YES).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3, 5, 8-10, 12, and 15-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Trueba et al. (US Pat. No. 11,735,156 B1 hereinafter Trueba), in view of Gupta et al. (US Pat. No. 11,605,388 B1 hereinafter Gupta).
Regarding claim 1, Trueba discloses a processor, comprising: one or more circuits to use one or more neural networks (Trueba, Col. 2, lines 7-11: “The processing component(s), referred to herein as a voice-transfer component, may include one or more neural-network models configured as one or more encoders and one or more neural-network models configured as one or more decoders.”) to generate, from an input speech and a reference speech, an audio signal based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech (Trueba, Fig. 1B; Col. 4, lines 27-65: "The user device 110 and/or remote system 120 processes (134) the first audio data to determine first encoded data corresponding to phoneme characteristics of the first speech."; "The user device 110 and/or remote system 120 may also process (136) the first audio data to determine second encoded data corresponding to a phrase corresponding to the first speech."; "The user device 110 and/or remote system 120 processes (138) the second audio data (e.g., the target input data 152) to determine third encoded data corresponding to vocal characteristics of the second speech (e.g., the target speech)."; "The user device 110 and/or remote system 120 may then process (140) the first encoded data, the second encoded data, and the third encoded data to determine third audio data (e.g., the output data 162) that corresponds to the phrase encoded data, the phoneme characteristics encoded data, and the vocal characteristics encoded data."). However, Trueba fails to expressly recite an audio signal that maintains prosody of the input speech, wherein the audio signal is generated based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech.
Gupta teaches an audio signal that maintains prosody of the input speech, wherein the audio signal is generated based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech (Gupta, Col. 3, lines 37-41: “The methods and systems described in this specification enable speech audio to be generated in a target speaker's voice, while maintaining the performance (e.g. speech prosody) and timing of source speech audio from which the acoustic features relating to a source speaker are derived.”).
Trueba and Gupta are analogous arts because they both belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba to incorporate the teachings of Gupta to maintain the prosody of the input speech during speech generation. This allows the sound of a person’s voice to be modified without changing the original speaker’s performance and timing (Gupta, Col. 3). This helps retain quality in the original speech even when it is modified to sound different, resulting in higher quality output audio.

	Regarding claim 2, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more second features comprises a timbre of the second voice signal (Trueba, Col. 3, lines 63-65: "the vocal characteristics may represent features of the voice of a particular speaker, such as tone, resonance, timbre, pitch, and/or frequency."), and the one or more neural networks are to generate the audio signal such that the audio signal comprises the one or more first audio features and the timbre corresponding to the second voice signal (Trueba, Col. 4, line 65- col. 5, line 3: "The output data 162 thus may include a representation of the phrase and/or phoneme characteristics corresponding to the source input data 150, while the representation further corresponds to the vocal characteristics represented in the target input data 152."; Col. 3, lines 50-51: "The user device 110 and/or other device may output audio 14 corresponding to the output data 162.").

	Regarding claim 3, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more first audio features comprise at least one of: pitch, amplitude, and linguistic content (Trueba, Col. 3, lines 58-65: "The first audio data may further correspond to phoneme characteristics and vocal characteristics; the phoneme characteristics may represent pronunciation of the first speech that is independent of a voice of a particular speaker, such as syllable breaks, cadence, and/or emphasis, while the vocal characteristics may represent features of the voice of a particular speaker, such as tone, resonance, timbre, pitch, and/or frequency.").

	Regarding claim 5, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more neural networks (Trueba, Col. 2, lines 7-11: “The processing component(s), referred to herein as a voice-transfer component, may include one or more neural-network models configured as one or more encoders and one or more neural-network models configured as one or more decoders.”) comprise a generator to generate the audio signal based, at least in part, on one or more first encodings of the one or more second features, and one or more second encodings of the one or more first audio features (Trueba, Col. 3, lines 60-65: "The user device 110 and/or remote system 120 may then process (140) the first encoded data, the second encoded data, and the third encoded data to determine third audio data (e.g., the output data 162) that corresponds to the phrase encoded data, the phoneme characteristics encoded data, and the vocal characteristics encoded data."; Col. 3, lines 50-51: "The user device 110 and/or other device may output audio 14 corresponding to the output data 162.").

	Regarding claim 8, Trueba discloses a method, comprising: using one or more neural networks (Trueba, Col. 2, lines 7-11: “The processing component(s), referred to herein as a voice-transfer component, may include one or more neural-network models configured as one or more encoders and one or more neural-network models configured as one or more decoders.”) to generate, from an input speech and a reference speech, an audio signal based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech (Trueba, Fig. 1B; Col. 4, lines 27-65: "The user device 110 and/or remote system 120 processes (134) the first audio data to determine first encoded data corresponding to phoneme characteristics of the first speech."; "The user device 110 and/or remote system 120 may also process (136) the first audio data to determine second encoded data corresponding to a phrase corresponding to the first speech."; "The user device 110 and/or remote system 120 processes (138) the second audio data (e.g., the target input data 152) to determine third encoded data corresponding to vocal characteristics of the second speech (e.g., the target speech)."; "The user device 110 and/or remote system 120 may then process (140) the first encoded data, the second encoded data, and the third encoded data to determine third audio data (e.g., the output data 162) that corresponds to the phrase encoded data, the phoneme characteristics encoded data, and the vocal characteristics encoded data."). However, Trueba fails to expressly recite an audio signal that maintains prosody of the input speech, wherein the audio signal is generated based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech.
	Gupta teaches an audio signal that maintains prosody of the input speech, wherein the audio signal is generated based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech (Gupta, Col. 3, lines 37-41: “The methods and systems described in this specification enable speech audio to be generated in a target speaker's voice, while maintaining the performance (e.g. speech prosody) and timing of source speech audio from which the acoustic features relating to a source speaker are derived.”).
Trueba and Gupta are analogous arts because they both belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba to incorporate the teachings of Gupta to maintain the prosody of the input speech during speech generation. This allows the sound of a person’s voice to be modified without changing the original speaker’s performance and timing (Gupta, Col. 3). This helps retain quality in the original speech even when it is modified to sound different, resulting in higher quality output audio.

	Regarding claim 9, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more second features comprises a timbre of the second voice signal (Trueba, Col. 3, lines 63-65: "the vocal characteristics may represent features of the voice of a particular speaker, such as tone, resonance, timbre, pitch, and/or frequency."), and the one or more neural networks are to generate the audio signal such that the audio signal comprises the one or more first audio features and the timbre corresponding to the second voice signal (Trueba, Col. 4, line 65- col. 5, line 3: "The output data 162 thus may include a representation of the phrase and/or phoneme characteristics corresponding to the source input data 150, while the representation further corresponds to the vocal characteristics represented in the target input data 152."; Col. 3, lines 50-51: "The user device 110 and/or other device may output audio 14 corresponding to the output data 162.").

	Regarding claim 10, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more first audio features comprise at least one of: pitch, amplitude, and linguistic content (Trueba, Col. 3, lines 58-65: "The first audio data may further correspond to phoneme characteristics and vocal characteristics; the phoneme characteristics may represent pronunciation of the first speech that is independent of a voice of a particular speaker, such as syllable breaks, cadence, and/or emphasis, while the vocal characteristics may represent features of the voice of a particular speaker, such as tone, resonance, timbre, pitch, and/or frequency.").

	Regarding claim 12, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more neural networks (Trueba, Col. 2, lines 7-11: “The processing component(s), referred to herein as a voice-transfer component, may include one or more neural-network models configured as one or more encoders and one or more neural-network models configured as one or more decoders.”) comprise a generator to generate the audio signal based, at least in part, on one or more first encodings of the one or more second features, and one or more second encodings of the one or more first audio features (Trueba, Col. 3, lines 60-65: "The user device 110 and/or remote system 120 may then process (140) the first encoded data, the second encoded data, and the third encoded data to determine third audio data (e.g., the output data 162) that corresponds to the phrase encoded data, the phoneme characteristics encoded data, and the vocal characteristics encoded data."; Col. 3, lines 50-51: "The user device 110 and/or other device may output audio 14 corresponding to the output data 162.").

	Regarding claim 15, Trueba discloses a system, comprising: one or more processors comprising one or more circuits to use one or more neural networks (Trueba, Col. 2, lines 7-11: “The processing component(s), referred to herein as a voice-transfer component, may include one or more neural-network models configured as one or more encoders and one or more neural-network models configured as one or more decoders.”) to generate, from an input speech and a reference speech, an audio signal based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech (Trueba, Fig. 1B; Col. 4, lines 27-65: "The user device 110 and/or remote system 120 processes (134) the first audio data to determine first encoded data corresponding to phoneme characteristics of the first speech."; "The user device 110 and/or remote system 120 may also process (136) the first audio data to determine second encoded data corresponding to a phrase corresponding to the first speech."; "The user device 110 and/or remote system 120 processes (138) the second audio data (e.g., the target input data 152) to determine third encoded data corresponding to vocal characteristics of the second speech (e.g., the target speech)."; "The user device 110 and/or remote system 120 may then process (140) the first encoded data, the second encoded data, and the third encoded data to determine third audio data (e.g., the output data 162) that corresponds to the phrase encoded data, the phoneme characteristics encoded data, and the vocal characteristics encoded data."). However, Trueba fails to expressly recite an audio signal that maintains prosody of the input speech, wherein the audio signal is generated based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech.
	Gupta teaches an audio signal that maintains prosody of the input speech, wherein the audio signal is generated based, at least in part, on one or more first audio features corresponding to a first voice signal of the input speech and one or more second features different from the one or more first audio features corresponding to a second voice signal of the reference speech (Gupta, Col. 3, lines 37-41: “The methods and systems described in this specification enable speech audio to be generated in a target speaker's voice, while maintaining the performance (e.g. speech prosody) and timing of source speech audio from which the acoustic features relating to a source speaker are derived.”).
Trueba and Gupta are analogous arts because they both belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba to incorporate the teachings of Gupta to maintain the prosody of the input speech during speech generation. This allows the sound of a person’s voice to be modified without changing the original speaker’s performance and timing (Gupta, Col. 3). This helps retain quality in the original speech even when it is modified to sound different, resulting in higher quality output audio.

	Regarding claim 16, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more second features comprise a timbre of the second voice signal (Trueba, Col. 3, lines 63-65: "the vocal characteristics may represent features of the voice of a particular speaker, such as tone, resonance, timbre, pitch, and/or frequency."), and the one or more neural networks are to generate the audio signal such that the audio signal comprises the one or more first audio features and the timbre corresponding to the second voice signal (Trueba, Col. 4, line 65- col. 5, line 3: "The output data 162 thus may include a representation of the phrase and/or phoneme characteristics corresponding to the source input data 150, while the representation further corresponds to the vocal characteristics represented in the target input data 152."; Col. 3, lines 50-51: "The user device 110 and/or other device may output audio 14 corresponding to the output data 162.").

	Regarding claim 17, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more first audio features comprise at least one of: pitch, amplitude, and linguistic content (Trueba, Col. 3, lines 58-65: "The first audio data may further correspond to phoneme characteristics and vocal characteristics; the phoneme characteristics may represent pronunciation of the first speech that is independent of a voice of a particular speaker, such as syllable breaks, cadence, and/or emphasis, while the vocal characteristics may represent features of the voice of a particular speaker, such as tone, resonance, timbre, pitch, and/or frequency.").

	Regarding claim 18, the rejection of claim 1 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. Trueba further discloses wherein the one or more neural networks (Trueba, Col. 2, lines 7-11: “The processing component(s), referred to herein as a voice-transfer component, may include one or more neural-network models configured as one or more encoders and one or more neural-network models configured as one or more decoders.”) comprise a generator to generate the audio signal based, at least in part, on one or more first encodings of the one or more second features, and one or more second encodings of the one or more first audio features (Trueba, Col. 3, lines 60-65: "The user device 110 and/or remote system 120 may then process (140) the first encoded data, the second encoded data, and the third encoded data to determine third audio data (e.g., the output data 162) that corresponds to the phrase encoded data, the phoneme characteristics encoded data, and the vocal characteristics encoded data."; Col. 3, lines 50-51: "The user device 110 and/or other device may output audio 14 corresponding to the output data 162.").

Claim(s) 4 and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Trueba, in view of Gupta, as applied to claims 1-3, 5, 8-10, 12, and 15-18 above, and further in view of Carmiel et al. (US Pat. Pub. No. 2023/0352001 A1 hereinafter Carmiel).
Regarding claim 4, the rejection of claim 3 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. However, Trueba, in view of Gupta, fails to expressly recite wherein the linguistic content is represented by one or more phoneme posteriorgrams.
Carmiel teaches wherein the linguistic content is represented by one or more phoneme posteriorgrams (Carmiel, [0055]: "In another example, another approach is based on converting speech using phonetic posteriorgrams (PPGs). Such prior approach is limited to voice conversion, whereas at least some embodiments described herein enable converting other and/or selected voice attributes such as accent. Such prior approach is are limited to a “many-to-one” approach, i.e., different input voices are mapped to a single output voice. At least some embodiments described herein provide a “many-to-many” approach where the input voice may be converted to different sets of output voice attributes.").
Trueba, Gupta, and Carmiel are analogous arts because they all belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba, as modified by the speaker conversion system of Gupta, to incorporate the teachings of Carmiel to represent linguistic content using phoneme posteriorgrams. Using phoneme posteriorgrams is a known technique for speech recognition in many-to-one voice conversion systems (Carmiel, [0082]). Using this technique allows for a voice conversion system to effectively recognize the input speech.

Regarding claim 11, the rejection of claim 10 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. However, Trueba, in view of Gupta, fails to expressly recite wherein the linguistic content is represented by one or more phoneme posteriorgrams.
Carmiel teaches wherein the linguistic content is represented by one or more phoneme posteriorgrams (Carmiel, [0055]: "In another example, another approach is based on converting speech using phonetic posteriorgrams (PPGs). Such prior approach is limited to voice conversion, whereas at least some embodiments described herein enable converting other and/or selected voice attributes such as accent. Such prior approach is are limited to a “many-to-one” approach, i.e., different input voices are mapped to a single output voice. At least some embodiments described herein provide a “many-to-many” approach where the input voice may be converted to different sets of output voice attributes.").
Trueba, Gupta, and Carmiel are analogous arts because they all belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba, as modified by the speaker conversion system of Gupta, to incorporate the teachings of Carmiel to represent linguistic content using phoneme posteriorgrams. Using phoneme posteriorgrams is a known technique for speech recognition in many-to-one voice conversion systems (Carmiel, [0082]). Using this technique allows for a voice conversion system to effectively recognize the input speech.

Claim(s) 6, 13, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Trueba, in view of Gupta, as applied to claims 1-3, 5, 8-10, 12, and 15-18 above, and further in view of Jia et al. (US Pat. Pub. No. 2022/0068256 A1 hereinafter Jia).
Regarding claim 6, the rejection of claim 5 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. However, Trueba, in view of Gupta, fails to expressly recite wherein the generator is trained based, at least in part, on one or more audio signals including voices not included in the second voice signal.
Jia teaches wherein the generator is trained based, at least in part, on one or more audio signals including voices not included in the second voice signal (Jia, [0003]: "The method includes receiving, at data processing hardware, a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The method further includes training, at the data processing hardware, a text-to-speech (TTS) model using the first plurality of recorded speech samples from the assortment of speakers.").
Trueba, Gupta, and Jia are analogous arts because they all belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba, as modified by the speaker conversion system of Gupta, to incorporate the teachings of Jia to train the audio generation model based on audio that is different from a specified audio. The accuracy and/or robustness of a neural network for audio generation depends on the training data set (Jia, [0002]). As such, it is important to have a varied set of data for training a neural network.

Regarding claim 13, the rejection of claim 12 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. However, Trueba, in view of Gupta, fails to expressly recite wherein the generator is trained based, at least in part, on one or more audio signals including voices not included in the second voice signal.
Jia teaches wherein the generator is trained based, at least in part, on one or more audio signals including voices not included in the second voice signal (Jia, [0003]: "The method includes receiving, at data processing hardware, a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The method further includes training, at the data processing hardware, a text-to-speech (TTS) model using the first plurality of recorded speech samples from the assortment of speakers.").
Trueba, Gupta, and Jia are analogous arts because they all belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba, as modified by the speaker conversion system of Gupta, to incorporate the teachings of Jia to train the audio generation model based on audio that is different from a specified audio. The accuracy and/or robustness of a neural network for audio generation depends on the training data set (Jia, [0002]). As such, it is important to have a varied set of data for training a neural network.

Regarding claim 19, the rejection of claim 18 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. However, Trueba, in view of Gupta, fails to expressly recite wherein the generator is trained based, at least in part, on one or more audio signals including voices not included in the second voice signal.
Jia teaches wherein the generator is trained based, at least in part, on one or more audio signals including voices not included in the second voice signal (Jia, [0003]: "The method includes receiving, at data processing hardware, a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The method further includes training, at the data processing hardware, a text-to-speech (TTS) model using the first plurality of recorded speech samples from the assortment of speakers.").
Trueba, Gupta, and Jia are analogous arts because they both belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba, as modified by the speaker conversion system of Gupta, to incorporate the teachings of Jia to train the audio generation model based on audio that is different from a specified audio. The accuracy and/or robustness of a neural network for audio generation depends on the training data set (Jia, [0002]). As such, it is important to have a varied set of data for training a neural network.

Claim(s) 7, 14, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Trueba, in view of Gupta, as applied to claims 1-3, 5, 8-10, 12, and 15-18 above, and further in view of Prenger et al. (US Pat. Pub. No. 2020/0394994 A1 hereinafter Prenger).
Regarding claim 7, the rejection of claim 5 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. However, Trueba, in view of Gupta, fails to expressly recite wherein the generator comprises one or more residual blocks, and each of the one or more residual blocks receive the one or more of the first encodings as input.
Prenger teaches wherein the generator comprises one or more residual blocks, and each of the one or more residual blocks receive the one or more of the first encodings as input (Prenger, [0023]: "In at least one embodiment, WN( ) is an audio transformation that uses layers of dilated convolutions with gated-tan h nonlinearities, as well as residual connections and skip connections.").
Trueba, Gupta, and Prenger are analogous arts because they all belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba, as modified by the speaker conversion system of Gupta, to incorporate the teachings of Prenger to use residual blocks in an audio generation neural network. Residual blocks can be used to create a system that generates “high quality audio without sacrificing quality audio without sacrificing quality at rates that may even exceed real-time requirements” (Prenger, [0002]). Generating high quality audio is important to ensure a good user experience.

Regarding claim 14, the rejection of claim 12 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. However, Trueba, in view of Gupta, fails to expressly recite wherein the generator comprises one or more residual blocks, and each of the one or more residual blocks receive the one or more of the first encodings as input.
Prenger teaches wherein the generator comprises one or more residual blocks, and each of the one or more residual blocks receive the one or more of the first encodings as input (Prenger, [0023]: "In at least one embodiment, WN( ) is an audio transformation that uses layers of dilated convolutions with gated-tan h nonlinearities, as well as residual connections and skip connections.").
Trueba, Gupta, and Prenger are analogous arts because they all belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba, as modified by the speaker conversion system of Gupta, to incorporate the teachings of Prenger to use residual blocks in an audio generation neural network. Residual blocks can be used to create a system that generates “high quality audio without sacrificing quality audio without sacrificing quality at rates that may even exceed real-time requirements” (Prenger, [0002]). Generating high quality audio is important to ensure a good user experience.

Regarding claim 20, the rejection of claim 18 is incorporated. Trueba, in view of Gupta, discloses all of the elements of the current invention as stated above. However, Trueba, in view of Gupta, fails to expressly recite wherein the generator comprises one or more residual blocks, and each of the one or more residual blocks receive the one or more of the first encodings as input.
Prenger teaches wherein the generator comprises one or more residual blocks, and each of the one or more residual blocks receive the one or more of the first encodings as input (Prenger, [0023]: "In at least one embodiment, WN( ) is an audio transformation that uses layers of dilated convolutions with gated-tan h nonlinearities, as well as residual connections and skip connections.").
Trueba, Gupta, and Prenger are analogous arts because they all belong to the field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic speech processing system of Trueba, as modified by the speaker conversion system of Gupta, to incorporate the teachings of Prenger to use residual blocks in an audio generation neural network. Residual blocks can be used to create a system that generates “high quality audio without sacrificing quality audio without sacrificing quality at rates that may even exceed real-time requirements” (Prenger, [0002]). Generating high quality audio is important to ensure a good user experience.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to TYLER J BECKER whose telephone number is (703)756-1271. The examiner can normally be reached M-Th, 7:15am-5:45pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/TYLER BECKER/              Examiner, Art Unit 2657                                                                                                                                                                                          

/DANIEL C WASHBURN/               Supervisory Patent Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Feb 15, 2023
Application Filed
Feb 24, 2025
Non-Final Rejection mailed — §101, §103
Jul 24, 2025
Response Filed
Oct 02, 2025
Final Rejection mailed — §101, §103
Apr 02, 2026
Request for Continued Examination
Apr 03, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/346,232
Patent 12632657
Joint Speech and Text Streaming Model for ASR
2y 10m to grant Granted May 19, 2026
18/274,767
Patent 12614560
REVERBERATION REMOVAL DEVICE, PARAMETER ESTIMATION DEVICE, REVERBERATION REMOVAL METHOD, PARAMETER ESTIMATION METHOD, AND PROGRAM
2y 9m to grant Granted Apr 28, 2026
18/484,927
Patent 12597433
SPEECH SIGNAL ENHANCEMENT METHOD AND APPARATUS, AND ELECTRONIC DEVICE
2y 5m to grant Granted Apr 07, 2026
18/334,771
Patent 12585893
Full Media Translator
2y 9m to grant Granted Mar 24, 2026
17/692,070
Patent 12518777
SYSTEMS AND METHODS FOR AUTHENTICATION USING SOUND-BASED VOCALIZATION ANALYSIS
3y 10m to grant Granted Jan 06, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
75%
Grant Probability
92%
With Interview (+16.5%)
2y 7m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 20 resolved cases by this examiner. Grant probability derived from career allowance rate.