Prosecution Insights
Last updated: April 19, 2026
Application No. 18/108,080

METHOD FOR PERFORMING SYNTHETIC SPEECH GENERATION OPERATION ON TEXT

Non-Final OA §103
Filed
Feb 10, 2023
Examiner
WITHEY, THEODORE JOHN
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Neosapience Inc.
OA Round
3 (Non-Final)
44%
Grant Probability
Moderate
3-4
OA Rounds
2y 11m
To Grant
90%
With Interview

Examiner Intelligence

Grants 44% of resolved cases
44%
Career Allow Rate
10 granted / 23 resolved
-18.5% vs TC avg
Strong +47% interview lift
Without
With
+46.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
39 currently pending
Career history
62
Total Applications
across all art units

Statute-Specific Performance

§101
22.0%
-18.0% vs TC avg
§103
48.6%
+8.6% vs TC avg
§102
17.1%
-22.9% vs TC avg
§112
12.0%
-28.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 23 resolved cases

Office Action

§103
DETAILED ACTION This office action is in response to Applicant’s request for continued examination (RCE), received on 11/17/2025. Claims 1 and 5 have been amended. Claim 11 has been cancelled. Claims 1, 3, 5-10 are pending and have been considered. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Priority Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed for the parent Application No. KR10-2020-0102500, filed on 08/14/2020. Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/17/2025 has been entered. Response to Arguments Applicant's arguments filed 11/17/2025, see pgs. 5-8, with respect to “Rejections under 35 U.S.C. 103” for independent claim 1, have been fully considered but they are not persuasive. Applicant’s representative asserts, “The cited references do not teach or suggest ‘determining whether the plurality of speech style characteristics are revealed by the plurality of synthetic speeches; selecting, by the processor of the information processing system, at least one sentence as an inspection target from among the plurality of sentences based on a result of analyzing at least one of the plurality of speech style characteristics or the plurality of synthetic speeches, wherein the analyzing the plurality of synthetic speeches comprises determining that an characteristic of a synthetic speech corresponding to the at least one sentence is different from an characteristic of a synthetic speech corresponding to an adjacent sentence,’ as recited in amended independent claim 1. Support for the amendments can be found, for example, at paragraph [0086] of the present publication (US 2023/0186895). Kaszczuk is directed to a system and method for improving TTS output by incorporating user feedback into iterative TTS processing. Kaszczuk describes a device that converts text into an acoustic waveform that is recognizable as speech corresponding to the input text and the text is in the form of sentences. (col. 1, lines 9-11 and col. 3, lines 65-67). The input text may include special annotations for processing by the TTS module 214 to indicate how particular text is to be pronounced when spoken aloud. (col. 4, lines 1-3). The Examiner alleges that column 10, lines 24-28 of Kaszczuk disclose selecting at least one sentence to be an inspection target from among the plurality of sentences based on a result of analyzing at least one of the plurality of speech style characteristics or the plurality of synthetic speeches. (Final Office Action, page 8). Column 10, lines 24-28 state: a user to edit the preliminary TTS results by adjusting acoustic features of the preliminary TTS results, by substituting phonetic units in the preliminary TTS results, or by making other edits. The user edits (or commands based on those edits) are then taken as inputs to the TTS system along However, the cited portion of Kaszczuk merely describes that a user edits the preliminary TTS results to have the second round of TTS results that is more desirable than the preliminary TTS results. Kaszczuk does not teach or suggest selecting a certain sentence as an inspection target among a plurality of sentences by a processor. The Examiner alleges that the selection of a sentence in Kaszczuk is based on a required first waveform generation which inherently will require speech style analysis. (Final Office Action, page 8). However, Kaszczuk does not teach or suggest determining whether a speech style characteristics for a certain sentence is revealed by a synthetic speech generated with respect to the certain sentence. In addition, Kaszczuk does not teach or suggest ‘wherein the analyzing the plurality of synthetic speeches comprises determining that an characteristic of a synthetic speech corresponding to the at least one sentence is different from an characteristic of a synthetic speech corresponding to an adjacent sentence,’ as recited in amended independent claim 1.” In response, the examiner would like to refer to the previously cited user editing operation of Kaszczuk with regard to the “selecting a certain sentence as an inspection target among a plurality of sentences by a processor”. Specifically, the examiner respectfully asserts that presentation of a sentence to be edited by a user through a user interface (see Fig. 7A-7B example interface) requires the processor of the computing system to visually “select” for displaying the to-be-edited result. The user is not presented with a physical sentence; therefore, there is a required step of “selecting” by the processor for displaying the inspection target. Further, referring to the flowchart of Fig. 6, the step 606 of “providing preliminary TTS results” before step 608 “receive user feedback on the preliminary TTS results” indicates the preliminary result to be selected by a processor for provision to the user. Continuing, with regard to the newly added claim elements, new portions of Kaszczuk and Sung will be relied upon. Specifically, the process of Kaszczuk will be revisited. With regard to the “determining whether the plurality of speech style characteristics are revealed by the plurality of synthetic speeches” element, the examiner respectfully asserts that this operation is one synonymous to the user feedback/editing operation of Kaszczuk. “Revealing” speech style characteristics in synthetic speeches maps to an operation of determining if a synthesized speech has an expected prosody/style. If the user provides positive/no feedback, this indicates that the speech style characteristic is revealed in the synthesized speech. If the user provides negative feedback altering speech synthesis results (see [Col. 10, Lines 34-50]), then this is indicative of a desired speech style characteristic not being revealed, causing the need for adjustment. See updated rejections below with complete mapping. With regard to the “wherein the analyzing the plurality of synthetic speech comprises determining that an characteristic…” element, new sections of Sung will be incorporated. Specifically, [0086] of Sung discloses changing the style of speech based on importance of meaning of the sentences the speech represents. Determining to apply different speech styles to sentences next to each other indicates a determination that the styles, i.e. characteristics, will be different causing the different importance determinations, wherein an importance determination has a direct relation to the style. See complete mapping below. Applicant’s representative continues, “Second, the Office alleges that column 11, lines 8-10 of Kaszczuk disclose outputting a visual representation indicating the inspection target in an area corresponding to the selected at least one sentence. (Office Action, page 5). However, column 11, lines 8-10 describe that a user selects a portion of TTS results to edit. The TTS results are illustrated in FIG. 7A below. As illustrated in FIG. 7A, the TTS results show an input text. The displayed input text is not selected by a processor based on a result of analyzing at least one of the plurality of speech style characteristics or the plurality of synthetic speeches. Instead, the input text is selected by a user who is different from a processor. In addition, the input text in Kaszczuk is not selected based on a result of analyzing at least one of the plurality of speech style characteristics or the plurality of synthetic speeches.” In response, the examiner would like to refer to the language of the claims as currently amended. Specifically, the “instructing…” element of the claim has no relation, connection, or suggestion that the visual representation of the inspection target is performed based on a result of analyzing at least one of the plurality of speech style characteristics as asserted by Applicant’s representative. The examiner has previously demonstrated that Kaszczuk discloses “selecting a certain sentence as an inspection target among a plurality of sentences by a processor”. Further, presentation of a result to a user interface indicates the processor selected the sentence for presentation, wherein presentation of the sentence with associated prosody information (see example interface of Figs. 7A-7B) indicates a previous analysis of the synthesized speech in order to generate and present an accurate waveform of the synthesized speech on the interface. Applicant’s representative asserts that the user is selecting the input text (see pg. 7 of remarks), though in view of the flowchart Fig. 6 of Kaszczuk which discloses generating results 604 before presentation 606 without specific disclosure that the user is making the selection on what is to be presented, it is unclear to the examiner where in Kaszczuk this claim is founded. The examiner respectfully asserts that the functionality of Kaszczuk appears to be presentation of all synthesized speech without user selection as provision of the results is never defined to be based on user selection; therefore, the processor is selecting the synthesized speech to be presented to the user for correction, wherein the synthesized speech characteristics will have been analyzed by a processor as would be required for appropriate generation of the waveform representing the synthesized speech on the user interface. Applicant's arguments filed 11/17/2025, see pgs. 8-9, with respect to “Rejections under 35 U.S.C. 103” for independent claim 5, have been fully considered but they are not persuasive. Applicant’s representative asserts, “The Examiner alleges that col. 6, lines 55-62 of Czuczman disclose selecting at least one sentence to an inspection by analyzing a behavior pattern of the first user account. (Final Office Action, page 17). Col. 6, lines 55-62 of Czuczman states: uled review prior to the lower priority sentences. Sentences prioritized in a number of different ways, for example different prosodic units are included in each sentence, the linguistic coverage of the sentence, the length of the sentence, how easy the sentence is to pronounce (which may be represented by a readability score), the number of new units in the sentence not already in the corpus, and/or a number of other factors. The overall priority may be a combination of such factors, for example, a The Examiner appears to interpret the features of analyzing a behavior pattern of the first user very broadly, and alleges that different prosodic units described in Czuczman correspond to the behavior pattern of claim 5. (Final Office Action, page 17). The cited references do not teach or suggest ‘wherein the at least one sentence is selected as the inspection target in response to determining that the speech style characteristics selected for the at least one sentence is different from recommended speech style characteristics, determining that selection of the speech style characteristics for the at least one sentence is made in less than a predetermined time, or determining that selection of the speech style characteristics for the at least one sentence is frequently changed from the first terminal of the first user,’ as recited in amended independent claim 5. Regarding claim 11, the Examiner alleges that column 6, lines 50- 60 of Czuczman describe that sentences may be prioritized based on how many different prosodic units are included in each sentence, and this description corresponds to the features of determining that section of speech style characteristics for the at least one sentence is frequently changed. (Final Office Action, page 20). However, Czuczman does not teach or suggest determining that the speech style characteristics selected for the at least one sentence is different from recommended speech style characteristics, determining that selection of the speech style characteristics for the at least one sentence is made in less than a predetermined time, or determining that selection of the speech style characteristics for the at least one sentence is frequently changed from the first terminal of the first user, as recited in amended independent claim 5. Czuczman merely describes determining priorities based on how many different prosodics are included in each sentence. For at least these reasons, amended independent claim 5 and its dependent claims are patentable over the cited references.” Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references. It is unclear to the examiner why their interpretation of behavior patterns based on prosodic units as cited using Czuczman is inappropriate for the given claim. Consider the prosodic units of Czuczman in view of the speech units of Kaszczuk (see Fig. 7A) which clearly define the units to be specific tones and/or ways of speaking, indicating a change of prosodic units of Czuczman will change a manner of speaking. Applicant's arguments do not comply with 37 CFR 1.111(c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections. See updated rejections below. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1, 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaszczuk et al. (US-9978359-B1), hereinafter Kaszczuk, in view of Sung et al. (US-20200234693-A1), hereinafter Sung. Regarding claim 1, Kaszczuk discloses: a method for performing a synthetic speech generation operation on text ([Col. 1, Lines 9-11] In such text-to-speech (TTS) systems, a device converts text into an acoustic waveform that is recognizable as speech corresponding to the input text [TTS tracks to a method of speech synthesis]), comprising: receiving, from a first terminal of a first user ([Col. 2, Lines 3-4] device 106 receives text 108 for TTS processing… [Col. 3, Lines 60-65] textual data may originate from an internal component of the TTS device 202 or may be received by the TTS device 202 from an input device such as a keyboard [Receiving text from an input device such as a keyboard indicates a user to enter that text]), a plurality of sentences ([Col. 3, Lines 65-67] The text may be in the form of sentences including text, numbers, and/or punctuation for conversion by the TTS module 214 into speech); receiving, from the first terminal of the first user ([In view of the previously disclosed first terminal and user]), a plurality of speech style characteristics for the plurality of sentences ([Col. 4, Lines 1-3] The input text may also include special annotations for processing by the TTS module 214 to indicate how particular text is to be pronounced when spoken aloud [Information regarding pronunciation tracks to speech style characteristics]); selecting, by the processor of the information processing system ([Fig. 2, controller/processor 208], [Wherein the system the processor is used in is for TTS, i.e. information processing]), at least one sentence as an inspection target from among the plurality of sentences based on a result of analyzing at least one of the plurality of speech style characteristics or the plurality of synthetic speeches ([Fig. 7B, Speech Waveform 704], [Col. 10, Lines 24-28] allow a user to edit the preliminary TTS results by adjusting acoustic features of the preliminary TTS results, by substituting phonetic units in the preliminary TTS results, or by making other edits… As the user feedback is taken as an input to the TTS system, the second round of TTS results is likely to be more desirable than the preliminary TTS results [In view of the previously disclosed input being comprised of multiple sentences, indicating selection of at least one sentence in view of the user interface of Fig. 7B, i.e. sentence “Hello”, wherein that is “selected” and sent to a user from a system based on results of a preliminary TTS operation, wherein generation of a speech waveform resembling input text indicates a required analysis of speech style characteristics, i.e. the selection is “based on” a required first waveform generation which inherently will require speech style analysis. This method could be extended to more than one speech without a change in functionality]); instructing, by the processor of the information processing system ([In view of the previously disclosed processor of an information processing system of Kaszczuk]), a second terminal of a second user to output ([Fig. 1, User 102, Display 110], [In view of the input text being received by device 106 through a different medium, indicating the user 102 is distinct, i.e. second, from a user with a keyboard entering original text as previously disclosed, further indicating that keyboard input system is distinct from the display 110 on device 106 meaning a second terminal/user could be implemented here without a change in functionality to Kaszczuk. Disclosing “input device(s)”, [Fig. 2 206], indicates multiple terminals with associated users]), on a screen of the second terminal ([Fig. 7B, 700b], [Col. 13, Lines 40-45] if the user selects portions of TTS results to edit, the user interface then displays a screen 700b showing the text 702 corresponding to the selected portion of TTS results to edit), a visual representation indicating the inspection target in an area corresponding to the selected at least one sentence ([Col. 11, Lines 8-10] The user may then select a portion of the TTS results to edit. The user interface may then display the selected results to be altered in a separate screen, 700 [In view of the previously disclosed input text containing multiple sentences indicating inspection of at least one sentence, i.e. “Hello”, a waveform tracks to a visual representation of the inspection target]); and, receiving, from the second terminal of the second user ([In view of the previously disclosed second terminal and user of Kaszczuk]), a response to at least one of the plurality of synthetic speeches including a request to change at least one speech style characteristic corresponding to the at least one sentence ([Fig. 6, 608], [Fig. 7B, Select Characteristic 706b], [Col. 10, Lines 37-39] to receive feedback from a user on those results, and to incorporate that feedback to alter the speech synthesis results [Selecting/changing emotional characteristics of input text indicates a response to at least one synthetic speech 704 including a request to change a speech style characteristic, i.e. tone, received by TTS system at step 608 of Fig. 6]). Kaszczuk does not disclose: inputting, by a processor of an information processing system, the plurality of sentences and the plurality of speech style characteristics into an artificial neural network text-to-speech synthesis model, so as to generate a plurality of synthetic speeches for the plurality of sentences that reflect the plurality of speech style characteristics; and, wherein the analyzing the plurality of synthetic speeches comprises determining that an characteristic of a synthetic speech corresponding to the at least one sentence is different from an characteristic of a synthetic speech corresponding to an adjacent sentence. Sung discloses: inputting, by a processor of an information processing system ([In view of the previously disclosed processor of an information processing system of Kaszczuk]), the plurality of sentences and the plurality of speech style characteristics into an artificial neural network text-to-speech synthesis model ([0250] Here, the artificial intelligence model may be an artificial intelligence model learned using at least one of artificial intelligence algorithms such as machine learning, neural networks, genes, deep learning, and classification algorithms, and in particular, may include at least one artificial neural network, [In view of Fig. 5A speech synthesizer]), so as to generate a plurality of synthetic speeches for the plurality of sentences that reflect the plurality of speech style characteristics ([0065] In addition, the same text may be output as a ‘calm style’ speech and a ‘friendly style’ speech, respectively [Generating multiple styles for the same sentence tracks to generation of sentences reflecting a plurality of speech style characteristics in view of the TTS synthesis model disclosed in Kaszczuk indicating that this method can be applied to multiple sentences simultaneously with no change in functionality]); and, wherein the analyzing the plurality of synthetic speeches comprises determining that an characteristic of a synthetic speech corresponding to the at least one sentence is different from an characteristic of a synthetic speech corresponding to an adjacent sentence ([0086] when it is determined that a meaning of a specific sentence is important as a result of speech recognition of a specific sentence for responding to the received user's speech, the electronic device may change the information on the tone of the output speech among the plurality of parameter information into ‘pleasant style’ in real time. Accordingly, the electronic device may output the output speech of ‘slow style’, output the output speech of ‘pleasant style’ for the specific sentence determined to be important, and then output the output speech of ‘slow style’ again for the sentence following the specific sentence determined to be important, [Determining to “change” styles from pleasant to slow indicates a required awareness that a current speech style is pleasant/slow to know when a change is required, further indicating that the sentence characteristic/style information is being monitored/determined from sentence-to-sentence. Further, as importance is related to style, the importance measure has a direct correlation to style indicating that a change in one, i.e. importance, will result in a direct change in the other indicating that the importance determination is equivalent to a characteristic determination]). Kaszczuk and Sung are considered analogous art within speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kaszczuk to incorporate the teachings of Sung, because of the novel way to synthesize a plurality of response sentence or feelings (Sung, [0005]). Kaszczuk further discloses: determining whether the plurality of speech characteristics are revealed by the plurality of synthetic speeches ([Col. 11, Lines 50-59] the user may provide audio feedback to the TTS device to provide the TTS device with an example of how certain selected portions of the preliminary TTS results should sound, similar to a line reading in an acting context. Thus the user may speak the input text with the same stresses, mannerisms, and other characteristics the user wishes the TTS device to emulate in the voice of the synthesized speech, [Providing feedback based on how TTS results, i.e. synthesized speech, should sound indicates a user determination that the plurality of speech characteristics are not revealed by the synthetic speeches; therefore, should a user have not feedback, i.e. accepting a preliminary TTS results (see [Col. 11, Lines 17-20]), this is indicative that the speech characteristics are revealed, i.e. the amount of feedback tracks to a measure of “revealed-ness”, as determined by the user]). Regarding claim 10, Kaszczuk in view of Sung discloses: the method according to claim 1. Kaszczuk further discloses: A non-transitory computer-readable recording medium storing instructions that, when executed by one or more processors, cause performance of the method according to claim 1 ([Col. 14, Lines 30-35] Aspects of the present disclosure may be implemented as a computer implemented method, a system, or as an article of manufacture such as a memory device or non-transitory computer readable storage medium [In view of the rejection of claim 1, see above]). Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaszczuk in view of Sung, further in view of Kang (US-20140088970-A1). Regarding claim 3, Kaszczuk in view of Sung discloses: the method according to claim 1. Kaszczuk in view of Sung does not disclose: wherein the receiving the response to the at least one of the plurality of synthetic speeches further includes receiving a request to change at least one sentence associated with the at least one synthetic speech. Kang discloses: wherein the receiving the response to the at least one of the plurality of synthetic speeches further includes receiving a request to change at least one sentence associated with the at least one synthetic speech ([Fig. 7], [0028] FIG. 7 illustrates a method through which the user inputs an editing command for the selected range of the text, [0037] The editing command may correspond to one of a copy command, a cut command, an edit command, a transmit command and a search command for the selected range of the text [An edit command tracks to a request to change at least one sentence associated with the synthetic speech output from the Kaszczuk in view of Sung synthesizing system]). Kaszczuk, Sung, and Kang are considered analogous art within speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kaszczuk in view of Sung to incorporate the teachings of Kang, because of the novel way to incorporate a multi-modal environment allowing for users to edit text from synthesized output through an intuitive interface for increased accessibility (Kang, [0021]). Sung further discloses: inputting the changed at least one speech style characteristic and the changed at least one sentence into the artificial neural network text-to- speech synthesis model ([0250] Here, the artificial intelligence model may be an artificial intelligence model learned using at least one of artificial intelligence algorithms such as machine learning, neural networks, genes, deep learning, and classification algorithms, and in particular, may include at least one artificial neural network, [In view of the style editing of Kaszczuk, the sentence editing of Kang, and the text-to-speech synthesis model of Kaszczuk in view of Sung]), so as to generate at least one synthetic speech for the changed at least one sentence that reflects the changed at least one speech style characteristic ([Fig. 4B], [Fig. 4B indicates a scenario where two responses are generated varying in both wording and prosody, i.e. speech style characteristics, shown as ‘slowly with high tone’, in view of the speech style characteristic tracking/collection of Kaszczuk, further in view of the text editing of Kang]). Claim(s) 5-7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaszczuk in view of Sung, further in view of Czuczman et al. (US-9679554-B1), hereinafter Czuczman. Regarding claim 5, Kaszczuk discloses: a method for performing a synthetic speech generation operation on text ([Col. 1, Lines 9-11] In such text-to-speech (TTS) systems, a device converts text into an acoustic waveform that is recognizable as speech corresponding to the input text [TTS tracks to a method of speech synthesis]), comprising: receiving, from a first terminal of a first user ([Col. 2, Lines 3-4] device 106 receives text 108 for TTS processing… [Col. 3, Lines 60-65] textual data may originate from an internal component of the TTS device 202 or may be received by the TTS device 202 from an input device such as a keyboard [Receiving text from an input device such as a keyboard indicates a user to enter that text]), a plurality of sentences ([Col. 3, Lines 65-67] The text may be in the form of sentences including text, numbers, and/or punctuation for conversion by the TTS module 214 into speech); receiving, from the first terminal of the first user ([In view of the previously disclosed first terminal and user]), a plurality of speech style characteristics for the plurality of sentences ([Col. 4, Lines 1-3] The input text may also include special annotations for processing by the TTS module 214 to indicate how particular text is to be pronounced when spoken aloud [Information regarding pronunciation tracks to speech style characteristics]); instructing, by the processor of the information processing system ([In view of the previously disclosed processor of an information processing system of Kaszczuk]), a second terminal of a second user to output ([Fig. 1, User 102, Display 110], [In view of the input text being received by device 106 through a different medium, indicating the user 102 is distinct, i.e. second, from a user with a keyboard entering original text as previously disclosed, further indicating that keyboard input system is distinct from the display 110 on device 106 meaning a second terminal/user could be implemented here without a change in functionality to Kaszczuk. Disclosing “input device(s)”, [Fig. 2 206], indicates multiple terminals with associated users]), on a screen of the second terminal ([Fig. 7B, 700b], [Col. 13, Lines 40-45] if the user selects portions of TTS results to edit, the user interface then displays a screen 700b showing the text 702 corresponding to the selected portion of TTS results to edit), a visual representation indicating the inspection target in an area corresponding to the selected at least one sentence ([Col. 11, Lines 8-10] The user may then select a portion of the TTS results to edit. The user interface may then display the selected results to be altered in a separate screen, 700 [In view of the previously disclosed input text containing multiple sentences indicating inspection of at least one sentence, i.e. “Hello”, a waveform tracks to a visual representation of the inspection target]); and, receiving, from the second terminal of the second user ([In view of the previously disclosed second terminal and user of Kaszczuk]), a response to at least one of the plurality of synthetic speeches including a request to change at least one speech style characteristic corresponding to the at least one sentence ([Fig. 6, 608], [Fig. 7B, Select Characteristic 706b], [Col. 10, Lines 37-39] to receive feedback from a user on those results, and to incorporate that feedback to alter the speech synthesis results [Selecting/changing emotional characteristics of input text indicates a response to at least one synthetic speech 704 including a request to change a speech style characteristic, i.e. tone, received by TTS system at step 608 of Fig. 6]). Kaszczuk does not disclose: inputting, by a processor of an information processing system, the plurality of sentences and the plurality of speech style characteristics into an artificial neural network text-to-speech synthesis model, so as to generate a plurality of synthetic speeches for the plurality of sentences that reflect the plurality of speech style characteristics; Sung discloses: inputting, by a processor of an information processing system ([In view of the previously disclosed processor of an information processing system of Kaszczuk]), the plurality of sentences and the plurality of speech style characteristics into an artificial neural network text-to-speech synthesis model ([0250] Here, the artificial intelligence model may be an artificial intelligence model learned using at least one of artificial intelligence algorithms such as machine learning, neural networks, genes, deep learning, and classification algorithms, and in particular, may include at least one artificial neural network, [In view of Fig. 5A speech synthesizer]), so as to generate a plurality of synthetic speeches for the plurality of sentences that reflect the plurality of speech style characteristics ([0065] In addition, the same text may be output as a ‘calm style’ speech and a ‘friendly style’ speech, respectively [Generating multiple styles for the same sentence tracks to generation of sentences reflecting a plurality of speech style characteristics in view of the TTS synthesis model disclosed in Kaszczuk indicating that this method can be applied to multiple sentences simultaneously with no change in functionality]). Kaszczuk and Sung are considered analogous art within speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kaszczuk to incorporate the teachings of Sung, because of the novel way to synthesize a plurality of response sentences in differing styles to more accurately respond to a user's situational intent or feelings (Sung, [0005]). Kaszczuk in view of Sung does not disclose: selecting, by the processor of the information processing system, at least one sentence as an inspection target from among the plurality of sentences by analyzing a behavior pattern of the first user that selects the plurality of speech style characteristics for the plurality of sentences; and, receiving, from the second terminal of the second user, a request to change at least one speech style characteristic corresponding to the at least one sentence, wherein the at least one sentence is selected as the inspection target in response to determining that the speech style characteristics selected for the at least one sentence is different from recommended speech style characteristics, determining that selection of the speech style characteristics for the at least one sentence is made in less than a predetermined time, or determining that selection of the speech style characteristics for the at least one sentence is frequently changed. Czuczman discloses: selecting, by the processor of the information processing system ([In view of the previously disclosed processor of an information processing system of Kaszczuk]), at least one sentence as an inspection target from among the plurality of sentences by analyzing a behavior pattern of the first user that selects the plurality of speech style characteristics for the plurality of sentences ([Col. 6, Lines 55-62] Sentences may be prioritized in a number of different ways, for example how many different prosodic units are included in each sentence, the linguistic coverage of the sentence, the length of the sentence, how easy the sentence is to pronounce (which may be represented by a readability score), the number of new units in the sentence not already in the corpus, and/or a number of other factors. The overall priority may be a combination of such factors [Determining priority sentences, i.e. to be sent to proofreaders, based on different prosodic units, tracks to determining inspection sentences based on the behavior pattern of the first user (output from server 112), i.e. the speech style characteristics or prosodic units known in the system at the time of determining priority sentences are used to determine which sentences are different, and therefore have new prosodic units, i.e. behavior patterns]); and, receiving, from the second terminal of the second user ([In view of the second user and terminal of Kaszczuk]), a request to change at least one speech style characteristic corresponding to the at least one sentence ([Col. 8, Lines 25-30] In another aspect, the present system may offer proofreaders the ability to accept or reject text segments and may also offer the ability to edit text segments or to otherwise indicate why a text segment is rejected. For example, a proofreader may mark words that are incorrect, from another language, have confusing pronunciations, etc. [A proofreader, i.e. a second user, indicating reasons why a text segment is rejected tracks to a request to change at least one speech style, i.e. pronunciation, corresponding to the text segment which can be a sentence so the text segment is not rejected]), wherein the at least one sentence is selected as the inspection target in response to determining that the speech style characteristics selected for the at least one sentence is different from recommended speech style characteristics, determining that selection of the speech style characteristics for the at least one sentence is made in less than a predetermined time, or determining that selection of the speech style characteristics for the at least one sentence is frequently changed ([Col. 6, Lines 50-60] Prior to sending a sentence to a proofreader for review, the system may prioritize (302) the selected sentences. The top priority sentences may be scheduled for review prior to the lower priority sentences. Sentences may be prioritized in a number of different ways, for example how many different prosodic units are included in each sentence, [Selecting priority sentences to be received before other text indicates the priority sentence is an inspection target, wherein priority is based on number of prosodic units, indicating potential frequently changing speech style characteristics if the total number of differing prosodic units is large. The examiner would like to note that due to the disjunctive nature of the claim, the other “determining…” steps do not require a mapping]). Kaszczuk, Sung, and Czuczman are considered analogous art within speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kaszczuk in view of Sung to incorporate the teachings of Czuczman, because of the novel way to introduce proofreaders to identify appropriately synthesized text and using the approved sentences as templates for other similarly formatted sentence, reducing training time consumption (Czuczman, [Col. 1, Lines 45-60]). Regarding claim 6, Kaszczuk in view of Sung discloses: the method according to claim 1. Kaszczuk in view of Sung does not disclose: wherein the receiving, from the second terminal of the second user, the response to the at least one synthetic speech further includes receiving a marker indicating whether or not to use the at least one synthetic speech, in an area displaying at least one sentence associated with the at least one synthetic speech. Czuczman discloses: wherein the receiving, from the second terminal of the second user, the response to the at least one synthetic speech further includes receiving a marker indicating whether or not to use the at least one synthetic speech, in an area displaying at least one sentence associated with the at least one synthetic speech ([Fig. 5], [A second user 10 responding to whether or not speech is accepted, in view of the synthesized speech output of Kaszczuk in view of Sung, indicating whether or not to use that speech, i.e. yes or no, in view of the display of Kaszczuk determining whether input text needs to be edited or accepted. The results from the proofreader of Czuczman could be sent to the display of Kaszczuk with no change in functionality. Consider a situation in which the preliminary TTS results are that sent from the proofreader with the accompanying yes/no]). Kaszczuk, Sung, and Czuczman are considered analogous art within speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kaszczuk in view of Sung to incorporate the teachings of Czuczman, because of the novel way to introduce proofreaders to identify appropriately synthesized text and using the approved sentences as templates for other similarly formatted sentence, reducing training time consumption (Czuczman, [Col. 1, Lines 45-60]). Regarding claim 7, Kaszczuk in view of Sung, further in view of Czuczman discloses: the method according to claim 6. Czuczman further discloses: if the marker indicates that the at least one synthetic speech is not to be used, providing information on the at least one sentence associated with the at least one synthetic speech to the first terminal of the first user ([Col. 8, Lines 25-30] In another aspect, the present system may offer proofreaders the ability to accept or reject text segments and may also offer the ability to edit text segments or to otherwise indicate why a text segment is rejected [A rejected text segment tracks to one which should not be used, information sent to server 112 which could then reasonably be sent to the speech synthesis system of Kaszczuk in view of Sung where the first user would be alerted not to use it]). Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaszczuk in view of Sung, further in view of Morita et al. (US-20200066250-A1), hereinafter Morita. Regarding claim 8, Kaszczuk in view of Sung discloses: the method according to claim 1. Kaszczuk further discloses: receiving a response for selecting at least one speech style characteristic from among the plurality of speech style characteristic candidates ([Col. 10, Lines 37-39] to receive feedback from a user on those results, and to incorporate that feedback to alter the speech synthesis results [Where feedback in Kaszczuk is given based on synthesized speech 124 having speech style characteristics]). Kaszczuk in view of Sung does not disclose: outputting a plurality of speech style characteristic candidates for each of the plurality of sentences. Morita discloses: outputting a plurality of speech style characteristic candidates for each of the plurality of sentences ([Fig. 6], [0050] The screen 100 illustrated in FIG. 6 includes a text box 101 for inputting the text information to be subjected to speech synthesis; a pulldown menu 102 for selecting the speaker individuality to be used; slide bars 103a, 103b, and 103c for setting general voice quality parameters [Having a display consisting of a plurality of speaker identities 102, and a plurality of voice settings 103, indicates that the text output when synthesizer button 104 is hit will have a speech style characteristic candidate for a sentence, wherein there is no change to functionality in producing more samples with differing style or words, see store button 105 for this purpose]). Kaszczuk, Sung, and Morita are considered analogous art within speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kaszczuk in view of Sung to incorporate the teachings of Morita, because of the novel way to enable creation of a plurality of speaker individualities while keeping track of existing profiles to prevent voice theft for increased security (Morita, [0004]). Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaszczuk in view of Sung, further in view of Morita, further in view of Knight et al. (US-9788777-B1), hereinafter Knight. Regarding claim 9, Kaszczuk in view of Sung, further in view of Morita discloses: the method according to claim 8. Kaszczuk in view of Sung, further in view of Morita does not disclose: wherein the plurality of speech style characteristic candidates includes a recommended speech style characteristic candidate that is determined based on a result of analyzing the plurality of sentences. Knight discloses: wherein the plurality of speech style characteristic candidates includes a recommended speech style characteristic candidate that is determined based on a result of analyzing the plurality of sentences ([Col. 12, 35-45] In some examples, mood and/or emotion-based recommendation is used during a telephone call (e.g., as on-hold music). The music may be recommended by the example recommendation engine 155 based on a mood of a caller (as measured by the tone of voice and/or sensors), a type of caller (e.g., work colleague, family member, friend, etc.), the type of call (e.g., irate customer support, status inquiry, etc.), an environmental context, etc. [Making a recommendation about what kind of on-hold music to play based on the mood of the caller, gathered through tone and other speech style characteristics of previous sentences of the call, tracks to recommending a speech style characteristic, i.e. type of music, based on the context of how the caller is speaking]). Kaszczuk, Sung, Morita, and Knight are considered analogous art within speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kaszczuk in view of Sung, further in view of Morita to incorporate the teachings of Knight, because of the novel way to use pre-verbal utterances for building a classification model for increased emotional classification leading to an improved user experience in responses with surrounding tone knowledge (Knight, [Col. 5, Lines 1-15]). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Park et al. (US-20210335381-A1) discloses “According to an embodiment of the present invention, there is provided an artificial intelligence (AI) apparatus for mutually converting a text and a speech, including: a memory configured to store a plurality of Text-To-Speech (TTS) engines; and a processor configured to: obtain image data containing a text, determine a speech style corresponding to the text, generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and output the generated speech” (abstract). See entire document. Freud (US-20160365087-A1) discloses “A guide track based speech synthesis system and method that uses an imitator voice and extracted parameter from the imitator voice to enhance the speech synthesized by conventional approach using the library built from an original voice with performance idiosyncrasies, emotions, and characteristics. The imitator voice reads from an input script to recorded speech in substantially the same way as the original voice. The recorded speech is stored in a guide track. Prior recordings of audio from the original voice are used to build a voice library. Context features and prosodic features are extracted from the guide track and corrected. Spectral features which align with the context features and prosodic features of the guide track are generated from the voice library. The aligned acoustic features are then converted to a speech waveform of an enhanced synthetic voice” (abstract). See entire document. Sato (US-8214216-B2) discloses “A simply configured speech synthesis device and the like for producing a natural synthetic speech at high speed. When data representing a message template is supplied, a voice unit editor (5) searches a voice unit database (7) for voice unit data on a voice unit whose sound matches a voice unit in the message template. Further, the voice unit editor (5) predicts the cadence of the message template and selects, one at a time, a best match of each voice unit in the message template from the voice unit data that has been retrieved, according to the cadence prediction result. For a voice unit for which no match can be selected, an acoustic processor (41) is instructed to supply waveform data representing the waveform of each unit voice. The voice unit data that is selected and the waveform data that is supplied by the acoustic processor (41) are combined to generate data representing a synthetic speech” (abstract). See entire document. Any inquiry concerning this communication or earlier communications from the examiner should be directed to THEODORE JOHN WITHEY whose telephone number is (703)756-1754. The examiner can normally be reached Monday - Friday, 8am-5pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax
Read full office action

Prosecution Timeline

Feb 10, 2023
Application Filed
Jan 21, 2025
Non-Final Rejection — §103
May 28, 2025
Response Filed
Jul 11, 2025
Final Rejection — §103
Nov 17, 2025
Request for Continued Examination
Nov 26, 2025
Response after Non-Final Action
Dec 10, 2025
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12591744
METHOD FOR TRAINING SEMANTIC REPRESENTATION MODEL, DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
Patent 12536994
APPARATUS FOR CLASSIFYING SOUNDS BASED ON NEURAL CODE IN SPIKING NEURAL NETWORK AND METHOD THEREOF
2y 5m to grant Granted Jan 27, 2026
Patent 12475330
METHOD FOR IDENTIFYING NOISE SAMPLES, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Nov 18, 2025
Patent 12417759
SPEECH RECOGNITION USING CADENCE PATTERNS
2y 5m to grant Granted Sep 16, 2025
Patent 12412580
Sound Extraction System and Sound Extraction Method
2y 5m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
44%
Grant Probability
90%
With Interview (+46.9%)
2y 11m
Median Time to Grant
High
PTA Risk
Based on 23 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month