Last updated: April 19, 2026
Application No. 18/683,786
SPEECH SYNTHESIS APPARATUS, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM

Final Rejection §102§103
Filed
Feb 15, 2024
Examiner
LAM, PHILIP HUNG FAI
Art Unit
2656
Tech Center
2600 — Communications
Assignee
The University of Tokyo
OA Round
2 (Final)
Interview Optional

— +45.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 129 resolved cases, 2023–2026
Examiner Intelligence

LAM, PHILIP HUNG FAI View full profile →
Grants 83% — above average
Career Allow Rate
107 granted / 129 resolved
+20.9% vs TC avg
Strong +46% interview lift
Without
With
+45.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
29 currently pending
Career history
158
Total Applications
across all art units
Statute-Specific Performance

§101
23.7%
-16.3% vs TC avg
§103
53.7%
+13.7% vs TC avg
§102
11.1%
-28.9% vs TC avg
§112
5.3%
-34.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 129 resolved cases
Office Action

§102 §103
DETAILED ACTION
Introduction
This office action is in response to Applicant’s Amended submission filed on 12/17/2025.

Response to Amendment and Arguments
35 U.S.C. 112 Rejections
The amendment and argument are persuasive, therefor the rejection has been withdrawn.
35 U.S.C. 103 Rejections
Applicant’s arguments are moot in view of the new or modified grounds of rejection that were necessitated by the amendments to the Claims. 
Applicant’s arguments are directed to material that is added by the most recent amendments to the independent Claims.  Response, p. 8.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-2, and 6-8 are rejected under 35 U.S.C. 102(a)(1) & (a)(2) as being anticipated by Kingsbury (US 20190043474).
Kingsbury discloses: A speech synthesis apparatus comprising: a memory; and a processor coupled to the memory and configured to: ([0009] In accordance with embodiments herein, a device is provided that is comprised of a processor and a memory storing program instructions accessible by the processor.)
obtain (i) utterance information on texts contained in specific book data of a book; ([0009] Responsive to execution of the instructions, the processor analyzes textual content to identify narratives for associated content segments of the textual content; designates character models for the corresponding narratives; and generates an audio rendering of the textual content utilizing the character models in connection with the corresponding narratives for the associated content segments.) Also see para 0034.
obtain (ii) image information on images that are contained in the specific book data of the book; ([0024] The term “textual content” refers to any and all textual, graphical, image or video information or data that may be converted to corresponding audio information or data. The textual content may represent various types of incoming and outgoing textual, graphical, image and video content including, but not limited to, electronic books, electronic stories, correspondence, email, webpages, technical material, text messages, social media content, alerts, advertisements and the like.)
obtain (iii) speech data of a reading of the text contained in the specific book data of the book; ([0009] Responsive to execution of the instructions, the processor analyzes textual content to identify narratives for associated content segments of the textual content; designates character models for the corresponding narratives; and generates an audio rendering of the textual content utilizing the character models in connection with the corresponding narratives for the associated content segments.) Also see para 0027.
and generate, based on the obtained (i) utterance information, the obtained (ii) image information, and the obtained (iii) speech data, all of which are obtained based on same specific book data of the book, a speech synthesis model for reading out a text associated with an image. ([0009] Responsive to execution of the instructions, the processor analyzes textual content to identify narratives for associated content segments of the textual content; designates character models for the corresponding narratives; and generates an audio rendering of the textual content utilizing the character models in connection with the corresponding narratives for the associated content segments.) Also see para 0027 and 0034.

Regarding Claim 2, Kingsbury discloses all the limitation of Claim 1 (see detailed mapping from above)
Kingsbury further discloses: wherein the processor configured to obtain, as the image information, information on an image that is contained in a specific page of the book and that is associated with one of the texts that is contained in the specific page of the book. ([0024] The term “textual content” refers to any and all textual, graphical, image or video information or data that may be converted to corresponding audio information or data. The textual content may represent various types of incoming and outgoing textual, graphical, image and video content including, but not limited to, electronic books, electronic stories, correspondence, email, webpages, technical material, text messages, social media content, alerts, advertisements and the like. [0085] The camera unit 616 may capture one or more frames of image data.)

Claim 6 recites a method that corresponds to the apparatus of claim 1 and is therefore rejected under the same grounds as claim 1 above.

Claim 7 recites a non-transitory computer readable storage medium that corresponds to the apparatus of claim 1 and is therefore rejected under the same grounds as claim 1 above.  Kingsbury discloses ([0097] implemented as hardware with associated instructions (for example, software stored on a tangible and non-transitory computer readable storage medium, such as a computer hard drive, ROM, RAM, or the like) that perform the operations described herein.)

Regarding Claim 8, Kingsbury discloses: A speech synthesis apparatus comprising: a memory; and a processor coupled to the memory and configured to: ([0009] In accordance with embodiments herein, a device is provided that is comprised of a processor and a memory storing program instructions accessible by the processor.)
Obtain (i) utterance information on a text contained in data of a book; ([0009] Responsive to execution of the instructions, the processor analyzes textual content to identify narratives for associated content segments of the textual content; designates character models for the corresponding narratives; and generates an audio rendering of the textual content utilizing the character models in connection with the corresponding narratives for the associated content segments.) Also see para 0034.
Obtain (ii) image information on an image contained in the data of the book, wherein the image information corresponds to the text contained in the data of the book; ([0024] The term “textual content” refers to any and all textual, graphical, image or video information or data that may be converted to corresponding audio information or data. The textual content may represent various types of incoming and outgoing textual, graphical, image and video content including, but not limited to, electronic books, electronic stories, correspondence, email, webpages, technical material, text messages, social media content, alerts, advertisements and the like.)
acquire a synthesized speech corresponding to the text by inputting the obtained (i) utterance information and the obtained (ii) image information to a speech synthesis model for reading out a text that is associated with an image, and the speech synthesis model having been generated based upon utterance information, image information, and speech data, all of which were obtained based on a same specific book data of a specific book. ([0009] Responsive to execution of the instructions, the processor analyzes textual content to identify narratives for associated content segments of the textual content; designates character models for the corresponding narratives; and generates an audio rendering of the textual content utilizing the character models in connection with the corresponding narratives for the associated content segments.) Also see para 0027 and 0034.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Kingsbury, in in view of Li (US 20200027454).
Regarding Claim 3, Kingsbury discloses all the limitation of Claim 1 (see detailed mapping from above).
Kingsbury does not explicitly disclose the feature recited below.
Li (in the related field of text and voice information processing) discloses: wherein the processor configured to obtain, as the speech data, data of speech reading out one of the texts that is contained in a specific page of the book and that is associated with an image contained in the specific page of the book. ([0048] When attending a conference, to better record meeting content, the user may not only obtain the target picture by using the terminal, but also obtain the voice file corresponding to the text information in the target picture. The voice file is in one-to-one correspondence with the text information.) [Also see fig. 4, where it is obvious that the text described the image is on the same page] 
Kingsbury and Li are considered analogous art.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Kingsbury to combine with the teaching of Li to disclose the above feature, because voice associating text with image will aid reader/viewer understand content, especially those that may be visually impaired (Li, [0048]).

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Kingsbury, in view of Mitsui (US 20170345412).
Regarding Claim 4, Kingsbury discloses all the limitation of Claim 1 (see detailed mapping from above).
Kingsbury does not explicitly disclose the feature recited below.
Mitsui (in the related field of processing speech) discloses: wherein the processor is configured to obtain the utterance information presenting at least one of accents, ([0081] The applicable segment searching unit 108 may compare an accent phrase obtained by dividing input utterance information (hereinafter referred to as an input accent phrase) with an accent phrase obtained by dividing original-speech utterance information (hereinafter referred to as an original-speech accent phrase).) [the claim only required one of the recited elements]
Kingsbury and Mitsui are considered analogous art.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Kingsbury to combine with the teaching of Mitsui to disclose the above feature, because when generating synthetic speech, matching to a pre-segmented accent phrase library allows the system to produce more natural-sounding output, as it can select segments that best match the desired prosodic features (Mitsui, [0081]).

Claims 5, 9, and 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Kingsbury, in view of Ma (already of record)
Regarding Claim 5, Kingsbury discloses all the limitation of Claim 1 (see detailed mapping from above)
Kingsbury does not explicitly disclose the below cited features.
Ma discloses: wherein the processor further configured to: convert the utterance information into a language vectors, wherein each language vector represents linguistic information on a corresponding text of the texts; ([sect 3.1 Modality-specific encoders] Text encoder: We process text into a sequence of 128-D character-level embeddings via a 66-symbol trainable lookup table. We then feed each of the embeddings into two fully-connected (FC) layers.) [text reads on utterance information, as utterance information comes from text] Also see fig. 3 for the architecture of the modality specific encoders and decoders. – reproduced below for convenience of viewing.

    PNG
    media_image1.png
    448
    1025
    media_image1.png
    Greyscale

convert the image information into visual feature vectors, wherein each visual feature vector represents a visual feature of a corresponding image contained in the specific book data of the book; ([sect 3.1 Modality-specific encoders] Image encoder: We feed images to a three-layer CNN and perform max-pooling to obtain the output eimg ∈ R512.) Also see fig. 3 for the architecture of the modality specific encoders and decoders.
and generate the speech synthesis model using training data containing the speech data that is associated with the language vectors and the visual feature vectors. ([sect 3.2 Training of the memory fusion module] Therefore, we solve the cross-modal generation tasks as provided by two paired datasets. Specifically, we aim to reconstruct xjA from ujA and xjB from ujB in a cross-modal fashion. We define our loss as: see equations (3-6), Sect 3.3 Modality-Specific Decoders and 3.4 Learning Objective and Optimization also disclose further details about the training.)
Kingsbury and Ma are considered analogous art.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Kingsbury to combine with the teaching of Ma to disclose the above feature, because the technique described by Ma improves performance on traditional cross-modal generation, suggesting that it improves data efficiency in solving individual tasks (Ma, [Abstract]).

Regarding Claim 9, Kingsbury discloses all the limitation of Claim 1 (see detailed mapping from above)
Kingsbury does not explicitly disclose the below cited features.
Ma discloses: wherein the speech synthesis model includes: a first input configured to receive a language vector derived from the utterance information; ([sect 3.1 Modality-specific encoders] Text encoder: We process text into a sequence of 128-D character-level embeddings via a 66-symbol trainable lookup table. We then feed each of the embeddings into two fully-connected (FC) layers.) [text reads on utterance information, as utterance information comes from text] Also see fig. 3 for the architecture of the modality specific encoders and decoders. – reproduced below for convenience of viewing.
an encoder layer configured to encode the language vector; (see sect 3.1 and fig. 3 below)
    PNG
    media_image2.png
    426
    975
    media_image2.png
    Greyscale

a second input configured to receive a visual feature vector derived from the image information; ([sect 3.1 Modality-specific encoders] Image encoder: We feed images to a three-layer CNN and perform max-pooling to obtain the output eimg ∈ R512.) Also see fig. 3 for the architecture of the modality specific encoders and decoders.
a visual information extraction layer configured to extract features from the visual feature vector; ([sect 3.1 Modality-specific encoders] Image encoder: We feed images to a three-layer CNN and perform max-pooling to obtain the output eimg ∈ R512.) Also see fig. 3 for the architecture of the modality specific encoders and decoders.
and a decoder layer configured to receive an output from the encoder layer and an output from the visual information extraction layer to generate speech features. (see fig. 3 from above)
Where the rationale for the combination would be similar to the one already provided.

Regarding Claim 12, Kingsbury discloses all the limitation of Claim 1 (see detailed mapping from above)
Kingsbury does not explicitly disclose the below cited features.
Ma discloses: wherein the processor is configured to convert the image information into a visual feature vector using a neural network for image identification that has been trained previously. ([sect 3.1 Modality-specific encoders] Image encoder: We feed images to a three-layer CNN and perform max-pooling to obtain the output eimg ∈ R512.) Also see fig. 3 for the architecture of the modality specific encoders and decoders.
Where the rationale for the combination would be similar to the one already provided.

Regarding Claim 13, Kingsbury discloses all the limitation of Claim 12 (see detailed mapping from above)
Kingsbury does not explicitly disclose the below cited features.
Ma discloses: wherein the processor is configured to use an output of an intermediate layer of the neural network as the visual feature vector. ([sect 3.1 Modality-specific encoders] Image encoder: We feed images to a three-layer CNN and perform max-pooling to obtain the output eimg ∈ R512.) Also see fig. 3 for the architecture of the modality specific encoders and decoders.
Where the rationale for the combination would be similar to the one already provided.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Kingsbury, in view of Ahmed (US 20240127832).
Regarding Claim 10, Kingsbury discloses all the limitation of Claim 1 (see detailed mapping from above)
Kingsbury does not explicitly disclose the below cited features.
Ahmed discloses: wherein the speech data includes speech parameters including a basic frequency and spectral parameters including a mel-spectrogram. ([0538] These linguistics features (bitstream 3) may then be converted, e.g. through an acoustic model, to acoustics features, like MFCCs, fundamental frequency, mel-spectrogram for example, or a combinations of those. This operation may be performed by a preconditioning layer 710c, which may be either deterministic or learnable.)
Kingsbury and Ahmed are considered analogous art.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Kingsbury to combine with the teaching of Ahmed to disclose the above feature, because the technique described by Ahmed improves human perception by using mel-spectrogram, leading to higher quality and more natural sound (Ahmed, [0538]).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Kingsbury, in view of Lee (US 20210027168).
Regarding Claim 11, Kingsbury discloses all the limitation of Claim 1 (see detailed mapping from above)
Kingsbury does not explicitly disclose the below cited features.
Lee discloses: wherein the processor is configured to convert the utterance information into a language vector using a one-hot expression, and a number of dimensions of the one-hot expression corresponds to a number of characters contained in the utterance information. ([0043] When input data is input, the processor 120 may represent the input data as a vector (a matrix or tensor). Here, the method of representing the input data as a vector (a matrix or tensor) may vary depending on the type of the input data. For example, if a text (or a text converted from a user voice) is input as input data, the processor 120 may represent the text as a vector through One hot Encoding, or represent the text as a vector through Word Embedding. Here, the One hot encoding is a method in which only the value of the index of a specific word is represented as 1 and the value of the remaining index is represented as 0, and the Word Embedding is a method in which a word is represented as a real number in the dimension (e.g., 128 dimensions) of a vector set by a user. As an example of the Word Embedding method, Word2Vec, FastText, Glove, etc. may be used.)
Kingsbury and Lee are considered analogous art.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Kingsbury to combine with the teaching of Lee to disclose the above feature, because the technique described by Lee has the advantage of speed and simplicity as one hot vector encoding use low computational resource for simple task (Lee, [0043]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
	Wang US 20210191506 – discloses “In some embodiments, generator 210 may generate a multimodal affective expression that may be a combination of a text, a voice, a symbol, a facial expression, and any other relevant modality.” See Abstract and para 0046, 0054 for additional details.
Gopala US 20210074260 -discloses “system 700 may analyze the inputs. For example, open-face parameters may be extracted from the one or more images 710 by facial analysis engine (FAE) 716, additional parameters may be determined by convolutional neural network (CNN) 718, and audio features (such as text, an emotional state, etc.) may be extracted by audio feature extractors (AFE) 720 from voice clip 712 and/or synthetic audio 714. Next, a multi-modal fusion engine (MMFE) 722 may combine these determined features to determine a representation for frame x.sub.t.” See Abstract, and para 0134 and figs. 2-5, and 7-9 for additional details.
Breazeal US 20180133900 – discloses “a social robot may employ multiple output modes for expression, an important use case is to supplement spoken semantic meaning with reinforcing multi-modal cues such as visuals, sounds, lighting and/or movement (e.g., via use of jiboji). The on-screen content can be a static graphic or an animation, or it could be supplemented with other multi-modal cues. As an example, the spoken word “pizza” may connote the same meaning as an image of a pizza. The on-screen content can be a static graphic or an animation, or it could be supplemented with other multi-modal cues. For instance, the robot might say “John wants to know if you want pizza for dinner” where an icon of a pizza appears on screen when the robot says “pizza”. Alternatively, the robot may put text on the screen “John wants to know if you want [graphic pizza icon] for [graphic dinner place setting icon].” Text display on a screen may be derived from text in a TTS source file and may be used for display as well as speech generation contemporaneously.” See Abstract, and para 0141 for additional details.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Phillip H Lam whose telephone number is (571)272-1721. The examiner can normally be reached 9 AM-3 PM Pacific Time.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PHILIP H LAM/            Examiner, Art Unit 2656                                                                                                                                                                                                     
/BHAVESH M MEHTA/            Supervisory Patent Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

Feb 15, 2024
Application Filed
Sep 20, 2025
Non-Final Rejection — §102, §103
Nov 26, 2025
Interview Requested
Dec 12, 2025
Examiner Interview Summary
Dec 12, 2025
Applicant Interview (Telephonic)
Dec 17, 2025
Response Filed
Jan 23, 2026
Final Rejection — §102, §103
Mar 31, 2026
Interview Requested
Apr 10, 2026
Examiner Interview Summary
Apr 10, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/643,239
Patent 12591626
SEARCH STRING ENHANCEMENT
2y 5m to grant Granted Mar 31, 2026
18/329,990
Patent 12572735
DOMAIN-SPECIFIC DOCUMENT VALIDATION
2y 5m to grant Granted Mar 10, 2026
18/377,570
Patent 12572747
MULTI-TURN DIALOGUE RESPONSE GENERATION WITH AUTOREGRESSIVE TRANSFORMER MODELS
2y 5m to grant Granted Mar 10, 2026
18/119,007
Patent 12562158
ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF
2y 5m to grant Granted Feb 24, 2026
18/670,728
Patent 12561194
ROOT CAUSE PATTERN RECOGNITION BASED MODEL TRAINING
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+45.5%)
2y 8m
Median Time to Grant
Moderate
PTA Risk
Based on 129 resolved cases by this examiner. Grant probability derived from career allow rate.