Last updated: May 29, 2026

Application No. 18/404,943

INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD FOR ARTIFICIAL SPEECH GENERATION

Non-Final OA §103

Filed

Jan 05, 2024

Priority

Jan 12, 2023 — EU 23151301.1

Examiner

NEWAY, SAMUEL G

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Sony Group Corporation

OA Round

2 (Non-Final)

Interview Optional

— +7.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 75% grant rate with +7.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 688 resolved cases, 2023–2026

Examiner Intelligence

NEWAY, SAMUEL G View full profile →

Grants 75% — above average

Career Allowance Rate

518 granted / 688 resolved

+13.3% vs TC avg

Moderate +8% lift

Without

With

+7.7%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

29 currently pending

Career history

718

Total Applications

across all art units

Statute-Specific Performance

§101

9.7%

-30.3% vs TC avg

§103

66.8%

+26.8% vs TC avg

§102

7.3%

-32.7% vs TC avg

§112

12.1%

-27.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 688 resolved cases

Office Action

§103

DETAILED ACTION
This is responsive to the amendment filed 20 November 2025.
Claims 1-6, 8-16 and 18-20 are pending and considered below.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claims 1-6, 8-16 and 18-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8-16 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Federico et al. (US 11,545,134) in view of Deyle et al. (US 2018/0077095).
Claim 1:
Federico discloses an information processing device for generating artificial speech data (Abstract), comprising circuitry configured to:
obtain, based on speech data, speech emotional indicators and associated timing data of the emotional indicators (“extract paralinguistic information (e.g. accent, pitch, volume, speech rate, modulation, and fluency) from the source utterances and prosodically aligned text and timing information to use to reproduce an equivalent or at least credible target utterance”, col. 7, lines 28-36, see also col. 9, lines 44-46); 
obtain, based on the speech data, text data (“speech recognizer 307 receives the utterances and transcribes them into a sequence of words including timing, punctuation and casing information”, col. 6, lines 41-43, see also col. 9, lines 33-36); 
obtain, based on video data associated with the speech data, video emotional indicators (“the paralanguage modeler 313 utilizes video to create and use … model prosody that is consistent with the visual”, col. 7, lines 33-36, see also “information from the video is used in the generation of prosody information”, col. 9, lines 46-48);
associate the speech emotional indicators and the video emotional indicators with the text data based on the associated timing data (“a prosodic aligner 312 temporally aligns the machine translator 309 output with the speech segments of the original audio. As such, the takes prosodic aligner 312 in the utterances (or text from a text file) and translated text (including timing) to match the distribution of words and pauses to generate prosodically aligned translated text and timing (such as splits, etc.). In some embodiments, pre-processed video information is also provided to the prosodic aligner 312 to use in this alignment”, col. 7, lines 5-15); and
generate artificial speech data based on the text data, the speech emotional indicators associated with the text data, the video emotional indicators associated with the text data, and the associated timing data (“speech generator 315 creates a speech signal that reproduces a given sentence with a specified timbre and prosody for text by attempting to match a specified time interval as provided by the machine translator. In particular, the speech generator 315 uses a ML speaker model corresponding to a speaker of each segment. The ML speaker model is selected based on the speaker label and then fed the prosody, text, and timing information for a corresponding segment to create a speech signal”, col. 8, lines 7-15, see also “information from the video is used in the generation of prosody information”, col. 9, lines 46-48).
Federico does not explicitly disclose that the video emotional indicators comprise at least one of facial expressions, eye direction, gestures, and body pose.
In an analogous art similarly generating artificial speech based on video emotional indicators (“adjusting speech output from a text-to-speech processor with inflections indicated by emotional metadata”, [0026]), Deyle discloses that the video emotional indicators comprise at least one of facial expressions, eye direction, gestures, and body pose (“Facial expression recognition module 222 may receive image data captured by one or more camera(s) 212. Such image data may include one or more still images and/or video that includes a user's face. The facial expression recognition module 222 may analyze such image data to detect facial expressions that are indicative of certain emotions. For instance, the facial expression recognition module 222 may detect emotion by analyzing the shape and/or position of a user eye or eye's (e.g., how open or closed the eye(s) are), and/or by analyzing the position of the user's mouth (e.g., smiling, frowning, or neither)”, [0039], see also “Body expression recognition module 224 may also receive image data captured by one or more camera(s) 212. Such image data may include one or more still images and/or video that includes a portion of the user's body (e.g., the upper half of the user's body), and possibly the entirety of the user's body. The body expression recognition module 224 may analyze such image data to detect “body language” that is indicative of certain emotions; certain gestures, movements, and/or positioning of the body or portions thereof, which is characteristic of certain emotions. For instance, certain hand gestures, head movements, arm movements, whole-body movements, and/or stances, may be considered to be indicative of certain emotional states. Accordingly, when such a gesture, movement, and/or positioning is detected, body expression recognition module 224 may generate emotion data that is indicative of the emotion or emotions associated with the detected gesture, movement, and/or positioning”, [0040]).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result of determining Federico’s video emotional indicators using at least one of facial expressions, eye direction, gestures, and body pose as disclosed by Deyle because those physical characteristics satisfactorily convey human emotional states (see Deyle, [0039] and [0040]).
Claim 2:
Federico in view of Deyle discloses the information processing device according to claim 1, wherein the speech emotional indicators are associated with the text data based on the associated timing data, and wherein the generation of the artificial speech data is based on the speech emotional indicators associated with the text data (Federico, col. 7, lines 28-36, see also col. 9, lines 44-46).
Claim 3:
Federico in view of Deyle discloses the information processing device according to claim 2, wherein the associated timing data are indicative of time intervals (Federico, col. 7, lines 5-11).
Claim 4:
Federico in view of Deyle discloses the information processing device according to claim 1, wherein the speech emotional indicators are obtained based on speech indicators of the speech data (Federico, col. 7, lines 28-36, see also col. 9, lines 44-46).
Claim 5:
Federico in view of Deyle discloses the information processing device according to claim 1, wherein the speech emotional indicators are at least one of: inferred emotion, speech tempo, speech pause, emotional pause, speech pitch, speech rhythm (Federico, col. 7, lines 28-36, see also col. 9, lines 44-46).
Claim 6:
Federico in view of Deyle discloses the information processing device according to claim 1, wherein the circuitry is further configured to: obtain the speech emotional indicators based on an artificial neural network, which is configured to determine the speech emotional indicators based on the speech data (Federico, “a ML-based paralanguage modeler”, col. 7, lines 28-36).
Claim 8:
Federico in view of Deyle discloses the information processing device according to claim 1, wherein the speech emotional indicators and the video emotional indicators are associated to each other, based on the associated timing data (Federico, col. 7, lines 33-36, see also col. 8, lines 7-15).
Claim 9:
Federico in view of Deyle discloses the information processing device according to claim 1, wherein the speech data is captured of a speaker (Federico, col. 6, lines 63-67).
Claim 10:
Federico in view of Deyle discloses the information processing device according to claim 1, wherein the speech data is captured of multiple speakers (Federico, col. 6, lines 63-67), and wherein the emotional speech indicators are obtained associated with each speaker based on the video data, and wherein the generation of the artificial speech is based on the speech emotional indicators associated with one speaker of the multiple speakers (Federico, col. 6, lines 63-67, see also col. 7, lines 33-36).
Claims 11-16 and 18-20:
Federico in view of Deyle discloses an information processing method for generating artificial speech data, comprising the steps performed by the information processing device of claims 1-6 and 8-10 as shown above.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SAMUEL G NEWAY whose telephone number is (571)270-1058. The examiner can normally be reached Monday-Friday 9:00am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SAMUEL G NEWAY/Primary Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

Show 2 earlier events

Sep 29, 2025

Interview Requested

Oct 14, 2025

Examiner Interview Summary

Oct 14, 2025

Applicant Interview (Telephonic)

Nov 20, 2025

Response Filed

Jan 30, 2026

Final Rejection mailed — §103

Mar 27, 2026

Response after Non-Final Action

Apr 16, 2026

Request for Continued Examination

Apr 19, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/161,767

Patent 12619834

SYSTEMS AND METHODS FOR INTENT CLASSIFICATION IN A NATURAL LANGUAGE PROCESSING AGENT

3y 3m to grant Granted May 05, 2026

18/489,772

Patent 12613789

ARTIFICIAL INTELLIGENCE BASED GENERATION OF DATA CONNECTORS

2y 6m to grant Granted Apr 28, 2026

18/441,889

Patent 12608561

STRUCTURED DOCUMENT GENERATION USING DOCUMENT-SCALE EMBEDDINGS

2y 2m to grant Granted Apr 21, 2026

18/736,727

Patent 12608554

Method And System For Understanding Medical Chinese Spoken Language, Electronic Device, And Storage Medium

1y 10m to grant Granted Apr 21, 2026

18/067,086

Patent 12602538

METHOD AND SYSTEM FOR EXEMPLAR LEARNING FOR TEMPLATIZING DOCUMENTS ACROSS DATA SOURCES

3y 4m to grant Granted Apr 14, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3

Expected OA Rounds

75%

Grant Probability

83%

With Interview (+7.7%)

3y 0m (~8m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 688 resolved cases by this examiner. Grant probability derived from career allowance rate.