Last updated: May 04, 2026
Application No. 18/741,267
ELECTRONIC DEVICE AND CONTROL METHOD THEREOF

Non-Final OA §103
Filed
Jun 12, 2024
Priority
Jan 30, 2024 — RE 10-2024-0014163
Examiner
CASTILLO-TORRES, KEISHA Y
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Kia Corporation
OA Round
1 (Non-Final)
Interview Optional

— +29.5% interview lift. Examiner has a relatively high allowance rate (74%); +29.5% interview lift. A written response may suffice.
Based on 110 resolved cases, 2023–2026
Examiner Intelligence

CASTILLO-TORRES, KEISHA Y View full profile →
Grants 74% — above average
Career Allowance Rate
82 granted / 110 resolved
+12.5% vs TC avg
Strong +30% interview lift
Without
With
+29.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
32 currently pending
Career history
142
Total Applications
across all art units
Statute-Specific Performance

§101
26.0%
-14.0% vs TC avg
§103
43.5%
+3.5% vs TC avg
§102
14.9%
-25.1% vs TC avg
§112
8.8%
-31.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 110 resolved cases
Office Action

§103
DETAILED ACTION
Claims 1-20 of the instant application are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 06/12/2024 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. 10-2024-0014163, filed in the Korean Intellectual Property Office on January 30, 2024.

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Objections
Claims 1, 11, and 20 objected to because of the following informalities: the claims recite “a first word segment being a first unit on grammar” and “a second word segment being a second unit on grammar” on lines 5-6 of claim 1 and on lines 2-3 of claims 11 and 20. The Examiner notes that “unit on grammar” should read “grammatical unit” or “unit of grammar.” Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-2, 9, 11-12, and 20, is/are rejected under 35 U.S.C. 103 as being unpatentable over Pérez-Mayos et al. (“Part-of-Speech and Prosody-based Approaches for Robot Speech and Gesture Synchronization,” Journal of Intelligent & Robotic Systems (2020) 99:277–287, https://doi.org/10.1007/s10846-019-01100-3, Published online: 16 November 2019.) and further in view of Aguayo et al. (US 20240135927 A1). 

As to independent claim 1, Pérez-Mayos et al. teaches:
1. An electronic device (see ¶ 1-3 of 3 Experimental Framework: “Since experimenting with humanoid robots is costly and potentially dangerous—accidental falls or unexpected movements may damage the robot itself and/or its surroundings—we conducted our experiments using a simulation software, able to model the physics that control the robot limitations of movement and speed. Both the technical specifications and the environment for gesture generation are described in this section.” “3.1 Technical Specifications: For the implementation of our model, we used REEMC, 3 a full-size biped humanoid robotics research platform developed by PAL Robotics. The robot weighs 80 kg and is 165 cm tall. It has 44 degrees of freedom, offering applications for walking, navigation, grasping, face and speech recognition, and running over Ubuntu Linux 12.04 LTS and is ROS Hydro and OROCOS compatible. The ROS (Robot Operating System4 is an open-source framework for robot software development, which provides standard operating system services such as hardware abstraction, low-level device control, implementation of commonly used functionality, message-passing between processes, and package management. Pyttsx5 and eSpeak6 constitute the speech processing framework. Ptttsx is a cross-platform Python wrapper for text-to-speech synthesis that relies on the default speech synthesiser on each OS. For Ubuntu, it uses eSpeak, a compact open source software speech synthesiser written in C which supports SSML (Speech Synthesis Markup Language) and HTML.”), comprising:
identify a first word segment being a first unit on grammar and a second word segment being a second unit on grammar from a target sentence included in a corpus (see Fig. 3 (PoS-based approach architecture. A motion sequence is computed assigning an emblematic gesture to each keyword found on the text and a beat gesture to each content-word with no associated keyword) and ¶ 1 of 4.1 Part-of-Speech (PoS)-Based Approach: “Inspired by the semantic and pragmatic synchronization rules suggested by [21], which state that co- occurring gestures and speech relate to the same idea unit and to the same pragmatic function, we used a shallow parsing technique to analyse the input text, identify certain keywords and fire events for the robot to perform gestures related to those keywords.” 
¶ 1-2 of 4.2 Prosody-Based Approach: “… according to all the literature being conducted that states that gestures and speech can be synchronized through prosody cues, we designed a prosody-based approach to synchronize our gestures with the speech. In this approach, we decided to use just beat gestures, instead of mixing beat and emblematic gestures, to better appreciate the difference.
Speech prosody is conveyed by the intonation, rhythm, and stress elements. In this work, we used the pitch value to model intonation. The goal was to align beat gestures with speech prosody, more precisely with the pitch. The main idea was to assign each beat gesture a certain prosody curve, in order to find the beat gestures sequence that better matches the speech prosody. For each beat gesture, we stored the name and a list of poses pitch and time points, as a representation of the prosody of the gesture. For example, a certain beat gesture where we move both hands up and down twice to highlight certain parts of our speech (e.g. “We should do it quick and well”, where we want to highlight the words “quick” and “well”), could be described with the sequence of time points and pitch [(250, 3), (500, 9), (750, 3), (1000, 9)], where the time points 500 ms, 1000 ms correspond to the two hands reaching the lower space point, accompanying the words “quick” and “well”).” );
determine a target phrase including the first word segment and the second word segment, (see Fig. 2, ¶ 1 of 4.1 Part-of-Speech (PoS)-Based Approach, and ¶ 1-2 of 4.2 Prosody-Based Approach citations as in limitation above. More specifically:  ¶ 2 of 4.2 Prosody-Based Approach: “… For example, a certain beat gesture where we move both hands up and down twice to highlight certain parts of our speech (e.g. “We should do it quick and well”, where we want to highlight the words “quick” and “well”), could be described with the sequence of time points and pitch [(250, 3), (500, 9), (750, 3), (1000, 9)], where the time points 500 ms, 1000 ms correspond to the two hands reaching the lower space point, accompanying the words “quick” and “well”).”), 
determine a change in voice tone in the target phrase, based on information about a pitch contour of voice data obtained by applying the target sentence to a text-to-speech model and an utterance time of each word segment included in the target phrase (see Fig. 1 (Experimental framework overview. Top-left modules correspond to the speech synthesis software, responsible of the TTS and voice control. Top-right modules correspond to gestures design and simulation software. Bottom modules correspond to ROS packages involved in the process: movement control and speech and gesture synchronization), Fig. 4 (Plot of the pitch values for demo text), Fig. 5 (Prosody-based approach architecture. A prosody curve is manually described for each beat gesture in the gestures database. Then, a motion sequence is computed by assigning the gesture with the most similar form to each pitch peak of the text to be spoken. No emblem gestures are performed), and Fig. 6 (Combined approach architecture. A prosody curve is manually described for each beat gesture in the gestures database. Then, a motion sequence is computed by assigning an emblematic gesture to each keyword found on the text, and assigning the gesture with the most similar form to each pitch peak of the text to be spoken), and 
¶ 2 of 4.1 Part-of-Speech (PoS)-Based Approach: “…Finally, for each uttered word, an event is fired by Pyttsx with the word that is about to be spoken. If that word is in our motion sequence, we command the robot to perform the appropriate gesture.”
¶ 3 of 4.2 Prosody-Based Approach: “We used mbrola voices18 to be able to extract the pitch and time points pairs from the SSML text. Figure 4 shows a plot of the pitch values for the demo text Hello! My name is Reem and I am here to show you that it is possible to use a prosody-based approach to speech and gesture synchronization. As you see, I will keep moving while I talk according to the pitch accent of the text. Have a nice day! Bye!. Then, we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph, randomly choosing if different motions were similar enough, and cleaned that sequence to delete gestures that overlapped with other gestures.”
and ¶ 2 of 4.3 Combined Approach: “This approach works exactly as explained for the prosody-based approach, with one difference: when the synthesizer is about to pronounce words associated with keywords, an event is fired and the proper  emblematic gesture is sent to the robot to perform, as we did for the first approach…”).

    PNG
    media_image1.png
    584
    586
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    450
    1188
    media_image2.png
    Greyscale



However, Pérez-Mayos et al. does not explicitly teach, but Aguayo et al. does teach:
one or more processors (see ¶ [0027]: “Implementations of the systems, devices, and methods disclosed herein may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed herein…”); and
a storage medium storing computer-readable instructions that, when executed by the one or more processors, enable the one or more processors (see ¶ [0027]: “…Computer-readable media that store computer-executable instructions are computer storage media (devices). Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, implementations of the disclosure can comprise at least two distinctly different kinds of computer-readable media: computer storage media (devices) and transmission media.”)
determine a target phrase including the first word segment and the second word segment, based on comparing a first pause time between the first word segment and the second word segment and a threshold time (see ¶ [0050]: “…The speech begins with a wake-up key phrase “hey robot”, followed by a period of no voice activity detection (VAD) 22. The system chooses a NVAD cut-off period of 5 seconds. Next, the system detects voice activity and proceeds to receive words, “what's the weather”, during which time there is no complete parse and so the system chooses a NVAD cut-off period of 2 seconds. Next, there is a pause in the speech 24, during which time there is no VAD, but a complete parse. Since there is a complete parse, the system chooses a shorter NVAD period of 1 second. Next, the speech continues, so there is VAD but again no complete parse, so the system returns to a NVAD cut-off period of 2 seconds. Finally, is another period of silence 26, during which there is no VAD, but a complete parse, so the system chooses a NVAD period of 1 second.”
¶ [0053]: “Some embodiments use a continuously adaptive algorithm to continuously adapt the NVAD cut-off period. Some such embodiments gradually decrease one or more NVAD cut-off periods, such as by 1% of the NVAD cut-off period each time there is a cut-off, and, if the speaker continues a sentence after a partial period threshold, such as 80%, the NVAD cut-off period, the NVAD cut-off period increases, such as by 5% for each such occurrence of a user continuing a sentence. Some embodiments increase the NVAD cut-off period in proportion to the amount of time beyond a partial-period threshold (such as 80%) after which that the user continued the sentence.”
and ¶ [0076]: “FIG. 10 is a flow diagram depicting an embodiment of a method 100 to increase an NVAD cut-off period. At 101, a processing system receives audio of at least one spoken sentence. Next, at 102, the processing system detects periods of voice activity and no voice activity in the audio associated with the spoken sentence. At 103, the processing system maintains an NVAD cut-off based on the detection. At 104, the processing system decreases the NVAD cutoff period responsive to detecting a complete sentence. Finally, at 105, the processing system increases the NVAD cut-off period responsive to detecting a period of voice activity within a partial period threshold of detecting a period of no voice activity where the partial period threshold is less than the NVAD cut-off period.”), 
Pérez-Mayos et al. and Aguayo et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/voice analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. to incorporate the teachings of Aguayo et al. of one or more processors, a storage medium storing computer-readable instructions that, when executed by the one or more processors, enable the one or more processors , and determin[ing] a target phrase including the first word segment and the second word segment, based on comparing a first pause time between the first word segment and the second word segment and a threshold time which provides the benefit of avoiding cutting off slow speakers while improving responsiveness for fast speakers and avoiding pre-mature cut-offs for incomplete sentences ([0012] of Aguayo et al.).

As to independent claim 11, Pérez-Mayos et al. further teaches:
11. A control method (see ¶ 1 and 4 of 3. Experimental Framework: 
¶ 1: “…we conducted our experiments using a simulation software, able to model the physics that control the robot limitations of movement and speed.”
¶ 4 “The robot was controlled using Play motion,9 MoveIt!10 and Joint Trajectory Controller.…”), comprising:
[the limitations as in claim 1, as taught by Pérez-Mayos et al. in combination with Aguayo et al., above.]


As to independent claim 20, Pérez-Mayos et al. further teaches:
20. A control method (see ¶ 1 and 4 of 3. Experimental Framework: 
¶ 1: “…we conducted our experiments using a simulation software, able to model the physics that control the robot limitations of movement and speed.”
¶ 4 “The robot was controlled using Play motion,9 MoveIt!10 and Joint Trajectory Controller.…”), comprising:
[the limitations as in claim 1, as taught by Pérez-Mayos et al. in combination with Aguayo et al., above.] 

Pérez-Mayos et al. further teaches:
determining a change in voice tone in each of collective phrases included in the target sentence (see Fig.4 (Plot of the pitch values for demo text) as presented above and ¶ 3 of 4.2 Prosody-Based Approach: “We used mbrola voices18 to be able to extract the pitch and time points pairs from the SSML text. Figure 4 shows a plot of the pitch values for the demo text Hello! My name is Reem and I am here to show you that it is possible to use a prosody-based approach to speech and gesture synchronization. As you see, I will keep moving while I talk according to the pitch accent of the text. Have a nice day! Bye! Then, we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph, randomly choosing if different motions were similar enough, and cleaned that sequence to delete gestures that overlapped with other gestures.”);
determining a largest phrase having a largest change in voice tone among the collective phrases included in the target sentence as a gesture assignment candidate (see Fig.4 (Plot of the pitch values for demo text) as presented above and ¶ 3 of 4.2 Prosody-Based Approach as in citation above. More specifically: “…we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph…” and further and Fig. 6 (Combined approach architecture. A prosody curve is manually described for each beat gesture in the gestures database. Then, a motion sequence is computed by assigning an emblematic gesture to each keyword found on the text, and assigning the gesture with the most similar form to each pitch peak of the text to be spoken) as presented above.
Here, the Examiner notes that the limitation of “largest phrase” is unclear since no support for the “largest phrase” was found in the as filed Specification. The Examiner further notes that the “largest phrase” is interpreted under the broadest reasonable interpretation (BRI) as including a phrase having a largest change in voice tone.), 
wherein the gesture assignment candidate corresponds to a gesture execution interval of the target sentence, based on the determining of the change in voice tone in each of the collective phrases included in the target sentence (see Fig. 4 and 6 as presented above and ¶ 3 of 4.2 Prosody-Based Approach as in citation above. More specifically: “…we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph…”);
determining a gesture of the gesture assignment candidate, based on a gesture type corresponding to the gesture assignment candidate (see Fig. 4 and 6 as presented above and ¶ 3 of 4.2 Prosody-Based Approach as in citation above. More specifically: “…we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph…”); and
designating the gesture to correspond to an utterance time of the gesture assignment candidate to generate a robot gesture of a robot scheduled to output the target sentence (see Fig. 4 and 6 as presented above and ¶ 3 of 4.2 Prosody-Based Approach as in citation above. More specifically: “…we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph…” and further ¶ 4 of 4.2 Prosody-Based Approach: “Instead of working with events, as we had the time points, we launched the speech and sent to the robot the gestures in the right time point, using a timer. Figure 5 shows the architecture used in this approach.19”).

Regarding claims 2 and 12, Pérez-Mayos et al. in combination with Aguayo et al. teach the limitations as in claims 1 and 11, above.
Pérez-Mayos et al. further teaches:
2 and 12. The device/method of claims 1 and 11, 
(claim 2) wherein the instructions further enable the one or more processors to:
(claim 12) wherein the determining of the target phrase (see Fig. 2, ¶ 1 of 4.1 Part-of-Speech (PoS)-Based Approach, and ¶ 1-2 of 4.2 Prosody-Based Approach citations as in claim 1, above.) includes:
(claims 2 and 12) identify a first utterance time of the first word segment and a second utterance time of the second word segment from the voice data (see ¶ 1-2 of 4.2 Prosody-Based Approach citations as in claim 1, above. More specifically:  ¶ 2 of 4.2 Prosody-Based Approach: “… For example, a certain beat gesture where we move both hands up and down twice to highlight certain parts of our speech (e.g. “We should do it quick and well”, where we want to highlight the words “quick” and “well”), could be described with the sequence of time points and pitch [(250, 3), (500, 9), (750, 3), (1000, 9)], where the time points 500 ms, 1000 ms correspond to the two hands reaching the lower space point, accompanying the words “quick” and “well”).”);
determine a first change in voice tone in the first word segment by use of the first utterance time and information about a first pitch contour of the first word segment (see ¶ 3 of 4.2 Prosody-Based Approach: “We used mbrola voices18 to be able to extract the pitch and time points pairs from the SSML text. Figure 4 shows a plot of the pitch values for the demo text Hello! My name is Reem and I am here to show you that it is possible to use a prosody-based approach to speech and gesture synchronization. As you see, I will keep moving while I talk according to the pitch accent of the text. Have a nice day! Bye!. Then, we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph, randomly choosing if different motions were similar enough, and cleaned that sequence to delete gestures that overlapped with other gestures.”
and ¶ 2 of 4.3 Combined Approach: “This approach works exactly as explained for the prosody-based approach, with one difference: when the synthesizer is about to pronounce words associated with keywords, an event is fired and the proper  emblematic gesture is sent to the robot to perform, as we did for the first approach…”); and
determine a second change in voice tone in the second word segment by use of the second utterance time and information about a second pitch contour of the second word segment (see Fig. 4, ¶ 3 of 4.2 Prosody-Based Approach and ¶ 2 of 4.3 Combined Approach citations as in limitation(s) above. (Fig. 4: peaks corresponding to the words in the demo text)).

    PNG
    media_image1.png
    584
    586
    media_image1.png
    Greyscale


Regarding claim 9, Pérez-Mayos et al. in combination with Aguayo et al. teach the limitations as in claim 1, above.
Pérez-Mayos further teaches:
9. The device of claim 1, wherein the instructions further enable the one or more processors to:
determine a largest phrase having a largest change in voice tone among collective phrases included in the target sentence as a gesture assignment candidate (see Fig.4 (Plot of the pitch values for demo text), Fig. 6 (Combined approach architecture. A prosody curve is manually described for each beat gesture in the gestures database. Then, a motion sequence is computed by assigning an emblematic gesture to each keyword found on the text, and assigning the gesture with the most similar form to each pitch peak of the text to be spoken)  as presented above and ¶ 3 of 4.2 Prosody-Based Approach: “We used mbrola voices18 to be able to extract the pitch and time points pairs from the SSML text. Figure 4 shows a plot of the pitch values for the demo text Hello! My name is Reem and I am here to show you that it is possible to use a prosody-based approach to speech and gesture synchronization. As you see, I will keep moving while I talk according to the pitch accent of the text. Have a nice day! Bye! Then, we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph, randomly choosing if different motions were similar enough, and cleaned that sequence to delete gestures that overlapped with other gestures.”
Here, the Examiner notes that the limitation of “largest phrase” is unclear since no support for the “largest phrase” was found in the as filed Specification. The Examiner further notes that the “largest phrase” is interpreted under the broadest reasonable interpretation (BRI) as including a phrase having a largest change in voice tone.), 
wherein the gesture assignment candidate corresponds to a gesture execution interval of the target sentence, based on a change in voice tone in each phrase included in the target sentence being determined (see Fig. 4 and 6 as presented above and ¶ 3 of 4.2 Prosody-Based Approach as in citation above. More specifically: “…we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph…”);
determine a gesture of the gesture assignment candidate, based on a gesture type corresponding to the gesture assignment candidate (see Fig. 4 and 6 as presented above and ¶ 3 of 4.2 Prosody-Based Approach as in citation above. More specifically: “…we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph…”); and
allow the gesture to correspond to an utterance time of the gesture assignment candidate to generate a robot gesture of a robot scheduled to output the target sentence (see Fig. 4 and 6 as presented above and ¶ 3 of 4.2 Prosody-Based Approach as in citation above. More specifically: “…we found the pitch peaks and assigned to each peak the gesture with the most similar pitch graph…” and further ¶ 4 of 4.2 Prosody-Based Approach: “Instead of working with events, as we had the time points, we launched the speech and sent to the robot the gestures in the right time point, using a timer. Figure 5 shows the architecture used in this approach.19”).

Claims 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pérez-Mayos et al. (“Part-of-Speech and Prosody-based Approaches for Robot Speech and Gesture Synchronization,” Journal of Intelligent & Robotic Systems (2020) 99:277–287, https://doi.org/10.1007/s10846-019-01100-3, Published online: 16 November 2019.) and Aguayo et al. (US 20240135927 A1) as applied to claims 2 and 12, above, and further in view of Hirabayashi et al. (US 20090055188 A1). 
Regarding claims 3 and 13, Pérez-Mayos et al. in combination with Aguayo et al. teach the limitations as in claims 2 and 12, above.
Pérez-Mayos et al. further teaches:
3 and 13. The device/method of claims 2 and 12, 
(claim 3) wherein the instructions further enable the one or more processors to:
(claim 13) wherein the determining of the target phrase (see Fig. 2, ¶ 1 of 4.1 Part-of-Speech (PoS)-Based Approach, and ¶ 1-2 of 4.2 Prosody-Based Approach citations as in claim 1, above.) includes:

However, Pérez-Mayos et al. in combination with Aguayo et al. do not explicitly teach, but Hirabayashi et al. does teach:
obtain a first rate of change in voice tone at intervals of a set unit time from the first pitch contour of the first word segment corresponding to the first utterance time to determine the first change in voice tone (see Fig. 2 (example of pitch patterns generated for each accent phrase (syllable/accent phrase boundaries)), Fig. 11A-B (method of smoothing processing which changes a pitch at the connection point based on the degree of emphasis according to a modification example 3), and ¶ [0055]: “A case in which the emphasis degree information 200 is "Emphasis 0 (no emphasis)" or "Emphasis 1 (weak emphasis)" will be explained. In this case, the smoothing processing section in the connection portion between the accent phrase and the next accent phrase is considered to be divided into a flat type (the accent phrase without accented syllable) and not-flat type (the accent phrase with accented syllable).” 
and ¶ [0093-0100]: “(5-3) Modification Example 3: [0093] It is also preferable that the modification method is decided by deciding the pitch of the connection point at the connection boundary which is used in the pattern connection module 13 based on at least the emphasis degree information 200. [0094] Specifically, when the accent type of the accent phrase is the flat-type, a connection-point pitch at the connection boundary between the accent phrase and the next accent phrase is decided to be a value at the end point of the accent phrase. [0095] When the accent type of the accent phrase is not the flat type, the pitch is decided according to the following conditions. [0096] The first condition is when the emphasis degree is stronger than the emphasis degree of the next accent phrase. At this time, the connection-point pitch is decided to be a value higher than an average value of the pitch of the end point in the accent phrase and the pitch of the start point in the next accent phrase. [0097] The second condition is when the emphasis degree is equal. At this time, the average value of the above pitches is decided. [0098] The third condition is when the emphasis degree of the accent phrase is weaker than the emphases degree of the next accent phrase. At this time, a value lower than the average value is decided. [0099] As described above, the modification method of the pitch pattern at the connection point can be controlled also by changing the pitch at the connection point according to the emphasis degree. [0100] An example of changing the method of deciding the boundary point according to the emphasis degree is shown in FIG. 11A and FIG. 11B. Since both the accent phrase and the next accent phrase are not emphasized (emphasis degree 0) in FIG. 11A, the second condition is applied [i.e., emphasis degree equal], and the connection pitch is decided to be the average value of the end-point pitch of the accent phrase and the start-point pitch of the next accent pitch. On the other hand, since the accent phrase is emphasized in FIG. 11B, the first condition is applied [i.e., emphasis degree is stronger] and the connection pitch is decided to be the value higher than the average value, thereby connecting the emphasized accent phrase and the not-emphasized next accent phrase smoothly without unnatural pitch change at the connection portion.”); and
obtain a second rate of change in voice tone at the intervals of the set unit time from the second pitch contour of the second word segment corresponding to the second utterance time to determine the second change in voice tone (see Figs. 2 and 11A-B and ¶ [0055, and 0093-0100] citations as in limitation above. More specifically: Fig. 2 (syllable/accent phrase boundaries) and “[0093] It is also preferable that the modification method is decided by deciding the pitch of the connection point at the connection boundary which is used in the pattern connection module 13 based on at least the emphasis degree information 200.”).

    PNG
    media_image3.png
    236
    518
    media_image3.png
    Greyscale

Pérez-Mayos et al., Aguayo et al., and Hirabayashi et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/voice analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. in combination with Aguayo et al. to incorporate the teachings of Hirabayashi et al. of obtain[ing] a first rate of change in voice tone at intervals of a set unit time from the first pitch contour of the first word segment corresponding to the first utterance time to determine the first change in voice tone and obtain[ing] a second rate of change in voice tone at the intervals of the set unit time from the second pitch contour of the second word segment corresponding to the second utterance time to determine the second change in voice tone which provides the benefit of putting proper stress and emphasis to intonation and improving understandability or naturalness of the synthetic speech to be generated ([0084] of Hirabayashi et al.).

Claims 4 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pérez-Mayos et al. (“Part-of-Speech and Prosody-based Approaches for Robot Speech and Gesture Synchronization,” Journal of Intelligent & Robotic Systems (2020) 99:277–287, https://doi.org/10.1007/s10846-019-01100-3, Published online: 16 November 2019.) and Aguayo et al. (US 20240135927 A1) as applied to claims 2 and 12, above, and further in view of Hirabayashi et al. (US 20090055188 A1) and Kato et al. (US 20140136192 A1). 

Regarding claims 4 and 14, Pérez-Mayos et al. in combination with Aguayo et al. teach the limitations as in claims 2 and 12, above.
However, Pérez-Mayos et al. in combination with Aguayo et al. do not explicitly teach, but Hirabayashi et al. does teach:
4 and 14. The device/method of claims 2 and 12, 
(claim 4) wherein the instructions further enable the one or more processors to:
(claim 14) further comprising:
determine an average of the first change in voice tone and the second change in voice tone, based on the first word segment and the second word segment being included in the target phrase (see Figs. 2 and 11A-B and ¶ [0055, and 0093-0100] citations as in limitation above. More specifically:
Fig. 2 (syllable/accent phrase boundaries),
¶ [0093]: “It is also preferable that the modification method is decided by deciding the pitch of the connection point at the connection boundary which is used in the pattern connection module 13 based on at least the emphasis degree information 200.”
and ¶ [0096-0100]: “[0096] The first condition is when the emphasis degree is stronger than the emphasis degree of the next accent phrase. At this time, the connection-point pitch is decided to be a value higher than an average value of the pitch of the end point in the accent phrase and the pitch of the start point in the next accent phrase. [0097] The second condition is when the emphasis degree is equal. At this time, the average value of the above pitches is decided. [0098] The third condition is when the emphasis degree of the accent phrase is weaker than the emphases degree of the next accent phrase. At this time, a value lower than the average value is decided. [0099] As described above, the modification method of the pitch pattern at the connection point can be controlled also by changing the pitch at the connection point according to the emphasis degree. [0100] An example of changing the method of deciding the boundary point according to the emphasis degree is shown in FIG. 11A and FIG. 11B. Since both the accent phrase and the next accent phrase are not emphasized (emphasis degree 0) in FIG. 11A, the second condition is applied [i.e., emphasis degree equal], and the connection pitch is decided to be the average value of the end-point pitch of the accent phrase and the start-point pitch of the next accent pitch. On the other hand, since the accent phrase is emphasized in FIG. 11B, the first condition is applied [i.e., emphasis degree is stronger] and the connection pitch is decided to be the value higher than the average value, thereby connecting the emphasized accent phrase and the not-emphasized next accent phrase smoothly without unnatural pitch change at the connection portion.”); 
Pérez-Mayos et al., Aguayo et al., and Hirabayashi et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/voice analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. in combination with Aguayo et al. to incorporate the teachings of Hirabayashi et al. of determin[ing] an average of the first change in voice tone and the second change in voice tone, based on the first word segment and the second word segment being included in the target phrase which provides the benefit of putting proper stress and emphasis to intonation and improving understandability or naturalness of the synthetic speech to be generated ([0084] of Hirabayashi et al.).

However, Pérez-Mayos et al. in combination with Aguayo et al. and Hirabayashi et al. do not explicitly teach, but Kato et al. does teach:
determine a value obtained by applying the average to a normalization function as a target phrase change in voice tone for the target phrase (see ¶ [0045]: “The normalization degree calculation unit 6 calculates a degree of normalization as a function value of an increasing function with the scalar S indicating power (average amplitude in the present example) as a variable. Assuming a degree of normalization .alpha. and an increasing function A(S) with the scalar S indicating power as a variable, .alpha.=A(S) is established. As described above, the degree of normalization is a defined value for defining an intermediate form between a form in which a plurality of pitch waveforms corresponding to one segment are completely normalized and a form in which normalization is not performed at all to maintain the original pitch waveforms.”).
Pérez-Mayos et al., Aguayo et al., Hirabayashi et al., and Kato et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/voice analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. in combination with Aguayo et al. and Hirabayashi et al. to incorporate the teachings of Kato et al. of determin[ing] a value obtained by applying the average to a normalization function as a target phrase change in voice tone for the target phrase which provides the benefit of acquiring a natural synthesis speech ([0017] of Kato et al.).



Claims 5-6 and 15-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pérez-Mayos et al. (“Part-of-Speech and Prosody-based Approaches for Robot Speech and Gesture Synchronization,” Journal of Intelligent & Robotic Systems (2020) 99:277–287, https://doi.org/10.1007/s10846-019-01100-3, Published online: 16 November 2019.) and Aguayo et al. (US 20240135927 A1) as applied to claims 1 and 11, above, and further in view of Fukuda (US 12367866 B2). 

Regarding claims 5 and 15, Pérez-Mayos et al. in combination with Aguayo et al. teach the limitations as in claims 1 and 11, above.
Pérez-Mayos et al. further teaches:
5 and 15. The device/method of claims 1 and 11, 
(claim 5) wherein the instructions further enable the one or more processors to:
(claim 15) wherein the determining of the target phrase (see Fig. 2, ¶ 1 of 4.1 Part-of-Speech (PoS)-Based Approach, and ¶ 1-2 of 4.2 Prosody-Based Approach citations as in claim 1, above.) includes: 
However, Pérez-Mayos et al. in combination with Aguayo et al. do not explicitly teach, but Fukuda does teach:
include the first word segment and the second word segment in the target phrase, based on the first pause time being less than the threshold time (see ¶ Col. 11, line 34 – Col. 12, line 15: “(42) In step 604, a determination is made as to whether the duration of silence between either the one-word utterance (e.g., ‘okay’) and the preceding sentence (e.g., 5.34 seconds) or the succeeding sentence (e.g., 0.1 seconds) is less than or equal to a given threshold. The use of threshold is to ensure that the duration of silence/non-speech between the one-word utterance and the preceding or succeeding sentence is not too long. Namely, if the duration of silence is too long, then the non-speech portion will factor too much into the model training. In other words, if the amount of training data for non-speech increases, it also means that the amount of speech training data is relatively reduced. Doing so, can degrade the classification ability of the model for speech data (the data in the speech portion), while also undesirably improving the classification ability of the model for non-speech/noise signals which can include background noise. The threshold can be set by a user. According to an exemplary embodiment, the threshold is less than or equal to 1 second. (43) If it is determined in step 604 that, YES, the duration of silence between the one-word utterance and the preceding sentence and/or the succeeding sentence is less than or equal to the threshold (e.g., less than or equal to 1 second), then in step 606 a determination is made as to whether both (a) the duration of silence between the one-word utterance and the preceding sentence and (b) the duration of silence between the one-word utterance and the succeeding sentence meet the constraint, i.e., they are both less than or equal to the threshold. If it is determined in step 606 that, NO. (a) the duration of silence between the one-word utterance and the preceding sentence and (b) the duration of silence between the one-word utterance and the succeeding sentence do not both meet the constraint (i.e., only one of the preceding or succeeding sentence meets the constraint), then in step 608 the one-word utterance is concatenated with whichever (preceding or succeeding) sentence meets the constraint. (44) This concept is illustrated in scenario A in FIG. 5. Namely, the duration of silence between the one-word utterance ‘okay’ and the preceding sentence ‘i don't know i would much rather be in the warm of sun’ is 5.34 seconds which is greater than the threshold (e.g., 1 second). However, the duration of silence between ‘okay’ and the succeeding sentence ‘well throughout all of this i called my parents who live in florida’ is 0.1 seconds which is less than the threshold. In that case, the one-word utterance ‘okay’ is concatenated with the succeeding sentence, i.e., ‘well throughout all of this i called my parents who live in florida okay.’”
and claim 12: “12. The computer program product of claim 11, wherein the program instructions further cause the computer to perform: determining, for each one-word utterance in the initial training dataset, whether the duration of silence between the one-word utterance and either the preceding or the succeeding sentence is less than or equal to a threshold duration, wherein the duration of silence with both the preceding and the succeeding sentence is greater than the threshold duration, and wherein the program instructions further cause the computer to remove the one-word utterance from the initial training dataset; and concatenating the one-word utterance with whichever of the preceding or the succeeding sentence with the duration of silence that is less than or equal to the threshold duration.”); and
include one of the first word segment or the second word segment in the target phrase, based on the first pause time being greater than or equal to the threshold time (see ¶ Col. 11, line 34 – Col. 12, line 15 and Claim 12 citations as in limitation above. More specifically, see the example: “(44) This concept is illustrated in scenario A in FIG. 5. Namely, the duration of silence between the one-word utterance ‘okay’ and the preceding sentence ‘i don't know i would much rather be in the warm of sun’ is 5.34 seconds which is greater than the threshold (e.g., 1 second). However, the duration of silence between ‘okay’ and the succeeding sentence ‘well throughout all of this i called my parents who live in florida’ is 0.1 seconds which is less than the threshold. In that case, the one-word utterance ‘okay’ is concatenated with the succeeding sentence, i.e., ‘well throughout all of this i called my parents who live in florida okay.’”).
Pérez-Mayos et al., Aguayo et al., and Fukuda are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/voice analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. in combination with Aguayo et al. to incorporate the teachings of Fukuda of include[ing] the first word segment and the second word segment in the target phrase, based on the first pause time being less than the threshold time; and include[ing] one of the first word segment or the second word segment in the target phrase, based on the first pause time being greater than or equal to the threshold time which provides the benefit of reducing insertion errors (due to a failure of the voice activated detection systems in correctly identifying background noise) by improving automatic speech recognition models (Col. 7, lines 47-50 of Fukuda).

Regarding claims 6 and 16, Pérez-Mayos et al. in combination with Aguayo et al. and Fukuda teach the limitations as in claims 5 and 15, above.
Fukuda further teaches:
6 and 16. The device/method of claims 5 and 15, 
(claim 6) wherein the instructions further enable the one or more processors to:
(claim 16) further comprising:
determine positions of the first word segment and the second word segment in the voice data, based on the first pause time being less than the threshold time (see Fig. 5 and ¶ Col. 11, line 34 – Col. 12, line 15 and Claim 12 citations as in claims 5 and 15, above. More specifically, see the example: “(44) This concept is illustrated in scenario A in FIG. 5. Namely, the duration of silence between the one-word utterance ‘okay’ and the preceding sentence ‘i don't know i would much rather be in the warm of sun’ is 5.34 seconds which is greater than the threshold (e.g., 1 second). However, the duration of silence between ‘okay’ and the succeeding sentence ‘well throughout all of this i called my parents who live in florida’ is 0.1 seconds which is less than the threshold. In that case, the one-word utterance ‘okay’ is concatenated with the succeeding sentence, i.e., ‘well throughout all of this i called my parents who live in florida okay.’”);
identify a third word segment subsequent to the second word segment from the voice data, based on the second word segment being subsequent to the first word segment (see Fig. 5 and ¶ Col. 11, line 34 – Col. 12, line 15 and Claim 12 citations as in limitation above. More specifically, see the example above: (1) ‘i don't know i would much rather be in the warm of sun’, (2) ‘okay’, and (3) ‘well throughout all of this i called my parents who live in florida’ and the determination of whether they met the silence threshold(s) to determine where the (2) ‘okay’ would be concatenated); and
determine whether to include the third word segment in the target phrase, based on comparing a second pause time between the second word segment and the third word segment and the threshold time (see Fig. 5 and ¶ Col. 11, line 34 – Col. 12, line 15 and Claim 12 citations as in limitation above. More specifically, see the example above: (1) ‘i don't know i would much rather be in the warm of sun’, (2) ‘okay’, and (3) ‘well throughout all of this i called my parents who live in florida’ and the determination of whether they met the silence threshold(s) to determine where the (2) ‘okay’ would be concatenated).
Pérez-Mayos et al., Aguayo et al., and Fukuda are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/voice analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. in combination with Aguayo et al. to incorporate the teachings of Fukuda of determin[ing] positions of the first word segment and the second word segment in the voice data, based on the first pause time being less than the threshold time; identify[ing] a third word segment subsequent to the second word segment from the voice data, based on the second word segment being subsequent to the first word segment; and determin[ing] whether to include the third word segment in the target phrase, based on comparing a second pause time between the second word segment and the third word segment and the threshold time which provides the benefit of reducing insertion errors (due to a failure of the voice activated detection systems in correctly identifying background noise) by improving automatic speech recognition models (Col. 7, lines 47-50 of Fukuda).
Claims 7 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pérez-Mayos et al. (“Part-of-Speech and Prosody-based Approaches for Robot Speech and Gesture Synchronization,” Journal of Intelligent & Robotic Systems (2020) 99:277–287, https://doi.org/10.1007/s10846-019-01100-3, Published online: 16 November 2019.) and Aguayo et al. (US 20240135927 A1) as applied to claims 1 and 11, above, and further in view of Zhang et al. (US 20230136368 A1). 

Regarding claims 7 and 17, Pérez-Mayos et al. in combination with Aguayo et al. teach the limitations as in claims 1 and 11, above.
However, Pérez-Mayos et al. in combination with Aguayo et al. do not explicitly teach, but Zhang et al. does teach:
7 and 17. The device/method of claims 1 and 11, 
(claim 7) wherein the instructions further enable the one or more processors to:
(claim 17) further comprising:
obtain a first target vector with a first number of dimensions, by use of word embedding of a target window including the first word segment and the second word segment (see ¶ [0030-0035]: “[0030] As shown in FIG. 1, the present application provides a keyword extraction method applicable to a Word text. The method includes the steps described below. [0031] In S1, a Word text is acquired and a body is extracted. [0032] In S2, a set number of keywords are extracted by a TFIDF algorithm and a set number of keywords are extracted by a TextRank algorithm, respectively. [0033] In S3, a text name and a text title are acquired and word segmentation is performed. [0034] In S4, text feature vectors are constructed and inputted into a trained keyword extraction model. [0035] In S5, a final keyword set is extracted from the keywords extracted by the TextRank algorithm by using the keyword extraction model to complete text keyword extraction.”);
apply the first target vector to a phrase unit recognition model to obtain an output indicating whether to perform segmentation of word segments included in the target window (see ¶ [0030-0035] citations as in limitation above. More specifically: “[0034] In S4, text feature vectors are constructed and inputted into a trained keyword extraction model. [0035] In S5, a final keyword set is extracted from the keywords extracted by the TextRank algorithm by using the keyword extraction model to complete text keyword extraction.””); and
train the phrase unit recognition model, based on a first loss obtained by use of comparing the output and the target sentence (see ¶ [0030-0035] citations as in limitations above. More specifically: “[0034] In S4, text feature vectors are constructed and inputted into a trained keyword extraction model.  and further: 
¶ [0068-0073]: “[0068] 5, the 100 keywords extracted by the TextRank algorithm are re-determined using the text feature vector and the keyword extraction model, and the words determined to be keywords are finally extracted as the final keyword set. [0069] 6 the keywords extracted by different models in the method of the present application are compared with the original keywords, and the accuracy rate and the recall rate are calculated. [0070] 7, the keywords of the test set of 535 papers are extracted by the TextRank algorithm and the TFIDF algorithm and compared with the original keywords, so as to calculate the accuracy rate and the recall rate. [0071] After the test of the test set, the accuracy rate and the recall rate of extracted keywords are described in Table 2. [0072] The following conclusions may be drawn from the analysis of the test results. [0073] 1, by comparison, the accuracy rate and the recall rate of keywords extracted by different models in this method are generally better than those of keywords extracted by the TextRank algorithm and keywords extracted by the TFIDF algorithm. The recall rate is not improved much, and that is because when the TextRank algorithm and the TFIDF algorithm are used in this embodiment, 10 keywords are respectively extracted, and the number of extracted keywords is relatively large. The accuracy rate is improved significantly, up to 16% higher than the TextRank algorithm and 31% higher than the TFIDF algorithm.”).
Pérez-Mayos et al., Aguayo et al., and Zhang et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in natural language (e.g., keyword) analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. in combination with Aguayo et al. to incorporate the teachings of Zhang et al. of obtain[ing] a first target vector with a first number of dimensions, by use of word embedding of a target window including the first word segment and the second word segment; apply[ing] the first target vector to a phrase unit recognition model to obtain an output indicating whether to perform segmentation of word segments included in the target window; and train[ing] the phrase unit recognition model, based on a first loss obtained by use of comparing the output and the target sentence which provides the benefit of improving accuracy of keyword extraction ([0096] of Zhang et al.).
Claims 8 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pérez-Mayos et al. (“Part-of-Speech and Prosody-based Approaches for Robot Speech and Gesture Synchronization,” Journal of Intelligent & Robotic Systems (2020) 99:277–287, https://doi.org/10.1007/s10846-019-01100-3, Published online: 16 November 2019.) and Aguayo et al. (US 20240135927 A1) as applied to claims 1 and 11, above, and further in view of Oplustil Gallegos et al. (US 20240087558 A1). 

Regarding claims 8 and 18, Pérez-Mayos et al. in combination with Aguayo et al. teach the limitations as in claims 1 and 11, above.
However, Pérez-Mayos et al. in combination with Aguayo et al. do not explicitly teach, but Oplustil Gallegos et al. does teach:
8 and 18. The device/method of claims 1 and 11, 
(claim 8) wherein the instructions further enable the one or more processors to:
(claim 18) further comprising:
obtain a second target vector including vectors with a second number of dimensions for every word segment included in the target phrase, by use of word embedding of the target phrase (see ¶ [0286-0291]: “[0286] This speech audio 367 is the speech signal that is obtained in S303 of FIG. 12(b) [0287] In step S305, intonation vector is then obtained for the input text in step S300. In an embodiment, the intonation vector is derived from a pitch track that is extracted from the synthesised speech 367 derived from the input text. [0288] The intonation vector is a real valued single dimension pitch vector with a length equal to the number of time steps of the decoder output. In an embodiment, the intonation vector is obtained from a real valued pitch vector with the length of phonemes or encoder input steps and this is then sampled to a vector with a length equal to the decoder timesteps. [0289] The intonation vector is derived from the pitch vector which has pitch values for encoder timesteps. This is shown in stage 2 of FIG. 12(c). The audio track is taken from synthesised speech 367. This is then analysed in 369 to obtain the start and end times of each phoneme. In an embodiment, the start and end times of each phoneme can be obtained from the alignments produced in the attention network during synthesis. The average pitch is then calculated for each phoneme by averaging the pitch between the start and end time of each phoneme. [0290] Once the pitch has been obtained for each phoneme, an n_encoder steps vector is produced where each step corresponds to the average pitch for the phoneme in 371. [0291] Once the pitch vector has been derived from the synthesised speech, the user can modify the vector by the user control 373. This allows the user to increased and/or decrease the pitch of one or more of the phonemes. Referring back to the interface shown in FIG. 12(a), the interface shows the encoder time step pitch vector and the interface allow it is possible to move with a mouse one or more of the phonemes to increase or decrease its pitch to produce the intonation vector.”);
apply the second target vector to an encoder for reducing an input target dimension of an input target to reduce the second number of dimensions of the second target vector (see ¶ [0286-0291] citations as in limitation above. More specifically: “[0288] The intonation vector is a real valued single dimension pitch vector with a length equal to the number of time steps of the decoder output. In an embodiment, the intonation vector is obtained from a real valued pitch vector with the length of phonemes or encoder input steps and this is then sampled to a vector with a length equal to the decoder timesteps.” and “[0290] Once the pitch has been obtained for each phoneme, an n_encoder steps vector is produced where each step corresponds to the average pitch for the phoneme in 371.”);
apply the second target vector after applied to the encoder to a voice tone change prediction model to obtain a temporary change in voice tone in the target phrase (see ¶ [0286-0291] citations as in limitation above. More specifically: “[0288] The intonation vector is a real valued single dimension pitch vector with a length equal to the number of time steps of the decoder output. In an embodiment, the intonation vector is obtained from a real valued pitch vector with the length of phonemes or encoder input steps and this is then sampled to a vector with a length equal to the decoder timesteps.” and “[0290] Once the pitch has been obtained for each phoneme, an n_encoder steps vector is produced where each step corresponds to the average pitch for the phoneme in 371.” and further ¶ [0296-0300]: “[0296] Referring to FIG. 12(b), the intonation vector is then input into an intonation model in step 307 and modified speech is then output in step S309. [0297] FIG. 12(c) shows the intonation model 381 which is very similar to the model 353 of stage 1. The intonation model 381 has an encoder 383 decoder 387 architecture where the encoder 383 and decoder 387 are linked by an attention network 385. However, the intonation model 381 differs in a number of significant ways: [0298] i) The model 381 has been trained using text and synthesised speech as opposed to real speech [0299] ii) The alignment matrix of the attention network 383 is not predicted from the input text, but the alignment matrix is taken from stage 1 and imposed on the output of the encoder of stage 2. [0300] iii) The decoder 387 also receives the intonation vector.”); and
train the voice tone change prediction model, based on a second loss obtained by use of comparing the temporary change in voice tone in the target phrase and a target phrase change in voice tone for the target phrase (see ¶ [0286-0291, 0296-0300] citations as in limitation above. More specifically: “[0298] i) The model 381 has been trained using text and synthesized speech as opposed to real speech”  and further ¶ [0272]: “Once the prominence vector is obtained from the training data, the model is trained as usual, feeding in the text and the prominence vector and learning to produce the Mel spectrogram of the audio by back-propagating the mean squared error loss between the synthesised Mel-spectrogram and the real Mel-spectrogram.” and ¶ [0312]: “Then, with the text, original audio, and time aligned average phoneme pitch it is possible to train the intonation model for the single stage model and/or the two-stage model. In an embodiment, the loss function used is a MSE error loss between the exact Mel spectrogram and the predicted spectrogram.”).
Pérez-Mayos et al., Aguayo et al., and Oplustil Gallegos et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in natural language (e.g., keyword) analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. in combination with Aguayo et al. to incorporate the teachings of Oplustil Gallegos et al. of obtain[ing] a second target vector including vectors with a second number of dimensions for every word segment included in the target phrase, by use of word embedding of the target phrase; apply[ing] the second target vector to an encoder for reducing an input target dimension of an input target to reduce the second number of dimensions of the second target vector; apply[ing] the second target vector after applied to the encoder to a voice tone change prediction model to obtain a temporary change in voice tone in the target phrase; and train the voice tone change prediction model, based on a second loss obtained by use of comparing the temporary change in voice tone in the target phrase and a target phrase change in voice tone for the target phrase which provides the benefit of enabling the modification of the speech signals while maintaining the accuracy and quality of trained TTS systems ([0141] of Oplustil Gallegos et al.).

Claims 10 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pérez-Mayos et al. (“Part-of-Speech and Prosody-based Approaches for Robot Speech and Gesture Synchronization,” Journal of Intelligent & Robotic Systems (2020) 99:277–287, https://doi.org/10.1007/s10846-019-01100-3, Published online: 16 November 2019.) and Aguayo et al. (US 20240135927 A1) as applied to claims 1 and 11, above, and further in view of Gibson et al. (US 6336092 B1). 

Regarding claims 10 and 19, Pérez-Mayos et al. in combination with Aguayo et al. teach the limitations as in claims 1 and 11, above.
However, Pérez-Mayos et al. in combination with Aguayo et al. do not explicitly teach, but Gibson et al. does teach:
10 and 19. The device/method of claims 1 and 11, 
(claim 10) wherein the instructions further enable the one or more processors to:
(claim 19) further comprising:
apply cubic spline interpolation to the voice data to identify the information about the pitch contour (see ¶ Col. 5, lines 18-30: “(13) There are several ways in which the interpolation can be accomplished. In all cases, the goal is to create an interpolated voiced signal having a pitch contour which blends with the bounding pitch contour in a meaningful way (for example, for singing, the interpolated notes should sound good with the background music). For some applications, the interpolated pitch contour may be calculated automatically, using, for example, cubic spline interpolation. In the preferred embodiment, the pitch contour is first computed using spline interpolation, and then any portions which are deemed unsatisfactory are fixed manually by an operator.”).
Pérez-Mayos et al., Aguayo et al., and Gibson et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in voice/speech analysis techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pérez-Mayos et al. in combination with Aguayo et al. to incorporate the teachings of Gibson et al. of apply[ing] cubic spline interpolation to the voice data to identify the information about the pitch contour which provides the benefit of  allowing the transformation of voice either with or without pitch correction to match the pitch of the target singer (e.g., speaker) (Col. 2, lines 39-31 of Gibson et al.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Regarding generation of gestures in a robot by analyzing speech text (¶ [0007 and 0045]; pertinent to claims 1, 11, and 20):
Ng-Thow-Hing et al. (US 20120191460 A1).

Regarding dialog/speech authoring tools associated with expressive social robot (¶ [0027, 0034, 0065, 0072]; pertinent to claims 1, 11, and 20):
Breazeal et al. (US 20180133900 A1).

Regarding speech pauses (¶ [0070]; pertinent to claims 1, 11, and 20):
Sen et al. (US 20200105286 A1).

Regarding sentiment adapted communication (¶ [0033, 0056-0057]; pertinent to claims 1, 11, and 20):
Takano et al. (US 20200218781 A1).

Regarding intonation modifications in text-to-speech (¶ [0032 and 0055]; pertinent to claims 1, 11, and 20):
Lahr et al. (US 20230386446 A1).

Regarding pitch contour of sentences (¶ [0015, 0025-0026]; pertinent to claims 3 and 13):
Chen (US 20140195242 A1).

Regarding change of pitch between two voice signals (¶ [0041]; pertinent to claims 3 and 13):
Aher et al. (US 20210272550 A1).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 9:00 am - 4:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659



/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Jun 12, 2024
Application Filed
Feb 20, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/441,704
Patent 12608546
PROCESSING EVENT DATA AND/OR TABULAR DATA FOR INPUT TO ONE OR MORE MACHINE LEARNING MODELS
2y 2m to grant Granted Apr 21, 2026
17/710,137
Patent 12573402
GENERATING AND/OR UTILIZING UNINTENTIONAL MEMORIZATION MEASURE(S) FOR AUTOMATIC SPEECH RECOGNITION MODEL(S)
3y 11m to grant Granted Mar 10, 2026
18/187,330
Patent 12536989
Language-agnostic Multilingual Modeling Using Effective Script Normalization
2y 10m to grant Granted Jan 27, 2026
17/995,518
Patent 12531050
VOICE DATA CREATION DEVICE
3y 3m to grant Granted Jan 20, 2026
17/954,845
Patent 12499332
TRANSLATING TEXT USING GENERATED VISUAL REPRESENTATIONS AND ARTIFICIAL INTELLIGENCE
3y 2m to grant Granted Dec 16, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+29.5%)
2y 10m (~11m remaining)
Median Time to Grant
Low
PTA Risk
Based on 110 resolved cases by this examiner. Grant probability derived from career allowance rate.