Last updated: May 29, 2026
Application No. 18/043,105
VOICE CONVERSION DEVICE, VOICE CONVERSION METHOD, PROGRAM, AND RECORDING MEDIUM

Non-Final OA §103
Filed
Feb 27, 2023
Priority
Jun 30, 2021 — JP 2021-108707 +1 more
Examiner
DORVIL, RICHEMOND
Art Unit
2658
Tech Center
2600 — Communications
Assignee
DWANGO CO., LTD.
OA Round
3 (Non-Final)
This examiner grants 28% of cases after interview

— +22.5% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 53 resolved cases, 2023–2026
Examiner Intelligence

DORVIL, RICHEMOND View full profile →
Grants only 28% of cases
Career Allowance Rate
15 granted / 53 resolved
-33.7% vs TC avg
Strong +22% interview lift
Without
With
+22.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
3 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
3.4%
-36.6% vs TC avg
§103
90.4%
+50.4% vs TC avg
§102
3.4%
-36.6% vs TC avg
§112
0.7%
-39.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 53 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.


Response to Amendment
This communication is responsive to the applicant’s amendment dated 04/09/2025.

Response to Arguments
Applicant's arguments filed on 04/09/2025 have been fully considered but they are not persuasive.
Applicant’s response to the rejection of claim 9 as being unpatentable over Mohammadi in view of Pan, argued that “The prior art does not teach at least the feature of claim 9 … an extraction unit … an adjustment unit ….; and a generation unit …” without clearly point out which of these features are not disclosed by the combination of Mohammadi in view of Pan. The non-final rejection clearly states how these features are rejected in view of the Broadest Reasonable Interpretation (BRI) of the claim language. See page 8 of the non-final rejection mailed 01/15/2025. 
The applicant’s response further asserted that “Mohammadi obtains the phoneme information and the pitch information from 2 different sources” while the applicant’s claimed invention recites “(1) an extraction unit extracts phoneme and pitch information from a conversion source and (2) applies the pitch information from the source voice to the destination voice”. This assertion is incorrect.
Claim 9 simply recites extract time series data including a phoneme and a pitch (see claim 9, lines 3-4). Mohammadi in view of Pan disclose this limitation (see Mohammadi, col 6, lines 2 – 5 “each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech”
“applies the pitch information from the source voice to the destination”, this limitation is not explicitly recited in claim 9. 
Nothing in the claim(s) require “applies the pitch information from the source voice to the destination”. In response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e. “applies the pitch information from the source voice to the destination”) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
The applicant’s response further argued that “the pitch used to synthesize a designated person’s voice is not the pitch obtained by the extraction unit”. The examiner cannot concur because as disclosed by Mohammadi the word spoken by the user is obtained (extracted) from the source speech (see col. 5 line 64 to col. 6 line 5.)
Regarding claim 11, the applicant’s response argues that in Mohammadi “the text sentences are not same sentences as the utterance content of the source voice” The examiner cannot concur. As stated in the office action, Mohammadi discloses the source speech is converted to text … (see col. 5, lines 64 – 65).
Regarding claim 13, the applicant’s response argued that “Neither Siinghal nor the other cited references teach that compressed or expanded speech phonemes be inputted into the generation unit.”
 The argument has been fully considered and is persuasive.  The rejection of claim 13 has been withdrawn. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 9-12, 14, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over US 10186251 B1 Mohammadi (“Mohammadi”) in view of US 20220068259 A1 Pan et al. (“Pan”).
Regarding Claim 9, Mohammadi teaches a voice conversion apparatus, comprising: 
an input unit that inputs designation of a conversion destination voice ("In the preferred embodiment, the target voice is one of a plurality of target voices that may be selected by the user and utilized to generate target speech. That is, the target voice is one of a number of human and/or non-human voices to which the user can choose to convert his or her voice." [Mohammadi Col 3, lines 8-20]); 
an extraction unit that analyzes voice data of a conversion source voice and extracts time series data including a pitch ("the user may provide 150 source speech by speaking... The source speech... is transmitted to an ASR unit that recognizes and extracts 152 words and/or other linguistic content. In some embodiments, the ASR unit also outputs the start time and end time and/or duration of each of the extracted words as it appears in the source voice recording." [Mohammadi Col. 5 lines 52 - 63]; "Once the source speech is converted to text, that text is used to synthesize 154 intermediate speech... each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech as defined in the intermediate speech corpus." [Mohammadi Col. 5, line 64 to Col. 6 line 7]– the examiner notes that the preceding quotations describe the process of extracting linguistic content and pitch from a conversion source voice);
an adjustment unit that matches a height of the pitch to a height of the designated conversion destination voice ("The intonation of the source speech is linearly mapped to have the same average as the average of the target intonation, by multiplying the source fundamental frequency by a constant value, which the constant value is computed from the training sentences. The intonation is represented using a sequence of frequency values comprising the fundamental frequency, f0, in a sequence of frames. This sequence of fundamental frequencies, f0, may be referred to as the pitch contour. The pitch contour of the source speech is transformed linearly so that its average and standard deviation is the same at the calculated average and standard deviation of the pitch contour of the target, respectively." [Mohammadi Col. 6, line21 - 37]); and
a generation unit that inputs the pitch to a deep learning model that learns voice data of many people and is capable of synthesizing a designated person's voice in time-series order, and generates voice data obtained by synthesizing the designated conversion destination voice ("the user may provide 150 source speech by speaking... The source speech... is transmitted to an ASR unit that recognizes and extracts 152 words and/or other linguistic content. In some embodiments, the ASR unit also outputs the start time and end time and/or duration of each of the extracted words as it appears in the source voice recording." [Mohammadi Col. 5, line 52 - 63]; "Once the source speech is converted to text, that text is used to synthesize 154 intermediate speech... each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech as defined in the intermediate speech corpus." [Mohammadi Col. 5, line 64 – Col 6, line 7]; "Each spectrum and pitch of the synthesized intermediate speech is then converted 156 by the DNN into a target speech spectrum and pitch. The pitch corresponding to the target speech training data is then converted 157 to an excitation signal, and the excitation signal convolved 158 with the target spectrum retrieved from the DNN after conversion 156. Each convolution 158 produces a segment of target speech in the voice of the target, and the plurality of target speech segments produce 160 a target voice sequence that may also be referred to as target speech. In accordance with this invention, the target speech has the linguistic content of the source speaker but the voice color of the target speaker." [Mohammadi Col. 6, lines 6 - 20] – the examiner notes the preceding quotations describe the process of using linguistic content and pitch as inputs to a deep learning model; "In the preferred embodiment, the target voice is one of a plurality of target voices that may be selected by the user and utilized to generate target speech. That is, the target voice is one of a number of human and/or non-human voices to which the user can choose to convert his or her voice." [Mohammadi col. 3, lines 8 - 20] – the examiner notes that the system of Mohammadi can select one of multiple conversion destination voices).
Mohammadi does not explicitly teach:
an extraction unit that analyzes voice data of a conversion source voice and extracts time series data including a phoneme; or
a generation unit that inputs the phoneme to a deep learning model that learns voice data of many people and is capable of synthesizing a designated person's voice in time-series order, and generates voice data obtained by synthesizing the designated conversion destination voice.
Pan, however, does teach extracting phonemes from a conversion source voice and using phonemes as an input to a deep learning voice conversion model (“The waveform of the source speaker data is converted to phonetic data (e.g., phonetic posterior grams, PPGs). The PPG data is aligned to the waveform speaker data and defines one or more features corresponding to the prosody style of the source speaker. In addition to the one or more features defined by the PPG data, the computing system also extracts one or more additional prosody features from the source speaker data. Then, based on (i) the PPG data, (ii) the additionally extracted prosody features, and (iii) the voice timbre of the target speaker, the computing system generates a spectrogram having the voice timbre of the target speaker and the prosody style of the source speaker.” [Pan 0033]; “In some instances, the storage 140 includes computer-executable instructions 118 for instantiating or executing one or more of the models and/or engines shown in computing system 110. In some instances, the one or more models are configured as machine learning models or machine learned models.” [Pan 0042]; “Once the PPG data 936 is generated, the PPG-Mel model 938 receives the PPG data. The PPG-Mel model 938, or more broadly, the PPG-spectrogram model is configured to generate spectrogram data based on the received PPG data 936. The PPG to Mel model is initially a source PPG to Mel model, wherein the source PPG to Mel model 938 is trained on multi-speaker data. After the initial training, the PPG to Mel model 938 is then refined and/or adapted to be speaker dependent for a particular (or additional) target speaker. This is done by training the PPG to Mel model 938 on a target speaker's data (e.g., target speaker data 141). In this manner, the PPG-Mel model is able to generate spectrograms, or mel-spectrograms, in the target speaker's timbre with increased quality due to the speaker dependent adaptation.” [Pan 0110]; See Fig. 8 for a high-level view of Pan’s system.).
Mohammadi and Pan are both in the voice conversion art. Therefore, it would have been obvious to one of ordinary skill in the art prior to the filing date of the invention to incorporate Pan’s usage of phonemes as an input to voice conversion system into Mohammadi’s usage of generic linguistic information because phonemes are a well-known type of linguistic information, and because Pan’s method enables allows for multiple speaker style features to be represented (“In these embodiments, a computing system converts a waveform of the source speaker data to phonetic posterior gram (PPG) data by aligning the waveform of the source speaker data to the PPG data, wherein the PPG data define one or more features corresponding to the prosody style of the source speaker data.” [Pan 0014]).
Regarding Claim 10, Mohammadi in view of Pan teaches the device of Claim 9 as well as a training unit that extracts phonemes and pitches from many people's voice data which become conversion destination voices ("To capture the necessary linguistic content, the target speaker may be prompted to recite one or more sentences that include all or most of the most common phonemes and diphones in the English language." [Mohammadi Col. 2, line 63 – Col 3, line 7]; "In the preferred embodiment, the target voice is one of a plurality of target voices that may be selected by the user and utilized to generate target speech. That is, the target voice is one of a number of human and/or non-human voices" [Mohammadi Col. 3, lines 8 - 20]; "Thereafter, the voice conversion system also computes 120 the mean pitch and standard deviation for the audio segment of the target speech training data based on the data extracted in step 112 above. The representation of the DNN, mean pitch, and standard deviation are stored in a DB 120 for later use by the voice conversion system." [Mohammadi Col. 4, lines 10 - 30]), and trains a deep learning model capable of synthesizing each of the many people's voices from the phonemes and the pitches ("In the preferred embodiment, the target voice is one of a plurality of target voices that may be selected by the user and utilized to generate target speech. That is, the target voice is one of a number of human and/or non-human voices" [Mohammadi Col. 3, lines 8 - 20]).
Regarding Claim 11, Mohammadi in view of Pan teaches the device of Claim 9 as well as wherein the extraction unit inputs the same sentences as in an utterance content of the conversion source voice in combination with the voice data of the conversion source voice, and analyzes the sentence to extract phonemes ("the user may provide 150 source speech by speaking... The source speech... is transmitted to an ASR unit that recognizes and extracts 152 words and/or other linguistic content. In some embodiments, the ASR unit also outputs the start time and end time and/or duration of each of the extracted words as it appears in the source voice recording." [Mohammedi 26]; "Once the source speech is converted to text, that text is used to synthesize 154 intermediate speech... each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech as defined in the intermediate speech corpus." [Mohammedi Col. 5, line 64 – Col. 6, line 7]; "Each spectrum and pitch of the synthesized intermediate speech is then converted 156 by the DNN into a target speech spectrum and pitch. The pitch corresponding to the target speech training data is then converted 157 to an excitation signal, and the excitation signal convolved 158 with the target spectrum retrieved from the DNN after conversion 156. Each convolution 158 produces a segment of target speech in the voice of the target, and the plurality of target speech segments produce 160 a target voice sequence that may also be referred to as target speech. In accordance with this invention, the target speech has the linguistic content of the source speaker but the voice color of the target speaker." [Mohammedi Col 6, lines 7 - 20] – Mohammadi describes the process of extracting linguistic content from voice data, and, as described in the rejection for Claim 9, Pan teaches using phonemes as the linguistic content input to a voice conversion system: “The waveform of the source speaker data is converted to phonetic data (e.g., phonetic posterior grams, PPGs). The PPG data is aligned to the waveform speaker data and defines one or more features corresponding to the prosody style of the source speaker. In addition to the one or more features defined by the PPG data, the computing system also extracts one or more additional prosody features from the source speaker data. Then, based on (i) the PPG data, (ii) the additionally extracted prosody features, and (iii) the voice timbre of the target speaker, the computing system generates a spectrogram having the voice timbre of the target speaker and the prosody style of the source speaker.” [Pan 0033].)
Regarding Claim 12, Mohammadi in view of Pan teaches the device of Claim 9 as well as wherein the extraction unit extracts phonemes by analyzing a sentence instead of the voice data of the conversion source voice, reads out pitches corresponding to the phonemes from a storage device, and transmits the pitches to the adjustment unit ("the user may provide 150 source speech by speaking... The source speech... is transmitted to an ASR unit that recognizes and extracts 152 words and/or other linguistic content. In some embodiments, the ASR unit also outputs the start time and end time and/or duration of each of the extracted words as it appears in the source voice recording." [Mohammedi 26]; "Once the source speech is converted to text, that text is used to synthesize 154 intermediate speech... each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech as defined in the intermediate speech corpus." [Mohammedi Col. 5 line 64 – Col 6, line 7]; "Each spectrum and pitch of the synthesized intermediate speech is then converted 156 by the DNN into a target speech spectrum and pitch. The pitch corresponding to the target speech training data is then converted 157 to an excitation signal, and the excitation signal convolved 158 with the target spectrum retrieved from the DNN after conversion 156. Each convolution 158 produces a segment of target speech in the voice of the target, and the plurality of target speech segments produce 160 a target voice sequence that may also be referred to as target speech. In accordance with this invention, the target speech has the linguistic content of the source speaker but the voice color of the target speaker." [Mohammedi Col 6, lines 8 - 20]). 
Regarding Claim 14, Mohammadi teaches a voice conversion method causing a computer to:
input designation of a conversion destination voice ("In the preferred embodiment, the target voice is one of a plurality of target voices that may be selected by the user and utilized to generate target speech. That is, the target voice is one of a number of human and/or non-human voices to which the user can choose to convert his or her voice." [Mohammadi Col. 3, lines 8 - 20]); 
analyze voice data of a conversion source voice and extract time series data including a pitch ("the user may provide 150 source speech by speaking... The source speech... is transmitted to an ASR unit that recognizes and extracts 152 words and/or other linguistic content. In some embodiments, the ASR unit also outputs the start time and end time and/or duration of each of the extracted words as it appears in the source voice recording." [Mohammadi Col. 5, lines 52 -63]; "Once the source speech is converted to text, that text is used to synthesize 154 intermediate speech... each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech as defined in the intermediate speech corpus." [Mohammadi Col. 5, line 64 – Col 6, line 7]– the examiner notes that the preceding quotations describe the process of extracting linguistic content and pitch from a conversion source voice);
match a height of the pitch to a height of the designated conversion destination voice ("The intonation of the source speech is linearly mapped to have the same average as the average of the target intonation, by multiplying the source fundamental frequency by a constant value, which the constant value is computed from the training sentences. The intonation is represented using a sequence of frequency values comprising the fundamental frequency, f0, in a sequence of frames. This sequence of fundamental frequencies, f0, may be referred to as the pitch contour. The pitch contour of the source speech is transformed linearly so that its average and standard deviation is the same at the calculated average and standard deviation of the pitch contour of the target, respectively." [Mohammadi 6, lines 21 - 37]); and
input the pitch to a deep learning model that learns voice data of many people and is capable of synthesizing a designated person's voice in time-series order, and generate voice data obtained by synthesizing the designated conversion destination voice ("the user may provide 150 source speech by speaking... The source speech... is transmitted to an ASR unit that recognizes and extracts 152 words and/or other linguistic content. In some embodiments, the ASR unit also outputs the start time and end time and/or duration of each of the extracted words as it appears in the source voice recording." [Mohammadi Col. 5, lines 52 - 63]; "Once the source speech is converted to text, that text is used to synthesize 154 intermediate speech... each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech as defined in the intermediate speech corpus." [Mohammadi Col. 5, line 64 – Col 6 line 7]; "Each spectrum and pitch of the synthesized intermediate speech is then converted 156 by the DNN into a target speech spectrum and pitch. The pitch corresponding to the target speech training data is then converted 157 to an excitation signal, and the excitation signal convolved 158 with the target spectrum retrieved from the DNN after conversion 156. Each convolution 158 produces a segment of target speech in the voice of the target, and the plurality of target speech segments produce 160 a target voice sequence that may also be referred to as target speech. In accordance with this invention, the target speech has the linguistic content of the source speaker but the voice color of the target speaker." [Mohammadi Col 6, lines 7 - 20] – the examiner notes the preceding quotations describe the process of using linguistic content and pitch as inputs to a deep learning model; "In the preferred embodiment, the target voice is one of a plurality of target voices that may be selected by the user and utilized to generate target speech. That is, the target voice is one of a number of human and/or non-human voices to which the user can choose to convert his or her voice." [Mohammadi Col. 3, lines 8 - 20] – the examiner notes that the system of Mohammadi can select one of multiple conversion destination voices).
Mohammadi does not explicitly teach:
analyze voice data of a conversion source voice and extract time series data including a phoneme; 
or input the phoneme to a deep learning model that learns voice data of many people and is capable of synthesizing a designated person's voice in time-series order, and generate voice data obtained by synthesizing the designated conversion destination voice.
Pan, however, does teach extracting phonemes from a conversion source voice and using phonemes as an input to a deep learning voice conversion model (“The waveform of the source speaker data is converted to phonetic data (e.g., phonetic posterior grams, PPGs). The PPG data is aligned to the waveform speaker data and defines one or more features corresponding to the prosody style of the source speaker. In addition to the one or more features defined by the PPG data, the computing system also extracts one or more additional prosody features from the source speaker data. Then, based on (i) the PPG data, (ii) the additionally extracted prosody features, and (iii) the voice timbre of the target speaker, the computing system generates a spectrogram having the voice timbre of the target speaker and the prosody style of the source speaker.” [Pan 0033]; “In some instances, the storage 140 includes computer-executable instructions 118 for instantiating or executing one or more of the models and/or engines shown in computing system 110. In some instances, the one or more models are configured as machine learning models or machine learned models.” [Pan 0042]; “Once the PPG data 936 is generated, the PPG-Mel model 938 receives the PPG data. The PPG-Mel model 938, or more broadly, the PPG-spectrogram model is configured to generate spectrogram data based on the received PPG data 936. The PPG to Mel model is initially a source PPG to Mel model, wherein the source PPG to Mel model 938 is trained on multi-speaker data. After the initial training, the PPG to Mel model 938 is then refined and/or adapted to be speaker dependent for a particular (or additional) target speaker. This is done by training the PPG to Mel model 938 on a target speaker's data (e.g., target speaker data 141). In this manner, the PPG-Mel model is able to generate spectrograms, or mel-spectrograms, in the target speaker's timbre with increased quality due to the speaker dependent adaptation.” [Pan 0110]; See Fig. 8 for a high-level view of Pan’s system.).
Mohammadi and Pan are both in the voice conversion art. Therefore, it would have been obvious to one of ordinary skill in the art prior to the filing date of the invention to incorporate Pan’s usage of phonemes as an input to voice conversion system into Mohammadi’s usage of generic linguistic information because phonemes are a well-known type of linguistic information, and because Pan’s method enables allows for multiple speaker style features to be represented (“In these embodiments, a computing system converts a waveform of the source speaker data to phonetic posterior gram (PPG) data by aligning the waveform of the source speaker data to the PPG data, wherein the PPG data define one or more features corresponding to the prosody style of the source speaker data.” [Pan 0014]).
Regarding Claim 15, Mohammadi teaches a recording medium that records a program causing a computer to execute: 
processing of inputting designation of a conversion destination voice ("In the preferred embodiment, the target voice is one of a plurality of target voices that may be selected by the user and utilized to generate target speech. That is, the target voice is one of a number of human and/or non-human voices to which the user can choose to convert his or her voice." [Mohammadi Col. 3, lines 8 - 20]); 
processing of analyzing voice data of a conversion source voice and extracting time series data including a pitch ("the user may provide 150 source speech by speaking... The source speech... is transmitted to an ASR unit that recognizes and extracts 152 words and/or other linguistic content. In some embodiments, the ASR unit also outputs the start time and end time and/or duration of each of the extracted words as it appears in the source voice recording." [Mohammadi Col. 5, lines 52 - 63]; "Once the source speech is converted to text, that text is used to synthesize 154 intermediate speech... each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech as defined in the intermediate speech corpus." [Mohammadi Col. 5 line 63 – Col. 6, line 7]– the examiner notes that the preceding quotations describe the process of extracting linguistic content and pitch from a conversion source voice);
processing of matching a height of the pitch to a height of the designated conversion destination voice ("The intonation of the source speech is linearly mapped to have the same average as the average of the target intonation, by multiplying the source fundamental frequency by a constant value, which the constant value is computed from the training sentences. The intonation is represented using a sequence of frequency values comprising the fundamental frequency, f0, in a sequence of frames. This sequence of fundamental frequencies, f0, may be referred to as the pitch contour. The pitch contour of the source speech is transformed linearly so that its average and standard deviation is the same at the calculated average and standard deviation of the pitch contour of the target, respectively." [Mohammadi Col. 6, lines 21 - 37]); and 
processing of inputting the pitch to a deep learning model that learns voice data of many people and is capable of synthesizing a designated person's voice in time- series order, and generating voice data obtained by synthesizing the designated conversion destination voice ("the user may provide 150 source speech by speaking... The source speech... is transmitted to an ASR unit that recognizes and extracts 152 words and/or other linguistic content. In some embodiments, the ASR unit also outputs the start time and end time and/or duration of each of the extracted words as it appears in the source voice recording." [Mohammadi 26]; "Once the source speech is converted to text, that text is used to synthesize 154 intermediate speech... each word spoken by the user is used to retrieve a spectrum and pitch for the segment of speech as defined in the intermediate speech corpus." [Mohammadi Col. 5, line 63 – Col 6, line 7]; "Each spectrum and pitch of the synthesized intermediate speech is then converted 156 by the DNN into a target speech spectrum and pitch. The pitch corresponding to the target speech training data is then converted 157 to an excitation signal, and the excitation signal convolved 158 with the target spectrum retrieved from the DNN after conversion 156. Each convolution 158 produces a segment of target speech in the voice of the target, and the plurality of target speech segments produce 160 a target voice sequence that may also be referred to as target speech. In accordance with this invention, the target speech has the linguistic content of the source speaker but the voice color of the target speaker." [Mohammadi Col. 6, lines 8 - 20] – the examiner notes the preceding quotations describe the process of using linguistic content and pitch as inputs to a deep learning model; "In the preferred embodiment, the target voice is one of a plurality of target voices that may be selected by the user and utilized to generate target speech. That is, the target voice is one of a number of human and/or non-human voices to which the user can choose to convert his or her voice." [Mohammadi Col 3, lines 8 - 20] – the examiner notes that the system of Mohammadi can select one of multiple conversion destination voices).
Mohammadi does not explicitly teach: processing of analyzing voice data of a conversion source voice and extracting time series data including a phoneme; 
or processing of inputting the phoneme to a deep learning model that learns voice data of many people and is capable of synthesizing a designated person's voice in time- series order, and generating voice data obtained by synthesizing the designated conversion destination voice.
Pan, however, does teach extracting phonemes from a conversion source voice and using phonemes as an input to a deep learning voice conversion model (“The waveform of the source speaker data is converted to phonetic data (e.g., phonetic posterior grams, PPGs). The PPG data is aligned to the waveform speaker data and defines one or more features corresponding to the prosody style of the source speaker. In addition to the one or more features defined by the PPG data, the computing system also extracts one or more additional prosody features from the source speaker data. Then, based on (i) the PPG data, (ii) the additionally extracted prosody features, and (iii) the voice timbre of the target speaker, the computing system generates a spectrogram having the voice timbre of the target speaker and the prosody style of the source speaker.” [Pan 0033]; “In some instances, the storage 140 includes computer-executable instructions 118 for instantiating or executing one or more of the models and/or engines shown in computing system 110. In some instances, the one or more models are configured as machine learning models or machine learned models.” [Pan 0042]; “Once the PPG data 936 is generated, the PPG-Mel model 938 receives the PPG data. The PPG-Mel model 938, or more broadly, the PPG-spectrogram model is configured to generate spectrogram data based on the received PPG data 936. The PPG to Mel model is initially a source PPG to Mel model, wherein the source PPG to Mel model 938 is trained on multi-speaker data. After the initial training, the PPG to Mel model 938 is then refined and/or adapted to be speaker dependent for a particular (or additional) target speaker. This is done by training the PPG to Mel model 938 on a target speaker's data (e.g., target speaker data 141). In this manner, the PPG-Mel model is able to generate spectrograms, or mel-spectrograms, in the target speaker's timbre with increased quality due to the speaker dependent adaptation.” [Pan 0110]; See Fig. 8 for a high-level view of Pan’s system.).
Mohammadi and Pan are both in the voice conversion art. Therefore, it would have been obvious to one of ordinary skill in the art prior to the filing date of the invention to incorporate Pan’s usage of phonemes as an input to voice conversion system into Mohammadi’s usage of generic linguistic information because phonemes are a well-known type of linguistic information, and because Pan’s method enables allows for multiple speaker style features to be represented (“In these embodiments, a computing system converts a waveform of the source speaker data to phonetic posterior gram (PPG) data by aligning the waveform of the source speaker data to the PPG data, wherein the PPG data define one or more features corresponding to the prosody style of the source speaker data.” [Pan 0014]).

Allowable Subject Matter
Claim 13 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

The following is a statement of reasons for the indication of allowable subject matter:  See applicant argument with respect to claim 13. The prior art taken alone or in combination fail to teach, inter alia, the voice conversion apparatus according to claim 9 wherein the extraction unit extracts an utterance section of each of the phonemes, and inputs the utterance section that is compressed or expanded to the generation unit, and the adjustment unit compresses or expands the pitches in a time direction in accordance with the compression or expansion of the utterance section.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
US Patent No. 10614826, Huffman
US Patent No. 10157608, Ohtani

THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to RICHEMOND DORVIL whose telephone number is (571)272-7602. The examiner can normally be reached 8:30 - 5:30 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Feb 27, 2023
Application Filed
Jan 15, 2025
Non-Final Rejection mailed — §103
Apr 09, 2025
Response Filed
Aug 18, 2025
Final Rejection mailed — §103
Oct 14, 2025
Response after Non-Final Action
Dec 17, 2025
Request for Continued Examination
Jan 12, 2026
Response after Non-Final Action
May 26, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/173,402
Patent 12591738
Autocorrect Candidate Selection
3y 1m to grant Granted Mar 31, 2026
18/461,095
Patent 12573397
ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF
2y 6m to grant Granted Mar 10, 2026
18/301,064
Patent 12567401
EVALUATING RELIABILITY OF AUDIO DATA FOR USE IN SPEECH PROCESSING
2y 10m to grant Granted Mar 03, 2026
18/447,506
Patent 12547849
ABSTRACTIVE SUMMARIZATION OF INFORMATION TECHNOLOGY ISSUES USING A METHOD OF GENERATING COMPARATIVES
2y 6m to grant Granted Feb 10, 2026
18/005,801
Patent 12505853
SIGNAL PROCESSING DEVICE AND METHOD
2y 11m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
28%
Grant Probability
51%
With Interview (+22.5%)
3y 6m (~3m remaining)
Median Time to Grant
High
PTA Risk
Based on 53 resolved cases by this examiner. Grant probability derived from career allowance rate.