Last updated: April 19, 2026
Application No. 18/596,031
METHODS FOR REAL-TIME ACCENT CONVERSION AND SYSTEMS THEREOF

Non-Final OA §103
Filed
Mar 05, 2024
Examiner
BLANKENAGEL, BRYAN S
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Sanas AI Inc.
OA Round
1 (Non-Final)
Interview Optional

— +35.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 377 resolved cases, 2023–2026
Examiner Intelligence

BLANKENAGEL, BRYAN S View full profile →
Grants 67% — above average
Career Allow Rate
254 granted / 377 resolved
+5.4% vs TC avg
Strong +35% interview lift
Without
With
+35.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
23 currently pending
Career history
400
Total Applications
across all art units
Statute-Specific Performance

§101
25.6%
-14.4% vs TC avg
§103
49.3%
+9.3% vs TC avg
§102
13.3%
-26.7% vs TC avg
§112
6.5%
-33.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 377 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 4 objected to because of the following informalities: the last line reads “the first pronunciation of the first speech content” but there is no antecedent basis for this in the claims. Claim 11 appears to recite similar matter, and instead reads “the first pronunciation of the second speech content.”  Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 3-7, 15-16, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chang et al. (US 2014/0187210 A1), hereinafter referred to as Chang, in view of Dirac et al. (US 10,163,451 B2), hereinafter referred to as Dirac.

Regarding claim 1, Chang teaches:
A system, comprising memory having instructions stored thereon and one or more processors coupled to the memory (Fig. 10 elements 1020, 1030, para [0069], where the components of the system are used) and configured to execute the instructions to:
apply the first machine-learning algorithm to second speech content comprising a set of phonemes associated with a first pronunciation of the second speech content to generate an output (Fig. 8, para [0063], where a speaker configures their voice to have an accent removed, and para [0031-32], where the phonemes are used to determine the dialect of the speaker);
based on the output, synthesize, using a second machine-learning algorithm trained with first audio data comprising the first accent and second audio data comprising a second accent, third audio data representative of the second speech content having the second accent (Chang Fig. 8, para [0040], [0054], [0064], where another speaker chooses to apply voice modifications to other speakers for synthesis in another accent); and
convert the synthesized third audio data into a synthesized version of the second speech content having the second accent (para [0035], where the synthesizer converts the spectrogram to a time-domain digital speech signal).  
Chang does not teach:
train a first machine-learning algorithm with first speech content from a first plurality of speakers having a first accent;
using a second machine-learning algorithm trained with first audio data comprising the first accent and second audio data comprising a second accent,
Dirac teaches:
train a first machine-learning algorithm with first speech content from a first plurality of speakers having a first accent (Fig. 1 element 131-134, col. 3 lines 36-60, where sample sets for different accents are used, and col. 7 lines 52-60, where machine learning is used to update and refine the accent translation models);
using a second machine-learning algorithm trained with first audio data comprising the first accent and second audio data comprising a second accent (Dirac Fig. 3 element 131-132, 321-322, col. 5 line 63 - col. 6 line 16, where sample sets for different accents are used, and accent translation models use both accents, and col. 7 lines 52-60, where machine learning is used to update and refine the accent translation models)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chang by using the machine learning of Dirac (Dirac col. 7 lines 52-60) in the accent modification of Chang (Chang para [0063-64]), in order to refine the accent translation models continually using new audio samples (Dirac col. 7 lines 52-60).

Regarding claim 3, Chang in view of Dirac teaches:
The system of claim 1, wherein the one or more processors are further configured to execute the instructions to map at least a first non-text linguistic representation of a first phoneme of the set of phonemes to a second non-text linguistic representation of a second phoneme associated with a second pronunciation of the second speech content, wherein the synthesized version of the second speech content further comprises the second phoneme and the first and second phonemes are different phonemes (Chang para [0031], where phonemes are distinguished by a unique pattern or signature in a spectrogram, and para [0035], where the phoneme formant stream is modified to shape formant patterns to match phonemes in another dialect).  

Regarding claim 4, Chang in view of Dirac teaches:
The system of claim 3, wherein the second pronunciation of the second speech content is different than the first pronunciation of the first speech content (Chang para [0054], where a voice is modified to add an accent).  

Regarding claim 5, Chang in view of Dirac teaches:
The system of claim 1, wherein the one or more processors are further configured to execute the instructions to apply, to the output, a learned mapping between the first audio data and the second audio data (Chang para [0031], where phonemes are distinguished by a unique pattern or signature in a spectrogram, and para [0035], where the phoneme formant stream is modified to shape formant patterns to match phonemes in another dialect).  

Regarding claim 6, Chang in view of Dirac teaches:
The system of claim 3, wherein the one or more processors are further configured to execute the instructions to map one or more frames in the output to one or more corresponding frames in the second non-text linguistic representation (Chang para [0031], [0035], where a spectrogram represents the spectra of the frames, and where the locations of the phonemes are modified or shaped to form the modified spectrogram).  

Regarding claim 7, Chang in view of Dirac teaches:
The system of claim 1, wherein the first audio data corresponds to a second plurality of speakers having the first accent and the second audio data corresponds to a single speaker having the second accent (Dirac Fig. 1 element 131-134, col. 3 lines 36-60, where sample sets for different accents using various individuals are used, and Chang para [0032], where the dialect of a speaker is known ahead of time).  

Regarding claim 15, Chang teaches:
A non-transitory computer-readable medium comprising instructions that, when executed by at least one processor (Fig. 10 elements 1020, 1030, para [0069], [0073], where the components of the system are used), cause the at least one processor to:
apply a first machine-learning algorithm to first speech content comprising first phonemes associated with a first pronunciation to derive a non-text linguistic representation of the first phonemes (Fig. 8, para [0063], where a speaker configures their voice to have an accent removed, and para [0031-32], where the phonemes are used to determine the dialect of the speaker, and are identified using formants);
based on the non-text linguistic representation of the first phonemes (para [0035], where the formants in the spectrogram are modified), synthesize, using a second machine-learning algorithm trained with first audio data comprising a first accent and second audio data comprising a second accent, third audio data representative of the first speech content having the second accent (Fig. 8, para [0040], [0054], [0064], where another speaker chooses to apply voice modifications to other speakers for synthesis in another accent), wherein the synthesizing comprises mapping at least a first non-text linguistic representation of a first phoneme of the first phonemes to a second non- text linguistic representation of a second phoneme of second phonemes associated with a second pronunciation of the first speech content (para [0031], where phonemes are distinguished by a unique pattern or signature in a spectrogram, and para [0035], where the phoneme formant stream is modified to shape formant patterns to match phonemes in another dialect); and
convert the synthesized third audio data into a synthesized version of the first speech content having the second accent and comprising the second phonemes (para [0035], where the synthesizer converts the spectrogram to a time-domain digital speech signal).  
Chang does not teach:
a first machine-learning algorithm;
using a second machine-learning algorithm trained with first audio data comprising a first accent and second audio data comprising a second accent,
Dirac teaches:
a first machine-learning algorithm (Fig. 1 element 131-134, col. 3 lines 36-60, where sample sets for different accents are used, and col. 7 lines 52-60, where machine learning is used to update and refine the accent translation models)
using a second machine-learning algorithm trained with first audio data comprising a first accent and second audio data comprising a second accent (Fig. 3 element 131-132, 321-322, col. 5 line 63 - col. 6 line 16, where sample sets for different accents are used, and accent translation models use both accents, and col. 7 lines 52-60, where machine learning is used to update and refine the accent translation models),
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chang by using the machine learning of Dirac (Dirac col. 7 lines 52-60) in the accent modification of Chang (Chang para [0063-64]), in order to refine the accent translation models continually using new audio samples (Dirac col. 7 lines 52-60).

Regarding claim 16, Chang in view of Dirac teaches:
The non-transitory computer-readable medium of claim 15, wherein the instructions, when executed by the at least one processor further cause the at least one processor to train the first machine-learning algorithm with fourth audio data comprising second speech content from speakers having the first accent (Dirac Fig. 1 element 131-134, col. 3 lines 36-60, where sample sets for different accents are used, and col. 7 lines 52-60, where machine learning is used to update and refine the accent translation models).  

Regarding claim 18, Chang in view of Dirac teaches:
The non-transitory computer-readable medium of claim 15, wherein the second pronunciation of the first speech content is different than the first pronunciation of the first speech content and the first and second phonemes are different phonemes (Chang para [0035], where the phoneme formant stream is modified to shape formant patterns to match phonemes in another dialect).  

Regarding claim 19, Chang in view of Dirac teaches:
The non-transitory computer-readable medium of claim 15, wherein the first speech content further comprises a set of prosodic features, the instructions, when executed by the at least one processor further cause the at least one processor to synthesize the third audio data and the set of prosodic features, and the synthesized version of the first speech content has the set of prosodic features (Chang para [0035], where the synthesized voice removes the accent while still sounding like the speaker, and para [0046], where tone and tempo are options for modification).  

Regarding claim 20, Chang in view of Dirac teaches:
The non-transitory computer-readable medium of claim 15, wherein the instructions, when executed by the at least one processor further cause the at least one processor to transmit the synthesized version of the first speech content to a computing device (Chang Fig. 8, para [0064], where the modified audio is transmitted to a speaker in a conference call).

Claim(s) 2, 8-11, 13, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chang, in view of Dirac, and further in view of Peng et al. (US 2015/0170642 A1), hereinafter referred to as Peng.

Regarding claim 2, Chang in view of Dirac teaches:
The system of claim 1,
Chang in view of Dirac does not teach:
wherein the one or more processors are further configured to execute the instructions to align and classify each of a plurality of frames of the first speech content corresponding to respective ones of the speakers to facilitate the training.
Peng teaches:
wherein the one or more processors are further configured to execute the instructions to align and classify each of a plurality of frames of the first speech content corresponding to respective ones of the speakers to facilitate the training (para [0025-26], where alignment of frames is performed, and para [0030], where the accent group identifier is determined).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chang in view of Dirac by using the training of Peng (Peng para [0018]) on the system of Chang in view of Dirac (Chang para [0063-64]) to associate actual pronunciations to an expected pronunciation and perform replacements of substitutions before processing an utterance (Peng para [0003]).

Regarding claim 8, Chang teaches:
A method implemented by one or more computing devices and comprising:
applying the first machine-learning algorithm to second speech content comprising a first set of phonemes associated with a first pronunciation of the second speech content (Fig. 8, para [0063], where a speaker configures their voice to have an accent removed, and para [0031-32], where the phonemes are used to determine the dialect of the speaker);
based on the application of the first machine-learning algorithm, synthesizing, using a second machine-learning algorithm trained with first audio data comprising the first accent and second audio data comprising a second accent, third audio data representative of the second speech content having the second accent (Chang Fig. 8, para [0040], [0054], [0064], where another speaker chooses to apply voice modifications to other speakers for synthesis in another accent); and
converting the synthesized third audio data into a synthesized version of the second speech content having the second accent (para [0035], where the synthesizer converts the spectrogram to a time-domain digital speech signal).  
Chang does not teach:
aligning and classifying each of a plurality of frames of first speech content corresponding to respective speakers having a first accent to train a first machine-learning algorithm;
using a second machine-learning algorithm trained with first audio data comprising the first accent and second audio data comprising a second accent 
Dirac teaches:
using a second machine-learning algorithm trained with first audio data comprising the first accent and second audio data comprising a second accent (Fig. 3 element 131-132, 321-322, col. 5 line 63 - col. 6 line 16, where sample sets for different accents are used, and accent translation models use both accents, and col. 7 lines 52-60, where machine learning is used to update and refine the accent translation models)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chang by using the machine learning of Dirac (Dirac col. 7 lines 52-60) in the accent modification of Chang (Chang para [0063-64]), in order to refine the accent translation models continually using new audio samples (Dirac col. 7 lines 52-60).
Peng teaches:
aligning and classifying each of a plurality of frames of first speech content corresponding to respective speakers having a first accent to train a first machine-learning algorithm (para [0025-26], where alignment of frames is performed, and para [0030], where the accent group identifier is determined);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chang in view of Dirac by using the training of Peng (Peng para [0018]) on the system of Chang in view of Dirac (Chang para [0063-64]) to associate actual pronunciations to an expected pronunciation and perform replacements of substitutions before processing an utterance (Peng para [0003]).

Regarding claim 9, Chang in view of Dirac and Peng teaches:
The method of claim 8, further comprising mapping at least a first non- text linguistic representation of a first phoneme of the first set of phonemes to a second non-text linguistic representation of a second phoneme of a second set of phonemes associated with a second pronunciation of the second speech content to facilitate the synthesizing (Chang para [0031], where phonemes are distinguished by a unique pattern or signature in a spectrogram, and para [0035], where the phoneme formant stream is modified to shape formant patterns to match phonemes in another dialect).  

Regarding claim 10, Chang in view of Dirac and Peng teaches:
The method of claim 9, wherein the synthesized version of the second speech content comprises the second set of phonemes (Chang para [0035], where the phoneme formant stream is modified to shape formant patterns to match phonemes in another dialect).  

Regarding claim 11, Chang in view of Dirac and Peng teaches:
The method of claim 9, wherein the second pronunciation of the second speech content is different than the first pronunciation of the second speech content and the first and second phonemes are different phonemes (Chang para [0035], where the phoneme formant stream is modified to shape formant patterns to match phonemes in another dialect).  

Regarding claim 13, Chang in view of Dirac and Peng teaches:
The method of claim 8, further comprising receiving a first user input indicating a selection of the first accent and a second user input indicating a selection of the second accent (Dirac col. 8 lines 33-52, col. 9 lines 15-38, where users manually select the first and second accents).  

Regarding claim 17, Chang in view of Dirac teaches:
The non-transitory computer-readable medium of claim 16,
Chang in view of Dirac does not teach:
wherein the instructions, when executed by the at least one processor further cause the at least one processor to align and classify each of a plurality of frames of the second speech content corresponding to respective ones of the speakers to train the first machine-learning algorithm.
Peng teaches:
wherein the instructions, when executed by the at least one processor further cause the at least one processor to align and classify each of a plurality of frames of the second speech content corresponding to respective ones of the speakers to train the first machine-learning algorithm (para [0025-26], where alignment of frames is performed, and para [0030], where the accent group identifier is determined).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chang in view of Dirac by using the training of Peng (Peng para [0018]) on the system of Chang in view of Dirac (Chang para [0063-64]) to associate actual pronunciations to an expected pronunciation and perform replacements of substitutions before processing an utterance (Peng para [0003]).

Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chang, in view of Dirac, and Peng, and further in view of Feinauer et al. (US 2020/0193971 A1), hereinafter referred to as Feinauer.

Regarding claim 12, Chang in view of Dirac and Peng teaches:
The method of claim 8, further comprising continuously converting the synthesized third audio data into a synthesized version of third speech content having the second accent between 50-700 ms after receiving the third speech content having the first accent, wherein the synthesized version of the third speech content has the second accent (Chang para [0015], where the voice modification is performed in real time).  
Chang in view of Dirac and Peng does not teach:
between 50-700 ms;
Feinauer teaches:
between 50-700 ms (col. 22 line 60 - col. 23 line 9, where an example of 100 ms is determined as sounding like real time)
Chang in view of Dirac and Peng teaches accent conversion in real time (Chang para [0015]). However, claim 1 recites that the conversion is performed between 50-700 ms after receiving the content. Feinauer teaches a lag duration that sounds like real time to a user, with an example of 100 ms (Feinauer col. 22 line 60 - col. 23 line 9). The cited section of Feinauer gives the value of 100 ms only as an example, recognizing that any value that is perceived as real time would suffice, and selection of such would be within the level of ordinary skill in the art, and would lead to predictable results. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have substituted the specific time value of Feinauer in place of the real time designation of Chang in view of Dirac and Peng, where the result of the substitution would predictably result in a non-noticeable delay.

Claim(s) 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chang, in view of Dirac, and Peng, and further in view of Ganapathiraju et al. (US 2014/0025379 A1), hereafter referred to as Ganapathiraju.

Regarding claim 14, Chang in view of Dirac and Peng teaches:
The method of claim 8, wherein the first machine-learning algorithm comprises a non-text learned linguistic representation for the first accent (Chang para [0031], where accents consist of phonemes that are distinguished by a unique pattern or signature in a spectrogram) and the method further comprises:
aligning and classifying each of the plurality of frames according to phoneme sounds of the first speech content to train the first machine-learning algorithm (Peng para [0025-26], where alignment of frames is performed, and para [0030], where the accent group identifier is determined); and
detecting, for each of another plurality of frames in the second speech content, a respective monophone and triphone sound based on the non-text learned linguistic representation (Chang para [0031], where phonemes are distinguished by a unique pattern or signature in a spectrogram, and para [0035], where the phoneme formant stream is modified to shape formant patterns to match phonemes in another dialect).  
Chang in view of Dirac and Peng does not teach:
monophone and triphone sounds;
Ganapathiraju teaches:
monophone and triphone sounds (para [0023], where monophones and triphones are both used)
Chang in view of Dirac and Peng teaches processing using phonemes (Chang para [0031]). However, claim 1 recites that monophones and triphones are used. Ganapathiraju teaches using monophones and triphones (Ganapathiraju para [0023]). Para [0023] recognizes that phonemes can be modeled in isolation or in context of other phonemes, both being withing the level of ordinary skill in the art, and predictable in usage. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have substituted the monophones and triphones of Ganapathiraju in place of the phonemes of Chang in view of Dirac and Peng, where the result of the substitution would predictably allow for processing of individual phonemes or phonemes in context  of other phonemes.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US 8,650,035 B1 col. 7 lines 3-22 teaches performing accent conversion in a conference call using selected conversion heuristics; US 2020/0193971 A1 para [0041-42] teaches accent and dialect modification in real time on a call; US 2023/0223006 A1 Abstract teaches adding and removing accents from voices.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRYAN S BLANKENAGEL whose telephone number is (571)270-0685. The examiner can normally be reached 8:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at 571-272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRYAN S BLANKENAGEL/Primary Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Mar 05, 2024
Application Filed
Jul 26, 2024
Response after Non-Final Action
Feb 11, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/401,768
Patent 12602551
GENERATION OF SYNTHETIC DOCUMENTS FOR DATA AUGMENTATION
2y 5m to grant Granted Apr 14, 2026
17/850,617
Patent 12579993
Multi-Talker Audio Stream Separation, Transcription and Diaraization
2y 5m to grant Granted Mar 17, 2026
18/014,217
Patent 12572759
MULTILINGUAL CONVERSATION TOOL
2y 5m to grant Granted Mar 10, 2026
18/251,876
Patent 12555591
MACHINE LEARNING ASSISTED SPATIAL NOISE ESTIMATION AND SUPPRESSION
2y 5m to grant Granted Feb 17, 2026
18/066,128
Patent 12547836
KNOWLEDGE FACT RETRIEVAL THROUGH NATURAL LANGUAGE PROCESSING
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
67%
Grant Probability
99%
With Interview (+35.2%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 377 resolved cases by this examiner. Grant probability derived from career allow rate.