Last updated: April 19, 2026

Application No. 18/747,173

HIGH PRIVACY DSP-BASED AUDIO ANONYMIZATION WITH AUDIO SEGMENTATION AND RANDOMIZATION

Non-Final OA §103

Filed

Jun 18, 2024

Examiner

PASHA, ATHAR N

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Huawei Technologies Co., Ltd.

OA Round

1 (Non-Final)

Interview Optional

— +17.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 154 resolved cases, 2023–2026

Examiner Intelligence

PASHA, ATHAR N View full profile →

Grants 90% — above average

Career Allow Rate

138 granted / 154 resolved

+27.6% vs TC avg

Strong +17% interview lift

Without

With

+17.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

18 currently pending

Career history

172

Total Applications

across all art units

Statute-Specific Performance

§101

21.9%

-18.1% vs TC avg

§103

49.4%

+9.4% vs TC avg

§102

16.9%

-23.1% vs TC avg

§112

5.2%

-34.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 154 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
  

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 6/18/24 is being considered by the examiner.
Claim Objections
Listed claims are objected to for the informalities shown and may be addressed with suggested amendments:
Claims 8 and 18 : … another first adjusted..

Claim Rejections - 35 USC § 103
 In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 3 , 10, 11, 13, 20  are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (US 20250174214 A1) in further view of Suskind (US 20240078339 A1).
With respect to Claims 1 and 11 Zhang teaches 
(Claim 1) A method of generating an anonymized audio output, the method executable by a processor, the method comprising: 
(Claim 11) An electronic device comprising a non-transitory computer-readable medium and a processor for generating an anonymized audio output, the non-transitory computer-readable medium comprising instructions, which upon being executed by the processor, configure the processor to (¶[0097]According to an example implementation of the present disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer readable medium and includes computer-executable instructions that are executed by a processor to implement the method described above. According to the example embodiments of the present disclosure, there is provided a computer program product having a computer program stored thereon, and the computer program, when executed by a processor, implements the method described above):
acquiring an original audio input, the original audio input being an audio recording of a speaker ([0033] Specifically, the user may upload the first media data via the control 220, and the first media data may include timbre that the user desires to specify... For another example, if the musical work desired to be generated by the user includes his or her own timbre, a piece of audio (and/or video) including his or her own speaking or singing may be uploaded.); 
stochastically determining a base pitch value based on at least a first probabilistic function (Zhang ¶[0051] As shown in FIG. 6, the process of adjusting the pitch is described by taking only the audio segment 420 as an example of a plurality of audio segment. At block 610, the pitch 620 of the audio segment 420 may be determined, for example, the pitch 620 may be determined based on the mean 612 of pitches  at various time points of the audio segment 420. Alternatively, and/or additionally, the pitch 620 may be determined based on a median 614 of the pitches at various time points of the audio segment 420, ¶[0044] In particular, audio segments that are too short in time length may be discarded, and only audio segments with time length meeting a predetermined length condition [probabilistic function] (e.g., not below 0.3 seconds or other numerical value) are retained.); 
segmenting the original audio input into a plurality of audio segments, each of the plurality of audio segments being associated with a respective pitch ( Zhang ¶ [0040] dividing the first media data into a plurality of audio segments based on pitch information in the first media data; and generating the second media data using the plurality of audio segments. It should be understood that each audio segment has a specified timbre, so that the second media data generated by using respective audio segments will have the specified timbre.); 
for a first audio segment from the plurality of audio segments: 
generating a first pitch adjustment value using a combination of a first value and the base pitch value, the first value being determined using a second probabilistic function ([0052] For example, the pitch 650 of the note 510 [first value] may be obtained, and then the same pitch 650 may be obtained by resampling 632 and/or time scaling 634, and in this case, the pitch of the obtained audio segment 640 is the same as the pitch of the note 510, ¶ [0048] According to an example implementation of the present disclosure, the mapping relationship may be established in a plurality of manners. For example, a corresponding target audio segment may be selected for the target note based on a random selection mode, in this case, the audio segments corresponding to the individual notes 510 [first value] are randomly selected [second probabilistic function]); 
generating a first adjusted audio segment by adjusting a first pitch of the first audio segment using the first pitch adjustment value, the first adjusted audio segment having a first adjusted pitch that is different from the first pitch ([0052] For example, the pitch 650 of the note 510 may be obtained, and then the same pitch 650 may be obtained by resampling 632 and/or time scaling 634, and in this case, the pitch of the obtained audio segment 640 is the same as the pitch of the note 510, ¶ [0048] According to an example implementation of the present disclosure, the mapping relationship may be established in a plurality of manners. For example, a corresponding target audio segment may be selected for the target note based on a random selection mode, in this case, the audio segments corresponding to the individual notes 510 are randomly selected [second probabilistic function]); 
Zhang does not explicitly disclose however Suskind teaches generating the anonymized audio output using the first adjusted audio segment (Suskind ¶ [0055]  Anonymization module 312 applies one or more levels of anonymization to captured video, images, and/or audio, as applicable. Transceiver module 316 transmits and/or receives anonymized and/or plaintext video, images, and/or audio via a network interface. Audio/video display module 320 enables the presentation of anonymized and/or plaintext video, images, and/or audio to a user via a user interface).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  to include anonymization of Suskind in order to provide privacy in communication ([0066], Suskind); 

With respect to claims 3 and 13 Zhang teaches wherein the method further comprises: 
for a second audio segment from the plurality of audio segments (Zhang ¶[0047] At this time, for the target note in the plurality of notes, the target note may be mapped to a certain audio segment in the plurality of audio segments. For example, the note 510 may be mapped to the audio segment 420, the note 512 may be mapped to the audio segment 422 [second audio segment], the note 514 may be mapped to the audio segment 426, the note 516 may be mapped to the audio segment 420, and so forth.): 
generating a second pitch adjustment value using a combination of a second value and the base pitch value, the second value being different from the first value, the second pitch adjustment value being different from the first pitch adjustment value (Zhang ¶[0047] At this time, for the target note in the plurality of notes, the target note may be mapped to a certain audio segment in the plurality of audio segments. For example, the note 510 may be mapped to the audio segment 420, the note 512 may be mapped to the audio segment 422, the note 514 may be mapped to the audio segment 426, the note 516 may be mapped to the audio segment 420, and so forth.); 
generating a second adjusted audio segment by adjusting a second pitch of the second audio segment using the second pitch adjustment value, the second adjusted audio segment having a second adjusted pitch that is different from the second pitch (Zhang ¶[0047] At this time, for the target note in the plurality of notes, the target note may be mapped to a certain audio segment in the plurality of audio segments. For example, the note 510 may be mapped to the audio segment 420, the note 512 may be mapped to the audio segment 422 [second audio segment], the note 514 may be mapped to the audio segment 426, the note 516 may be mapped to the audio segment 420, and so forth.); and 
Zhang does not explicitly disclose however Suskind teaches wherein the generating the anonymized audio output further comprises using the second adjusted audio segment (Suskind ¶ [0055] FIG. 300 illustrates a computer system 200 executing program instructions stored in memory 208 using processor 212 to provide various modules to offer anonymizing videoconferencing functionality. Audio/video capture module 304 captures video and/or still images using a camera and/or a microphone, as applicable. Participant location module 308 locates a participant in an image or video capture or audio sample. Anonymization module 312 applies one or more levels of anonymization to captured video, images, and/or audio, as applicable. Transceiver module 316 transmits and/or receives anonymized and/or plaintext video, images, and/or audio via a network interface. Audio/video display module 320 enables the presentation of anonymized and/or plaintext video, images, and/or audio to a user via a user interface). 
With respect to claims 10 and 20  Zhang teaches trigger transmission, to a server over a communication network, of the anonymized audio output in lieu of the original audio input (Suskind ¶ [0055]  Anonymization module 312 applies one or more levels of anonymization to captured video, images, and/or audio, as applicable. Transceiver module 316 transmits and/or receives anonymized and/or plaintext video, images, and/or audio via a network interface. Audio/video display module 320 enables the presentation of anonymized and/or plaintext video, images, and/or audio to a user via a user interface).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  to include anonymization of Suskind in order to provide privacy in communication ([0066], Suskind); 



Claims 2 and 12  are rejected under 35 U.S.C. 103 as being unpatentable over Zhang and Suskind  in further view of Chen(US 20070198263 A1).
 With respect to claims 2 and 12, Zhang and Suskind do not explicitly disclose however Chen teaches determining a gender of the speaker using the original audio input (Chen ¶Claims 6. The method of claim 5 wherein determining the speaker's age and/or gender includes determining whether the runtime pitch falls into a range, wherein the range depends on the speakers age and/or gender.); and 
wherein the stochastically determining a base pitch value is further based on the gender of the speaker (Chen ¶Claims 6. The method of claim 5 wherein determining the speaker's age and/or gender includes determining whether the runtime pitch falls into a range, wherein the range depends on the speakers age and/or gender.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  in view of anonymization of Suskind to include pitch based on gender of Chen  in order to provide a proper pitch adjustment from a baseline pitch.


Claims 4-5 and 14-15  are rejected under 35 U.S.C. 103 as being unpatentable over Zhang, Suskind and Chen  in further view of Ll (US 20210043222 A1).
With respect to claims 4 and 14, none of Zhang, Susskind and Chen explicitly disclose however Ll teaches wherein the method further comprises extracting a plurality of features from the original audio input using a feature extraction model, and wherein the determining the gender further comprises ([0030] Since the feature extraction unit 113 uses the CNN, characteristics of feature extraction are formed through learning with the neural network using the CNN). : 
inputting the plurality of features into a gender classification model ([0031] The classification unit 114 performs a plurality of sets of classification processing for classifying emotion and gender, using the feature data output from the feature extraction unit 113. The classification unit 114 is constituted by a neural network that has three hierarchical structures, for example. Although the details will be described later, the classification unit 114 includes, for example, (i) a layer that has a bidirectional LSTM configuration, (ii) a layer that has an attention mechanism, and (iii) a layer that classifies and outputs emotion and gender corresponding to speech data) ; and 
outputting a gender class value by the gender classification model, the gender class value being indicative of the gender of the speaker in the original audio input ([0031] The classification unit 114 performs a plurality of sets of classification processing for classifying emotion and gender, using the feature data output from the feature extraction unit 113. The classification unit 114 is constituted by a neural network that has three hierarchical structures, for example. Although the details will be described later, the classification unit 114 includes, for example, (i) a layer that has a bidirectional LSTM configuration, (ii) a layer that has an attention mechanism, and (iii) a layer that classifies and outputs emotion and gender corresponding to speech data) .
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  in view of anonymization of Suskind in view of based on gender of Chen to include classification model of Ll  in order to provide a proper pitch adjustment from a baseline pitch.
  With respect to claims 5 and 15 Ll further teaches wherein the feature extraction model is at least one of: a Convolutional Neural Network (CNN), Recurrent Neural Networks (RNN), Gaussian Mixture Models (GMM) ([0030] Since the feature extraction unit 113 uses the CNN, characteristics of feature extraction are formed through learning with the neural network using the CNN).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  in view of anonymization of Suskind in view of based on gender of Chen to include classification model of Ll  in order to provide a proper pitch adjustment from a baseline pitch.

Claims 6 and 16  are rejected under 35 U.S.C. 103 as being unpatentable over Zhang, Suskind in further view of Divikaran (US 20040143434 A1).
With respect to claims 6 and 16, none of Zhang and Susskind explicitly disclose however Divikaran teaches wherein the segmenting the original audio input comprises: employing a segmentation model to segment the original audio input into the plurality of audio segment, the segmentation model being at least one of: Hidden Markov Models (HMMs), and Gaussian Mixture Models (GMM) (Divikaran ¶[0008] The invention uses a generalized sound recognition hidden Markov model (HMM) framework for joint segmentation and classification of the audio signal of the news video. The HMM not only provides a classification label for audio segment, but also compact state duration histogram descriptors).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  in view of anonymization of Suskind to include segmentation model of Divikaran  in order to provide joint segmentation and classification of male/female (Divikaran , [0008])

Claims 7 and 17  are rejected under 35 U.S.C. 103 as being unpatentable over Zhang, Suskind in further view of Zhang_2 (US 20110307260 A1).
With respect to claims 7 and 17, none of Zhang and Suskind explicitly disclose however Zhang_2 teaches wherein the method further comprises: generating the first value using the gender classification model based on the extracted pitch speaker (Zhang_2 ¶ [0084] Audio Feature Extraction module 302 will also identify the pitch of the speaker. Alternatively, Audio Feature Extraction 302 can identify data that can be used to determine pitch. That information is provided to Gender Classifier-Pitch 312. In response to the data from Audio Feature Extraction 302, the Gender Classifier-Pitch 312 will classify whether the speaker is male or female. It is understood that females have a higher pitch than males).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  in view of anonymization of Suskind to include classification based on pitch of Zhang_2  in order to produce an accurate recognition of a user's gender (Zhang_2 , [0025])

Claims 8-9 and 18-19  are rejected under 35 U.S.C. 103 as being unpatentable over Zhang, Suskind in further view of Sakurai (US 20090144064 A1).
With respect to claims 8 and 18  Suskind further teaches generating the anonymized audio output using the other first adjusted audio segment (Suskind ¶ [0055] FIG. 300 illustrates a computer system 200 executing program instructions stored in memory 208 using processor 212 to provide various modules to offer anonymizing videoconferencing functionality. Audio/video capture module 304 captures video and/or still images using a camera and/or a microphone, as applicable. Participant location module 308 locates a participant in an image or video capture or audio sample. Anonymization module 312 applies one or more levels of anonymization to captured video, images, and/or audio, as applicable. Transceiver module 316 transmits and/or receives anonymized and/or plaintext video, images, and/or audio via a network interface. Audio/video display module 320 enables the presentation of anonymized and/or plaintext video, images, and/or audio to a user via a user interface).
 None of Zhang and Suskind explicitly disclose however Sakurai teaches wherein the method further comprises: generating an other first adjusted audio segment using a time-scale modification based pitch shifting algorithm and the first adjusted audio segment (Sakurai ¶[0003] One prior art technique achieves pitch shifting by seamless time-scale modification (TSM) and restoration of the original time scale through sampling rate conversion); 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  in view of anonymization of Suskind to include classification based on pitch of Sakurai in order to accurately switch pitch (Sakurai , [0027])

With respect to claims 9 and 19 Sakurai further teaches wherein the time-scale modification based pitch shifting algorithm is at least one of: Phase Vocoder (PV), Synchronous Overlap and Add (SOLA), Pitch-Synchronous Overlap and Add (PSOLA), and Waveform Similarity Overlap-Add (WSOLA) (Sakurai ¶[0003] One prior art technique achieves pitch shifting by seamless time-scale modification (TSM) and restoration of the original time scale through sampling rate conversion, ¶[0013] There are two common approaches to changing the fundamental frequency contour in speech synthesis systems. The first approach uses a speech production model. Voiced speech is approximated as the output of a vocal tract filter fed by an impulse train or another excitation signal source. Controlling the fundamental frequency is relatively straightforward, since it is dictated by the fundamental frequency of the source. However, such systems only work satisfactorily for signals containing pure speech that can be approximated by the model. The second approach is known as PSOLA).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify pitch determination of Zhang  in view of anonymization of Suskind to include classification based on pitch of Sakurai in order to accurately switch pitch (Sakurai , [0027])
 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675. The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ATHAR N PASHA/Primary Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

Jun 18, 2024

Application Filed

Mar 07, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/335,256

Patent 12596882

COMPLIANCE DETECTION USING NATURAL LANGUAGE PROCESSING

2y 5m to grant Granted Apr 07, 2026

17/418,193

Patent 12586563

Method, System and Apparatus for Understanding and Generating Human Conversational Cues

2y 5m to grant Granted Mar 24, 2026

18/094,032

Patent 12579173

SYSTEMS AND METHODS FOR DYNAMICALLY PROVIDING INTELLIGENT RESPONSES

2y 5m to grant Granted Mar 17, 2026

18/087,629

Patent 12566921

GAZETTEER INTEGRATION FOR NEURAL NAMED ENTITY RECOGNITION

2y 5m to grant Granted Mar 03, 2026

18/200,559

Patent 12547844

INTELLIGENT MODEL SELECTION SYSTEM FOR STYLE-SPECIFIC DIGITAL CONTENT GENERATION

2y 5m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

90%

Grant Probability

99%

With Interview (+17.0%)

2y 8m

Median Time to Grant

Low

PTA Risk

Based on 154 resolved cases by this examiner. Grant probability derived from career allow rate.