Last updated: May 29, 2026
Application No. 18/654,898
ADAPTIVE SPEECH REGENERATION

Final Rejection §103
Filed
May 03, 2024
Priority
May 04, 2023 — provisional 63/500,164
Examiner
ROBERTS, SHAUN A
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Shure Acquisition Holdings Inc.
OA Round
2 (Final)
Interview Optional

— +10.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 76% grant rate with +10.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 652 resolved cases, 2023–2026
Examiner Intelligence

ROBERTS, SHAUN A View full profile →
Grants 76% — above average
Career Allowance Rate
495 granted / 652 resolved
+13.9% vs TC avg
Moderate +10% lift
Without
With
+10.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
18 currently pending
Career history
679
Total Applications
across all art units
Statute-Specific Performance

§101
1.8%
-38.2% vs TC avg
§103
83.6%
+43.6% vs TC avg
§102
12.5%
-27.5% vs TC avg
§112
0.1%
-39.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 652 resolved cases
Office Action

§103
DETAILED ACTION
1.	This action is responsive to remarks filed 3/5/26.
Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
3.	Claim 24 is newly added.
Response to Arguments
4.	Applicant’s arguments filed have been fully considered but are not persuasive.
Regarding claim 1 Applicant argues that claim 1 requires a single model generating both representations, where the first audio transformation model generates the first voice vector representation and first speech vector representation (and that the limitations are mapped to different systems and neither reference individually teaches a single model generating both representation).  Examiner respectfully disagrees.
Regarding claim 1 Hijazi teaches An audio processing apparatus comprising at least one processor and a memory storing instructions that are operable, when executed by the at least one processor (fig 3A; para 53-55; 0053: system for data driven audio enhancement; processing apparatus that is configured to receive…audio from one or more microphones; 54: processors; memory), to cause the audio processing apparatus to: 
receive audio signals captured by one or more audio capture devices, wherein the audio signals comprise one or more first speech signals associated with a first speaking entity (0053; 57: microphones capture the audio data; 0068: speech signals as uttered by the one of the one or more users; speaker identification); 
divide the audio signals into one or more audio segments (0055: determine a window of audio samples; 0035); 
input a first audio segment of the one or more audio segments to a first audio transformation model to generate a first voice vector representation, wherein the first voice vector representation comprises one or more characteristics related to a first speaking voice associated with the first speaking entity (0068: speaker identification; 
[0086] In some implementations (not shown in FIG. 5), a speech signal may be further classified by speaker and the audio enhancement network may be selected and/or modified based on the speaker identification. For example, once the likely speaker is identified by voice and/or facial recognition, the speech enhancement may be further improved by being swapping in a speaker-specific deep neural for speech improvement, or bring in a speaker-specific embedding (e.g., an “i-vector”) to a speech enhancement network trained to use speaker embedding information to further improve disambiguation of speech from noise. For example, the process 500 may include selecting 540, based on the speaker identification (e.g., in addition to one or more classifications), the audio enhancement network from the set of multiple audio enhancement networks. For example, the selected audio enhancement network may include a neural network that has been trained using speech signals of a person associated with the speaker identification. In some implementations, the selected audio enhancement network includes a speaker-specific embedding associated with the speaker identification.); 
{input the one or more audio segments to the first audio transformation model to generate a first speech vector representation, wherein the first speech vector representation comprises one or more words spoken by the first speaking entity and one or more respective contextual attributes associated with the one or more words;} and 
output one or more of the first voice vector representation or the first speech vector representation (fig 1;
[0090] The process 500 includes storing, playing, or transmitting 560 an enhanced audio signal based on the enhanced audio segment. For example, determining the enhanced audio signal may include blending the sequence of respective enhanced audio segments, including combining (e.g., using weighted averaging) corresponding samples from multiple segments in overlapping portions of the enhanced audio segments. In some implementations, where the audio data is extracted from AV data, the enhanced audio signal is merged (e.g., synchronized and encoded) with a visual track from the AV data and stored or transmitted 560 as part of enhanced AV data.).
Hijazi doesn’t specifically teach where Balasubramaniam teaches
input the one or more audio segments to the first audio transformation model to generate a first speech vector representation, wherein the first speech vector representation comprises one or more words spoken by the first speaking entity and one or more respective contextual attributes associated with the one or more words
(Balasubramaniam et al (10,896,428) teaches 
Fig 3; Col 10 l. 15-30: ASR; Col 12 l. 45-62: speech analyzer may generate a feature vector for the frame using these statistics, such as a feature vector that includes a separate dimension for each of the computed statistics; 
Col 13 l. 17-47: At block 316, the speech analyzer 116 or some other module or component of the dynamic contact management system 100 can generate speech feature data using the feature vectors generated above. In some embodiments, the speech analyzer 116 can analyze the feature vectors with respect to a model trained to classify the feature vectors into particular classifications, such as classifications associated with particular emotions. Illustratively, the model may be a k-means clustering trained model, a support vector machine (“SVM”) trained model, or some other classification model. In some embodiments, the speech analyzer 116 may use different models in different contexts. In this way, the speech analyzer 116 may accurately determine speech features (e.g., emotion classifications) across different contexts, when a general model may not accurately determine speech features in specific contexts. The different models may be designed for analyzing speech that has different speech characteristics, such as speaking rate, prosody, pitch, volume, language, dialect, accent, some combination thereof, etc. For example, the speech analyzer 116 may access data regarding the gender of a speaker (e.g., data associated with a profile of a customer or agent, modelling applied to predict the gender of a speaker, etc.). The speech analyzer may use a first model if the person whose voice is represented by the audio data is a male, and a second model if the person whose voice is represented by the audio data is a female. As another example, the speech analyzer 116 may access data regarding a geographic location of a speaker. The speech analyzer 116 may select a model tailored to speech characteristics of speakers (e.g., accent, prosody, etc.) in the geographic location.;
Col 14 l. 47-50).

It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Balasubramaniam to obtain additional speech components for improved speech enhancement.
Hijazi already teaches obtaining audio and speech components and processing using multiple networks/models to enhance the speech.  Hijazi teaches performing speaker identification and removing noise.  In a similar field of endeavor, Balasubramaniam is directed to a system that analyzes aspects of speech from a customer contact and generates dynamic output based on the analysis (Col 1 l. 54-56).  One could thus look to Balasubramaniam to further incorporate additional speech components for more comprehensive speech enhancement.  Hijazi also teaches the multiple enhancement models and using a plurality of different input data (video, speech, music, etc – figure 1), and further incorporating Balasubramaniam (and its additional speech vectors) would still allow for the audio processing and merging (of Hijazi) presenting a reasonable expectation of success in still obtaining the enhanced signals.
Hijazi teaches a speech enhancement network:
[0040] The system 100 includes a speech enhancement network 130 that may be trained to determine an enhanced speech segment 131 based on a window of audio samples 121 for which the speech enhancement network 130 has been selected based on a classification of the window of audio samples 121 determined by the audio segmentation and classification module 120. Speech enhancement may be a complex and sophisticated audio enhancement method in the system 100. For example, the speech enhancement network 130 may be a learned system, using deep neural network structures trained on a wide variety of speech impaired by noise, speech babble, music and acoustic reverberation to cover the wide range possible speech inputs. The training of the deep neural network may be key to its success in mapping noisy, reverberant speech into cleaner speech with less reverberation. An example of a structure of a training process is shown in FIG. 2. For example, the process 700 of FIG. 7 may be implemented to train a neural network of the speech enhancement network 130.
Therefore, Hijazi teaches the speech enhancement network 130 may be a learned system, using deep neural network structures trained on a wide variety of speech and it would be obvious to allow the particular network to further incorporate Balasubramaniam to extract additional speech components, for an improved system allowing for more comprehensive speech enhancement.

Allowable Subject Matter
5.	Claim 24 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The claim recites:
The audio processing apparatus of claim 1, wherein the first audio transformation model is configured to (i) determine one or more frequency patterns associated with a speaking voice of the first speaking entity, (ii) extract one or more speech primitives comprising the one or more words, and (iii) determine the one or more respective contextual attributes associated with the one or more speech primitives;
Which is not taught by the cited prior art of record.

Claim Rejections - 35 USC § 103
6.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

7.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

8.	Claims 1-13, 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Hijazi et al (2020/0075033) in view of Balasubramaniam et al (10,896,428).
	
Regarding claim 1 Hijazi teaches An audio processing apparatus comprising at least one processor and a memory storing instructions that are operable, when executed by the at least one processor (fig 3A; para 53-55; 0053: system for data driven audio enhancement; processing apparatus that is configured to receive…audio from one or more microphones; 54: processors; memory), to cause the audio processing apparatus to: 
receive audio signals captured by one or more audio capture devices, wherein the audio signals comprise one or more first speech signals associated with a first speaking entity (0053; 57: microphones capture the audio data; 0068: speech signals as uttered by the one of the one or more users; speaker identification); 
divide the audio signals into one or more audio segments (0055: determine a window of audio samples; 0035); 
input a first audio segment of the one or more audio segments to a first audio transformation model to generate a first voice vector representation, wherein the first voice vector representation comprises one or more characteristics related to a first speaking voice associated with the first speaking entity (0068: speaker identification; 
[0086] In some implementations (not shown in FIG. 5), a speech signal may be further classified by speaker and the audio enhancement network may be selected and/or modified based on the speaker identification. For example, once the likely speaker is identified by voice and/or facial recognition, the speech enhancement may be further improved by being swapping in a speaker-specific deep neural for speech improvement, or bring in a speaker-specific embedding (e.g., an “i-vector”) to a speech enhancement network trained to use speaker embedding information to further improve disambiguation of speech from noise. For example, the process 500 may include selecting 540, based on the speaker identification (e.g., in addition to one or more classifications), the audio enhancement network from the set of multiple audio enhancement networks. For example, the selected audio enhancement network may include a neural network that has been trained using speech signals of a person associated with the speaker identification. In some implementations, the selected audio enhancement network includes a speaker-specific embedding associated with the speaker identification.); 
{input the one or more audio segments to the first audio transformation model to generate a first speech vector representation, wherein the first speech vector representation comprises one or more words spoken by the first speaking entity and one or more respective contextual attributes associated with the one or more words;} and 
output one or more of the first voice vector representation or the first speech vector representation (fig 1;
[0090] The process 500 includes storing, playing, or transmitting 560 an enhanced audio signal based on the enhanced audio segment. For example, determining the enhanced audio signal may include blending the sequence of respective enhanced audio segments, including combining (e.g., using weighted averaging) corresponding samples from multiple segments in overlapping portions of the enhanced audio segments. In some implementations, where the audio data is extracted from AV data, the enhanced audio signal is merged (e.g., synchronized and encoded) with a visual track from the AV data and stored or transmitted 560 as part of enhanced AV data.).
Hijazi doesn’t specifically teach where Balasubramaniam teaches
input the one or more audio segments to the first audio transformation model to generate a first speech vector representation, wherein the first speech vector representation comprises one or more words spoken by the first speaking entity and one or more respective contextual attributes associated with the one or more words
(Balasubramaniam et al (10,896,428) teaches 
Fig 3; Col 10 l. 15-30: ASR; Col 12 l. 45-62: speech analyzer may generate a feature vector for the frame using these statistics, such as a feature vector that includes a separate dimension for each of the computed statistics; 
Col 13 l. 17-47: At block 316, the speech analyzer 116 or some other module or component of the dynamic contact management system 100 can generate speech feature data using the feature vectors generated above. In some embodiments, the speech analyzer 116 can analyze the feature vectors with respect to a model trained to classify the feature vectors into particular classifications, such as classifications associated with particular emotions. Illustratively, the model may be a k-means clustering trained model, a support vector machine (“SVM”) trained model, or some other classification model. In some embodiments, the speech analyzer 116 may use different models in different contexts. In this way, the speech analyzer 116 may accurately determine speech features (e.g., emotion classifications) across different contexts, when a general model may not accurately determine speech features in specific contexts. The different models may be designed for analyzing speech that has different speech characteristics, such as speaking rate, prosody, pitch, volume, language, dialect, accent, some combination thereof, etc. For example, the speech analyzer 116 may access data regarding the gender of a speaker (e.g., data associated with a profile of a customer or agent, modelling applied to predict the gender of a speaker, etc.). The speech analyzer may use a first model if the person whose voice is represented by the audio data is a male, and a second model if the person whose voice is represented by the audio data is a female. As another example, the speech analyzer 116 may access data regarding a geographic location of a speaker. The speech analyzer 116 may select a model tailored to speech characteristics of speakers (e.g., accent, prosody, etc.) in the geographic location.;
Col 14 l. 47-50).

It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Balasubramaniam to obtain additional speech components for improved speech enhancement.
Hijazi already teaches obtaining audio and speech components and processing using multiple networks/models to enhance the speech.  Hijazi teaches performing speaker identification and removing noise.  In a similar field of endeavor, Balasubramaniam is directed to a system that analyzes aspects of speech from a customer contact and generates dynamic output based on the analysis (Col 1 l. 54-56).  One could thus look to Balasubramaniam to further incorporate additional speech components for more comprehensive speech enhancement.  Hijazi also teaches the multiple enhancement models and using a plurality of different input data (video, speech, music, etc – figure 1), and further incorporating Balasubramaniam (and its additional speech vectors) would still allow for the audio processing and merging (of Hijazi) presenting a reasonable expectation of success in still obtaining the enhanced signals.

Regarding claim 2 Hijazi teaches The audio processing apparatus of claim 1, wherein the first audio transformation model comprises a neural network (0040: speech enhancement network may be a learned system, using deep neural network structures; 0086: the selected audio enhancement network may include a neural network).

Regarding claim 3 Hijazi teaches The audio processing apparatus of claim 1, wherein the instructions are operable, when executed by the at least one processor, to further cause the audio processing apparatus to: 
prior to inputting the one or more audio segments to the first audio transformation model, extract the one or more words spoken by the first speaking entity from the one or more audio segments and convert the one or more words into a computer-readable format ([0057] The one or more microphones 316 are configured to capture the audio data. The audio data may include one or more audio tracks that are synchronized with one or more visual tracks captured by the one or more image sensors 314. The one or more microphones 316 may convert sound waves incident on the one or more microphones 316 into one or more digital audio recordings. The one or more microphones 316 may detect sound and convey information constituting an audio signal as electrical signals (e.g., analog or digital signals). In some implementations, the one or more microphones 316 include an analog-to-digital converter.).

Regarding claim 4 Balasubramaniam teaches The audio processing apparatus of claim 1, wherein the first audio transformation model is configured to generate the first speech vector representation based on one or more speech primitives associated with the one or more words spoken by the first speaking entity and the one or more respective contextual attributes associated with the one or more words (Fig 3; Col 10 l. 15-30: ASR; Col 12 l. 45-62: speech analyzer may generate a feature vector for the frame using these statistics, such as a feature vector that includes a separate dimension for each of the computed statistics; 
Col 13 l. 17-47: At block 316, the speech analyzer 116 or some other module or component of the dynamic contact management system 100 can generate speech feature data using the feature vectors generated above.).
Rejected for similar rationale and reasoning as claim 1

Regarding claim 5 Balasubramaniam teaches The audio processing apparatus of claim 1, wherein the one or more respective contextual attributes comprise at least one of one or more acoustic features, one or more emotive qualities, or one or more speech delivery characteristics associated with the one or more words spoken by the first speaking entity 
(Fig 3; Col 10 l. 15-30: ASR, acoustic features; Col 12 l. 45-62: speech analyzer may generate a feature vector for the frame using these statistics, such as a feature vector that includes a separate dimension for each of the computed statistics; 
Col 13 l. 17-47: At block 316, the speech analyzer 116 or some other module or component of the dynamic contact management system 100 can generate speech feature data using the feature vectors generated above; particular emotions; speech characteristics).
Rejected for similar rationale and reasoning as claim 1

Regarding claim 6 Balasubramaniam teaches The audio processing apparatus of claim 5, wherein the one or more acoustic features comprise at least one of pitch, articulation, volume, or intensity (Col 13 l 33-34: speech characteristics…pitch, volume).
Rejected for similar rationale and reasoning as claim 1

Regarding claim 7 Balasubramaniam teaches The audio processing apparatus of claim 5, wherein the one or more speech delivery characteristics comprise at least one of pause duration, pace, or speech rate (Col 13 l 33-34: speech characteristics, such as speaking rate).
Rejected for similar rationale and reasoning as claim 1

Regarding claim 8 Balasubramaniam teaches The audio processing apparatus of claim 5, wherein the one or more emotive qualities are characterized by one or more respective emotional dimensions, wherein the one or more respective emotional dimensions comprise at least one of valence, activation, or dominance (col 13 l. 48-50: the model may differentiate between a set of emotions, such as anger, boredom, disgust, anxiety, happiness, sadness, and neutral.).
Rejected for similar rationale and reasoning as claim 1

Regarding claim 9 Hijazi teaches The audio processing apparatus of claim 1, wherein the first speech vector representation and the first voice vector representation are configured for regeneration of the one or more first speech signals associated with the first speaking entity by a second audio processing apparatus
([0090] The process 500 includes storing, playing, or transmitting 560 an enhanced audio signal based on the enhanced audio segment. For example, determining the enhanced audio signal may include blending the sequence of respective enhanced audio segments, including combining (e.g., using weighted averaging) corresponding samples from multiple segments in overlapping portions of the enhanced audio segments. In some implementations, where the audio data is extracted from AV data, the enhanced audio signal is merged (e.g., synchronized and encoded) with a visual track from the AV data and stored or transmitted 560 as part of enhanced AV data; 
0092: the enhanced audio signal may be transmitted 560 to an external device (e.g., the media server 410 or a personal computing device) for display or storage.).

Regarding claim 10 Hijazi teaches The audio processing apparatus of claim 1, wherein the instructions are further operable to cause the audio processing apparatus to: 
transmit, in near real-time, the audio signals captured by the one or more audio capture devices to a second audio processing apparatus for outputting of the audio signals by the second audio processing apparatus ([0090] The process 500 includes storing, playing, or transmitting 560 an enhanced audio signal based on the enhanced audio segment.; 0092: the enhanced audio signal may be transmitted 560 to an external device (e.g., the media server 410 or a personal computing device) for display or storage).

Regarding claim 11 Hijazi does not specifically teach where Balasubramaniam teaches The audio processing apparatus of claim 10, wherein the audio signals are configured for generating, at the second audio processing apparatus, a live text transcript representative of the one or more first speech signals associated with the first speaking entity (col 8 l. 2-5: may also include a contact transcript section that provides a transcript of the utterances made by the customer and/or the agent; col 10 l. 18-20).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate a transcript for recognition and additional representation of the speech signal.

Regarding claim 12 Hijazi teaches The audio processing apparatus of claim 11, wherein the {live text transcript is configured for} regeneration of the one or more first speech signals associated with the first speaking entity at the second audio processing apparatus, and wherein the regeneration of the one or more first speech signals is executed at the second audio processing apparatus simultaneously with the outputting of the audio signals by the second audio processing apparatus (fig 1; [0090] The process 500 includes storing, playing, or transmitting 560 an enhanced audio signal based on the enhanced audio segment.; 0092: the enhanced audio signal may be transmitted 560 to an external device (e.g., the media server 410 or a personal computing device) for display or storage),
And does not specifically teach where Balasubramaniam teaches the live text transcript (col 8 l. 2-5: may also include a contact transcript section that provides a transcript of the utterances made by the customer and/or the agent; col 10 l. 18-20).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate a transcript for recognition and additional representation of the speech/regenerated speech signal.

Regarding claim 13 Hijazi teaches The audio processing apparatus of claim 1, wherein the one or more audio segments correspond to a predetermined duration of time (0035: windows of samples of an appropriate duration; window duration).

Regarding claim 15 Hijazi teaches The audio processing apparatus of claim 1, wherein the instructions are further operable to cause the audio processing apparatus to: 
compress at least one of the first speech vector representation or the first voice vector representation (31: encoded, compression algorithm; 0045; 131).

Regarding claim 16 Hijazi teaches The audio processing apparatus of claim 1, wherein the instructions are further operable to cause the audio processing apparatus to: 
capture, via one or more video or image capture devices, one or more portions of image data, wherein the one or more portions of image data are associated with the first speaking entity (0033: extract an audio track from a video; 0053: the system includes a video capture device…that includes a processing apparatus that is configured to receive images).

Regarding claim 17 Hijazi teaches The audio processing apparatus of claim 16, wherein the first voice vector representation associated with the first speaking entity is generated based on capturing the one or more portions of image data associated with the first speaking entity (0086: speaker is identified by voice and facial recognition).

Regarding claim 18 Hijazi teaches The audio processing apparatus of claim 1, wherein the first audio transformation model is further configured to generate one or more portions of audio localization data associated with the first speaking entity, wherein the one or more portions of audio localization data comprises an estimated location of the first speaking entity relative to the audio processing apparatus
(0030: For example, application of an environment specific model for enhancement of audio from such environment may provide an advantage given the training and testing conditions are well matched. For example, training of an audio enhancement network may be achieved with less training examples by using data augmentation based on a collection of room acoustic models. Some implementations may provide speaker identification and/or customization for particular users.; 49-50;
0068: In some implementations, the selected audio enhancement network includes a neural network that has been trained using audio signals that have been augmented using room acoustic models.).

Regarding claim 19 Hijazi teaches The audio processing apparatus of claim 18, wherein the first voice vector representation associated with the first speaking entity is generated based on the audio localization data (0068: input the window of audio samples to a speaker identification network to obtain a speaker identification, in which the speaker identification network includes a neural network that has been trained using speech signals from one or more users to recognize speech signals as uttered by one of the one or more users; and select, based on the speaker identification, the audio enhancement network from the set of multiple audio enhancement networks, in which the selected audio enhancement network includes a speaker-specific embedding associated with the speaker identification.  In some implementations, the selected audio enhancement network includes a neural network that has been trained using audio signals that have been augmented using room acoustic models).

Regarding claim 20 Hijazi teaches The audio processing apparatus of claim 1, wherein the instructions are further operable to cause the audio processing apparatus to: 
receive a second voice vector representation (53; 57; 68; 86); 
{receive a second speech vector representation;} 
input the second voice vector representation {and the second speech vector representation} into a second audio transformation model (fig 1; 68; 86); 
generate, based on the second voice vector representation, {the second speech vector representation,} and model output generated by the second audio transformation model, one or more second speech signals (fig 1; 0090); and 
output the one or more second speech signals in a second speaking voice associated with a second speaking entity related to the second voice vector representation (fig 1; 0090
86 the selected audio enhancement network may include a neural network that has been trained using speech signals of a person associated with the speaker identification.) 
And does not specifically teach where Balasubramaniam teaches a second speech vector representation
Rejected for similar rationale and reasoning as claim 1 – where the voice information will be processed based on the specific speaker, and can be performed for multiple speakers).


9.	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Hijazi et al (2020/0075033) in view of Balasubramaniam et al (10,896,428) in further view of Streit (2021/0141896).

Regarding claim 14 Hijazi and Balasubramaniam do not specifically teach where Streit teaches The audio processing apparatus of claim 1, wherein the instructions are further operable to cause the audio processing apparatus to: 
encrypt both the first voice vector representation and the first speech vector representation (0067: generate encrypted embeddings or encrypted feature vectors (voice feature vectors)).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate encryption to further secure the specific data. 

Conclusion
10.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool.  To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.
For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/Primary Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

May 03, 2024
Application Filed
Dec 05, 2025
Non-Final Rejection mailed — §103
Mar 04, 2026
Applicant Interview (Telephonic)
Mar 04, 2026
Examiner Interview Summary
Mar 05, 2026
Response Filed
Mar 30, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/575,883
Patent 12639534
WEBTOON CONTENT MULTILINGUAL TRANSLATION METHOD
2y 3m to grant Granted May 26, 2026
18/667,219
Patent 12626705
APPARATUS AND METHOD FOR MAPPING EMERGENCY CALL DATA MANUAL
1y 12m to grant Granted May 12, 2026
18/566,268
Patent 12621616
METHOD OF OPERATING A HEARING AID SYSTEM AND A HEARING AID SYSTEM USING SPEECH FORECASTING
2y 5m to grant Granted May 05, 2026
18/274,775
Patent 12609133
SCENE ESTIMATE METHOD, SCENE ESTIMATE APPARATUS, AND PROGRAM
2y 8m to grant Granted Apr 21, 2026
18/312,688
Patent 12586599
AUDIO SIGNAL PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM WITH MACHINE LEARNING AND FOR MICROPHONE MUTE STATE FEATURES IN A MULTI PERSON VOICE CALL
2y 10m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
76%
Grant Probability
86%
With Interview (+10.5%)
2y 11m (~10m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 652 resolved cases by this examiner. Grant probability derived from career allowance rate.