Last updated: May 29, 2026

Application No. 18/471,876

SYSTEMS AND METHODS FOR EFFICIENT SPEECH REPRESENTATION

Non-Final OA §103

Filed

Sep 21, 2023

Priority

Sep 23, 2022 — provisional 63/376,820

Examiner

COLUCCI, MICHAEL C

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Jpmorgan Chase Bank N A

OA Round

3 (Non-Final)

Interview Optional

— +15.2% interview lift. Examiner has a relatively high allowance rate (76%); +15.2% interview lift. A written response may suffice.

Based on 999 resolved cases, 2023–2026

Examiner Intelligence

COLUCCI, MICHAEL C View full profile →

Grants 76% — above average

Career Allowance Rate

758 granted / 999 resolved

+13.9% vs TC avg

Strong +15% interview lift

Without

With

+15.2%

Interview Lift

resolved cases with interview

Typical timeline

3y 1m

Avg Prosecution

32 currently pending

Career history

1033

Total Applications

across all art units

Statute-Specific Performance

§101

3.6%

-36.4% vs TC avg

§103

86.9%

+46.9% vs TC avg

§102

2.9%

-37.1% vs TC avg

§112

1.1%

-38.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 999 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION

Response to Arguments
Applicant's arguments with respect to claims 1-20 have been considered but are moot in view of the new ground(s) of rejection. Applicant’s arguments are directed to the amended subject matter; new prior art is provided below.


Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4-8, 11-15, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20230033768 A1 Chong; Dading et al. (hereinafter Chong) in view of US 20210319266 A1 Chen; Ting et al. (hereinafter Chen) and further in view of US 20190355366 A1 NG; Raymond W.M. et al. (hereinafter NG).
Re claim 1, Chong teaches
1. A method for efficient speech representation, comprising: (fig. 4 & fig. 5)
receiving, by a speech representation learning computer program, …audio training data comprising raw audio data; (raw audio data expressly taught 0088-0090 and 0021…fig. 4 & 5 receiving audio input in a speech recognition context 0092)
training, by the speech representation learning computer program, a teacher model using the raw audio data to generate target outputs; (using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092… raw audio data expressly taught 0088-0090 and 0021 with 0074 where audio features element 506 and teacher data/features element 514 are used to produce labeled outputs for loss calculation element 516 and model updates using an iterative routine as in fig. 5…)
adding, by the speech representation learning computer program, audio distortion to the raw audio data; (using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092… raw audio data expressly taught 0088-0090 and 0021 with 0074 where audio features element 506 and teacher data/features element 514 are used to produce labeled outputs for loss calculation element 516 and model updates using an iterative routine as in fig. 5…)
training, by the speech representation learning computer program, a student model with the raw audio data and audio distortion to mimic the target outputs; (the result is element 430 for instance using the output of the student model with noisy + audio data, and using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092… raw audio data expressly taught 0088-0090 and 0021 with 0074 where audio features element 506 and teacher data/features element 514 are used to produce labeled outputs for loss calculation element 516 and model updates using an iterative routine as in fig. 5…)
injecting, by the speech representation learning computer program, a known audio feature into a layer of the student model, wherein workers guide the layer of the student model in learning the known audio feature; (neural network based audio + noise teacher and student model training overall featuring labels, layers, and frequency features, with element 440 as a worker analogous to a layer pe se 0075, the result is element 430 for instance using the output of the student model with noisy + audio data, and using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092)
…comprising features form the teaches model and features from the known audio features (using raw audio for instance 0021 with 0074 where audio features element 506 and teacher data/features element 514 are used to produce labeled outputs for loss calculation element 516 and model updates using an iterative routine as in fig. 5…)



However, while Chong teaches neural network layers for speech recognition or audio concepts and labels as in element 420, as well as the task of ASR in itself, it fails to teach heads per se as follows:
Unlabeled audio training data… (Chen unlabeled training data can be audio 0119)
providing, by the speech representation learning computer program, a neural network head to the student model; (Chen heads, in the context of audio content training 0112, applicable in a teacher model 0135 fig. 13, for student models 0007-0008, using fig. 2a-2b with 0127-0129 heads for NN and specific task, 0139 per application, utilizing a CNN layers 0123)
training, by the speech representation learning computer program, the neural network head with a labeled dataset …; and (Chen heads, in the context of audio content training 0112, applicable in a teacher model 0135 fig. 13, for student models labeled 0007-0008, using fig. 2a-2b with 0127-0129 heads for NN and specific task, 0139 per application, utilizing a CNN layers 0123)
deploying, by the speech representation learning computer program, the student model with the neural network head to an application for the specific task. (Chen 0127-0129 heads for NN and specific task, 0139 per application… heads, in the context of audio content training 0112, applicable in a teacher model 0135 fig. 13, for student models labeled 0007-0008, using fig. 2a-2b with 0127-0129 heads for NN and specific task, 0139 per application, utilizing a CNN layers 0123)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chong to incorporate the above claim limitations as taught by Chen to allow for use of a known technique of neural network heads to improve similar methods of teacher-student model training with labeled outputs in the same way, wherein the system of Chong is improved with pretrained projection head neural network layers and is fine-tuned with a small number or fraction of labeled data, and then distillation training is performed based on reusing the unlabeled pretraining data to distill the network to a student network that performs one or more specialized tasks, thus learning improves accuracy and computational efficiency.

However, while the combination teaches frame operations and teach-student model groups on an end-to-end operation, in lieu of official notice on the operations end-to-end teacher model layer and intermediate outputs, the combination thus fails to teach:
wherein the target outputs comprise embeddings at a frame level; (NG taking raw microphone data into a teacher model as in fig. 5, the teacher model operating at a frame-level and outputting embedding vectors to a student model 0048-0049 and 0056)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chong in view of Chen to incorporate the above claim limitations as taught by NG to allow for combining prior art elements according to known methods to yield predictable results such as for enabling the student to learn high-fidelity speaker characteristics at a granular level rather than just the final utterance-level decision, improving the teacher model of the combination and end-to-end concepts such applicable to knowledge distillation for creating compact, lightweight, and efficient models for real-time or resource-constrained devices.

Re claim 15, this claim has been rejected for teaching a broader, or narrower claim based on general inclusion of hardware alone (e.g. processor, memory, instructions), representation of claim 1 omitting/including hardware for instance, otherwise amounting to a virtually identical scope.
See fig. 1-2 of Chong.

	
Re claims 4, 11, and 18, Chong teaches
4. The method of claim 1, further comprising: calculating, by the speech representation learning computer program, a worker loss for the student model, wherein the worker loss represents a difference between an output of the student model and the target outputs. (loss thereof with element 440 as a worker analogous to a layer pe se 0075… neural network based audio + noise teacher and student model training overall featuring labels, layers, and frequency features, with element 440 as a worker analogous to a layer pe se 0075, the result is element 430 for instance using the output of the student model with noisy + audio data, and using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092)

Re claims 5, 12, and 19, Chong teaches
5. The method of claim 1, wherein the specific task comprises speech recognition, speaker verification, keyword spotting, emotion recognition, and/or speech separation. (as in fig. 4 & 5 receiving audio input in a speech recognition context 0092… neural network based audio + noise teacher and student model training overall featuring labels, layers, and frequency features, with element 440 as a worker analogous to a layer pe se 0075, the result is element 430 for instance using the output of the student model with noisy + audio data, and using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092)

Re claims 6 and 13, Chong teaches
6. The method of claim 1, wherein the known audio feature comprises Mel Frequency Cepstral Coefficients (MFCC), gammatone, Log power spectrum (LPS), FBank, and/or prosody. (0021, 0029, 0088… neural network based audio + noise teacher and student model training overall featuring labels, layers, and frequency features, with element 440 as a worker analogous to a layer pe se 0075, the result is element 430 for instance using the output of the student model with noisy + audio data, and using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092)

Re claims 7 and 14, Chong teaches
7. The method of claim 1, wherein the audio distortion comprises additive noise, reverberation, and/or clipping. (0083 with fig. 4… features 0021, 0029, 0088… neural network based audio + noise teacher and student model training overall featuring labels, layers, and frequency features, with element 440 as a worker analogous to a layer pe se 0075, the result is element 430 for instance using the output of the student model with noisy + audio data, and using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092)

Re claim 8, Chong teaches
8. A system, comprising: (fig. 1-2)
a data source…audio training data comprising raw audio data; (raw audio data expressly taught 0088-0090 and 0021…fig. 4 and 5)
a downstream system executing a specific task; and (downstream as in the flow from teacher to student per se fig. 4 and 5)
an electronic device executing a speech representation learning computer program that is configured to receive the raw audio data from the data source, train a teacher model using the raw audio data to generate target outputs; add audio distortion to the raw audio data, train a student model with the raw audio data and raw audio distortion to mimic the target outputs, inject a known audio feature into a layer of the student model, wherein workers guide the layer of the student model in learning the known audio feature… (raw audio data expressly taught 0088-0090 and 0021 with 0074 where audio features element 506 and teacher data/features element 514 are used to produce labeled outputs for loss calculation element 516 and model updates using an iterative routine as in fig. 5… neural network based audio + noise teacher and student model training overall featuring labels, layers, and frequency features, with element 440 as a worker analogous to a layer pe se 0075, the result is element 430 for instance using the output of the student model with noisy + audio data, and using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092)
……comprising features form the teaches model and features from the known audio features (using raw audio for instance 0021 with 0074 where audio features element 506 and teacher data/features element 514 are used to produce labeled outputs for loss calculation element 516 element 516 and model updates using an iterative routine as in fig. 5…)

However, while Chong teaches neural network layers for speech recognition or audio concepts and labels as in element 420, as well as the task of ASR in itself, it fails to teach heads per se as follows:
Unlabeled audio training data… (Chen unlabeled training data can be audio 0119)
…provide a neural network head to the student model, train the neural network head with a labeled dataset …, and deploy the student model with the neural network head to the downstream system. (Chen 0127-0129 heads for NN and specific task, 0139 per application… heads, in the context of audio content training 0112, applicable in a teacher model 0135 fig. 13, for student models labeled 0007-0008, using fig. 2a-2b with 0127-0129 heads for NN and specific task, 0139 per application, utilizing a CNN layers 0123)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chong to incorporate the above claim limitations as taught by Chen to allow for use of a known technique of neural network heads to improve similar methods of teacher-student model training with labeled outputs in the same way, wherein the system of Chong is improved with pretrained projection head neural network layers and is fine-tuned with a small number or fraction of labeled data, and then distillation training is performed based on reusing the unlabeled pretraining data to distill the network to a student network that performs one or more specialized tasks, thus learning improves accuracy and computational efficiency.

However, while the combination teaches frame operations and teach-student model groups on an end-to-end operation, in lieu of official notice on the operations end-to-end teacher model layer and intermediate outputs, the combination thus fails to teach:
wherein the target outputs comprise embeddings at a frame level; (NG taking raw microphone data into a teacher model as in fig. 5, the teacher model operating at a frame-level and outputting embedding vectors to a student model 0048-0049 and 0056)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chong in view of Chen to incorporate the above claim limitations as taught by NG to allow for combining prior art elements according to known methods to yield predictable results such as for enabling the student to learn high-fidelity speaker characteristics at a granular level rather than just the final utterance-level decision, improving the teacher model of the combination and end-to-end concepts such applicable to knowledge distillation for creating compact, lightweight, and efficient models for real-time or resource-constrained devices.


Re claim 20, Chong teaches
20. The non-transitory computer readable storage medium of claim 15, wherein the known audio feature comprises Mel Frequency Cepstral Coefficients (MFCC), gammatone, Log power spectrum (LPS), FBank, and/or prosody, and the audio distortion comprises additive noise, reverberation, and/or clipping. (0083 with fig. 4… features 0021, 0029, 0088… neural network based audio + noise teacher and student model training overall featuring labels, layers, and frequency features, with element 440 as a worker analogous to a layer pe se 0075, the result is element 430 for instance using the output of the student model with noisy + audio data, and using audio as well as inserted noise to train teach and student models 0021-0029 such as perturbing the signal 0006, and as in fig. 4 & 5 receiving audio input in a speech recognition context 0092)


Claims 2, 3, 9, 10, 16, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20230033768 A1 Chong; Dading et al. (hereinafter Chong) in view of US 20210319266 A1 Chen; Ting et al. (hereinafter Chen) and further in view of US 20190355366 A1 NG; Raymond W.M. et al. (hereinafter NG) and further in view of US 20220180202 A1 Yin; Yichun et al. (hereinafter Yin).
Re claims 2, 9, and 16, the combination while teaching neural networks and student-teacher models with layers, fails to teach:
2. The method of claim 1, wherein the teacher model and the student model each comprise a plurality of convolutional encoder layers and a plurality of transformer layers. (Yin CNN layers 0121 and transformer layers 0161 with fig. 5, and reduction of M to N layers for student models)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chong in view of Chen and NG to incorporate the above claim limitations as taught by Yin to allow for the use of a known technique such as CNN layers and also transformer layers in attention mechanisms or heads to improve similar neural network teacher-student methods in the same way, thereby improving the combinations speech recognition or audio handling, wherein the student model can learn to imitate the output data of the intermediate layer and the output layer of the teacher model, so that the student model more accurately learns semantic representation of the teacher model to implement effective knowledge transfer, thereby improving accuracy of a text or audio processing result of the student model, with the capability to operate with a smaller number of parameters and achieve a higher quality output.

Re claims 3, 10, and 17, the combination while teaching neural networks and student-teacher models with layers, fails to teach:
3. The method of claim 2, wherein the student model is initiated with weights from the teacher model and keeps fewer transformer layers than the teacher model. (Yin CNN layers 0121 and transformer layers 0161 with fig. 5, and reduction of M to N layers for student models)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Chong in view of Chen and NG to incorporate the above claim limitations as taught by Yin to allow for the use of a known technique such as CNN layers and also transformer layers in attention mechanisms or heads to improve similar neural network teacher-student methods in the same way, thereby improving the combinations speech recognition or audio handling, wherein the student model can learn to imitate the output data of the intermediate layer and the output layer of the teacher model, so that the student model more accurately learns semantic representation of the teacher model to implement effective knowledge transfer, thereby improving accuracy of a text or audio processing result of the student model, with the capability to operate with a smaller number of parameters and achieve a higher quality output.


Conclusion

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 03/24/2026 has been entered.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

US 20220051105 A1	Fukuda; Takashi et al.
Teacher student models

US 20200334538 A1	MENG; Zhong et al.
Teacher student models with ground truth

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL COLUCCI whose telephone number is (571)270-1847.  The examiner can normally be reached on M-F 9 AM - 7 PM. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/MICHAEL COLUCCI/Primary Examiner, Art Unit 2655                                                                                                                                                                                               (571)-270-1847
Examiner FAX:  (571)-270-2847
Michael.Colucci@uspto.gov

Read full office action

Prosecution Timeline

Sep 21, 2023

Application Filed

Aug 29, 2025

Non-Final Rejection mailed — §103

Nov 25, 2025

Response Filed

Jan 15, 2026

Final Rejection mailed — §103

Mar 11, 2026

Response after Non-Final Action

Mar 24, 2026

Request for Continued Examination

Mar 26, 2026

Response after Non-Final Action

Apr 29, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/422,681

Patent 12640144

Generating Synthetic Conference Transcripts Using Natural Language Processing

2y 4m to grant Granted May 26, 2026

18/401,171

Patent 12633286

MACHINE LEARNING MODEL IMPROVEMENT

2y 4m to grant Granted May 19, 2026

18/352,601

Patent 12626697

SYSTEM AND METHOD FOR KEYWORD FALSE ALARM REDUCTION

2y 10m to grant Granted May 12, 2026

19/225,487

Patent 12620262

USING ARTIFICIAL ENTITIES FOR GENERATING PERSONALIZED RESPONSES

11m to grant Granted May 05, 2026

18/515,502

Patent 12592240

ENCODING AND DECODING OF ACOUSTIC ENVIRONMENT

2y 4m to grant Granted Mar 31, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

76%

Grant Probability

91%

With Interview (+15.2%)

3y 1m (~5m remaining)

Median Time to Grant

High

PTA Risk

Based on 999 resolved cases by this examiner. Grant probability derived from career allowance rate.