Last updated: April 19, 2026
Application No. 18/350,464
Augmentation of Audiographic Images for Improved Machine Learning

Final Rejection §103§DP
Filed
Jul 11, 2023
Examiner
BLANKENAGEL, BRYAN S
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
6 (Final)
Interview Optional

— +35.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 377 resolved cases, 2023–2026
Examiner Intelligence

BLANKENAGEL, BRYAN S View full profile →
Grants 67% — above average
Career Allow Rate
254 granted / 377 resolved
+5.4% vs TC avg
Strong +35% interview lift
Without
With
+35.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
23 currently pending
Career history
400
Total Applications
across all art units
Statute-Specific Performance

§101
25.6%
-14.4% vs TC avg
§103
49.3%
+9.3% vs TC avg
§102
13.3%
-26.7% vs TC avg
§112
6.5%
-33.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 377 resolved cases
Office Action

§103 §DP
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 12/30/2025 have been fully considered but they are not persuasive. Regarding arguments on pages 15-16 of the Remarks, Examiner agrees that the tangent vectors of Denker do not represent the actual pattern to be recognized. However, such is not required by the claims. Applicant argues that Denker’s tangent vectors are not taught to be input into any network configured to perform the image recognition task. While the cited section of Denker on page 16 of the arguments does not appear to teach the tangent vectors being input to a network, other sections do. Examiner notes Fig. 9 and 10 teach inputting an original image and a tangent vector into the network functions. Denker col. 9 lines 65-67 teaches that the modules of Fig. 10 are incorporated by neural network chips, and col. 10 lines 56-65 teach inputting alphanumeric symbols into the networks. Further, col. 11 line 38 – col. 12 line 2 teaches training neural networks directly from tangent vectors. Therefore, the limitations are taught by the combination of references.
Regarding arguments on pages 16-17 of the Remarks, Examiner notes that the citations for Fig. 6 and 7 appear to refer to Denker’s cited prior art, not to the invention of Denker. Denker col. 6 lines 57-59 is the beginning of the paragraph that refers to Figs. 6 and 7. In contrast, the paragraph of col. 7 lines 5-16 refers to Fig. 8, and shows both training data U and tangent training data T being input to the learning machine. Regarding modifying Zhou, Examiner notes that Zhou teaches receiving an input spectrogram at a neural network. The principles of Denker are then applied to optimize the model of Zhou to be more robust to variance in the input data. Therefore, the use of Denker’s tangent vectors for training the neural network of Zhou would not change the principle of operation, because the tangent vectors would be used for training rather than for transcribing.
Regarding arguments on page 17 of the Remarks, In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning.  But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper.  See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 21, 26-36, 38,and 40-47 rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-10, 12-15, and 17-18 of U.S. Patent No. 11,138,471 in view of Zhou et al. (US 2019/0130897 A1), hereinafter referred to as Zhou. The prior art contained a device (method, product, etc.) which differed from the claimed device by the substitution of some components (spectrogram) with other components (a plurality of audiographic images corresponding to a plurality of times of an audio signal); the substituted components and their functions were known in the art; one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable.

Instant Application
Patent #11,138,471
Claim 21: A computer-implemented method to train a model to perform speech recognition, the method comprising:
Claim 1: A computer-implemented method to generate augmented training data, the method comprising: 
obtaining, by one or more computing devices, one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances;
obtaining, by one or more computing devices, one or more audiographic images that respectively visually represent one or more audio signals; 
Zhou: para [0037], where spectrograms are computed from speech signals
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images; 
Claim 8: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
inputting, by the one or more computing devices, the one or more augmented images into a machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting, by the one or more computing devices, the one or more augmented images into a machine-learned audio processing model; 
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images;
Zhou:  para [0062], where the model outputs recognized text transcriptions
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and 
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.

Claim 26: The computer-implemented method of claim 21, wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time.
Claim 1: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; 
Claim 27: The computer-implemented method of claim 26, wherein performing the time warping operation comprises fixing spatial dimensions of the at least one audiographic image and warping the image content of the at least one audiographic image to shift a point within the image content a distance along the axis representative of time.
Claim 2: The computer-implemented method of claim 1, wherein performing the time warping operation comprises fixing spatial dimensions of the at least one audiographic image and warping the image content of the at least one audiographic image to shift a point within the image content a distance along the axis representative of time.
Claim 28: The computer-implemented method of claim 27, wherein the distance comprises a user-specified hyperparameter or a learned value.
Claim 3: The computer-implemented method of claim 2, wherein the distance comprises a user-specified hyperparameter or a learned value.
Claim 29: The computer-implemented method of claim 27, wherein the point within the image content is randomly selected.
Claim 4: The computer-implemented method of claim 2, wherein the point within the image content is randomly selected.
Claim 30: The computer-implemented method of claim 21, wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image.
Claim 9: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image; 
Claim 31: The computer-implemented method of claim 30, wherein: 
Claim 10: The computer-implemented method of claim 9, wherein: 
the certain subset of frequencies extends from a first frequency to a second frequency that is spaced a distance from the first frequency;
the certain subset of frequencies extends from a first frequency to a second frequency that is spaced a distance from the first frequency;
the distance is selected from a distribution extending from zero to a frequency mask parameter.
the distance is selected from a distribution extending from zero to a frequency mask parameter.
Claim 32: The computer-implemented method of claim 31, wherein the frequency mask parameter comprises a user-specified hyperparameter or a learned value.
Claim 3: The computer-implemented method of claim 2, wherein the distance comprises a user-specified hyperparameter or a learned value.
Claim 33: The computer-implemented method of claim 30, wherein changing the pixel values for the image content associated with the certain subset of frequencies comprises changing the pixel values for the image content to equal a mean value associated with the at least one audiographic image.
Claim 12: The computer-implemented method of claim 9, wherein changing the pixel values for the image content associated with the certain subset of frequencies comprises changing the pixel values for the image content to equal a mean value associated with the at least one audiographic image.
Claim 34: The computer-implemented method of claim 30, wherein performing the frequency masking operation comprises enforcing an upper bound on a ratio of the certain subset of frequencies to all frequencies.
Claim 13: The computer-implemented method of claim 9, wherein performing the frequency masking operation comprises enforcing an upper bound on a ratio of the certain subset of frequencies to all frequencies.
Claim 35: The computer-implemented method of claim 21, wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image.
Claim 14: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image;
Claim 36: The computer-implemented method of claim 35, wherein: 
Claim 15: The computer-implemented method of claim 14, wherein: 
the certain subset of time steps extends from a first time step to a second time step that is spaced a distance from the first time step;
the certain subset of time steps extends from a first time step to a second time step that is spaced a distance from the first time step;
the distance is selected from a distribution extending from zero to a time mask parameter.
the distance is selected from a distribution extending from zero to a time mask parameter.
Claim 38: A computer system configured to perform operations, the operations comprising: 
Claim 8: A computing system comprising: one or more processors; a controller model; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: 
obtaining, by the computer system, one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances
Claim 9: obtaining, by one or more computing devices, one or more audiographic images that respectively visually represent one or more audio signals;
Zhou: para [0037], where spectrograms are computed from speech signals
performing, by the computer system, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images,
a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image;
Claim 8: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
inputting, by the computer system, the one or more augmented images into a machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting, by the one or more computing devices, the one or more augmented images into a machine-learned audio processing model;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving, by the computer system, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images;
Zhou:  para [0062], where the model outputs recognized text transcriptions
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and Page 6 of 8 Amendment Dated: September 28, 2021
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
Claim 40: 
Claim 8: A computing system comprising: 
One or more non-transitory computer-readable media that store instructions that, when executed by a computer system, cause the computer system to perform operations, the operations comprising:
one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:
obtaining, by the computer system, one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances
Claim 9: obtaining, by one or more computing devices, one or more audiographic images that respectively visually represent one or more audio signals;
Zhou: para [0037], where spectrograms are computed from speech signals
performing, by the computer system, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images,
a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image;
Claim 8: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
inputting, by the computer system, the one or more augmented images into a machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition; 
inputting, by the one or more computing devices, the one or more augmented images into a machine-learned audio processing model;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving, by the computer system, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances; 
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images;
Zhou:  para [0062], where the model outputs recognized text transcriptions
evaluating, by the computer system, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and 
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
modifying, by the computer system, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
Claim 41: A computer-implemented method, the method comprising: 

providing, by one or more computing devices, audio data as input to a machine-learned audio processing model to generate one or more predictions, the machine-learned audio processing model trained based on training operations comprising:
Zhou Fig. 2, para [0043], where the audio wave is input to the model
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances;
Claim 9: obtaining, by one or more computing devices, one or more audiographic images that respectively visually represent one or more audio signals;
Zhou: para [0037], where spectrograms are computed from speech signals
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:

performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images,
a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image;
Claim 8: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:

inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting, by the one or more computing devices, the one or more augmented images into a machine-learned audio processing model;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving one or more predictions respectively generated by the machine- learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images;
Zhou:  para [0062], where the model outputs recognized text transcriptions
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
modifying respective values of one or more parameters of the machine- learned audio processing model based on the objective function; and
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
receiving, by the one or more computing devices, an output based on the predictions generated by the machine-learned audio processing model.
Zhou para [0062], where the model outputs recognized text transcriptions
Claim 42:

The computer-implemented method of claim 41, wherein: 

performing the one or more augmentation operations comprises performing a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image;
Claim 9: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image; 
performing the one or more augmentation operations comprises performing a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; or
Claim 1: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; 
performing the one or more augmentation operations comprises performing a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image.
Claim 14: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image;
Claim 43:

The computer-implemented method of claim 41, wherein the output comprises: 

output audio data encoding one or more output human speech utterances representing a transformation of one or more input human speech utterances encoded by the audio data provided as input to the machine-learned audio processing model; or

text data representing a textual transcription of one or more input human speech utterances encoded by the audio data provided as input to the machine-learned audio processing model.
Zhou para [0062], where the model outputs recognized text transcriptions
Claim 44:

A computer system configured to perform operations, the operations comprising: 

providing, by the computer system, audio data as input to a machine-learned audio processing model to generate one or more predictions, the machine-learned audio processing model trained based on training operations comprising:
Zhou Fig. 2, para [0043], where the audio wave is input to the model
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances;
Claim 9: obtaining, by one or more computing devices, one or more audiographic images that respectively visually represent one or more audio signals;
Zhou: para [0037], where spectrograms are computed from speech signals
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:

performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images,
a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image;
Claim 8: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:

inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting, by the one or more computing devices, the one or more augmented images into a machine-learned audio processing model;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving one or more predictions respectively generated by the machine- learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images;
Zhou:  para [0062], where the model outputs recognized text transcriptions
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
modifying respective values of one or more parameters of the machine- learned audio processing model based on the objective function; and
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
receiving, by the computer system, an output based on the predictions generated by the machine-learned audio processing model.
Zhou para [0062], where the model outputs recognized text transcriptions
Claim 45:

The computer system of claim 44, wherein: 

performing the one or more augmentation operations comprises performing a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image;
Claim 9: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image; 
performing the one or more augmentation operations comprises performing a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; or
Claim 1: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; 
performing the one or more augmentation operations comprises performing a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image.
Claim 14: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image;
Claim 46:
Claim 8: A computing system comprising: 
One or more non-transitory computer-readable media that store a machine-learned audio processing model configured to perform speech recognition, the machine- learned audio processing model trained based on training operations comprising: 
one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances;
Claim 9: obtaining, by one or more computing devices, one or more audiographic images that respectively visually represent one or more audio signals;
Zhou: para [0037], where spectrograms are computed from speech signals
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:

performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images,
a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image;
Claim 8: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:

inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting, by the one or more computing devices, the one or more augmented images into a machine-learned audio processing model;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images;
Zhou:  para [0062], where the model outputs recognized text transcriptions
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
Claim 47:

The one or more non-transitory computer-readable media of claim 46, wherein: 

performing the one or more augmentation operations comprises performing a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image;
Claim 9: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image; 
performing the one or more augmentation operations comprises performing a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; or
Claim 1: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; 
performing the one or more augmentation operations comprises performing a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image.
Claim 14: wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image;



Claims 21, 26-36, 38, and 40-47 rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-8 and 21 of U.S. Patent No. 11,816,577 in view of Zhou et al. (US 2019/0130897 A1), hereinafter referred to as Zhou. The prior art contained a device (method, product, etc.) which differed from the claimed device by the substitution of some components (spectrogram) with other components (a plurality of audiographic images corresponding to a plurality of times of an audio signal); the substituted components and their functions were known in the art; one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable.

Instant Application
Patent #11,816,577
Claim 21: A computer-implemented method to train a model to perform speech recognition, the method comprising:
Claim 1: A computer-implemented method to generate augmented training data, the method comprising:
obtaining, by one or more computing devices, one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances;
obtaining, by one or more computing devices, a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal; 
Zhou: para [0037], where spectrograms are computed from speech signals
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images,
a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 1: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
inputting, by the one or more computing devices, the one or more augmented images into a machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting, by the one or more computing devices, the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions; 
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
Zhou: para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.

Claim 26: The computer-implemented method of claim 21, wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time.
Claim 1: wherein the one or more augmentation operations includes: a time warping operation comprising warping image content of the audiographic image along an axis representative of time
Claim 27: The computer-implemented method of claim 26, wherein performing the time warping operation comprises fixing spatial dimensions of the at least one audiographic image and warping the image content of the at least one audiographic image to shift a point within the image content a distance along the axis representative of time.
Claim 2: The computer-implemented method of claim 1, wherein performing the time warping operation comprises fixing spatial dimensions of the audiographic image and warping the image content of the audiographic image to shift a point within the image content a distance along the axis representative of time.
Claim 28: The computer-implemented method of claim 27, wherein the distance comprises a user-specified hyperparameter or a learned value.
Claim 3: The computer-implemented method of claim 2, wherein the distance comprises a user-specified hyperparameter or a learned value.
Claim 29: The computer-implemented method of claim 27, wherein the point within the image content is randomly selected.
Claim 4: The computer-implemented method of claim 2, wherein the point within the image content is randomly selected.
Claim 30: The computer-implemented method of claim 21, wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image.
Claim 1: wherein the one or more augmentation operations includes: a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image,
Claim 31: The computer-implemented method of claim 30, wherein: 
Claim 5: The computer-implemented method of claim 1, wherein: 
the certain subset of frequencies extends from a first frequency to a second frequency that is spaced a distance from the first frequency;
the certain subset of frequencies extends from a first frequency to a second frequency that is spaced a distance from the first frequency;
the distance is selected from a distribution extending from zero to a frequency mask parameter.
the distance is selected from a distribution extending from zero to a frequency mask parameter.
Claim 32: The computer-implemented method of claim 31, wherein the frequency mask parameter comprises a user-specified hyperparameter or a learned value.
Claim 3: The computer-implemented method of claim 2, wherein the distance comprises a user-specified hyperparameter or a learned value.
Claim 33: he computer-implemented method of claim 30, wherein changing the pixel values for the image content associated with the certain subset of frequencies comprises changing the pixel values for the image content to equal a mean value associated with the at least one audiographic image.
Claim 6: The computer-implemented method of claim 1, wherein changing the pixel values for the image content associated with the certain subset of frequencies comprises changing the pixel values for the image content to equal a mean value associated with the audiographic image.
Claim 34: The computer-implemented method of claim 30, wherein performing the frequency masking operation comprises enforcing an upper bound on a ratio of the certain subset of frequencies to all frequencies.
Claim 7: The computer-implemented method of claim 1, wherein performing the frequency masking operation comprises enforcing an upper bound on a ratio of the certain subset of frequencies to all frequencies.
Claim 35: The computer-implemented method of claim 21, wherein performing, by the one or more computing devices, the one or more augmentation operations comprises performing, by the one or more computing devices, a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image.
Claim 1: wherein the one or more augmentation operations includes: a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 36: The computer-implemented method of claim 35, wherein: 
Claim 8: The computer-implemented method of claim 1, wherein: 
the certain subset of time steps extends from a first time step to a second time step that is spaced a distance from the first time step;
the certain subset of time steps extends from a first time step to a second time step that is spaced a distance from the first time step;
the distance is selected from a distribution extending from zero to a time mask parameter.
the distance is selected from a distribution extending from zero to a time mask parameter.
Claim 38: A computer system configured to perform operations, the operations comprising: 
Claim 21: One or more non-transitory computer-readable media that collective store instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations comprising: 
obtaining, by the computer system, one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances
obtaining a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal;
Zhou: para [0037], where spectrograms are computed from speech signals
performing, by the computer system, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, 
a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 1: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
inputting, by the computer system, the one or more augmented images into a machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving, by the computer system, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
Zhou: para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and Page 6 of 8 Amendment Dated: September 28, 2021
evaluating an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
Claim 40: 

One or more non-transitory computer-readable media that store instructions that, when executed by a computer system, cause the computer system to perform operations, the operations comprising:
Claim 21: One or more non-transitory computer-readable media that collective store instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations comprising: 
obtaining, by the computer system, one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances
obtaining a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal;
Zhou: para [0037], where spectrograms are computed from speech signals
performing, by the computer system, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, 
a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 1: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
inputting, by the computer system, the one or more augmented images into a machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition; 
inputting the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving, by the computer system, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances; 
Zhou: para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions
evaluating, by the computer system, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and 
evaluating an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and
modifying, by the computer system, respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
Claim 41: A computer-implemented method, the method comprising: 

providing, by one or more computing devices, audio data as input to a machine-learned audio processing model to generate one or more predictions, the machine-learned audio processing model trained based on training operations comprising:
Zhou Fig. 2, para [0043], where the audio wave is input to the model
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances;
obtaining a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal;
Zhou: para [0037], where spectrograms are computed from speech signals
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:

performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, 
a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 1: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:

inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving one or more predictions respectively generated by the machine- learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
Zhou: para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
evaluating an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and
modifying respective values of one or more parameters of the machine- learned audio processing model based on the objective function; and
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
receiving, by the one or more computing devices, an output based on the predictions generated by the machine-learned audio processing model.
Zhou para [0062], where the model outputs recognized text transcriptions
Claim 42:

The computer-implemented method of claim 41, wherein: 

performing the one or more augmentation operations comprises performing a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image;
Claim 1: wherein the one or more augmentation operations includes: a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image,
performing the one or more augmentation operations comprises performing a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; or
Claim 1: wherein the one or more augmentation operations includes: a time warping operation comprising warping image content of the audiographic image along an axis representative of time
performing the one or more augmentation operations comprises performing a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image.
Claim 1: wherein the one or more augmentation operations includes: a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 43:

The computer-implemented method of claim 41, wherein the output comprises: 

output audio data encoding one or more output human speech utterances representing a transformation of one or more input human speech utterances encoded by the audio data provided as input to the machine-learned audio processing model; or

text data representing a textual transcription of one or more input human speech utterances encoded by the audio data provided as input to the machine-learned audio processing model.
Zhou para [0062], where the model outputs recognized text transcriptions
Claim 44:

A computer system configured to perform operations, the operations comprising: 

providing, by the computer system, audio data as input to a machine-learned audio processing model to generate one or more predictions, the machine-learned audio processing model trained based on training operations comprising:
Zhou Fig. 2, para [0043], where the audio wave is input to the model
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances;
obtaining a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal;
Zhou: para [0037], where spectrograms are computed from speech signals
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:

performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, 
a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 1: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:

inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving one or more predictions respectively generated by the machine- learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
Zhou: para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
evaluating an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and
modifying respective values of one or more parameters of the machine- learned audio processing model based on the objective function; and
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
receiving, by the computer system, an output based on the predictions generated by the machine-learned audio processing model.
Zhou para [0062], where the model outputs recognized text transcriptions
Claim 45:

The computer system of claim 44, wherein: 

performing the one or more augmentation operations comprises performing a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image;
Claim 1: wherein the one or more augmentation operations includes: a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image,
performing the one or more augmentation operations comprises performing a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; or
Claim 1: wherein the one or more augmentation operations includes: a time warping operation comprising warping image content of the audiographic image along an axis representative of time
performing the one or more augmentation operations comprises performing a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image.
Claim 1: wherein the one or more augmentation operations includes: a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 46:

One or more non-transitory computer-readable media that store a machine-learned audio processing model configured to perform speech recognition, the machine- learned audio processing model trained based on training operations comprising: 
Claim 21: One or more non-transitory computer-readable media that collective store instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations comprising: 
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances;
obtaining a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal;
Zhou: para [0037], where spectrograms are computed from speech signals
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:

performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, 
a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;
Claim 1: a time warping operation that comprises warping image content of the audiographic image along an axis representative of time; a frequency masking operation that comprises changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image; and a time masking operation that comprises changing pixel values for image content associated with a certain subset of time steps represented by the audiographic image;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:

inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition;
inputting the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions;
Zhou: para [0045], [0052], where the spectrograms are input to an RNN, that performs speech recognition
receiving one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances;
Zhou: para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model; and
evaluating an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.
Claim 47:

The one or more non-transitory computer-readable media of claim 46, wherein: 

performing the one or more augmentation operations comprises performing a frequency masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the frequency masking operation comprises changing pixel values for image content associated with a certain subset of frequencies represented by the at least one audiographic image;
Claim 1: wherein the one or more augmentation operations includes: a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image,
performing the one or more augmentation operations comprises performing a time warping operation on at least one audiographic image of the one or more audiographic images, wherein performing the time warping operation comprises warping image content of the at least one audiographic image along an axis representative of time; or
Claim 1: wherein the one or more augmentation operations includes: a time warping operation comprising warping image content of the audiographic image along an axis representative of time
performing the one or more augmentation operations comprises performing a time masking operation on at least one audiographic image of the one or more audiographic images, wherein performing the time masking operation comprises changing pixel values for image content associated with a certain subset of a time steps represented by the at least one audiographic image.
Claim 1: wherein the one or more augmentation operations includes: a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 21, 37-38, 40-41, 43-44, and 46 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. (US 2019/0130897 A1), hereinafter referred to as Zhou, in view of Denker et al. (US 5,572,628 A), hereinafter referred to as Denker.

Regarding claim 21, Zhou teaches:
A computer-implemented method to train a model to perform speech recognition, the method comprising: 
obtaining, by one or more computing devices, training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances (para [0037], where spectrograms are computed from speech signals);
inputting, by the one or more computing devices, the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition (para [0045], [0052], where the spectrograms are input to an RNN, which is updated by feedback and performs speech recognition); 
receiving, by the one or more computing devices, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances (para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions); 
evaluating, by the one or more computing devices, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model (para [0045], where a maximum likelihood objective function calculates error and gradients); and 
modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function (para [0045-47], where the error and gradients calculated by the objective function are fed back to the RNN for training).
Zhou does not teach:
generating, by the one or more computing devices, variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating, by the one or more computing devices, the variance in the training data comprises:
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
training, by the one or more computing devices, a machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:
Denker teaches:
generating, by the one or more computing devices, variance in the training data based on one or more augmentation operations applied to the one or more audiographic images (col. 11 lines 12-27, where the spectrograms are augmented such as by changing pitch or performing time shifting), wherein generating, by the one or more computing devices, the variance in the training data comprises:
performing, by the one or more computing devices, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together (col. 11 lines 12-27, col. 6 lines 11-56, where rotations or translations are performed on the image and pixels are altered, and where the spectrograms are augmented such as by changing pitch or performing time shifting);
training, by the one or more computing devices, a machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data (col. 11 lines 12-27, where the network is trained to be invariant to the transformations), wherein the training comprises:
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou by using the invariance training of Denker (Denker col. 11 lines 12-27) on the spectrogram inputs of Zhou (Zhou para [0037]), in order to make the network output invariant to speech related transformation (Denker col. 11 lines 12-27).

Regarding claim 37, Zhou in view of Denker teaches:
The computer-implemented method of claim 21, wherein the one or more audiographic images comprise:
one or more spectrograms (Zhou para [0037], where a spectrogram is computed from speech signals).

Regarding claim 38, Zhou teaches:
A computer system configured to perform operations, the operations comprising: 
obtaining, by the computer system, training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances (para [0037], where a spectrogram is computed from speech signals);
inputting, by the computer system, the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition (para [0045], [0052], where the spectrograms are input to an RNN, which is updated by feedback and performs speech recognition); 
receiving, by the computer system, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances (para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions); 
evaluating, by the computer system, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model (para [0045], where a maximum likelihood objective function calculates error and gradients); and 
modifying, by the computer system, respective values of one or more parameters of the machine-learned audio processing model based on the objective function (para [0045-47], where the error and gradients calculated by the objective function are fed back to the RNN for training).  
Zhou does not teach:
generating, by the computer system, variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating, by the computer system, the variance in the training data comprises:
performing, by the computer system, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
training, by the computer system, a machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:
Denker teaches:
generating, by the computer system, variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating, by the computer system (col. 11 lines 12-27, where the spectrograms are augmented such as by changing pitch or performing time shifting), the variance in the training data comprises:
performing, by the computer system, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together (col. 11 lines 12-27, col. 6 lines 11-56, where rotations or translations are performed on the image and pixels are altered, and where the spectrograms are augmented such as by changing pitch or performing time shifting);
training, by the computer system, a machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data (col. 11 lines 12-27, where the network is trained to be invariant to the transformations), wherein the training comprises:
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou by using the invariance training of Denker (Denker col. 11 lines 12-27) on the spectrogram inputs of Zhou (Zhou para [0037]), in order to make the network output invariant to speech related transformation (Denker col. 11 lines 12-27).

Regarding claim 40, Zhou teaches:
One or more non-transitory computer-readable media that store instructions that, when executed by a computer system, cause the computer system to perform operations (para [0040], where a computer readable medium is used), the operations comprising: 
obtaining, by the computer system, training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances (para [0037], where a spectrogram is computed from speech signals);
inputting, by the computer system, the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition (para [0045], [0052], where the spectrograms are input to an RNN, which is updated by feedback and performs speech recognition); 
receiving, by the computer system, one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances (para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions); 
evaluating, by the computer system, an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model (para [0045], where a maximum likelihood objective function calculates error and gradients); and 
modifying, by the computer system, respective values of one or more parameters of the machine-learned audio processing model based on the objective function (para [0045-47], where the error and gradients calculated by the objective function are fed back to the RNN for training).  
Zhou does not teach:
generating, by the computer system, variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating, by the computer system, the variance in the training data comprises:
performing, by the computer system, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
training, by the computer system, a machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:
Denker teaches:
generating, by the computer system, variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating, by the computer system (col. 11 lines 12-27, where the spectrograms are augmented such as by changing pitch or performing time shifting), the variance in the training data comprises:
performing, by the computer system, one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together (col. 11 lines 12-27, col. 6 lines 11-56, where rotations or translations are performed on the image and pixels are altered, and where the spectrograms are augmented such as by changing pitch or performing time shifting);
training, by the computer system, a machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data (col. 11 lines 12-27, where the network is trained to be invariant to the transformations), wherein the training comprises:
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou by using the invariance training of Denker (Denker col. 11 lines 12-27) on the spectrogram inputs of Zhou (Zhou para [0037]), in order to make the network output invariant to speech related transformation (Denker col. 11 lines 12-27).

Regarding claim 41, Zhou teaches:
A computer-implemented method, the method comprising: 
providing, by one or more computing devices, audio data as input to a machine-learned audio processing model to generate one or more predictions (Fig. 2, para [0043], where the audio wave is input to the model), the machine-learned audio processing model trained based on training operations comprising:
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances (para [0037], where a spectrogram is computed from speech signals);
inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition (para [0045], [0052], where the spectrograms are input to an RNN, which is updated by feedback and performs speech recognition);
receiving one or more predictions respectively generated by the machine- learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances (para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions);
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model (para [0045], where a maximum likelihood objective function calculates error and gradients); and 
modifying respective values of one or more parameters of the machine- learned audio processing model based on the objective function (para [0045-47], where the error and gradients calculated by the objective function are fed back to the RNN for training); and
receiving, by the one or more computing devices, an output based on the predictions generated by the machine-learned audio processing model (para [0062], where the model outputs recognized text transcriptions). 
Zhou does not teach:
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:
performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:
Denker teaches:
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images (col. 11 lines 12-27, where the spectrograms are augmented such as by changing pitch or performing time shifting), wherein generating the variance in the training data comprises:
performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together (col. 11 lines 12-27, col. 6 lines 11-56, where rotations or translations are performed on the image and pixels are altered, and where the spectrograms are augmented such as by changing pitch or performing time shifting);
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data (col. 11 lines 12-27, where the network is trained to be invariant to the transformations), wherein the training comprises:
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou by using the invariance training of Denker (Denker col. 11 lines 12-27) on the spectrogram inputs of Zhou (Zhou para [0037]), in order to make the network output invariant to speech related transformation (Denker col. 11 lines 12-27).

Regarding claim 43, Zhou in view of Denker teaches:
The computer-implemented method of claim 41, wherein the output comprises: 
output audio data encoding one or more output human speech utterances representing a transformation of one or more input human speech utterances encoded by the audio data provided as input to the machine-learned audio processing model (where another limitation is chosen); or 
text data representing a textual transcription of one or more input human speech utterances encoded by the audio data provided as input to the machine-learned audio processing model (para [0062], where the model outputs recognized text transcriptions).  

Regarding claim 44, Zhou teaches:
A computer system configured to perform operations, the operations comprising: 
providing, by the computer system, audio data as input to a machine-learned audio processing model to generate one or more predictions (Fig. 2, para [0043], where the audio wave is input to the model), the machine-learned audio processing model trained based on training operations comprising:
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances (para [0037], where a spectrogram is computed from speech signals);
inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition (para [0045], [0052], where the spectrograms are input to an RNN, which is updated by feedback and performs speech recognition);
receiving one or more predictions respectively generated by the machine- learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances (para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions);
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model (para [0045], where a maximum likelihood objective function calculates error and gradients); and 
modifying respective values of one or more parameters of the machine- learned audio processing model based on the objective function (para [0045-47], where the error and gradients calculated by the objective function are fed back to the RNN for training); and
receiving, by the computer system, an output based on the predictions generated by the machine-learned audio processing model (para [0062], where the model outputs recognized text transcriptions). 
Zhou does not teach:
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:
performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:
Denker teaches:
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images (col. 11 lines 12-27, where the spectrograms are augmented such as by changing pitch or performing time shifting), wherein generating the variance in the training data comprises:
performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together (col. 11 lines 12-27, col. 6 lines 11-56, where rotations or translations are performed on the image and pixels are altered, and where the spectrograms are augmented such as by changing pitch or performing time shifting);
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data (col. 11 lines 12-27, where the network is trained to be invariant to the transformations), wherein the training comprises:
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou by using the invariance training of Denker (Denker col. 11 lines 12-27) on the spectrogram inputs of Zhou (Zhou para [0037]), in order to make the network output invariant to speech related transformation (Denker col. 11 lines 12-27).

Regarding claim 46, Zhou teaches:
One or more non-transitory computer-readable media that store a machine-learned audio processing model configured to perform speech recognition (para [0040], where a computer readable medium is used), the machine- learned audio processing model trained based on training operations comprising: 
obtaining training data comprising one or more audiographic images that respectively visually represent one or more audio signals, wherein the audio signals encode one or more human speech utterances (para [0037], where a spectrogram is computed from speech signals); 
inputting the one or more augmented images into the machine-learned audio processing model, wherein the machine-learned audio processing model is configured to perform speech recognition (para [0045], [0052], where the spectrograms are input to an RNN, which is updated by feedback and performs speech recognition);
receiving one or more predictions respectively generated by the machine-learned audio processing model based on the one or more augmented images, wherein the one or more predictions comprise textual transcriptions of the one or more human speech utterances (para [0045], where the RNN generates predictions including softmax probabilities, and para [0062], where the model outputs recognized text transcriptions);
evaluating an objective function that scores the one or more predictions respectively generated by the machine-learned audio processing model (para [0045], where a maximum likelihood objective function calculates error and gradients); and
modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function (para [0045-47], where the error and gradients calculated by the objective function are fed back to the RNN for training). 
Zhou does not teach:
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images, wherein generating the variance in the training data comprises:
performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together;
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data, wherein the training comprises:
Denker teaches:
generating variance in the training data based on one or more augmentation operations applied to the one or more audiographic images (col. 11 lines 12-27, where the spectrograms are augmented such as by changing pitch or performing time shifting), wherein generating the variance in the training data comprises:
performing the one or more augmentation operations on each of the one or more audiographic images to generate one or more augmented images, wherein performing the one or more augmentation operations on each of the one or more audiographic images comprises changing one or more pixel values of each of the one or more audiographic images to generate the one or more augmented images, and wherein performing the one or more augmentation operations comprises performing one or more of the following on each audiographic image: warping image content of the audiographic image, masking randomly selected pixels of the audiographic image, adding noise to the pixel values of the audiographic image, rotating some or all of the audiographic image, translating some or all of the audiographic image, or averaging two or more audiographic images together (col. 11 lines 12-27, col. 6 lines 11-56, where rotations or translations are performed on the image and pixels are altered, and where the spectrograms are augmented such as by changing pitch or performing time shifting);
training the machine-learned audio processing model to perform speech recognition with invariance to the generated variance in the training data (col. 11 lines 12-27, where the network is trained to be invariant to the transformations), wherein the training comprises:
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou by using the invariance training of Denker (Denker col. 11 lines 12-27) on the spectrogram inputs of Zhou (Zhou para [0037]), in order to make the network output invariant to speech related transformation (Denker col. 11 lines 12-27).

Claims 22-24 and 39 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou, in view of Denker, and further in view of Le Roux et al. (US 2019/0318725 A1), hereinafter referred to as Le Roux.

Regarding claim 22, Zhou in view of Denker teaches:
The computer-implemented method of claim 21, 
Zhou in view of Denker does not teach:
wherein the machine-learned audio processing model comprises an encoder model and a decoder model.
Le Roux teaches:
wherein the machine-learned audio processing model comprises an encoder model and a decoder model (Figs. 14, 16, para [0118], where the architecture includes an encoder network and a decoder network).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou in view of Denker by using the architecture of Le Roux (Le Roux Figs. 14, 16, para [0118]) in the model of Zhou in view of Denker (Zhou para [0045], [0052]) to perform end-to-end learning for speech recognition, without explicit separation of the underlying speech signals (Le Roux para [0007]).

Regarding claim 23, Zhou in view of Denker teaches:
The computer-implemented method of claim 21, 
Zhou in view of Denker does not teach:
wherein the machine-learned audio processing model is configured to generate a series of attention outputs.
Le Roux teaches:
wherein the machine-learned audio processing model is configured to generate a series of attention outputs (Figs. 14, 16, para [0118], where the network uses attention).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou in view of Denker by using the architecture of Le Roux (Le Roux Figs. 14, 16, para [0118]) in the model of Zhou in view of Denker (Zhou para [0045], [0052]) to perform end-to-end learning for speech recognition, without explicit separation of the underlying speech signals (Le Roux para [0007]).

Regarding claim 24, Zhou in view of Denker teaches:
The computer-implemented method of claim 21, 
Zhou in view of Denker does not teach:
wherein the machine-learned audio processing model comprises a sequence to sequence model.
Le Roux teaches:
wherein the machine-learned audio processing model comprises a sequence to sequence model (Figs. 14, 16, para [0118], where the model converts an input sequence to an output sequence).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou in view of Denker by using the architecture of Le Roux (Le Roux Figs. 14, 16, para [0118]) in the model of Zhou in view of Denker (Zhou para [0045], [0052]) to perform end-to-end learning for speech recognition, without explicit separation of the underlying speech signals (Le Roux para [0007]).

Regarding claim 39, Zhou in view of Denker teaches:
The computer system of claim 38, 
Zhou in view of Denker does not teach:
wherein the machine-learned audio processing model comprises an encoder model and a decoder model, and wherein the machine-learned audio processing model is configured to generate a series of attention outputs, and wherein the machine-learned audio processing model comprises a sequence to sequence model.
Le Roux teaches:
wherein the machine-learned audio processing model comprises an encoder model and a decoder model, and wherein the machine-learned audio processing model is configured to generate a series of attention outputs, and wherein the machine-learned audio processing model comprises a sequence to sequence model (Figs. 14, 16, para [0118], where the architecture includes an encoder network and a decoder network that converts an input sequence to an output sequence using attention).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou in view of Denker by using the architecture of Le Roux (Le Roux Figs. 14, 16, para [0118]) in the model of Zhou in view of Denker (Zhou para [0045], [0052]) to perform end-to-end learning for speech recognition, without explicit separation of the underlying speech signals (Le Roux para [0007]).

Claim 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou, in view of Denker, and further in view of Zhou et al. (US 2019/0130896 A1), hereinafter referred to as Zhou2.

Regarding claim 25, Zhou in view of Denker teaches:
The computer-implemented method of claim 21, 
Zhou in view of Denker does not teach:
wherein a parameter value for at least one of the one or more augmentation operations is randomly selected for a respective audiographic image of the one or more audiographic images.
Zhou2 teaches:
wherein a parameter value for at least one of the one or more augmentation operations is randomly selected for a respective audiographic image of the one or more audiographic images (para [0058], where pseudo-random noise is added to the speech sample).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Zhou in view of Denker by including the noise augmenter of Zhou2 (Zhou2 Fig. 1 element 124) in the augmentation operations of Zhou in view of Denker (Denker col. 11 lines 12-27, col. 6 lines 11-56), in order to increase the variation in generation of a synthetic training data set (Zhou2 para [0005-6], [0009]).

Allowable Subject Matter
Claims 26-36, 42, 45, and 47 would be allowable if rewritten to overcome the Double Patenting rejection(s), set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  The closest prior art of Zhou, Li, Seltzer, and Le Roux teaches the limitations of the independent claims. However, Zhou, Li, Seltzer, and Le Roux do not teach the claimed methods of augmentation defined in claims 22, 26, and 30. While Li does teach time and frequency masking of spectrograms, no mention is made of changing pixel values. Hence, none of the cited prior art either alone or in combination thereof, teaches the combination of limitations found in the claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US 2023/0309915 A1 para [0158-159] teaches training a network using a set of spectrogram images that is enlarged by applying signal augmentation techniques to the original audio signal.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRYAN S BLANKENAGEL whose telephone number is (571)270-0685. The examiner can normally be reached 8:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at 571-272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRYAN S BLANKENAGEL/Primary Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Jul 11, 2023
Application Filed
Apr 16, 2024
Non-Final Rejection — §103, §DP
Jul 19, 2024
Response Filed
Aug 06, 2024
Final Rejection — §103, §DP
Nov 11, 2024
Request for Continued Examination
Nov 14, 2024
Response after Non-Final Action
Feb 12, 2025
Non-Final Rejection — §103, §DP
May 08, 2025
Applicant Interview (Telephonic)
May 08, 2025
Examiner Interview Summary
May 16, 2025
Response Filed
May 22, 2025
Final Rejection — §103, §DP
Jul 11, 2025
Interview Requested
Jul 15, 2025
Applicant Interview (Telephonic)
Jul 15, 2025
Examiner Interview Summary
Jul 25, 2025
Response after Non-Final Action
Aug 27, 2025
Request for Continued Examination
Aug 28, 2025
Response after Non-Final Action
Sep 25, 2025
Non-Final Rejection — §103, §DP
Dec 10, 2025
Applicant Interview (Telephonic)
Dec 10, 2025
Examiner Interview Summary
Dec 30, 2025
Response Filed
Feb 12, 2026
Final Rejection — §103, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/401,768
Patent 12602551
GENERATION OF SYNTHETIC DOCUMENTS FOR DATA AUGMENTATION
2y 5m to grant Granted Apr 14, 2026
17/850,617
Patent 12579993
Multi-Talker Audio Stream Separation, Transcription and Diaraization
2y 5m to grant Granted Mar 17, 2026
18/014,217
Patent 12572759
MULTILINGUAL CONVERSATION TOOL
2y 5m to grant Granted Mar 10, 2026
18/251,876
Patent 12555591
MACHINE LEARNING ASSISTED SPATIAL NOISE ESTIMATION AND SUPPRESSION
2y 5m to grant Granted Feb 17, 2026
18/066,128
Patent 12547836
KNOWLEDGE FACT RETRIEVAL THROUGH NATURAL LANGUAGE PROCESSING
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

7-8
Expected OA Rounds
67%
Grant Probability
99%
With Interview (+35.2%)
2y 7m
Median Time to Grant
High
PTA Risk
Based on 377 resolved cases by this examiner. Grant probability derived from career allow rate.
Augmentation of Audiographic Images for Improved Machine Learning

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email