Detailed Action
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending for examination. Claims 1, 11, and 19 are independent.
Response to Amendment
The office action is responsive to the amendments filed on 11/19/2025. As
directed by the amendments claims 1-3, 11-13, and 19-10 are amended.
Response to Arguments
Applicant's arguments filed 11/19/2025 have been fully considered.
Examiner response: Applicant’s arguments with respect to claim(s) have been considered but are moot because of the new ground of rejection.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 1-20 rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claim 1 line 12 recites “updating non-zero integration weights for the plurality of prediction-net branches” Support for this limitation does not appears in the Speciation, drawings, or claims as originally submitted. Based on Examiner review, the speciation discloses integration weights, but does not further limit the weights as being non-zero.
Independent claims 11 and 19 recites similar limitations and are also rejected under 112(a).
Dependent claims 2-10, 12-18, and 20 do not resolve the 112(a) rejection from independent claims 1, 11, and 19 and are also rejected under 112(a).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-2, 4-5, 7-12, 14-15, and 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Jinyu Li et. al (US 2024/0265924 A1, hereinafter "Li") in view of Suchin Gururangan (DEMIX Layers: Disentangling Domains for Modular Language Modeling, hereafter “Suchin”) and Yang et al. ("Multi-Task Language Modeling for Improving Speech Recognition of Rare words", hereinafter "Yang").
Regarding Claim 1
Li discloses: A computer-implemented method for training a neural transducer for speech recognition, the method comprising: ([0032, 0039, 0061], a configurable multilingual model that is trained (undergoes a single training process) and the universal automatic speech recognition and language-specific modules (configurable multilingual models) are able to perform speech recognition using the training data)
initializing the neural transducer having a prediction network, an encoder network and a joint network: ([0061, 0065, FIG. 2A], in an example embodiment, user input that comprises LID information that is configured as a null value i.e. the language identification vector is null. In other example embodiments, the LID information’s language identification vector can be configured/initialized to be a one-hot or multi-hot vector; The configurable multilingual model can be configured to be an RNN-T and the RNN-T model comprises a prediction network, encoder network and a joint network).
expanding the prediction network by changing the prediction network to a plurality of prediction-net branchesA language-specific layer for distinguishing between different languages is used in the prediction network for increased performance. The computing system compiles the universal automatic speech recognition module with the plurality of language-specific automatic speech recognition modules to generate a configurable multilingual model that can recognize speech of all the different languages);
training, by a hardware processor, an entirety of the neural transducer by using training data sets for all of the plurality of specific sub-tasks ([0007, 0039, 0062], The configurable multilingual model is trained by the training data; the training data is comprised of multiple datasets for different models/purposes (e.g. training a model that recognizes speech in one specific language and a language-independent model that can recognize speech from multiple languages); and
fusing the plurality of prediction-net branches. ([0059, 0096], the configurable multilingual model comprises a compilation between a universal ASR module and the plurality of language-specific ASR modules.
Li does not specifically disclose expanding the prediction network by changing the prediction network to a plurality of prediction-net branches by copying weights from the prediction network to the plurality of prediction-net.
However, Suchin teaches expanding the prediction network by changing the prediction network to a plurality of prediction-net branches by copying weights from the prediction network to the plurality of prediction-net (page 10, “6. Adaptive Pretraining with New”, right column, lines 4-21, and figure 5; the cited section discloses how new expert or network is added to expand the network to perform a different task by copying an existing expert with all the network parameters then retraining the new expert on new specific task).
It would have been obvious to a person of ordinary skill in art before the effective filling date of the invention to implement the teaching of Suchin into the teaching of Li to expand the network by copying weights in branches of the expanded network. The modification would have been obvious because one of the ordinary skills of the art would be motivated to utilize the feature of Suchin to adapt new tasks by utilizing a copy of an existing branch of the network and training the new branch as needed for an efficient new task adaptation to improve the functionality of the model without extensive training need.
Li in view of Suchin does not explicitly disclose: updating non-zero integration weights for the plurality of prediction-net branches matched to the plurality of specific sub-tasks to obtain a trained neural transducer.
However, Yang discloses: fusing the plurality of prediction-net branches by updating non-zero integration weights for the plurality of prediction-net branches matched to the plurality of specific sub-tasks to obtain a trained neural transducer. ([Sections 3.2.- 4.3.] describes updating weights for sub-tasks and training the model. [Section 4.3. and Algorithm 1] disclose non-zero weights and initializing to 1.)
It would have been obvious to a person of ordinary skill in art before the effective filling date of the invention to implement the teaching of Yang into the teaching of Li in view of Suchin to update non-zero weights for predictions matched to specific subtask. The modification would have been obvious because one of the ordinary skills of the art would be motivated to utilize the feature of Yang to weigh different subtask and calculate an overall aggregation.
Regarding Claim 2
Li in view of Suchin and Yang disclose: The computer-implemented method of claim 1, wherein the fusing includes integrating a plurality of combinations of an output of the encoder network and an output of each of the plurality of prediction-net branches by using the non-zero integration weights, each of the non-zero integration weights being changed depending on each of the plurality of specific sub-tasks. ([0052, 0073-0074], Li, the configurable multilingual/user-selected model output is the weighted combination of the universal ASR module output and the outputs from all language-specific modules. The fusing refers to the joint network, which is a combination of the outputs from the encoder and prediction network and the plurality refers to each language-specific module).
Regarding claim 4, Li teaches:
wherein the integrating is performed by the joint network ([0061, 0074], the joint network is the combination of both the encoder and prediction network outputs).
Regarding claim 5, Li teaches:
wherein each specific sub-task is a sub-task for recognition of a language with a specific dialect ([0005], a plurality of language-specific ASR modules is trained (on different language-specific datasets) to recognize a variety of different languages. A dialect is considered to be a variation of a language that is either classified by its own vocabulary and grammar rules or a variety of languages that are all-together recognized as a single language).
Regarding claim 7, Li teaches:
The computer-implemented method of claim 1, wherein the neural transducer is randomly initialized ([0051,0089], the language identification (one-hot and multi-hot) vectors are randomly initialized with either a “1” or “0” during the configurable multilingual model’s training process).
Regarding claim 8, Li teaches:
comprising applying a softmax operation to an output of the joint network to obtain a softmax output for the neural transducer ([0061], the softmax operation is performed on the output of the joint network to calculate a softmax output).
Regarding claim 9, Li teaches:
comprising performing a speech recognition session using the trained neural transducer to recognize a user utterance ([0001, 0061-0062], the configurable multilingual model can be configured to be an RNN transducer that is trained to perform multilingual speech recognition and automatic speech recognition (ASR). ASR allows a model to recognize speech and transcribe the audio into a textual output).
Regarding claim 10, Li teaches:
wherein the neural transducer is a recurrent neural network transducer (RNN-T) ([0061], the configurable multilingual model can be configured to be an RNN-T).
Regarding claim 11,
This claim recites an article of manufacture (a computer program product) that performs the method as described in claim 1. Therefore claim 11 is rejected under the same reasons mentioned for claim 1. The additional elements of claim 11 are addressed below:
A computer program product for training a neural transducer for speech recognition, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: (Li: [0035, 0112], includes computer-readable media that store computer-executable instructions and are defined as physical hardware storage media/devices that exclude transmission media. Also, a computing system which is configured to execute computer-readable instructions using one or more processors.)
Regarding claim 12 – rejected under the same rationale of claim 2. In addition, claim 12 is dependent on claim 11, therefore same rationale applies.
Regarding claim 14 – rejected under the same rationale of claim 4. In addition, claim 14 is dependent on claim 12, therefore same rationale applies.
Regarding claim 15 – rejected under the same rationale of claim 5. In addition, claim 15 is dependent on claim 11, therefore same rationale applies.
Regarding claim 17 – rejected under the same rationale of claim 7. In addition, claim 17 is dependent on claim 11, therefore same rationale applies.
Regarding claim 18 – rejected under the same rationale of claim 8. In addition, claim 18 is dependent on claim 11, therefore same rationale applies.
Regarding claim 19,
This claim recites an article of manufacture (a computer processing system) that performs the method as described in claims 1 and 11. Therefore claim 19 is rejected under the same reasons mentioned for claim 1. The additional elements of claim 19 are addressed below with Li reference:
A computer processing system (Li: [0034-0035, FIG. 1A], FIG. 1A illustrates the architecture and components of the computing system).
a memory device for storing program code; a hardware processor operatively coupled to the memory device for running the program code to: ([0035, 0118, FIG. 1A], program modules can be stored in both local and remote memory devices and contain computer-readable instructions).
Regarding claim 20 – rejected under the same rationale of claim 2. In addition, claim 20 is dependent on claim 19, therefore same rationale applies.
Claim(s) 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Suchin, Yang, and Ogawa et al. (US 20220262356 A1, hereinafter "Ogwa").
Regarding Claim 3
Li in view of Suchin and Yang disclose: The computer-implemented method of claim 2, wherein a first no-zero integration weight is used for a main one of the plurality of prediction-net branches matched to an input dialect, ([0043, 0051-0052], Li describes language vectors (input for the configurable multilingual model) represent the activation of one or more language-specific modules. Non-zero values are positive weights that are activated/included in the model.
Li in view of Suchin and Yang does not explicitly disclose: with a second non-zero integration weights smaller than the first non-zero integration weight is used for non-main ones of the plurality of prediction- net branches.
However, Ogawa discloses in the same field of endeavor: wherein a first no-zero integration weight is used for a main one of the plurality of prediction-net branches, with a second non-zero integration weights smaller than the first non-zero integration weight is used for non-main ones of the plurality of prediction- net branches. ([Para 0105 and 0138] describes a main model having larger weights then other models.)
It would have been obvious to a person of ordinary skill in art before the effective filling date of the invention to implement the teaching of Ogawa into the teaching of Li in view of Suchin and Yang to weigh main models higher than non-main models . The modification would have been obvious because one of the ordinary skills of the art would be motivated to utilize the feature of Ogawa to give a higher weight to a main model and smaller weights to non-main models.
Regarding claim 13 – rejected under the same rationale of claim 3. In addition, claim 13 is dependent on claim 12, therefore same rationale applies.
Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Li in view Suchin and Yang and further in view of Joshi et. al (“Transfer Learning Approaches for Streaming End-to-End Speech Recognition System”, hereinafter “Joshi”).
Regarding claim 6, Li teaches a neural transducer ([0061], the configurable multilingual model can be configured to be an RNN-T).
Li in view of Suchin and Yang do not specifically disclose a neural transducer initialized with a pre-trained single-dialect network as the prediction network.
However, Joshi teaches wherein the neural transducer is initialized with a pre-trained single-dialect network as the prediction network ([Abstract, 4. Transfer learning methods for RNN-T], the prediction network of the RNN-T model (of the target language) can be initialized with pre-trained models from the source language to improve the target language model).
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to initialize the transducer with a pre-trained network to improve the average word error rate and overall model efficiency (compared to randomly initializing the model) and to leverage the amount of training data the source language (English) has in comparison to the target language (Hindi). (Joshi: 4. Transfer learning methods for RNN-T, 6. Discussion of results).
Regarding claim 16 – rejected under the same rationale of claim 6.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Ye et al. (US 20210304769 A1) describes recurrent neural network transducer (RNN-T) with a main model and non-main models..
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TEWODROS E MENGISTU whose telephone number is (571)270-7714. The examiner can normally be reached Mon-Fri 9:30-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ABDULLAH KAWSAR can be reached at (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TEWODROS E MENGISTU/Examiner, Art Unit 2127