DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1 – 20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1 – 20 of U.S. Patent No. 12062363. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the current application differ only by selecting single headed processing, averaging. Projection and normalization steps. All routine neural network refinements that would have been obvious to a person of ordinary in view of the claims of the patent. Some of the claim’s comparison is shown in the table below.
Application Number 18347842
US Patent Number: 12062363
Claim 1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames characterizing an utterance; at each of a plurality of time steps subsequent to an initial time step: receiving, as input to a prediction network of a recurrent neural network- transducer (RNN-T) model, a sequence of non-blank symbols output by a final Softmax layer; generating, by a single head of the prediction network, a sequence of embeddings for the sequence of non-blank symbols received as input at the corresponding time step, each corresponding embedding in the sequence of embeddings corresponds to a respective non-blank symbol in the sequence of non-blank symbols; weighting, by the single head of the prediction network, each corresponding embedding in the sequence of embeddings based on a respective position embedding assigned to the respective non-blank symbol that corresponds to the corresponding embedding; generating, as output from the single head of the prediction network, a weighted average of the sequence of weighted embeddings; generating, by a projection layer of the prediction network, a projection output for the weighted average of the sequence of weighted embeddings output from the single head of the prediction network; and normalizing the projection output for the weighted average of the sequence of weighted embeddings to provide, as output from the prediction network, a single embedding vector at the corresponding time step; and generating, by the final Softmax layer, as output, a speech recognition result for the sequence of acoustic frames based the single embedding vectors output from the prediction network at each of the plurality of time steps.
Claim 2. The computer-implemented method of claim 1, wherein the operations further comprise, at each of the plurality of time steps subsequent to the initial time step: generating, by a joint network of the RNN-T model, based the single embedding vector output from the prediction network at each of the plurality of time steps, a probability distribution over possible speech recognition hypotheses at the corresponding time step, wherein generating the speech recognition result for the sequences of acoustic is based on the probability distribution over possible speech recognition hypotheses generated at each of the plurality of time steps
Calim 11. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames characterizing an utterance; at each of a plurality of time steps subsequent to an initial time step: receiving, as input to a prediction network of a recurrent neural network-transducer (RNN-T) model, a sequence of non-blank symbols output by a final Softmax layer; generating, by a single head of the prediction network, a sequence of embeddings for the sequence of non-blank symbols received as input at the corresponding time step, each corresponding embedding in the sequence of embeddings corresponds to a respective non-blank symbol in the sequence of non-blank symbols; weighting, by the single head of the prediction network, each corresponding embedding in the sequence of embeddings based on a respective position embedding assigned to the respective non-blank symbol that corresponds to the corresponding embedding; generating, as output from the single head of the prediction network, a weighted average of the sequence of weighted embeddings; generating, by a projection layer of the prediction network, a projection output for the weighted average of the sequence of weighted embeddings output from the single head of the prediction network; and normalizing the projection output for the weighted average of the sequence of weighted embeddings to provide, as output from the prediction network, a single embedding vector at the corresponding time step; and generating, by the final Softmax layer, as output, a speech recognition result for the sequence of acoustic frames based the single embedding vectors output from the prediction network at each of the plurality of time steps.
Claim 1 - A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: at each of a plurality of time steps subsequent to an initial time step: receiving, as input to a prediction network of a recurrent neural network- transducer (RNN-T) model, a sequence of non-blank symbols output by a final Softmax layer; for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generating, by the prediction network, using a shared embedding matrix, an embedding of the corresponding non-blank symbol; and assigning, by the prediction network, a respective position vector to the corresponding non-blank symbol; generating, by the prediction network, a sequence of the embeddings each weighted proportional to a similarity between the embedding and the respective position vector; and generating, by a joint network of the RNN-T model, based on the sequence of the weighted embeddings generated by the prediction network at the corresponding time step, a probability distribution over possible speech recognition hypotheses at the corresponding time step.
Claim 2. The computer-implemented method of claim 1, wherein the operations further comprise: receiving, as input to an audio encoder of the RNN-T model, a sequence of acoustic frames; generating, by the audio encoder, at each of the plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and receiving, as input to the joint network, the higher order feature representation generated by the audio encoder at the corresponding time step.
Claim 11. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations comprising: at each of a plurality of time steps subsequent to an initial time step: receiving, as input to a prediction network of a recurrent neural network-transducer (RNN-T) model, a sequence of non-blank symbols output by a final Softmax layer; for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generating, by the prediction network, using a shared embedding matrix, an embedding of the corresponding non-blank symbol; and assigning, by the prediction network, a respective position vector to the corresponding non-blank symbol; generating, by the prediction network, at the corresponding time step, a sequence of the embeddings each weighted proportional to a similarity between the embedding and the respective position vector; and generating, by a joint network of the RNN-T model, based on the sequence of the weighted embeddings generated by the prediction network at the corresponding time step, a probability distribution over possible speech recognition hypotheses at the corresponding time step.
A person of ordinary skill in the art would have had reason to modify the 842 architecture to reduce model complexity, align module dimensionalities, and stabilize training. Specifically, One skilled in the art would have been motivated to (1) select a single attention/head configuration rather than multiple heads to reduce compute and parameter count, (2) aggregate a sequence of weighted embeddings to a single vector to simplify downstream interfacing and reduce decoding complexity, (3) apply a linear projection to match the dimensionality expected by the joint network or to reduce model size, and (4) apply normalization for known training stability benefits. These modifications produce predictable results and do not render the claimed subject matter nonobvious. KSR Int’l Co. v. Teleflex, 550 U.S. 398 (2007).
Applicant may overcome this rejection by one of the following: (A) filing a timely terminal disclaimer. (B) (B) amending the claims of 894 to include additional claim limitations that are supported by the specification and that render the claimed subject matter nonobvious over the parent application (US Patent No. 12062363).
Allowable Subject Matter
Claims 1 -20 are allowable over the prior art of record. The prior art taken alone or in combination fail to teach a prediction network having a “single head” that (i) “generates, by [the prediction network], a sequence of embeddings for the sequence of non‑blank symbols,” (ii) “weights . .. each corresponding embedding . . . based on a respective position embedding,” (iii) “generates . . . a weighted average of the sequence of weighted embeddings,” and (iv) applies a “projection layer” and “normalizing the projection output . . . to provide . . . a single embedding vector,” with the final Softmax producing the speech recognition result.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Vikas et a.: “Transfer Learning Approaches for Streaming End-to-End Speech Recognition System” Arxiv.org, Cornell University Library. 12 August 202, disclose a baseline RNN‑T elements (encoder, prediction network taking prior non‑blank symbol, joint network, Softmax, and an input embedding matrix), but it does not disclose the claimed prediction‑network pipeline of single‑head positional weighting, aggregation to a single vector via weighted averaging, followed by a projection and normalization step.
Kurata et al., PgPub US2022/0208179, discloses a method involves synthesizing first domain audio data from first domain text data. The synthesized first domain audio data is fed into a trained encoder of the recurrent neural network transducer (RNN-T) having an initial condition. The encoder is updated using the synthesized first domain audio data and the first domain text data. The second domain audio data is synthesized from second domain text data. The synthesized second domain audio data is fed into the updated encoder of the recurrent neural network transducer (RNN-T). The prediction network is updated using the synthesized second domain audio data and the second domain text data. The updated encoder is restored to the initial condition.
Prabhavalkar et al., US Patent No. 11145293, disclose methods, systems, and apparatus, including computer-readable media, for performing speech recognition using sequence-to-sequence models. An automated speech recognition (ASR) system receives audio data for an utterance and provides features indicative of acoustic characteristics of the utterance as input to an encoder. The system processes an output of the encoder using an attender to generate a context vector and generates speech recognition scores using the context vector and a decoder trained using a training process that selects at least one input to the decoder with a predetermined probability. An input to the decoder during training is selected between input data based on a known value for an element in a training example, and input data based on an output of the decoder for the element in the training example. A transcription is generated for the utterance using word elements selected based on the speech recognition scores. The transcription is provided as an output of the ASR system.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RICHEMOND DORVIL whose telephone number is (571)272-7602. The examiner can normally be reached 8:30 - 5:30 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RICHEMOND DORVIL/ Supervisory Patent Examiner, Art Unit 2658