Last updated: April 19, 2026
Application No. 18/744,592
JOINT AUTOMATIC SPEECH RECOGNITION AND SPEAKER DIARIZATION

Non-Final OA §101§102§DP
Filed
Jun 14, 2024
Examiner
CHAWAN, VIJAY B
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
Interview Optional

— +11.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 882 resolved cases, 2023–2026
Examiner Intelligence

CHAWAN, VIJAY B View full profile →
Grants 88% — above average
Career Allow Rate
776 granted / 882 resolved
+26.0% vs TC avg
Moderate +12% lift
Without
With
+11.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
21 currently pending
Career history
903
Total Applications
across all art units
Statute-Specific Performance

§101
20.9%
-19.1% vs TC avg
§103
13.8%
-26.2% vs TC avg
§102
33.8%
-6.2% vs TC avg
§112
9.4%
-30.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 882 resolved cases
Office Action

§101 §102 §DP
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101

35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 2-21 are rejected under 35 U.S.C. 101 because the claims are directed toward an abstract idea without significantly more. 
Claim 2, is rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (an abstract idea) and does not include additional elements that amount significantly more than the judicial exception. 
Step 1
Claim 1 is directed toward a “computer implemented method”, which is a method and thus falls within a statutory category under the most recent guidelines of 35 U.S.C. 101.
Step 2A, Prong 1
Claim 2 recites  instructions for “obtaining an audio segment sequence characterizing an audio segment”; “mapping, using a neural network, the audio segment sequence to an output sequence defining a plurality of output symbols, wherein each of the plurality of output symbols is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols, each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols”; and “determining, from the output sequence, a transcription of the audio segment that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word.”. These limitations collectively recite the collection, evaluation and translation of information, including language evaluation and translation. As characterized by the USPTO guidance and case law, such activities fall within the abstract-idea groupings of mental processes (e.g. observations, evaluations, and judgments that could be performed in the human mind or with pen and paper) and organizing /transmitting information. Reference can be made to latest patent eligibility guidelines. Accordingly, claim 2 recites an abstract idea.
Step 2A, Prong 2 
The claim is implemented on a server using a neural network, including one or more processors and memory storing one or more programs to be executed by the one or more processors. These are generic computer components performing their well-understood, routine, and conventional functions of storing and executing instructions, receiving requests, and sending content.
The claim does not recite any specific improvement to computer functionality (e.g., a particular translation algorithm, model architecture, data structure, memory organization, caching mechanism, latency-reduction technique, or network protocol that improves the operation of the computer or network). Nor does it effect a transformation of a physical article or use the abstract idea in any other manner that imposes a meaningful limit on the claim’s scope. Therefore, the claim does not integrate the abstract idea into a practical application under Step 2A, Prong 2.
Step 2B 
The ordered combination of limitations mirrors the abstract idea itself performed using routine computer operations. There is no recited unconventional hardware, no technical improvement to the functioning of the computer itself, and no nonconventional arrangement of known components etc.
Accordingly, claim 2 does not include an “inventive concept” sufficient to transform the abstract idea into a patent-eligible application.
Therefore , claim 2 is directed to an abstract idea and does not recite additional elements that integrate the exception into a practical application or amount to significantly more than the exception itself. Claim 1 is therefore rejected under 35 U.S.C. § 101. Dependent claims 3-9 do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional elements when considered both individually and as an ordered combination do not amount to significantly more than the abstract idea. 
The Independent claim 10 recite(s) the steps of “obtaining an audio segment sequence characterizing an audio segment”; “mapping, using a neural network, the audio segment sequence to an output sequence defining a plurality of output symbols, wherein each of the plurality of output symbols is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols, each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols”; and “determining, from the output sequence, a transcription of the audio segment that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word.”. All the steps can be performed by a human being including applying a translation service algorithm. These limitations collectively recite the collection, evaluation and translation of information, including language evaluation and translation. As characterized by the USPTO guidance and case law, such activities fall within the abstract-idea groupings of mental processes (e.g. observations, evaluations, and judgments that could be performed in the human mind or with pen and paper) and organizing /transmitting information. Reference can be made to latest patent eligibility guidelines. Accordingly, claim 10 recites an abstract idea.
Step 2B 
Beyond the abstract idea, the additional elements are the generic “server,” “one or more processors,”  “computers” and “memory” performing their conventional functions. Implementing the abstract idea on generic computer components does not amount to significantly more. Alice, 573 U.S. at 223–24).
The ordered combination of limitations mirrors the abstract idea itself performed using routine computer operations. There is no recited unconventional hardware, no technical improvement to the functioning of the computer itself, and no nonconventional arrangement of known components etc. 
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. With respect to integration of the abstract idea into a practical application, the additional element of using a generic computing device the determining and data gathering steps amount to no more than mere instructions to apply the exception using a generic computer. The current specification on paragraphs 0066-0069,  clearly specifies that “… [0066] This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on its software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions. [0067] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
[0068] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
 [0069] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.”
The additional elements have been considered both individually and as an ordered combination in the significantly more consideration.  The inclusion of the computer or memory and controller to perform the selecting and generating steps amount to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using a generic computing device cannot provide an inventive concept. Therefore, claim 10 as drafted is not patent eligible. The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional elements when considered both individually and as an ordered combination do not amount to significantly more than the abstract idea. 
Thus, taken alone, the additional elements do not amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.  Independent claim 10 is therefore not drawn to eligible subject matter as they are directed to an abstract idea without significantly more.
Claims 11-13 are dependent claims and do not contain subject matter that can be overcome the rejection of independent claim 10. Claim 10 is directed toward a non-transitory computer readable medium with instructions to implement the method of claim 2 and is rejected under similar rationale.
Claims 14-21 are system claims similar in scope and content of method claims 2-9 and are rejected under similar rationale.
All dependent claims when analyzed as a whole are held to be patent ineligible under 35 U.S.C. §101 because any additional recited limitations fail to establish that the claims are not directed to an abstract idea for the same reasons already recited for the independent claims.

Double Patenting

The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.

Claims 2-21 rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-20 of U.S. Patent No. 12,039,082. Although the claims at issue are not identical, they are not patentably distinct from each other because claims 2-21 of the instant application are similar in scope and content of the patented claims 1-20 of the patent issued to the same Applicant.
It is clear that all the elements of the application claims 2-21 are to be found in patented claims 1-20 (as the application claims 2-21 fully encompasses patented claims 1-20).  The difference between the application claims and the patent claims lies in the fact that the patent claim includes many more elements and is thus much more specific.  Thus the invention of claims 1-20 of the patent is in effect a “species” of the “generic” invention of the application claims 2-21. It has been held that the generic invention is “anticipated” by the “species”.  See In re Goodman, 29 USPQ2d 2010 (Fed. Cir. 1993).  Since application claims 2-21 is anticipated by claims 1-20 of the patent, it is not patentably distinct from of the patented claims. 
Application No: 18/744,592
Patent No: 12,039,982
2. A computer-implemented method comprising: obtaining an audio segment sequence characterizing an audio segment; mapping, using a neural network, the audio segment sequence to an output sequence defining a plurality of output symbols, wherein each of the plurality of output symbols is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols, each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols; and determining, from the output sequence, a transcription of the audio segment that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word.
1. A computer-implemented method comprising: obtaining an audio segment sequence characterizing an audio segment, the audio segment sequence comprising a plurality of audio frames; mapping, using a joint automatic speech recognition-speaker diarization (ASR-SD) neural network, the audio segment sequence to an output sequence comprising a respective output symbol for each of a plurality of time steps, wherein, for each of the time steps, the output symbol for the time step in the output sequence is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols, each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols; and determining, from the output sequence, a transcription of the audio segment data that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word.
3. The method of claim 2, wherein the output sequence comprises a respective output symbol at each of a plurality of time steps, and wherein the neural network comprises a transcription neural network, and wherein mapping the audio segment sequence comprises: processing the audio segment sequence using the transcription neural network, wherein the transcription neural network is configured to process the audio segment data to generate a respective encoded representation of each of the plurality of time steps.
2. The method of claim 1, wherein the joint ASR-SD neural network comprises a transcription neural network, and wherein mapping the audio segment sequence comprises: processing the audio segment sequence using the transcription neural network, wherein the transcription neural network is configured to process the audio segment data to generate a respective encoded representation of each of the plurality of time steps.

4. The method of claim 3, wherein the neural network further comprises a prediction neural network, and wherein mapping the audio segment sequence comprises, for each time step: identifying a current output symbol for the time step, and processing the current output symbol for the time step using the prediction neural network, wherein the prediction neural network is configured to process the current output symbol to generate a prediction representation for the time step conditioned on any non-blank output symbols that have already been included at any earlier time steps in the output sequence.
3. The method of claim 2, wherein the joint ASR-SD neural network further comprises a prediction neural network, and wherein mapping the audio segment sequence comprises, for each time step: identifying a current output symbol for the time step, and processing the current output symbol for the time step using the prediction neural network, wherein the prediction neural network is configured to process the current output symbol to generate a prediction representation for the time step conditioned on any non-blank output symbols that have already been included at any earlier time steps in the output sequence.
5. The method of claim 4, wherein the neural network comprises a joint neural network and a softmax output layer, and wherein mapping the audio segment sequence comprises, for each time step: processing the encoded representation for the time step and the prediction representation for the time step to generate a respective logit for each of the output symbols in the set of output symbols; and processing the logits for the output symbols using the softmax output layer to generate a probability distribution over the output symbols in the set of output symbols.
4. The method of claim 3, wherein the joint ASR-SD neural network comprises a joint neural network and a softmax output layer, and wherein mapping the audio segment sequence comprises, for each time step: processing the encoded representation for the time step and the prediction representation for the time step to generate a respective logit for each of the output symbols in the set of output symbols; and processing the logits for the output symbols using the softmax output layer to generate a probability distribution over the output symbols in the set of output symbols.
6. The method of claim 5, wherein mapping the audio segment sequence comprises, for each time step: selecting an output symbol from the set of output symbols using the probability distribution.
5. The method of claim 4, wherein mapping the audio segment sequence comprises, for each time step: selecting an output symbol from the set of output symbols using the probability distribution.
7. The method of claim 2, wherein the text symbols comprise symbols representing one or more of phonemes, morphemes, or characters.
6. The method of claim 1, wherein the text symbols represent phonemes, morphemes, or characters.
8. The method of claim 2, wherein determining, from the output sequence, a transcription of the audio segment data that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word comprises: identifying words represented by the text symbols defined by the output sequence; and for each identified word: identifying a speaker label symbol that immediately follows the text symbols representing the word in the output sequence; and identifying the word as having been spoken by a speaker represented by the identified speaker label.
7. The method of claim 1, wherein determining, from the output sequence, a transcription of the audio segment data that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word comprises: identifying words represented by the text symbols in the output sequence; and for each identified word: identifying a speaker label symbol that immediately follows the text symbols representing the word in the output sequence; and identifying the word as having been spoken by a speaker represented by the identified speaker label.
9. The method of claim 2, wherein the set of possible speakers is a set of possible speaking roles in a conversation, and wherein each speaker label symbol identifies a different speaking role from the plurality of possible speaking roles.
8. The method of claim 1, wherein the set of possible speakers is a set of possible speaking roles in a conversation, and wherein each speaker label symbol identifies a different speaking role from the plurality of possible speaking roles.
10. One or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising: obtaining an audio segment sequence characterizing an audio segment; mapping, using a neural network, the audio segment sequence to an output sequence defining a plurality of output symbols, wherein each of the plurality of output symbols is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols, each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols; and determining, from the output sequence, a transcription of the audio segment that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word.
9. One or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising: obtaining an audio segment sequence characterizing an audio segment, the audio segment sequence comprising a plurality of audio frames; mapping, using a joint automatic speech recognition-speaker diarization (ASR-SD) neural network, the audio segment sequence to an output sequence comprising a respective output symbol for each of a plurality of time steps, wherein, for each of the time steps, the output symbol for the time step in the output sequence is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols, each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols; and determining, from the output sequence, a transcription of the audio segment data that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word.
11. The computer-readable storage media of claim 10, wherein the output sequence comprises a respective output symbol at each of a plurality of time steps, wherein the neural network comprises a transcription neural network, and wherein mapping the audio segment sequence comprises: processing the audio segment sequence using the transcription neural network, wherein the transcription neural network is configured to process the audio segment data to generate a respective encoded representation of each of the plurality of time steps.
10. The computer-readable storage media of claim 9, wherein the joint ASR-SD neural network comprises a transcription neural network, and wherein mapping the audio segment sequence comprises: processing the audio segment sequence using the transcription neural network, wherein the transcription neural network is configured to process the audio segment data to generate a respective encoded representation of each of the plurality of time steps.

12. The computer-readable storage media of claim 11, wherein the neural network further comprises a prediction neural network, and wherein mapping the audio segment sequence comprises, for each time step: identifying a current output symbol for the time step, and processing the current output symbol for the time step using the prediction neural network, wherein the prediction neural network is configured to process the current output symbol to generate a prediction representation for the time step conditioned on any non-blank output symbols that have already been included at any earlier time steps in the output sequence.
11. The computer-readable storage media of claim 10, wherein the joint ASR-SD neural network further comprises a prediction neural network, and wherein mapping the audio segment sequence comprises, for each time step: identifying a current output symbol for the time step, and processing the current output symbol for the time step using the prediction neural network, wherein the prediction neural network is configured to process the current output symbol to generate a prediction representation for the time step conditioned on any non-blank output symbols that have already been included at any earlier time steps in the output sequence.
13. The computer-readable storage media of claim 12, wherein the neural network comprises a joint neural network and a softmax output layer, and wherein mapping the audio segment sequence comprises, for each time step: processing the encoded representation for the time step and the prediction representation for the time step to generate a respective logit for each of the output symbols in the set of output symbols; and processing the logits for the output symbols using the softmax output layer to generate a probability distribution over the output symbols in the set of output symbols.
12. The computer-readable storage media of claim 11, wherein the joint ASR-SD neural network comprises a joint neural network and a softmax output layer, and wherein mapping the audio segment sequence comprises, for each time step: processing the encoded representation for the time step and the prediction representation for the time step to generate a respective logit for each of the output symbols in the set of output symbols; and processing the logits for the output symbols using the softmax output layer to generate a probability distribution over the output symbols in the set of output symbols.
14. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising: obtaining an audio segment sequence characterizing an audio segment; mapping, using a neural network, the audio segment sequence to an output sequence defining a plurality of output symbols, wherein each of the plurality of output symbols is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols, each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols; and determining, from the output sequence, a transcription of the audio segment that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word.
13. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising: obtaining an audio segment sequence characterizing an audio segment, the audio segment sequence comprising a plurality of audio frames; mapping, using a joint automatic speech recognition-speaker diarization (ASR-SD) neural network, the audio segment sequence to an output sequence comprising a respective output symbol for each of a plurality of time steps, wherein, for each of the time steps, the output symbol for the time step in the output sequence is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols, each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols; and determining, from the output sequence, a transcription of the audio segment data that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word.
15. The system of claim 14, wherein the output sequence comprises a respective output symbol at each of a plurality of time steps, wherein the neural network comprises a transcription neural network, and wherein mapping the audio segment sequence comprises: processing the audio segment sequence using the transcription neural network, wherein the transcription neural network is configured to process the audio segment data to generate a respective encoded representation of each of the plurality of time steps.
14. The system of claim 13, wherein the joint ASR-SD neural network comprises a transcription neural network, and wherein mapping the audio segment sequence comprises: processing the audio segment sequence using the transcription neural network, wherein the transcription neural network is configured to process the audio segment data to generate a respective encoded representation of each of the plurality of time steps.

16. The system of claim 15, wherein the neural network further comprises a prediction neural network, and wherein mapping the audio segment sequence comprises, for each time step: identifying a current output symbol for the time step, and processing the current output symbol for the time step using the prediction neural network, wherein the prediction neural network is configured to process the current output symbol to generate a prediction representation for the time step conditioned on any non-blank output symbols that have already been included at any earlier time steps in the output sequence.
15. The system of claim 14, wherein the joint ASR-SD neural network further comprises a prediction neural network, and wherein mapping the audio segment sequence comprises, for each time step: identifying a current output symbol for the time step, and processing the current output symbol for the time step using the prediction neural network, wherein the prediction neural network is configured to process the current output symbol to generate a prediction representation for the time step conditioned on any non-blank output symbols that have already been included at any earlier time steps in the output sequence.
18. The system of claim 17, wherein mapping the audio segment sequence comprises, for each time step: selecting an output symbol from the set of output symbols using the probability distribution.
17. The system of claim 16, wherein mapping the audio segment sequence comprises, for each time step: selecting an output symbol from the set of output symbols using the probability distribution.
19. The system of claim 14, wherein the text symbols comprise symbols representing one or more of phonemes, morphemes, or characters.
18. The system of claim 13, wherein the text symbols represent phonemes, morphemes, or characters.
20. The system of claim 14, wherein determining, from the output sequence, a transcription of the audio segment data that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word comprises: identifying words represented by the text symbols defined by the output sequence; and for each identified word: identifying a speaker label symbol that immediately follows the text symbols representing the word in the output sequence; and identifying the word as having been spoken by a speaker represented by the identified speaker label.
19. The system of claim 13, wherein determining, from the output sequence, a transcription of the audio segment data that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word comprises: identifying words represented by the text symbols in the output sequence; and for each identified word: identifying a speaker label symbol that immediately follows the text been spoken by a speaker represented by the identified speaker label.

21. The system of claim 14, wherein the set of possible speakers is a set of possible speaking roles in a conversation, and wherein each speaker label symbol identifies a different speaking role from the plurality of possible speaking roles.
20. The system of claim 13, wherein the set of possible speakers is a set of possible speaking roles in a conversation, and wherein each speaker label symbol identifies a different speaking symbols representing the word in the output sequence; and identifying the word as having role from the plurality of possible speaking roles.


Claim Rejections - 35 USC § 102

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 2-21 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Thompson et al., (US 2020/0175961 A1)
As per claims 2, 10 and 14, Thompson et al., teach a computer-implemented method/system/non-transitory computer readable medium with instructions to implement the method comprising: 
obtaining an audio segment sequence characterizing an audio segment (0237, 0367, Table 1); 
mapping, using a neural network (0261, 0268), the audio segment sequence to an output sequence defining a plurality of output symbols, wherein each of the plurality of output symbols is selected from a set of output symbols that includes (i) a plurality of text symbols, (ii) a plurality of speaker label symbols (0256, 0269, 0284-0285), each speaker label symbol identifying a different speaker from a set of possible speakers, and (iii) a blank symbol, wherein the respective output symbols in the output sequence comprise a plurality of text symbols and at least one speaker label symbol selected from the plurality of speaker label symbols (1118, 1187, 1265); and 
determining, from the output sequence, a transcription of the audio segment that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word (0256, 0269, 0284-0285). 
As per claims 3, 11 and 15, Thompson et al., teach the method/system/non-transitory computer readable medium with instructions to implement the methods of claims 2, 10 and 14, wherein the output sequence comprises a respective output symbol at each of a plurality of time steps, and wherein the neural network comprises a transcription neural network (0256, 0261, 0269, 0284-0285, and wherein mapping the audio segment sequence comprises: processing the audio segment sequence using the transcription neural network, wherein the transcription neural network is configured to process the audio segment data to generate a respective encoded representation of each of the plurality of time steps (1118, 1187, 1265). 
As per claims 4, 12 and 16, Thompson et al., teach the method/system/non-transitory computer readable medium with instructions to implement the methods of claims 3, 11 and 15, wherein the neural network further comprises a prediction neural network, and wherein mapping the audio segment sequence comprises, for each time step: identifying a current output symbol for the time step, and processing the current output symbol for the time step using the prediction neural network, wherein the prediction neural network is configured to process the current output symbol to generate a prediction representation for the time step conditioned on any non-blank output symbols that have already been included at any earlier time steps in the output sequence (0367, 0766, 1481). 
As per claims 5, 13 and 17, Thompson et al., teach the method/system/non-transitory computer readable medium with instructions to implement the methods of claims 4, 12 and 16, wherein the neural network comprises a joint neural network and a softmax output layer, and wherein mapping the audio segment sequence comprises, for each time step (0994-0995): processing the encoded representation for the time step and the prediction representation for the time step to generate a respective logit for each of the output symbols in the set of output symbols (0994-0995); and processing the logits for the output symbols using the softmax output layer to generate a probability distribution over the output symbols in the set of output symbols (0253, 0585). 
As per claims 6 and 18, Thompson et al., teach the method/system of claims 5 and 17, wherein mapping the audio segment sequence comprises, for each time step: selecting an output symbol from the set of output symbols using the probability distribution (0227, 0248, 0255). 
As per claims 7 and 19, Thompson et al., teach the method/system of claims 2 and 14, wherein the text symbols comprise symbols representing one or more of phonemes, morphemes, or characters (0255-0256). 
As per claims 8 and 20, Thompson et al., teach the method/system of claims 5 and 17, wherein determining, from the output sequence, a transcription of the audio segment data that identifies (i) words spoken in the audio segment and (ii) for each of the spoken words, the speaker from the set of possible speakers that spoke the word comprises: identifying words represented by the text symbols defined by the output sequence; and for each identified word: identifying a speaker label symbol that immediately follows the text symbols representing the word in the output sequence; and identifying the word as having been spoken by a speaker represented by the identified speaker label (1118, 1187, 1265). 
As per claim 9 and 21, Thompson et al., teach the method of claim 2, wherein the set of possible speakers is a set of possible speaking roles in a conversation, and wherein each speaker label symbol identifies a different speaking role from the plurality of possible speaking roles (0171-0172, 0250).
Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Please see attached form PTO-892.
Graff et al., (US 12,249,334 B2) teach a text mining engine running on an artificial platform is trained to perform conversation role identification, semantic analysis, summarization, language detection, etc. The text mining engine analyzes words in a transcript that represent unique characteristics of a conversation and, based on the unique characteristics and utilizing classification predictive modeling, determines a conversation role for each participant of the conversation and metadata describing the conversation such as tonality of words spoken by a participant in a particular conversation role. Outputs from the text mining engine are indexed and useful for various purposes. For instance, because the system can identify which speaker in a customer service call is likely an agent and which speaker is likely a customer, words spoken by the agent can be analyzed for compliance reasons, training agents, providing quality assurance for improving customer service, providing feedback to improve the performance of the text mining engine, etc.
Sak et al., (US 10,706,840 B2) teach methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs. The sequence of outputs indicates a likely output label from among a predetermined set of output labels. The predetermined set of output labels includes output labels that respectively correspond to different linguistic units and to a placeholder label that does not represent a classification of acoustic data. The recurrent neural network is configured to use an output label indicated for a previous time step to determine an output label for the current time step. The generated sequence of outputs is processed to generate a transcription of the utterance, and the transcription of the utterance is provided.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to VIJAY B CHAWAN whose telephone number is (571)272-7601. The examiner can normally be reached 7-5 Monday thru Thursday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at 571-272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VIJAY B CHAWAN/Primary Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Jun 14, 2024
Application Filed
Feb 07, 2026
Non-Final Rejection — §101, §102, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/384,607
Patent 12603089
ELECTRONIC APPARATUS PERFORMING SPEECH RECOGNITION AND METHOD FOR CONTROLLING THEREOF
2y 5m to grant Granted Apr 14, 2026
18/438,891
Patent 12592229
WAKEWORD DETECTION
2y 5m to grant Granted Mar 31, 2026
18/512,110
Patent 12586579
End-To-End Segmentation in a Two-Pass Cascaded Encoder Automatic Speech Recognition Model
2y 5m to grant Granted Mar 24, 2026
18/814,983
Patent 12585895
Communication Channel Quality Improvement System Using Machine Conversions
2y 5m to grant Granted Mar 24, 2026
18/363,309
Patent 12579968
METHOD OF DETERMINING END POINT DETECTION TIME AND ELECTRONIC DEVICE FOR PERFORMING THE METHOD
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
88%
Grant Probability
99%
With Interview (+11.6%)
2y 8m
Median Time to Grant
Low
PTA Risk
Based on 882 resolved cases by this examiner. Grant probability derived from career allow rate.