Last updated: April 19, 2026
Application No. 18/417,724
Method and System for Low-Complexity Real-Time Multiclass Hierarchical Audio Classification

Non-Final OA §103§112§DP
Filed
Jan 19, 2024
Examiner
SIRJANI, FARIBA
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Audio Technologies and Codecs, Inc.
OA Round
1 (Non-Final)
Interview Optional

— +31.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 547 resolved cases, 2023–2026
Examiner Intelligence

SIRJANI, FARIBA View full profile →
Grants 76% — above average
Career Allow Rate
414 granted / 547 resolved
+13.7% vs TC avg
Strong +31% interview lift
Without
With
+31.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
31 currently pending
Career history
578
Total Applications
across all art units
Statute-Specific Performance

§101
14.1%
-25.9% vs TC avg
§103
49.1%
+9.1% vs TC avg
§102
14.7%
-25.3% vs TC avg
§112
10.7%
-29.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 547 resolved cases
Office Action

§103 §112 §DP
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-18 are pending. Claims 1 and 17-18 are independent.
This Application was published as U.S. 2025/0069592.
            Apparent priority: 24 August 2023 (provisional).
	Claims of the instant Application are rejected over the claims of US 18/417,632 published as 20250069591 under Obviousness Double Patenting and a Terminal Disclaimer is required over the term of this application.  
Claim Objections
Claim 3 is objected to because of informalities that may be addressed with the following suggested amendments: 
3. The method of claim 1, wherein audio classified as third audio class is further classified into two separate audio classes resulting in  a 3-stage hierarchical classifier and classification into 4 audio classes.
Appropriate correction is required.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: each and every limitation in Claims 17-18, except for the method step that is occurring in midst of system/device Claims.  Claims 17-18 are directed to system and device categories, respectively.  The limitations of these Claims need to be system/device limitations.  The Claims include the following:
17. A system for hierarchical audio classification, wherein the said system for hierarchical audio classification comprises: 
at least two separate AI models comprising of at least two independent Long Short-Term Memory (LSTM) neural networks …
at least another class transition AI neural network …
inputting, into each neural network, …. [Note that this is a method step in a system claim.]
at least a first audio classifier for …
at least a second audio classifier for … 
an AI class transition detector for ….
…

18. A device for hierarchical audio classification, wherein the said device for hierarchical audio classification comprises: 
at least two separate AI models comprising of at least two independent Long Short-Term Memory (LSTM) neural networks …
at least another class transition AI neural network …. 
inputting, into each neural network, …[Note that this is a method step in a system claim.]
at least a second audio classifier for …
an AI class transition detector for …
…

These limitations (model, classifier, detector, neural network) are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to processors or a combination of processor and memory and possibly transducers such as microphones and displays or to a combination of software and hardware.
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “microphone” or “processor” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 17-18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Please refer to the 112(f) discussion above.
First, these are system/device Claims that include a method step of “inputting ….”  A machine or manufacture Claim needs to comprise of system/device/hardware limitations.  A method step can be included to describe the function and operation of one of the hardware limitations but not as a whole limitation.  Presence of the process step of “inputting” as a stand-alone limitation of a system and a device Claim creates ambiguity as to the intent of the drafter and the category of the Claim and results in indefiniteness.  Remove the process step or fold it into a hardware limitation as part of the operation of that limitation.  The main limitations have to be system or device components not a process step.
Second, when a limitation is stated in the means-plus-function format, the Specification must provide the corresponding structure for the performance of the function and a search of the instant Specification did not readily yield corresponding structures for the models, classifiers, and detectors, or the neural networks of these Claims.  To overcome the rejection, either point to the location in the Specification that clearly defines the means-plus-function generic placeholders that are used as the limitations of Claims 17-18, or modify the language of these Claims to conform to either Machine/Manufacture or Process claim language as the statute (35 U.S.C. 101) sets forth.  For example, “models” and “neural networks” that are claimed as limitations of these two Claims are generally either computer programs or mathematical constructs neither of which qualify as an element of a device or a system.  One straight-forward way to conform the language to a single statutory category is to claim a system as a combination of processors and memory that is used for performing the steps of a process.
No new matter may be introduced.

Claim 11 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 11 recites:
11. The method of claim 1, further comprising low pass filtering with cut-off 2.5 Khz and 4 Khz on a speech sample for audio classification.
Low-pass filtering is specified with one value.  Which is it?  LPF with cutoff of 4 or LPF with cutoff of 2.5?  A bandpass filter has a lower and a higher cutoff.  Was this intended as a BPF?  Or, cut-off is either 2.5 or 4?
For applying art the Claim is interpreted as a LPF with a cutoff of 4KHz.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-18 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims of U.S. Application No. 18/417,632 as shown below. Although the claims at issue are not identical, they are not patentably distinct from each other because of the following mapping:
Instant Application
Reference  U.S. Application 18/417,632
1. A method for hierarchical audio classification, said method for hierarchical audio classification comprising: 
at least two classification stages; 
training and generating at least two independent Long Short-Term Memory (LSTM) neural networks, one for each classification stage, by an audio database tagged into audio classes comprising at least a background noise audio class, and at least a second audio class and at least a third audio class; 
training another class transition neural network based on a class transition tagged database; 
for each new audio frame input, inputting, into the at least two independent LSTM and the class transition neural network, a plurality of audio frame features; 
determining position of a possible audio class transition by using the said class transition neural network over a slice consisting of a plurality of consecutive audio frame features; 
classifying the incoming audio signal into either an intelligible audio class or the background noise class at a decision time resolution higher than the slice duration, in a first stage of the at least two stage classifier by using first of the at least two independent LSTM networks configured to run as a stateful predictor; 
further classifying the incoming audio signal, upon detecting the intelligible audio class in the first stage of the classifier, into either the second audio class and the third audio at a decision time resolution higher than the slice duration, in a second stage of the at least two stage classifier by using second of the at least two independent LSTM networks configured to run as a stateful predictor; and, 
performing a final classification of the incoming audio signal into the at least 3 audio classes at a decision time resolution higher than the slice duration; 
wherein the states of stateful LSTM predictors are reset based on class transient location derived using class transition neural network.

1. A method for hierarchical audio classification, said method for hierarchical audio classification comprising: 
at least two classification stages; 
training and generating at least two independent Long Short-Term Memory (LSTM) neural networks, one for each classification stage, by an audio database tagged into audio classes comprising at least a background noise audio class, at least a second audio class, and at least a third audio class; 
training another class transition neural network based on a class transition tagged database; 
for each new audio frame input, inputting, into the at least two independent LSTM and the class transition neural network, a plurality of audio frame features; 
determining position of a possible audio class transition by using the said class transition neural network over a slice consisting of a plurality of consecutive audio frame features; 
classifying the incoming audio signal into either an intelligible audio class or the background noise class at a decision time resolution higher than the slice duration, in a first stage of the at least two stage classifier by using first of the at least two independent LSTM networks; 

further classifying the incoming audio signal, upon detecting the intelligible audio class in the first stage of the classifier, into either the second audio class and the third audio class, in a second stage of the at least two stage classifier by using second of the at least two independent LSTM networks; and, 



performing a final classification of the incoming audio signal into the at least 3 audio classes at a decision time resolution higher than the slice duration; 
wherein the accuracy of the each of the at least two classification stages is improved using the determination of the position of possible transition by the transition detector neural network.

2. The method of claim 1, wherein the second audio class is a speech class and the third audio class is a music class.

2. The method of claim 1, wherein the second audio class is a speech class and the third audio class is a music class.
3. The method of claim 1, wherein audio classified as third audio class is further classified into two separate audio classes resulting in resulting in a 3-stage hierarchical classifier and classification into 4 audio classes.

3. The method of claim 1, wherein audio classified as third audio class is further classified into two separate audio classes resulting in resulting in a 3-stage hierarchical classifier and classification into 4 audio classes.
4. The method of claim 3, where the 4 audio classes are a background noise audio class, a speech audio class, a vocal music audio class, and a non-vocal music audio class.
4. The method of claim 3, wherein the 4 audio classes are a background noise audio class, a speech audio class, a vocal music audio class, and a non-vocal music audio class.
5. The method of claim 1, wherein the large tagged database is created by assigning, by integer encoding or one-hot encoding, a plurality of labels to a plurality of audio data.
5. The method of claim 1, wherein the large tagged database is created by assigning, by integer encoding or one-hot encoding, a plurality of labels to a plurality of audio data.
6. The method of claim 1, wherein the plurality of frame features consist of at least 20 features.
6. The method of claim 1, wherein the plurality of frame features consist of at least 20 features.
7. The method of claim 1 wherein each of the audio frames features is normalized to have mean 0 and standard deviation 1.
7. The method of claim 1 wherein each of the audio frames features is normalized to have mean 0 and standard deviation 1.
8. The method of claim 4, wherein the at least 20 audio frame features inculcate both temporal and frequency domain information.
8. The method of claim 4, wherein the at least 20 audio frame features inculcate both temporal and frequency domain information.

9. The method as claimed of claim 1, wherein the incoming audio signal is in 44100 Hz sample rate, 16 bit-depth, mono channel PCM WAVE format.
9. The method of claim 1, wherein the incoming audio signal is in 44100 Hz sample rate, 16 bit-depth, mono channel PCM WAVE format.

10. The method of claim 1, further comprising removing silence from a clean speech for converting a large duration of silence present in the clean speech to a small duration.
10. The method of claim 1, further comprising removing silence from a clean speech for converting a large duration of silence present in the clean speech to a small duration.

11. The method of claim 1, further comprising low pass filtering with cut-off 2.5 Khz and 4 Khz on a speech sample for audio classification.

11. The method of claim 1, comprising low pass filtering with cut-off 2.5 Khz and 4 Khz on a speech sample for audio classification.

12. The method of claim 1, wherein the audio frame slice is 64 frames.

12. The method of claim 1, wherein the audio frame slice is 64 frames.

13. The method of claim 10, wherein the desired decision time resolution is once every 16 audio frames.

13. The method of claim 10, wherein the desired decision time resolution is once every 16 audio frames.
14. The method of claim 1, wherein the first stage comprises two layers of LSTM having one dense layer and the input to the neural network is audio slice of 64 frames, each having 24 features.

14. The method of claim 1, wherein the first stage comprises two layers of LSTM having one dense layer and the input to the neural network is audio slice of 64 frames, each having 24 features.

15. The method of claim 1, wherein the input to the neural network in the second stage is audio slice of 64 frames each having 62 features.

15. The method of claim 1, wherein the input to the neural network in the second stage is an audio slice of 64 frames each having 62 features.

16. The method of claim 1, wherein the method uses a two-stage hierarchical binary classifier with two independent Long Short-Term Memory (LSTM) networks.

16. The method of claim 1, wherein the method uses a two-stage hierarchical binary classifier with two independent Long Short-Term Memory (LSTM) networks.



Claim 17 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.
Claim 14 is a device claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-3, 9, and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Casey (U.S. 20010044719) in view of Cox (U.S. 20230368000) and Renner (U.S. 20210321150).
Regarding Claim 1, Casey teaches: 
1. A method for hierarchical audio classification, said method for hierarchical audio classification comprising: 
at least two classification stages; [Casey, Figure 7 and 12 both show at least two classification stages and Casey is directed to a hierarchical classification methods which includes as many stages as the input ontology of classes includes.  “[0109] FIG. 12 shows a sound recognition classifier that uses a single database 1200 for all the necessary components of the classifier. The sound recognition classifier describes relationships between a number of probability models thus defining an ontology of classifiers. For example, a hierarchical recognizer can classify broad sound classes, such as animals, at the root nodes and finer classes, such as dogs:bark and cats:meow, at leaf nodes as described for FIGS. 6 and 7. This scheme defines mapping between an ontology of classifiers and a taxonomy of sound categories using the graph's descriptor scheme structure to enable hierarchical sound models to be used for extracting category descriptions for a given taxonomy.”]
training and generating at least two independent Long Short-Term Memory (LSTM) neural networks, one for each classification stage, by an audio database tagged into audio classes comprising at least a background noise audio class, and at least a second audio class and at least a third audio class; [Casey trains HMM for each category of sound using labeled data stored in audio databases and it names easily more than 3 audio classes in various parts of its disclosure.   “… Hidden Markov models are trained with sets of known features and stored in a database. The database can then be indexed by sets of unknown features to select or recognize like acoustic signals.”  Abstract.  Figure 15 shows more than two HMM models that are each trained to detect a particular class of sound.  Figure 14 shows a process for training the HMM.  “[0110] FIG. 13 shows a system 1300 for building a database of models. … The system extracts audio features from the files, and trains a hidden Markov model with these features. The system also uses a directory of sound examples for each sound class. The hierarchical directory structure defines an ontology that corresponds to a desired taxonomy. One hidden Markov model is trained for each of the directories in the ontology.”  “[0133] Model Acquisition and Training”  to [0138] .  “[0141] FIG. 16 shows classification performance for ten sound classes 1601-1610, respectively: bird chirps, applause, dog barks, explosions, foot steps, glass breaking, gun shots, gym shoes, laughter, and telephones.  Performance of the system was measured against a ground truth using the label of the source sound as specified by a professional sound-effect library.….”]
training another class transition neural network based on a class transition tagged database; [Casey mentions and uses a transition matrix to take into account transition from one state to another: “13. The method of claim 5 wherein the temporal features describe a trajectory of the spectral features over time, and further comprising: partitions the acoustic signal generated by a particular source into a finite number of states based on the corresponding spectral features; representing each state by a continuous probability distribution; representing the temporal features by a transition matrix to model probabilities of transitions to a next state given a current state.”]
for each new audio frame input, inputting, into the at least two independent LSTM and the class transition neural network, a plurality of audio frame features; [Casey, Figure 15, the “audio query 1501” is being subjected to feature extraction and the features are input to the various independent HMM models.  “[0140] FIG. 15 shows an automatic extraction system 1500 for indexing sound in a database using pre-trained classifiers saved as DDL files. An unknown sound is read from a media source format, such as a WAV file 1501. The unknown sound is spectrum projected 1520 as described above. The projection, that is, the set of features is then used to select 1530 one of the HMMs from the database 1200. … Each sound is then indexed by its category, model reference and the model state path and the descriptors are written to a database in DDL format. The indexed database 1599 can then be searched to find matching sounds using any of the stored descriptors as described above, for example, all dog barkings. The substantially similar sounds can then be presented in a result list 1560.” ]
determining position of a possible audio class transition by using the said class transition neural network over a slice consisting of a plurality of consecutive audio frame features; [Casey teaches the use of a transition matrix but does not elaborate the detection of the transition regions.]
classifying the incoming audio signal into either an intelligible audio class or the background noise class at a decision time resolution higher than the slice duration, in a first stage of the at least two stage classifier by using first of the at least two independent LSTM networks configured to run as a stateful predictor; [Casey, Figures 16 and 17 show classification of incoming sound and Casey also teaches that it separates background noise from the other sounds.  “[0145] As shown in FIG. 17 in simplified form, a sound query is presented to the system 1700 using the sound model state path description 1710 in DDL format. The system reads the query and populates internal data structures with the description information. This description is matched 1550 to descriptions taken from the sound database 1599 stored on disk. The sorted result list 1560 of closest matches is returned.”  “10. The method of claim 6 wherein the categories include environmental sounds, background noises, sound effects, sound textures, animal sounds, speech, non-speech utterances, and music.”]
further classifying the incoming audio signal, upon detecting the intelligible audio class in the first stage of the classifier, into either the second audio class and the third audio at a decision time resolution higher than the slice duration, in a second stage of the at least two stage classifier by using second of the at least two independent LSTM networks configured to run as a stateful predictor; and, [Casey, Figures 6-7 and 12 all show a hierarchical classification where the sounds of one class are further classified into sub-classes.  “[0064] As shown in FIG. 6 for a simple taxonomy 600, a description scheme (DS) is used for naming sound categories. As an example, the sound of a dog barking can be given the qualitative category label "Dogs" 610 with "Bark" 611 as a sub-category. In addition, "Woof" 612 or "Howl" 613 can be desirable sub-categories of "Dogs." The first two sub-categories are closely related, but the third is an entirely different sound event. …”  “[0069] As shown in FIG. 7, categories can be combined by the relational links into a classification scheme 700 to make a richer taxonomy; for example, "Barks" 611 is a sub-category of "Dogs" 610 which is a sub-category of "Pets" 701; as is the category "Cats" 710. Cats 710 has the sound categories "Meow" 711 and "purr" 712. The following is an example of a simple classification scheme for "Pets" containing two categories: "Dogs" and "Cats".”]
performing a final classification of the incoming audio signal into the at least 3 audio classes at a decision time resolution higher than the slice duration; [Casey, Figures 16 and 17 show classification of incoming sound and Figure 15 shows more than 3 classes of sound each represented by a separate HMM model.  The classification decision is made per frame which is 10ms whereas the slice duration is the window duration which is 30ms such that the decision time resolution is every 10ms as opposed to the longer slice duration of 30ms which means that the resolution of decision (every frame) is higher than the resolution of the slice (every few frames):  “[0086] The base features are derived from an audio spectrum envelope extraction process as described above. The audio spectrum projection descriptor is a container for dimension-reduced features that are obtained by projection of a spectrum envelope against a set of basis functions, also described above. For example, the audio spectrum envelope is extracted by a sliding window FFT analysis, with a resampling to logarithmic spaced frequncy bands. In the preferred embodiment, the analysis frame period is 10 ms. However, a sliding extraction window of 30 ms duration is used with a Hamming window….”]
wherein the states of stateful LSTM predictors are reset based on class transient location derived using class transition neural network. [Casey mentions but does not elaborate on transitions.]
Casey shows a hierarchical classification system for sounds where independent models are trained to detect different classes and sub-classes of sound. The models of Casey are HMMs and are not LSTMs or other type of NN.
Cox teaches that LSTMs and HMMs can be used interchangeably for audio classifications and also focuses on state changes:
training and generating at least two independent Long Short-Term Memory (LSTM) neural networks, one for each classification stage, by an audio database tagged into audio classes comprising at least a background noise audio class, and at least a second audio class and at least a third audio class; [Cox:   “[0116] In some embodiments, as shown in FIG. 6B, FIG. 7 and/or FIG. 8, the extraction features may be analyzed with a suitable model, including a machine learning model, neural network and/or statistical model, to analyze the features. In some embodiments, the model may include a long short-term memory (LSTM) based neural network. In some embodiments, the features may be calculated on a frame-by-frame (e.g., segment by segment as described with reference to FIG. 4 above) basis and can be fed into the model for examination. …”   “[0139] In some embodiments, the first model may include one or more cough detector models for detecting coughs. For example, the first model may include a burst classifier including, e.g., a CNN or other suitable classifier, an LSTM (e.g., as described above) or other suitable statistical model, an SVM (e.g., as described above), and/or a LSTM.”  “[0100] In some embodiments, in order to effectively extract the features, the audio may be split into single cough segments. Once the audio is split, formants may then be calculated along with track length, gap length, two peak detection, F1-F3 and more. These features then are analyzed by using methods such as correlation matrices, k-means clustering and PCA to find the most important features and cluster the data. Finally, a LSTM or statistical model can be used to predict whether the features correlate to class 1 or 9.”  [0172] … Clause 4. The method of clause 1, wherein the state changes is associated with at least one state comprises at least one of: [0184] an event state associated with the events of interest, [0185] a null state associated with no events, or [0186] a noise state associated with events not of interest.”]
…
wherein the states of stateful LSTM predictors are reset based on class transient location derived using class transition neural network. [Cox identifies state changes of HMM.  Cox has 3 discrete states which means that the states are reset after each transition.  The states are discussed in terms of HMMs but Cox also teaches the interchangeability of HMMs and LSTMs as shown above  “… The computing system may receive a signal data signature of time-varying data, the time-varying data having an event of interest and segment the signal data signature to isolate the event of interest by utilizing a first Hidden Markov model (HMM) configured to segment the signal data signature into at least one segment of the time-varying data by identifying state changes indicative of events of interest and where the at least one segment of the time-varying data has a first length. The computing system may use a second HMM configured to segment the at least one segment into a sub-segment of the time-varying data by identifying state changes within the at least one segment.”  Abstract.  “[0127] In some embodiments, the segmentation engine may employ a Hidden Markov Model (HMM) as the base model in which to train the segmentation process. A SDS can be modeled as a Markov process due to the nature of a signal changing states over time. For example, there may be three states that can model a given SDS being input into the system: Instance state for signal data of interest, Silence state for no signal data or negligible signal data, and Noise state for signal data having parasitic information. In some embodiments, other states may be defined to model aspects of the SDS. In some embodiments, the Hidden Markov Model may predict changes in these states based on features of the signal. For example, if there is 5 seconds of silence, then a user provides an input data (e.g., a forced cough vocalization, cough, sneeze, forced breath vocalization, breath sounds, heartbeat sounds, heart rate data, or other input signal data for any suitable time-series data), the model may predict the probability of a state change from silence to an instance of signal data of interest at the 5 second mark of the SDS.”]
Casey and Cox pertain to audio classification/splitting and training of models for audio classification and it would have been obvious to substitute the LSTM of Cox for HMMs of Casey considering that Cox expressly states that these models are interchangeable. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Casey discusses transition matrices but does not elaborate on transition models. 
Cox also discusses detecting transitions using HMMs but does not use LSTM for transition detection: “[0128] In some embodiments, the HMM may provide the best results for cases when there are clear transitions between each of the three states. A single HMM architecture may have less accuracy when one state blends into another, or there is a change between two states that is subtle enough to not be detected by a single HMM. Due to features being extracted in set time windows, there is a lack of precision when states very quickly change. An example of may be when there is a rapid sequence of peaks one after another. The states in which one peak ends, there is very brief silence, and another starts might occur within the same window, forcing the model to only predict one state for all 3 changes.”
Renner is focused on detecting transitions and teaches:
training another class transition neural network based on a class transition tagged database; [Renner trains a NN, that could be an LSTM, to detect transitions in the audio segments. “[0039] Transition detector neural network 206 can be trained using a training data set. The training data set can include a sequence of media content that is annotated with information specifying which frames of the sequence of media content include transitions between different content segments. Because of a data imbalance between classes of the transition detector neural network 206 (there may be far more frames that are considered non-transitions than transitions), the ground truth transitions frames can be expanded to be transition “neighborhoods”. For instance, for every ground truth transition frame, the two frames on either side can also labeled as transitions within the training data set. In some cases, some of the ground truth data can be slightly noisy and not temporally exact. Advantageously, the use of transition neighborhoods can help smooth such temporal noise.”  “[0040] Training transition detector neural network 206 can involve learning neural network weights that cause transition detector neural network 206 to provide a desired output for a desired input (e.g., correctly classify audio features and video features as being indicative of a transition from a program segment to an advertisement segment).”  “[0035] The configuration and structure of transition detector neural network 206 can vary depending on the desired implementation. As one example, transition detector neural network 206 can include a recurrent neural network. For instance, transition detector neural network 206 can include a recurrent neural network having a sequence processing model, such as stacked bidirectional long short-term memory (LSTM). As another example, transition detector neural network 206 can include a seq2seq model having a transformer-based architecture (e.g., a Bidirectional Encoder Representations from Transformers (BERT)).”]
for each new audio frame input, inputting, into the at least two independent LSTM and the class transition neural network, a plurality of audio frame features; [Renner, Figure 2, “Audio Feature Extractor 200” receive a “sequence of media content” /frames and generates “Audio Features” that are input o the “Transition Detector Neural Network 206.”  “[0029] … As such, the linear sequence of media content can include a sequence of frames, or images, and corresponding audio data representing program segments and/or advertisement segments….”  Figure 3, left side: “audio spectrogram frame sequence” as input to “audio feature extraction” which yields the Features Fs,t1 etc. for frames St1, St2, …. ]
determining position of a possible audio class transition by using the said class transition neural network over a slice consisting of a plurality of consecutive audio frame features; [Renner, Figure 3, the Probability of Transition at each time t1, t2, t3 … “P(transition ti|St1 …tN, Vt1 … tN) where S are frames of audio and V are frames of video is the output at the top of “transition detector 300.”  “[0050] FIG. 3 is a conceptual illustration of an example transition detector neural network 300. As shown in FIG. 3, transition detector neural network 300 is a recurrent neural network having audio feature extraction layers 302, video feature extraction layers 304, and classification layers 306. Audio feature extraction layers 302 include one or more convolution layers and are configured to receive as input a sequence of audio features (e.g., audio spectrograms) and output computation results….”  “[0051] Classification layers 306 receives concatenated features for a sequence of frames, and outputs, for each frame, a probability indicative of whether the frame is transition between different content segments. Classification layers 306 include bidirectional LSTM layers and fully convolutional neural network (FCN) layers. The probabilities determined by classification layers 306 are a function of hidden weights of the FCN layers, which can be learned during training.”  See also Figure 4, 410 and Figure 2, “Transition Data” as output.] [Renner, 
…
performing a final classification of the incoming audio signal into the at least 3 audio classes at a decision time resolution higher than the slice duration; [Renner as shown in Figure 3 outputs the classification/prediction of transition for each frame that is input whereas the “slice duration” is generally longer than one frame and is shown as 4 consecutive frames in Figure 3.]
Casey/Cox and Renner pertain to audio classification and handling of the transitions between classes and it would have been obvious to replace the mentions to transitions of the combination with the more rigorous transition handling of Renner. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 2, Casey teaches:
2. The method of claim 1, wherein the second audio class is a speech class and the third audio class is a music class. [Casey teaches classification into speech and music classes: “[0003] To date, very little work has been done on characterizing environmental and ambient sounds. Most prior art acoustic signal representation methods have focused on human speech and music. …”   “[0009] In yet another application, sound representations could be used to index audio media including a wide range of sound phenomena including environmental sounds, background noises, sound effects (Foley sounds), animal sounds, speech, non-speech utterances and music. …”]

Regarding Claim 3, Casey teaches:
3. The method of claim 1, wherein audio classified as third audio class is further classified into two separate audio classes resulting in resulting in a 3-stage hierarchical classifier and classification into 4 audio classes. [Casey: Figure 7 shows 3 levels of classification: 1-Pets; 2- Dogs and Cats; 3- Bark or Howl and Meow or Purr and 4 audio classes of Bark, Howl, Meow, and Purr.]

Regarding Claim 9, Casey teaches:
9. The method as claimed of claim 1, wherein the incoming audio signal is in 44100 Hz sample rate, 16 bit-depth, mono channel PCM WAVE format. [Casey: “[0086] The base features are derived from an audio spectrum envelope extraction process as described above. The audio spectrum projection descriptor is a container for dimension-reduced features that are obtained by projection of a spectrum envelope against a set of basis functions, also described above. For example, the audio spectrum envelope is extracted by a sliding window FFT analysis, with a resampling to logarithmic spaced frequncy bands. In the preferred embodiment, the analysis frame period is 10 ms. However, a sliding extraction window of 30 ms duration is used with a Hamming window. The 30 ms interval is chosen to provide enough spectral resolution to roughly resolve the 62.5 Hz-wide first channel of an octave-band spectrum. The size of the FFT analysis window is the next-larger power-of-two number of samples. This means for 30 ms at 32 kHz there are 960 samples but the FFT would be performed on 1024 samples, For 30 ms at 44.1 KHz, there are 1323 samples but the FFT would be performed on 2048 samples with out-of-window samples set to 0.”] 
Cox teaches:
wherein the incoming audio signal is in 44100 Hz sample rate, 16 bit-depth, mono channel PCM WAVE format. [Cox pertains to audio classification based on extracted features: “[0070] In some embodiments, the feature vector 206 may be used as an input to train machine-learning model(s) 113 which result in an ensemble of n classifiers 207. The ensemble of n classifiers is used to define the natural boundaries 114 in the training dataset.”   “[0074] At step 401, one or more feature extraction components may ingest an unprocessed audio file. In some embodiments, the audio file may include any suitable format and/or sample rate and/or bit depth. For example, the sample rate may include, e.g., 8 kilohertz (kHz), 11 kHz, 16 kHz, 22 kHz, 44.1 kHz, 48 kHz, 88.2 kHz, 96 kHz, 176.4 kHz, 192 kHz, 352.8 kHz, 384 khz, or other suitable sample rate. For example, the bit depth may include, e.g., 16 bits, 24 bits, 32 bits, or other suitable bit depth. For example, the format of the audio file may include, e.g., stereo or mono audio, and/or e.g., waveform audio fie format (WAV), MP3, Windows media audio (WMA), MIDI, Ogg, pulse code modulation (PCM), audio file format (AIFF), advanced audio coding (AAC), free lossless audio codec (FLAC), Apple lossless audio codec (ALAC), or other suitable file format or any combination thereof. In some embodiments, an example embodiment that balances detail with memory and resource efficiency and availability and compatibility with commonly available equipment, the audio file may include a 48 kHz mono WAV file.”  “[0066] In some embodiments, pre-processing 202 may include Stereo to Mono Compatibility which may include combining two channels of stereo information into one single mono representation. The stereo-to-mono filter may ensure that only a single perspective of the signal is being considered or analyzed at one time.”]
Rationale as provided for Claim 1.

Regarding Claim 16, Casey uses HMMs as its classifiers.
Cox teaches:
16. The method of claim 1, wherein the method uses a two-stage hierarchical binary classifier with two independent Long Short-Term Memory (LSTM) networks. [Cox teaches a hierarchical classifier with several stages and teaches that LSTMs can be used instead of HMMs.  “ [0129] In some embodiments, the problem of detecting rapid sequences of peaks may be overcome by a layered HMM architecture. An SDS is first segmented using a first layer HMM with a relatively large window size, allowing for generalizability over an entire signal. The resulting segments may then be provided to a second layer HMI with a much smaller window size than the first layer HMM, allowing for greater precision. …”   [0130] In some embodiments, a mechanism to determine whether the segments from the first layer HMM are to be sent to the second layer HMM may include a duration filter. If a segment from the first layer HMI is greater than a predefined duration, it may be likely that segment includes more than a single instance of the signal data of interest. Thus, that segment may sent to the second layer HMM for fine-tuning.”  “[0102] … To analyze the MFCCs, a machine learning model that is configured for time-series analysis may be employed such as, e.g., a recurrent neural network (RNN), a long short-term memory (LSTM), or other suitable machine learning model or any combination thereof.”  “[0099] In some embodiments, prediction based on the extracted features (e.g., as described above with reference to FIGS. 6A and 6B) may include a suitable machine learning-based processing according to one or more machine learning models. In some embodiments, the machine learning model(s) may includ
Read full office action
Prosecution Timeline

Jan 19, 2024
Application Filed
Oct 14, 2025
Non-Final Rejection — §103, §112, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/454,031
Patent 12603099
SELF-ADJUSTING ASSISTANT LLMS ENABLING ROBUST INTERACTION WITH BUSINESS LLMS
2y 5m to grant Granted Apr 14, 2026
18/152,553
Patent 12579482
Schema-Guided Response Generation
2y 5m to grant Granted Mar 17, 2026
18/341,681
Patent 12572737
GENERATIVE THOUGHT STARTERS
2y 5m to grant Granted Mar 10, 2026
18/406,094
Patent 12537013
AUDIO-VISUAL SPEECH RECOGNITION CONTROL FOR WEARABLE DEVICES
2y 5m to grant Granted Jan 27, 2026
18/180,329
Patent 12492008
Cockpit Voice Recorder Decoder
2y 5m to grant Granted Dec 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+31.0%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 547 resolved cases by this examiner. Grant probability derived from career allow rate.