DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on March 9, 2026 has been entered.
Response to Amendment
This Office Action has been issued in response to Applicant’s Communication of amended application S/N 17/006,011 filed on March 9, 2026. Claims 1 to 27 are currently pending with the application.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1 to 27 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Claims 1, 12, and 19 recite detecting events, determining possible events, updating confidence levels, and determining context data.
The limitation of detecting events, which specifically recites “detecting, based on audio data received from one or more sensors associated with one or more location labels, an audio event”, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components. That is, other than reciting “by a computing device”, nothing in the claim element precludes the steps from practically being performed in a human mind. For example, but for the “by a computing device” language, “detecting”, in the context of this claim encompasses the user realizing that an event might have happen just by listening to an audio recording. The limitation of determining events, which specifically recites “determining, based on one or more features extracted from the audio data, a plurality of possible audio events”, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components. That is, other than reciting “by a machine learning algorithm”, nothing in the claim element precludes the steps from practically being performed in a human mind. For example, but for the “by a machine learning algorithm” language, “determining”, in the context of this claim encompasses the user mentally determining possible events that might correspond to the audio recording in the previous step.
The limitation of determining events further recites “determining, based on a relationship between the one or more location labels and the one or more plurality of possible audio events, a most-likely audio event of the one or more plurality of possible audio events”, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components. That is, nothing in the claim element precludes the steps from practically being performed in a human mind. For example, “determining”, in the context of this claim encompasses the user mentally, and with or without the aid of pen and paper, determining or selecting an event from the plurality of possible events based on a relationship between the location and the events, and designating it as a most-likely audio event. The limitation of determining events in claim 12 further recites the determination of confidence levels associated with the plurality of possible audio events, and further determining updated confidence levels based on relationships between the plurality of possible audio events and the location, which also covers performance of the limitation in the mind, but for the recitation of generic computer components, since nothing in the claim element precludes the steps from practically being performed in a human mind, as the previously described determining steps. Same rationale applies to the determining steps recited in claim 19. If a claim limitation, under its broadest reasonable interpretation, covers mental processes but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claims recite the additional elements – “sending a notification associated with the audio event”, “receiving, by a computing device, from one or more sensors associated with one or more location labels, audio data”. The limitation “sending a notification associated with the audio event”, represent insignificant extra-solution activity because it is a mere nominal addition to the claim, a generic transmission of collected and analyzed data (See MPEP 2106.05(g)). The limitation “receiving, by a computing device, from one or more sensors associated with one or more location labels, audio data”, amount to data-gathering steps which is considered to be insignificant extra-solution activity (See MPEP 2106.05(g)). Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The insignificant extra-solution activity identified above, which include the data gathering and transmission steps, are recognized by the courts as well-understood, routine, and conventional activities when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (See MPEP 2106.05(d)(II)(i) Receiving or transmitting data over a network, e.g., using the Internet to gather data, buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)). The claims are not patent eligible.
Claims 5, and 7 are dependent on claim 1 and include all the limitations of claim 1. Therefore, claims 5, and 7 recite the same abstract idea of claim 1. The claims recite the additional limitations of “receiving, from at least one of: a Red Green Blue Depth (RGB-D) device, a Light Detection and Ranging (LIDAR) device, or a Radio Detection and Ranging (RADAR) device, or distance data”, and “sending context data and a confidence level”, respectively, which amount to data gathering and data transmission steps, and which is considered to be insignificant extra-solution activity, (See MPEP 2106.05(g)), and recognized by the courts as well-understood, routine, and conventional activities when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (See MPEP 2106.05(d)(II)(i) Receiving or transmitting data over a network, e.g., using the Internet to gather data, buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network); (v) Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93)). The limitation of determining, is a process that can be performed mentally. Therefore, the limitations do not amount to significantly more than the abstract idea.
Claims 2 to 4, 6, and 8 to 11 are dependent on claim 1 and include all the limitations of claim 1. Claim 27 is dependent on claim 19, includes all the limitations of claim 19. Therefore, claims 2 to 4, 6, and 8 to 11, and 27 recite the same abstract idea of claims 1 and 19. The claims recite additional “determining” limitations, which can be performed in the human mind, hence, are merely elaborating on the abstract idea. Therefore, these limitations do not amount to significantly more.
Claim 20 is dependent on claim 19 and includes all the limitations of claim 19. Therefore, claim 20 recites the same abstract idea of claim 19. The claim recites the additional limitations of “determining that the context data is relevant to the audio event and wherein the context data comprises information indicative of at least one of: an identity of an object, a type of activity, distance information, scene classification, time information, a time of the audio event, a volume of the audio event, or a historical trend”, where the determining limitation is further elaborating on the abstract idea, since it can be performed in the human mind, and where the limitation of “wherein the context data comprises information indicative of…” is tying the abstract idea to a field of use by further specifying the target data, and is simply an attempt to limit the application of the abstract idea to a particular technological environment); merely indicating a field of use or technological environment in which to apply the judicial exception does not meaningfully limit the claim (See MPEP 2106.05(h)).
Claim 21 is dependent on claim 19 and includes all the limitations of claim 19. Therefore, claim 21 recites the same abstract idea of claim 19. The claim recites the additional limitation of “determining the updated confidence level is based on a machine learning algorithm”, which is recited at a high-level of generality, and amounts to no more than mere instructions to apply the exception using generic computer components, because it does no more than invoking computers or other machinery merely as a tool to perform an existing process. Additional elements that invoke computers, computer components, or other machinery in its ordinary capacity, merely as a tool, or that simply add a general-purpose computer or computer components after the fact to an abstract idea, do not integrate a judicial exception into a practical application nor provide significantly more.
Claims 25 and 26 are dependent on claim 19 and include all the limitations of claim 19. Therefore, claims 25 and 26 recite the same abstract idea of claim 19. The claims recite the additional limitations of “the audio data and the video data are each associated with a scene sensed by the one or more sensors”, and “the location is indicated by at least one of: GPS coordinates, a geographical region, or a location label”, respectively, which is tying the abstract idea to a field of use by further specifying the target data, and is simply an attempt to limit the application of the abstract idea to a particular technological environment; merely indicating a field of use or technological environment in which to apply the judicial exception does not meaningfully limit the claim, (See MPEP 2106.05(h)).
Additionally, the claims do not include a requirement of anything other than conventional, generic computer technology for executing the abstract idea, and therefore, do not amount to significantly more than the abstract idea.
Same rationale applies to claims 13 to 18, and 22 to 24, since they recite similar limitations.
Claims 1 to 27 are therefore not drawn to eligible subject matter as they are directed to an abstract idea without significantly more.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 2, 4, and 6 to 27 are rejected under 35 U.S.C. 103 as being unpatentable over Mitchell et al. (U.S. Patent No. 10,878,840) hereinafter Mitchell, and further in view of SUBRAMANIAN et al. (U.S. Publication No. 2018/0225939) hereinafter Subramanian.
As to claim 1:
Mitchell discloses:
A method comprising:
detecting, by a computing device, based on audio data received from one or more sensors, an audio event [Column 2, lines 4 to 7 teach obtaining a sequence of frames of audio sound data from an audio signal, captured by a sound capturing devices such as a microphone; Column 18, lines 13 to 15 teach capturing audio data, and processing the captured audio data];
determining, based on one or more features extracted from the audio data by a machine learning algorithm, one or more a plurality of possible audio events [Column 14, lines 2 to 5 teach use a Deep Neural Network (DNN) to extract a number of acoustic features for a frame, and lines 35 to 37 teach classifying the acoustic features of the audio to classify the frame into sound classes];
determining, based on a relationship between the location and the one or more plurality of possible audio events, a most-likely audio event of the one or more plurality of possible audio events [Column 3, lines 32 to 41 teaches adjusting the sound class scores for multiple frames of the sequence of frames based on one or more of knowledge about a sound environment in which the audio data was captured, therefore, based on a relationship between the possible audio events and the location; Column 7, lines 17 to 25 teach a confidence value for each of a plurality of values for properties associated with the sound, where the property can be an environment, including restaurant, train station, shop, etc.; Column 8, lines 42 to 51 teach the property is associated with an environment in which the audio data was captured, which may indicate a type of environment, i.e., city, countryside, train station, kitchen, or a location such as outside, indoors, at home, office, etc.; Column 10, lines 28 to 45 teach processing the adjusted sound class scores to recognize a sound event, to generate a sound class decision for a frame, which can output a single sound class decision, and which is an indication that the frame is associated with the sound class, in other words, that the sound event that is represented by the sound class decision has occurred]; and
sending a notification associated with the audio event [Column 10, lines 60 to 64 teach generating an indication of a recognized sound event, and communicating the indication to a user device].
Mitchell does not appear to expressly disclose sensors associated with one or more location labels; a relationship between the one or more location labels and the plurality of possible audio events.
Subramanian discloses:
sensors associated with one or more location labels [Paragraph 0039 teaches sensors are labelled for identification of location within the premises; Paragraph 0043 teaches sound frames are coupled with location data];
a relationship between the one or more location labels and the plurality of possible audio events [Paragraph 0047 teaches rule set that define an association between sound signal attributes and location, time of the day, etc., where rules can be applied to a location, i.e., a playground and a specific time range, to determine the probability of occurrence of a sound event, therefore, determining possible audio events based on a relationship between the location labels and the audio events].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Mitchell, by incorporating sensors associated with one or more location labels; a relationship between the one or more location labels and the plurality of possible audio events, as taught by Subramanian [Paragraphs 0039, 0043, 0047], because the applications are directed to improvements in monitoring systems and events detection; having rules defining relationships between location labels and audio events, enable the consideration of location and context in surveillance sound systems, in order to improve recognition (See Subramanian Paras [0007], [0008]).
As to claim 2:
Mitchell discloses:
determining a confidence level associated with the most-likely audio event [Column 11, lines 14 to 21 teach recognizing at least one of a non-verbal sound event, by determining sound classes and corresponding sound class scores, which are representative of a degree of affiliation of the frame with a sound class of a plurality of sound classes; Column 18, lines 22 to 27 teach determining a set of sound classes for the frame, by classifying the frame and determining a sound class score for each of the sound classes, where the score indicates that the frame represents the sound class].
As to claim 4:
Mitchell discloses:
determining context data associated with the audio event, wherein the context data comprises information indicative of at least one of: an identity of an object, a type of activity, distance information, scene classification, time information, a time of the audio event, a volume of the audio event, or a historical trend [Column 4, lines 39 to 47 teach a property may be related to the audio frame, and may relate to external factors that cannot be derived directly from the frame of audio data, where examples of such properties may be a number of occupants in house when the frame of audio data was captured, or a time of day when the frame of audio data was captured].
As to claim 6:
Mitchell discloses:
determining at least one of: a time corresponding to the audio event, a likelihood that
the audio event corresponds to a sound associated with the one or more location labels, or a long term context [Column 4, lines 39 to 47 teach a property may be related to the audio frame, and may relate to external factors that cannot be derived directly from the frame of audio data, where examples of such properties may be a number of occupants in house when the frame of audio data was captured, or a time of day when the frame of audio data was captured].
As to claim 7:
Mitchell discloses:
sending the notification associated with the audio event comprises sending context data and a confidence level [Column 10, lines 16 to 22 teach using the adjusted class scores to generate indications of a level of likelihood that the frame is affiliated with at least one of a non-verbal sound event and a scene, i.e., to control a visual indicator, where different adjusted sound class scores could correspond to red, amber, or green display].
As to claim 8:
Mitchell discloses:
determining at least one of: a source of audio associated with the audio data [Column 2, lines 34 to 38 teach sound classes can be representative of, indicators of, or associated with, non-verbal sound events or scenes, for example a sound class may be “baby crying”, “dog barking” or “female speaking”].
As to claim 9:
Mitchell discloses:
determining one or more confidence levels associated with the plurality of possible audio events [Column 3, lines 22 to 24 teach the score represents the level of affiliation that a frame has with a sound class; Column 6, lines 47 to 65 teach determining a confidence level for each sound class].
As to claim 10:
Mitchell discloses:
determining an updated confidence level, wherein determining the updated confidence level comprises increasing, based on a change in context data, the confidence level [Column 5, lines 1 to 5 teach behavior of the (adjusted) sound class scores can be dynamically adapted to the property changes in the environment; Column 15, lines 9 to 12 teach reweighting one or more scores, where for example, for sound recognition in busy homes, the scores for any sound class related to speech events and/or scenes would be weighted up].
As to claim 11:
Mitchell discloses:
determining the updated confidence level comprises one or more of: decreasing, based on context data not being indicative of an object that originates a sound corresponding to the audio event, the confidence level; and increasing, based on the context data being indicative of an object that originates a sound corresponding to the audio event, the confidence level [Column 15, lines 7 to 15 teach reweighting the sound class scores using probabilities of sound event or scene occurrence, based on location or environment, where sound classes related to speech events would be weighted up for sounds recognized in busy homes, and where sounds recognized in unoccupied homes, the scores for any sound classes related to speech would be weighted down; Column 17, lines 57 to 62 teach a sequence of “baby cry” sound class decisions can be discarded if the sequence of “baby cry” sound class decisions are collectively shorter than 116 milliseconds (which is approximately equivalent to 10 frames)].
Same rationale applies to claim 18 since it recites similar limitations.
As to claim 12:
Mitchell discloses:
A method comprising:
receiving, by a computing device, from one or more sensors, audio data [Column 18, lines 13 to 15 teach capturing audio data, and processing the captured audio data];
determining, based on one or more extracted features from the audio data by a machine learning algorithm, a plurality of possible audio events, and a plurality of confidence levels associated with the plurality of possible audio events [Column 11, lines 14 to 21 teach recognizing at least one of a non-verbal sound event, by determining sound classes and corresponding sound class scores, which are representative of a degree of affiliation of the frame with a sound class of a plurality of sound classes; Column 14 lines 2 to 5 teach use a Deep Neural Network (DNN) to extract a number of acoustic features for a frame, and lines 35 to 38 teach classifying the audio frames by determining a set of sound classes and a score that the frame represents the sound class, therefore, confidence levels associated with the possible audio events; Column 18, lines 22 to 27 teach determining a set of sound classes for the frame, by classifying the frame and determining a sound class score for each of the sound classes, where the score indicates that the frame represents the sound class, therefore, determining confidence levels associated with the plurality of possible audio events];
determining, based on a relationship between the plurality of possible audio events and the location, a plurality of updated confidence levels [Column 3, lines 32 to 41 teaches adjusting the sound class scores for multiple frames of the sequence of frames based on one or more of knowledge about a sound environment in which the audio data was captured, therefore, based on a relationship between the possible audio events and the location; Column 8, lines 42 to 51 teach the property is associated with an environment in which the audio data was captured, which may indicate a type of environment, i.e., city, countryside, train station, kitchen, or a location such as outside, indoors, at home, office, etc.; Column 15, lines 7 to 15 teach reweighting the sound class scores using probabilities of sound event or scene occurrence, based on location or environment, where sound classes related to speech events would be weighted up for sounds recognized in busy homes, and where sounds recognized in unoccupied homes, the scores for any sound classes related to speech would be weighted down, in other words, the scores associated with possible events are weighted up or down based on the environment on which they were recognized, therefore, the location or environment does influence the scores]; and
causing, based on an updated confidence level of the plurality of updated confidence levels satisfying a threshold, a notification associated with an audio event of the plurality of possible audio events to be sent [Column 10 lines 11 to 19 teaches adjusted sound class scores for the frames are an indication of a level of likelihood that the frame is associated with the sound event or scene; Column 16, lines 42 to 44 teaches scores may then be compared to thresholds to make class decisions for the frame, where there are thresholds for each class and score; Column 18, lines 37 to 39 teach in response to recognizing a sound event, output a communication to a user device].
Mitchell does not appear to expressly disclose sensors associated with one or more location labels; a relationship between the plurality of possible audio events and the one or more location labels.
Subramanian discloses:
sensors associated with one or more location labels [Paragraph 0039 teaches sensors are labelled for identification of location within the premises; Paragraph 0043 teaches sound frames are coupled with location data];
a relationship between the plurality of possible audio events and the one or more location labels [Paragraph 0047 teaches rule set that define an association between sound signal attributes and location, time of the day, etc., where rules can be applied to a location, i.e., a playground and a specific time range, to determine the probability of occurrence of a sound event, therefore, determining possible audio events based on a relationship between the location labels and the audio events].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Mitchell, by incorporating sensors associated with one or more location labels; a relationship between the plurality of possible audio events and the one or more location labels, as taught by Subramanian [Paragraphs 0039, 0043, 0047], because the applications are directed to improvements in monitoring systems and events detection; having rules defining relationships between location labels and audio events, enable the consideration of location and context in surveillance sound systems, in order to improve recognition (See Subramanian Paras [0007], [0008]).
As to claim 13:
Mitchell discloses:
detecting a change in context data associated with the plurality of possible audio events, wherein the context data comprises information indicative of at least one of: an identity of an object, a type of activity, distance information, scene classification, time information, a time of the plurality of possible audio events, a volume of the plurality of possible audio events, or a historical trend [Column 5, lines 1 to 5 teach behavior of the (adjusted) sound class scores can be dynamically adapted to the property changes in the environment; Column 4, lines 45 to 47 teach properties may be a number of occupants in house when the frame of audio data was captured, or a time of day when the frame of audio data was captured].
As to claim 14:
Mitchell discloses:
determining a most-likely audio event of the plurality of possible audio events [Column 10, lines 28 to 45 teach processing the adjusted sound class scores to recognize a sound event, to generate a sound class decision for a frame, which can output a single sound class decision, and which is an indication that the frame is associated with the sound class, in other words, that the sound event that is represented by the sound class decision has occurred]
As to claim 15:
Mitchell discloses:
determining at least one of: a likelihood that the plurality of possible audio events correspond to a sound associated with the one or more location labels, a long term context, or a source of audio associated with the audio data [Column 4, lines 39 to 47 teach a property may be related to the audio frame, and may relate to external factors that cannot be derived directly from the frame of audio data, where examples of such properties may be a number of occupants in house when the frame of audio data was captured, or a time of day when the frame of audio data was captured].
As to claim 16:
The combination of Mitchell and Subramanian discloses:
determining, based on the one or more location labels, the plurality of updated confidence levels comprises one or more of: decreasing a first confidence level of the plurality of confidence levels based on the relationship between the one or more location labels and a first audio event of the plurality of possible audio events or increasing a second confidence level based on the relationship between the one or more location labels and a second audio event of the plurality of possible audio events [Mitchell - Column 15, lines 7 to 15 teach reweighting the sound class scores using probabilities of sound event or scene occurrence, based on location or environment, where sound classes related to speech events would be weighted up for sounds recognized in busy homes, and where sounds recognized in unoccupied homes, the scores for any sound classes related to speech would be weighted down; Column 17, lines 57 to 62 teach a sequence of “baby cry” sound class decisions can be discarded if the sequence of “baby cry” sound class decisions are collectively shorter than 116 milliseconds (which is approximately equivalent to 10 frames); Subramanian - Paragraph 0039 teaches sensors are labelled for identification of location within the premises; Paragraph 0043 teaches sound frames are coupled with location data].
As to claim 17:
Mitchell discloses:
decreasing, based on a change in context data, a confidence level of the plurality of confidence levels [Column 15, likes 12 to 15 teach for sound recognition in unoccupied homes, the scores for any sound class related to speech events and/or scenes would be weighted down; Column 17, lines 57 to 62 teach a sequence of “baby cry” sound class decisions can be discarded if the sequence of “baby cry” sound class decisions are collectively shorter than 116 milliseconds (which is approximately equivalent to 10 frames)].
As to claim 19:
Mitchell discloses:
A method comprising:
determining, by a computing device, based on based on one or more features extracted from audio data received from one or more sensors, an audio event [Column 2, lines 4 to 7 teach obtaining a sequence of frames of audio sound data from an audio signal, captured by a sound capturing devices such as a microphone; Column 18, lines 13 to 15 teach capturing audio data, and processing the captured audio data; Column 14, lines 2 to 5 teach use a Deep Neural Network (DNN) to extract a number of acoustic features for a frame, and lines 35 to 37 teach classifying the acoustic features of the audio to classify the frame into sound classes];
determining based on the location, a confidence level associated with the audio event [Column 3, lines 32 to 41 teaches adjusting the sound class scores for multiple frames of the sequence of frames based on one or more of knowledge about a sound environment in which the audio data was captured, therefore, based on the location, a confidence level];
determining, based on the one or more features extracted from the audio data, context data associated with the audio data received from the one or more sensors [Claim 1, lines 60 to 64 teach a sound scene may be an environment characterized by a set of expected sounds or sound types, where the scene may be recognized by recognizing and processing a number of audio events, and which may be indicative of a particular context; Column 8, lines 65 to 67 teach a property may be associated with an audio feature of an environment in which the frame of audio data was captured];
determining, based on a relationship between the location and the context data, an updated confidence level [Column 7, lines 17 to 25 teach a confidence value for each of a plurality of values for properties associated with the sound, where the property can be an environment, including restaurant, train station, shop, etc.; Column 8, lines 42 to 51 teach the property is associated with an environment in which the audio data was captured, which may indicate a type of environment, i.e., city, countryside, train station, kitchen, or a location such as outside, indoors, at home, office, etc.; Column 10, lines 28 to 45 teach processing the adjusted sound class scores to recognize a sound event, to generate a sound class decision for a frame, which can output a single sound class decision, and which is an indication that the frame is associated with the sound class, in other words, that the sound event that is represented by the sound class decision has occurred; Column 15, lines 7 to 15 teach reweighting the sound class scores using probabilities of sound event or scene occurrence, based on location or environment, where sound classes related to speech events would be weighted up for sounds recognized in busy homes, and where sounds recognized in unoccupied homes, the scores for any sound classes related to speech would be weighted down];
determining, based on the updated confidence level satisfying a threshold, that the audio event is accurately classified [Column 10 lines 11 to 19 teaches adjusted sound class scores for the frames are an indication of a level of likelihood that the frame is associated with the sound event or scene; Column 16, lines 42 to 44 teaches scores may then be compared to thresholds to make class decisions for the frame, where there are thresholds for each class and score]; and
based on determining that the audio event is accurately classified, sending a notification of the audio event to a user device [Column 18, lines 37 to 39 teach in response to recognizing a sound event, output a communication to a user device].
Mitchell does not appear to expressly disclose sensors associated with one or more location labels; a relationship between the one or more location labels and the context data.
Subramanian discloses:
sensors associated with one or more location labels [Paragraph 0039 teaches sensors are labelled for identification of location within the premises; Paragraph 0043 teaches sound frames are coupled with location data];
a relationship between the one or more location labels and the context data [Paragraph 0034 teaches context builder provides the knowledge component which is the contextual information relevant at that point in time for the ambient environment to determine the event; Paragraph 0047 teaches rule set that define an association between sound signal attributes and location, time of the day, etc., where rules can be applied to a location, i.e., a playground and a specific time range, to determine the probability of occurrence of a sound event, therefore, determining possible audio events based on a relationship between the location labels and context data].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Mitchell, by incorporating sensors associated with one or more location labels; a relationship between the one or more location labels and the context data, as taught by Subramanian [Paragraphs 0034, 0039, 0043, 0047], because the applications are directed to improvements in monitoring systems and events detection; having rules defining relationships between location labels and audio events, enable the consideration of location and context in surveillance sound systems, in order to improve recognition (See Subramanian Paras [0007], [0008]).
Same rationale applies to claims 20, and 22 to 24, since they recite similar limitations as those recited by the dependent claims above.
As to claim 21:
Mitchell discloses:
determining the updated confidence level is based on a machine learning algorithm [Column 3, lines 59 to 61 teach scores output by a DNN (or other machine learning model) may be adjusted based on some form of knowledge other the frame of audio data].
As to claim 25:
Mitchell discloses:
the audio data is associated with a scene sensed by the one or more sensors [Column 3, line 30 teaches determining more than one sound event or scene associated with a frame].
As to claim 26:
Mitchell as modified by Subramanian discloses:
the one or more location labels are determined based on one or more of: GPS coordinates, a geographical region, or object recognition [Paragraph 0034 teaches location data may be gathered in absolute terms like GPS or could be inferred approximately based on the sounds transmitted from different sensors].
As to claim 27:
Mitchell discloses:
determining that the context data matches the audio event [Column 16, lines 43 to 45 teach when a baby cry score is above the threshold then the decision for that frame is baby cry, otherwise the decision is “not a baby”, therefore, determining that the context data matches the audio event].
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Mitchell et al. (U.S. Patent No. 10,878,840) hereinafter Mitchell, in view of SUBRAMANIAN et al. (U.S. Publication No. 2018/0225939) hereinafter Subramanian, and further in view of Amini et al. (U.S. Publication No. 2019/0289263) hereinafter Amini.
As to claim 3:
Mitchell discloses all the limitations as set forth in the rejections of claim 1 above, but does not appear to expressly disclose determining, based on object recognition of video data received from the one or more sensors, an object associated with the audio event.
Amini discloses:
determining, based on object recognition of video data received from the one or more sensors, an object associated with the audio event [Paragraph 0083 teaches content recognition process may apply computer vision techniques to detect physical objects captured in the content; Paragraph 0084 teaches classifying detected objects, by processing the video content to identify instances of various classes of physical objects occurring in the captured video of the surveilled environment].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Mitchell, by determining, based on object recognition of video data received from the one or more sensors, an object associated with the audio event, as taught by Amini [Paragraphs 0083, 0084], because the applications are directed to improvements in monitoring systems and events detection; performing object recognition of video from the sensors enables detect events with higher accuracy and generate notifications automatically and in real-time, increasing the effectiveness of security systems (See Amini Paras [0025], [0026]).
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Mitchell et al. (U.S. Patent No. 10,878,840) hereinafter Mitchell, in view of SUBRAMANIAN et al. (U.S. Publication No. 2018/0225939) hereinafter Subramanian, and further in view of Gilmartin et al. (U.S. Publication No. 2018/0374325) hereinafter Gilmartin.
As to claim 5:
Mitchell discloses all the limitations as set forth in the rejections of claim 1 above, but does not appear to expressly disclose receiving, from at least one of: a Red Green Blue Depth (RGB-D) device, a Light Detection and Ranging (LIDAR) device, or a Radio Detection and Ranging (RADAR) device, or distance data.
Gilmartin discloses:
receiving, from at least one of: a Red Green Blue Depth (RGB-D) device, a Light Detection and Ranging (LIDAR) device, or a Radio Detection and Ranging (RADAR) device, or distance data [Paragraph 0015 teaches video camera may use an RGB sensor; Paragraph 0016 teaches the camera may further include a depth sensing camera based on LIDAR, to determine a distance to an object in the field of view of the camera].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Mitchell, by receiving, from at least one of: a Red Green Blue Depth (RGB-D) device, a Light Detection and Ranging (LIDAR) device, or a Radio Detection and Ranging (RADAR) device, or distance data, as taught by Gilmartin [Paragraphs 0015, 0016], because the applications are directed to improvements in monitoring systems and events detection; incorporating a LIDAR, RGB-D, or RADAR device to obtain distance information enables the generation of a complete 3D representation of the monitored environment, improving thereby the accuracy of event detection (See Gilmartin Para[0016]).
Response to Arguments
This is in response to arguments filed on March 9, 2026. Arguments have been carefully and fully considered.
Claim Rejections - 35 USC § 101
Applicants’ arguments have been carefully and respectfully considered, but are not persuasive.
In regards to claim 1, Applicant argues that “the claims do not recite a mental process”, and more specifically, that “Amended claim 1 recites "determining, based on one or more features extracted from the audio data by a machine learning algorithm, a plurality of possible audio events." This limitation cannot be performed in the human mind at least because machine learning algorithms require computational processing through trained models that operate on data in ways that are fundamentally beyond human mental capacity”.
In response to the preceding argument, Examiner respectfully disagrees, and respectfully submits adding a “computer-aided” limitation, computer components, or a machine learning algorithm recited at a high-level of generality without significantly more, to a claim covering an abstract concept, is insufficient to render a claim eligible where the claims are silent as to how the computer aids the method, the extent to which a computer aids the method, or the significance of the computer to the performance of the method, and amounts to merely saying “apply-it”. In order for a machine to add significantly more, it must “play a significant part in permitting the claimed method to be performed, rather than function solely as an obvious mechanism for permitting a solution to be achieved more quickly” (See, e.g., Versata Development Group v. SAP America, 793 F.3d 1306, 1335, 115 USPQ2d 1681, 1702 (Fed. Cir. 2015); See MPEP 2106.05(f)(II)(v) Requiring the use of software to tailor information and provide it to the user on a generic computer, Intellectual Ventures I LLC v. Capital One Bank (USA), 792 F.3d 1363, 1370-71, 115 USPQ2d 1636, 1642 (Fed. Cir. 2015)).
In regards to claim 1, Applicant argues that “amended claim 1 further recites that the audio data is received from “one or more sensors associated with one or more location labels.” This ties the claim to a specific technical implementation involving sensors with associated location metadata, not merely abstract data processing”.
In response to the preceding arguments, Examiner respectfully disagrees, and respectfully submits that, as further explained in the rejections above, the “receive” limitation as presently presented in the claims, is an additional limitation not part of the abstract idea, and is considered insignificant extra-solution activity. The fact that the audio data is received from “sensors associated with one or more location labels” does not change the fact that the limitation is directed to data gathering or transmitting steps, which is considered insignificant extra-solution activity.
As to claim 1, Applicant argues that “the claim recites "determining, based on a relationship between the one or more location labels and the plurality of possible audio events, a most-likely audio event of the plurality of possible audio events." This determination leverages the technical infrastructure of location-labeled sensors to refine the audio event classification. The relationship between location labels and possible audio events represents a technical correlation that is established and maintained through computational systems, not through human mental processes”.
In response to the preceding argument, Examiner respectfully disagrees, and respectfully points out that as presently presented, the determination limitations based on relationships between data, are in other words data comparisons that can be performed in the human mind, with the aid of pen and paper. Therefore, the limitations are part of the abstract idea.
In regards to claim 1, Applicant argues that “any alleged mental process is integrated into a practical application because the claims provide an improvement to technology”, more specifically, that “the Background of the instant application identifies the area for improvement reciting, "[a]s the amount of information being monitored via sensors has increased, the burden of generating pertinent notifications to users of a monitoring sensor has increased. Many different types of sounds may be sensed by the monitoring sensor. The quantity of different types of sounds being monitored increases the complexity of classifying captured audio at a high confidence level."”, thus, the present claims “provide a technical solution to the technical problem of properly identifying audio events in otherwise ambiguous audio data”.
In response to the preceding argument, Examiner respectfully disagrees, and respectfully submits that it is not clear, from the Applicant’s argument, what is the specific improvement in the functioning of a computer, or the improvement to another technology or technical field, that is achieved with the claimed invention. Furthermore, it is also not apparent from the Applicant’s argument, how such improvement correlate with the claim language as presently presented. Based on Applicant’s arguments, it appears that the improvement is directed to improving the identification of audio events, however, it is not clear how audio improving event detection constitute an improvement in the technology or the computer. It is important to keep in mind that an improvement in the judicial exception itself is not an improvement in technology. For example, providing an accurate classification of data even when the input data includes high amounts of data, or ambiguous audio data, may improve the event detection and alerts, but does not improve computers or technology.
In regards to claim 1, Applicant argues that “These features are not merely generic recitations of computer hardware or steps. Instead, they are unconventional features, which when considered as a whole amount to "significantly more" than any alleged judicial exception”.
In response to the preceding argument, Examiner respectfully disagrees, and respectfully submits that it is not clear what are the unconventional steps, or features recited in the claims that amount to significantly more than the abstract idea. As further discussed in the 101 rejections above, the additional limitations as recited in the claims, as presently presented, are considered by the courts as insignificant extra-solution activity, and are not sufficient to amount to significantly more than the abstract idea. Furthermore, even when taken as a whole, the claims are directed to collection, analysis, and transmission of certain results of the collection and analysis, where the analysis is recited at a high level of generality such that it can be performed in the human mind. Additionally, the Applicant has not demonstrated that a special purpose machine is required to carry out the claimed invention. A special purpose processor involves more than a processor only broadly applying the abstract idea and/or performing conventional functions. Therefore, adding a “computer-aided” limitation to a claim covering an abstract concept, without significantly more, is insufficient to render a claim eligible where the claims are silent as to how the computer aids the method, the extent to which a computer aids the method, or the significance of the computer to the performance of the method. In order for a machine to add significantly more, it must “play a significant part in permitting the claimed method to be performed, rather than function solely as an obvious mechanism for permitting a solution to be achieved more quickly”. (See, e.g., Versata Development Group v. SAP America, 793 F.3d 1306, 1335, 115 USPQ2d 1681, 1702 (Fed. Cir. 2015); See MPEP 2106.05(f)(II)(v) Requiring the use of software to tailor information and provide it to the user on a generic computer, Intellectual Ventures I LLC v. Capital One Bank (USA), 792 F.3d 1363, 1370-71, 115 USPQ2d 1636, 1642 (Fed. Cir. 2015)). The claims are directed to an abstract idea without significantly more, under the “Mental Processes” grouping of abstract ideas, as further detailed in the rejections above. 101 Rejections are hereby sustained.
Claim Rejections - 35 USC § 102/103
Applicant’s arguments have been respectfully and carefully considered, but are moot in view of new grounds of rejections, as necessitated by the amendments.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RAQUEL PEREZ-ARROYO whose telephone number is (571)272-8969. The examiner can normally be reached Monday - Friday, 8:00am - 5:30pm, Alt Friday, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sherief Badawi can be reached at 571-272-9782. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RAQUEL PEREZ-ARROYO/Primary Examiner, Art Unit 2169