Last updated: April 19, 2026

Application No. 18/707,840

MULTI-DEVICE, MULTI-CHANNEL ATTENTION FOR SPEECH AND AUDIO ANALYTICS APPLICATIONS

Final Rejection §103§112

Filed

May 06, 2024

Examiner

NGUYEN, QUYNH H

Art Unit

2693

Tech Center

2600 — Communications

Assignee

Dolby Laboratories Licensing Corporation

OA Round

2 (Final)

Interview Optional

— +17.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 1078 resolved cases, 2023–2026

Examiner Intelligence

NGUYEN, QUYNH H View full profile →

Grants 87% — above average

Career Allow Rate

941 granted / 1078 resolved

+25.3% vs TC avg

Strong +17% interview lift

Without

With

+17.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

29 currently pending

Career history

1107

Total Applications

across all art units

Statute-Specific Performance

§101

18.6%

-21.4% vs TC avg

§103

42.7%

+2.7% vs TC avg

§102

7.4%

-32.6% vs TC avg

§112

10.3%

-29.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1078 resolved cases

Office Action

§103 §112

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claim Rejections - 35 USC § 112
1.	The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 13 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 13 recites the limitation "the comparing" in line 1.  There is insufficient antecedent basis for this limitation in the claim.
Claim Rejections - 35 USC § 103
2.	Claims 1-2, 4-8, 13, 16, 18 are rejected under 35 U.S.C. 103 as being unpatentable over submitted prior art Horling (2018/0330589) in view of Calvo et al. (CN 110168570 A) and Pronovost (US Patent 12,535,780).
As to claim 1, Horling teaches a method comprising:
receiving, by a control system, sensor data from each of a plurality of sensors in an environment, the plurality of sensors corresponding to a plurality of devices in the environment, the sensor data including microphone data (Fig. 2, voice-activated devices 180, [0007] – microphone and other sensors are enabled, [0039] – voice-activated devices can leverage the sensors and outputs of interconnected device, [0071] – voice-activated device 180 includes at least one or more microphones, a speaker, a processor and memory storing at least one program for execution by the processor);
for each device:
producing, by the control system, voice assistance server 212 processes audio inputs collected by voice-activated devices 180 that can leverage the sensors, one or more content hosts 21 ([0082]);
producing, by the control system, the home assistance device obtains (806Z) one or more monitoring criteria, the home assistance device obtains the monitoring criteria from a server system (164), the monitoring criteria from a local database (device data 350) ([0172]); and
	controlling, by the control system, the operation of at least one device of the plurality of devices in the environment based, at least in part, in the one or more output analytics, wherein the controlling involves controlling at least of a loudspeaker operation or a microphone operation ([0178] – control comparing the monitored audio with the monitoring criteria; and [0039] – voice-activated devices can leverage the sensors and outputs of interconnected device, [0071] – voice-activated device 180 includes at least one or more microphones, a speaker, a processor and memory storing at least one program for execution by the processor).
Horling does not explicitly teach an embedding vector, inputting, by the control system, the device-wise context vectors into a machine learning model, wherein the machine learning model includes an attention mechanism; predicting, by the machine learning model, one or more output analytics.
Calvo teaches the user computing device obtains sensor data from a plurality of sensors; parameters of one or more measurement sensor data may indicate to the physical environment of the sensor; an audio sensor (e.g., microphone), computing device can the sensor data input from the plurality of sensors to the machine learning in virtual sensor model and received as machine learning virtual sensor output vector of the output of the virtual sensor model (abstract and throughout the CN); sensor output prediction model has been trained to receive sensor data from a plurality of sensors, parameters of one or more measurement indication in the physical environment of the sensor from the sensor data of each sensor to; in response to the received sensor data from the plurality of sensors, sensor output predicting future sensor output model has been trained to output one or more predictions (at least summary of the invention, 2nd paragraph). The sensor output prediction model 252 can at the same time (e.g., when the future time information 260 identifying a plurality of different future time) and/or iteratively (e.g., each iteration to provide a new sensor data vector 254-258 as input, output from the sensor prediction model 252 to output a new sensor output prediction vector 264) output a plurality of sensor output prediction vector (Exemplary Method, 9th paragraph above).
Pronovost teaches transformer-based machine learning model comprises an encoder for determining an input embedding (e.g., a vector) that represents characteristics of the input token and the attention mechanism of the transformer-based ML model may determine how important or relevant another or the same input token is to that token (col. 2, lines 33-38);  a transformer-based ML architecture, input embeddings may be associated with position embeddings. As an example, input tokens may represent physical objects in an environment, patches in a top-down representation of a scene, groups of data points in sensor data, pixels in an image, and the like, and position embeddings corresponding to the input embeddings may represent spatial positions of the input tokens from which the input embeddings are generated (col. 3, lines 21-29); an attention mechanism of the transformer-based machine learned model (col. 34, lines 37-40; claims 5 & 18).
It would have been obvious before the effective filing date of the claimed invention to incorporate the teachings of Calvo and Pronovost into the teachings of Horling for the purpose of providing solutions for performing a complete multiple sensors and predicted sensor output of depth machine learning and attention mechanism of the transformer based machine learning.
As to claim 2, Horling teaches the method of claim 1 wherein the controlling involves controlling one or more an automatic speech recognition process, and acoustic scene analysis process, a talker identification process or sound event classification process (Fig. 8, 804; abstract; [0019-0020, 0178, 0182] – the assistant mode to operating in the monitoring mode to obtain a classification of the sound and based on sound having a first sound classification emits a first simulated occupant response of a plurality of simulated occupant responses via one or more speakers).
As to claim 4, Calvo teaches the method of claim 1 wherein one or more aspects of the method is implemented via a trained neural network (summary of the invention, 2nd – 7th paragraphs - machine learning sensor output perfect model has been trained to receive sensor data from a plurality of sensors).
As to claim 5, Pronovost teaches the method of claim 4 wherein the trained neural network comprises a trained attention-based neural network (col. 2, lines 51-58 - Standard attention mechanisms, including self-attention and cross-attention, applied at the encoder, decoder, or both, in a transformer-based ML architecture; a transformer-based ML architecture, computation of attention is expressed in terms of query, key, and value vectors, where the vectors are obtained by a vector-to-matrix multiplication of input embeddings with weight matrices which may be learned during a training phase).
As to claim 6, Pronovost teaches the method of claim 1, wherein producing the device-wise context vector involves integrating each of a plurality of input embedding vectors corresponding to at least one multi-sensor device (col. 2, lines 51-58 - Standard attention mechanisms, including self-attention and cross-attention, applied at the encoder, decoder, or both, in a transformer-based ML architecture; a transformer-based ML architecture, computation of attention is expressed in terms of query, key, and value vectors, where the vectors are obtained by a vector-to-matrix multiplication of input embeddings with weight matrices which may be learned during a training phase). And Calvo teaches the sensor output prediction model 252 can at the same time (e.g., when the future time information 260 identifying a plurality of different future time) and/or iteratively (e.g., each iteration to provide a new sensor data vector 254-258 as input, output from the sensor prediction model 252 to output a new sensor output prediction vector 264) output a plurality of sensor output prediction vector (Exemplary Method, 9th paragraph above).
As to claim 7, Pronovost teaches the method of claim 6, wherein the control system is configured to implement a multi-channel neural context encoder for integrating each of the plurality of input embedding vectors (col. 2, lines 51-58 - Standard attention mechanisms, including self-attention and cross-attention, applied at the encoder, decoder, or both, in a transformer-based ML architecture; a transformer-based ML architecture, computation of attention is expressed in terms of query, key, and value vectors, where the vectors are obtained by a vector-to-matrix multiplication of input embeddings with weight matrices which may be learned during a training phase).
As to claim 8, Pronovost teaches the method of claim 7, wherein the multi-channel neural context encoder comprises a trained attention based neural network (col. 2, lines 51-58 - Standard attention mechanisms, including self-attention and cross-attention, may be applied at the encoder, decoder, or both, in a transformer-based ML architecture. In examples of a transformer-based ML architecture, computation of attention is expressed in terms of query, key, and value vectors, where the vectors are obtained by a vector-to-matrix multiplication of input embeddings with weight matrices which may be learned during a training phase).
As to claim 13, Pronovost teaches the method of claim 1, wherein the comparing is performed by a multi-device context module that comprises one or more attention-based neural network (col. 2, lines 51-58 - Standard attention mechanisms, including self-attention and cross-attention, applied at the encoder, decoder, or both, in a transformer-based ML architecture; a transformer-based ML architecture, computation of attention is expressed in terms of query, key, and value vectors, where the vectors are obtained by a vector-to-matrix multiplication of input embeddings with weight matrices which may be learned during a training phase); and Calvo teaches determining the loss function, ground live sensor data of the second portion (e.g., a remaining portion) is compared with virtual sensor model attempts to predict the loss function to be generated by virtual sensor model of virtual sensor output vector (summary, 12th paragraph) and data from at least some of the sensors of the plurality of sensors is more recently detected compared to the other. a virtual sensor output can be converted to update the sensor output near the detected based on a plurality of sensor learns the correlation between and other relation of prediction update for other sensor output value (summary, 10th paragraph).
Claim 16 is rejected for the same reasons discussed above with respect to claim 1. Furthermore, Horling teaches one or more processors and memory coupled to the one or more processors ([0021-0022, 0169], claims 1 and 19).
Claim 18 is rejected for the same reasons discussed above with respect to claim 1. Furthermore, Horling teaches one or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to implement the method of claim 1 ([0022, 0095, 0127, 0169]).

3.	Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over submitted prior art Horling, Calvo and Pronovost in view of Gautam et al. (2022/0414381).
	As to claim 3, Horling teaches generating one or more current output analytics tokens ([0178] – obtaining classification of the sound; output analytics token corresponds to the output by the classification or sound according to the monitoring criteria); and Pronovost teaches transformer-based machine learning model comprises an encoder for determining an input embedding (e.g., a vector) that represents characteristics of the input token and the attention mechanism of the transformer-based ML model may determine how important or relevant another or the same input token is to that token (col. 2, lines 33-38) and the embedding may be a high-dimensional vector or tensor that represents the object data in the embedding space (col. 14, lines 8-10). Horling, Calvo and Pronovost do not explicitly discuss obtaining, by the control system, one or more prior analytics output tokens within the length of a context window. However, Horling teaches obtaining input sound for some length of time ([0011]); and receipt of voice inputs for at least a predefined amount of time ([0085]).
	Gautam teaches prior techniques that only analyze video for emotions or low-level audio-visual feature through a number of optimizations. For example, the audio recommendation system is a neural network architecture that can process spatial and temporal video features simultaneously and generate an audio vector (or audio embedding) based on them, which can then be used to retrieve matching background tracks based on closest Euclidean distance. The audio recommendation system further trains an audio reasoning module using backpropagation to minimize the Euclidean distance between actual (ground truth) audio vectors and generated audio vectors and maximize the distance between generated audio vectors and the audio vectors of unrelated audio sequences via triplet loss ([0021]).
	It would have been obvious before the effective filing date of the claimed invention to incorporate the teachings of Gautam into the teachings of Horling, Calvo, Pronovost for the purpose of maximizing the distance between generated audio vectors and the audio vectors of unrelated audio sequences via triplet loss.

4.	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over submitted prior art Horling, Calvo, Pronovost in view of Li (CN 118197420 A).
	As to claim 14, Horling, Calvo, Pronovost do not explicitly discuss the method of claim 13, wherein the multi-device context module is configured to implement at least one of a scaled dot product attention process or a multi-head attention process.
	Li teaches The self-attention (SA) mechanism uses other parts of the sample to predict other parts of the data sample, and calculates a weighted sum of the input feature vectors in a dot-multiplied form. The multi-head attention (MSA) is a core component of the multi-device, capable of separating the input into a plurality of small parts, then calculating the scaling dot product of each input in parallel, and finally splicing all the attention outputs to obtain the final result (under 7. B network processing).
	It would have been obvious before the effective filing date of the claimed invention to incorporate the teachings of Li into the teachings of Horling, Calvo, Pronovost for the purpose of allowing the model to learn sequences and location information in different representations of subspaces.

5.	Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over submitted prior art Horling, Calvo,  Pronovost, and Gautam in view of Ku et al. (CN 108352157 A).
	As to claim 15, Horling, Calvo,  Pronovost, and Gautam do not explicitly discuss the method of claim 3, wherein the one or more output analytics tokens comprise one or more prior analytics output tokens corresponding to an active noise cancellation process.
	Ku teaches instantaneous phase value of one or more evaluation can be during operation of the adaptive filter and independent analysis is generated on any prior model of the secondary path; one or more estimated instantaneous phase value of the unsupervised learning process can be used to produce, may be based on the output of the adaptive filter to generate the control signal, wherein the control signal causes the anti-noise signal generated is configured to reduce the influence of the noise signal, noise signal can be generated by vehicle engine. a first plurality of value further can be updated based on the error signal, the error signal generated based on the residual noise, and residual noise produced by the anti-noise signal at least partially eliminating the noise signal, active noise cancellation system may include one or more acoustic transducers and one or more microphones, one or more acoustic transducer for generating a noise cancellation signal of the anti-noise signal, one or more microphones for sensing residual noise caused by the anti-noise signal at least partially eliminating of the noise signal; transfer function can be represented as matrix, wherein a given element of a matrix indicative of the specific microphone one or more microphones in the secondary path of a particular acoustic transducer in one or more acoustic transducers  (summary of the invention, 5th paragraph).
	It would have been obvious before the effective filing date of the claimed invention to incorporate the teachings of Ku into the teachings of Horling, Calvo,  Pronovost, and Gautam for the purpose of sensing residual noise caused by the anti-noise signal at least partially eliminating of the noise signal.
Allowable Subject Matter
6.	Claim 9 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim (claim 1) and any intervening claims (claim 6). Claims 10-12 are objected because they depend on objected claim 9.
Response to Arguments
7.	Applicant’s arguments with respect to claims 1-16 and 18 have been considered but are moot because the new ground of rejections.
Conclusion
8.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

9.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to QUYNH H NGUYEN whose telephone number is (571)272-7489. The examiner can normally be reached Monday-Thursday 7:30AM-5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ahmad Matar can be reached on 571-272-7488. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/QUYNH H NGUYEN/Primary Examiner, Art Unit 2693

Read full office action

Prosecution Timeline

May 06, 2024

Application Filed

Nov 18, 2025

Non-Final Rejection — §103, §112

Feb 11, 2026

Response Filed

Mar 05, 2026

Final Rejection — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/367,310

Patent 12591740

METHODS AND SYSTEMS FOR GENERATING TEXTUAL FEATURES

2y 5m to grant Granted Mar 31, 2026

17/942,860

Patent 12567409

RESTRICTING THIRD PARTY APPLICATION ACCESS TO AUDIO DATA CONTENT

2y 5m to grant Granted Mar 03, 2026

18/663,662

Patent 12566920

System and Method to Generate and Enhance Dynamic Interactive Applications from Natural Language Using Artificial Intelligence

2y 5m to grant Granted Mar 03, 2026

18/459,819

Patent 12563141

SYSTEM AND METHOD OF CONNECTING A CALLER TO A RECIPIENT BASED ON THE RECIPIENT'S STATUS AND RELATIONSHIP TO THE CALLER

2y 5m to grant Granted Feb 24, 2026

18/468,679

Patent 12554761

DATA SOURCE CURATION FOR LARGE LANGUAGE MODEL (LLM) PROMPTS

2y 5m to grant Granted Feb 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

87%

Grant Probability

99%

With Interview (+17.2%)

2y 8m

Median Time to Grant

Moderate

PTA Risk

Based on 1078 resolved cases by this examiner. Grant probability derived from career allow rate.