Last updated: May 29, 2026

Application No. 18/892,272

SPEECH DIALOG SYSTEM AND RECIPROCITY ENFORCED NEURAL RELATIVE TRANSFER FUNCTION ESTIMATOR

Non-Final OA §103

Filed

Sep 20, 2024

Priority

Jun 30, 2022 — continuation of 12/142,293

Examiner

AGAHI, DARIOUSH

Art Unit

2656

Tech Center

2600 — Communications

Assignee

Microsoft Technology Licensing, LLC

OA Round

1 (Non-Final)

Interview Optional

— +30.2% interview lift. Examiner has a relatively high allowance rate (85%); +30.2% interview lift. A written response may suffice.

Based on 174 resolved cases, 2023–2026

Examiner Intelligence

AGAHI, DARIOUSH View full profile →

Grants 85% — above average

Career Allowance Rate

148 granted / 174 resolved

+23.1% vs TC avg

Strong +30% interview lift

Without

With

+30.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 7m

Avg Prosecution

16 currently pending

Career history

195

Total Applications

across all art units

Statute-Specific Performance

§101

7.5%

-32.5% vs TC avg

§103

89.9%

+49.9% vs TC avg

§102

1.1%

-38.9% vs TC avg

§112

0.8%

-39.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 174 resolved cases

Office Action

§103

DETAILED ACTION
This office action is in response to Applicant’s submission filed on 9/20/2024. This is a CON case based on Application 17855554 (issued as US12142293). Claims 1-20 are pending in the application of which Claims 1, 9, and 17 are independent and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Specification
The specification is objected to because of the following informalities:  
Par. 0075, line 3 recites: “… If a source if placed directly in front of the array, ....”.  The underlined “if” should be changed to “is”.
Appropriate correction is required.





Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 4, 9, 12, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over  Garimella et al. (US9886948B1)(herein " Garimella"), and in further view of  Lu et al. (US20230124006A1)(herein "Lu").

Regarding claims 1, 9, and 17 Garimella teaches [A system comprising: a processor; and a memory storing instructions that control the processor to perform operations of: - claim 1], [A method comprising: - claim 9], and [A computer-readable storage device storing instructions that upon execution by a processor perform operations of: - claim 17] (Garimella, Col. 7, ll. 4-11:” … a set of executable program instructions stored on a computer-readable medium, such as one or more disk drives, of a computing system, such as a server computing device 500 of the speech processing system 100, as shown in FIG. 5. When the process 200 is initiated, the executable program instructions can be loaded into memory, such as RAM, and executed by one or more processors of the computing system.”, and Col. 11, ll. 60-67:” The memory 510 may include computer program instructions that the processing unit 502 executes in order to implement one or more embodiments. The memory 510 generally includes volatile memory, such as RAM, and/or other non-transitory computer-readable media. The memory 510 can store an operating system 512 that provides computer program instructions for use by the processing unit 502 in the general administration and operation of the computing device 500.”)
receiving multichannel speech data; (Garimella, Col. 4, ll. 66-67:” The extracted feature streams shown in FIG. 1 may correspond to streams from different channels, different microphones, etc.”, and Col. 7, ll. 12-13:” … the system executing the process 200 can receive input, such as audio input of a user utterance, …”).
processing the spectral embedding vector and one or more of the multichannel speech data into a Neural Transfer Function (NTF); and (Garimella, Col. 4, line 66 – Col. 5, line 8:” The extracted feature streams shown in FIG. 1 may correspond to streams from different channels, different microphones, etc. As one example, the extracted feature streams shown in FIG. 1 may correspond to a stream 122 of LFBE feature vectors and a stream 124 of i-vector feature vectors. Illustratively, feature vectors in stream 122 may be LFBE feature vectors of a current or most recent utterance, and may be extracted from an audio signal and provided to a neural network [NTF] 114 in a real time or substantially real time manner to perform speech recognition.”)
recognizing an utterance using the spectral embedding vector and the NTF. (Garimella, Col. 6, ll. 7-13:” The ASR module 110 can further process output from the neural network to determine a word or subword unit for a portion of audio input. The ASR module 110 can then process a set of words or subword units (such as a lattice or n-best list of candidate transcriptions) using a language model 116 to generate speech recognition results, such as a most likely transcription of the utterance in the audio signal.“)
Garimella, does not teach, however Lu teaches encoding, from the multichannel speech data, a reference channel into a spectral embedding vector; (Lu, Par. 0004:’ … determine spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data, the spectral embeddings including a first frequency class token (FCT); determine each vector of a second FCT by passing each vector of the first FCT in the spectral embeddings through the spectral transformer; …”).
Lu is considered to be analogous to the claimed invention because it is in the same field of endeavor. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garimella further in view of Lu to encode, from the multichannel speech data, a reference channel into a spectral embedding vector. Motivation to do so would improve performance of the transformer-based neural network model by reducing the number of parameters (Lu, Par. 0039).

Regarding claims 4, and 12, Garimella, as modified above, further teaches wherein the NTF does not include a relative transfer function (RTF) criterion and a residual. (Garimella, Col. 4, line 66 – Col. 5, line 8:” The extracted feature streams shown in FIG. 1 may correspond to streams from different channels, different microphones, etc. As one example, the extracted feature streams shown in FIG. 1 may correspond to a stream 122 of LFBE feature vectors and a stream 124 of i-vector feature vectors. Illustratively, feature vectors in stream 122 may be LFBE feature vectors of a current or most recent utterance, and may be extracted from an audio signal and provided to a neural network [NTF] 114 in a real time or substantially real time manner to perform speech recognition.”) Note: a neural transfer function is a non-linear squashing function for network activation, while a relative transfer function is a ratio of system responses between sensors/transducer/microphones, etc.


Claims 2, 10, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over  Garimella, and Lu, and in further view of  Matheja et al. (US20200380967A1)(herein " Matheja ").

Regarding claims 2, 10, and 18, Garimella, as modified above, teaches the system, the method, and the storage device of claims 1, 9, and 17 respectively.
Garimella, as modified above, does not teach, however, Matheja teaches wherein the multichannel speech data is received from a first microphone in a first zone and a second microphone in a second zone, (Matheja, Par. 0016:” … a speech dialog system that includes a first microphone, a second microphone, a processor and a memory. The first microphone captures first audio from a first spatial zone, and produces a first audio signal. The second microphone that captures second audio from a second spatial zone, and produces a second audio signal.”)
wherein the instructions further control the processor to perform operations of: receiving zone activity information in at least one of the first zone or the second zone; and (Matheja, Par. 0018:” detects, from the first audio signal and the second audio signal, speech activity in at least one of the first spatial zone or the second spatial zone, thus yielding processed audio; …”).
identifying from which of the first zone or the second zone the recognized utterance originated. (Matheja, Par. 0019:” determines from which of the first zone or the second zone the processed audio originated, thus yielding zone activity information;”).
Matheja is considered to be analogous to the claimed invention because it is in the same field of endeavor. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garimella, as modified above, further in view of Matheja to wherein the multichannel speech data is received from a first microphone in a first zone and a second microphone in a second zone, wherein the instructions further control the processor to perform operations of: receiving zone activity information in at least one of the first zone or the second zone; and identifying from which of the first zone or the second zone the recognized utterance originated. Motivation to do so would perform a zone-dedicated speech dialog based on the recognized utterance and the zone decision (Matheja, Par. 0024).
Claims 3, and 11 are rejected under 35 U.S.C. 103 as being unpatentable over  Garimella, and Lu, and in further view of  Eubank et al. (US20210329405A1)(herein " Eubank ").

Regarding claims 3, and 11, Garimella, as modified above, teaches the system, and the method  of claims 1, and 9 respectively.
Garimella, as modified above, does not teach, however, Eubank wherein the spectral embedding vector is processed by a trained neural spatial and residual encoder (NSRE) without explicit decoding. (Eubank, Par. 0040:” … can include a reverberation extractor [residual information] 310 that removes reverberant components from input audio signals to extract a direct component [spatial information]. The input audio signals can be generated by microphones in the physical environment, and processed into frequency domain audio signals. The extractor can remove the reverberant component from the audio signals, outputting a direct component. The direct component can be subtracted from the input audio signals by a subtractor 311 to extract the reverberant component. Like the input audio signals, the direct component and reverberant component can also be in the frequency domain. These can be fed as inputs to a trained neural network 312 (e.g., a convolutional neural network) which can then generate measured acoustic parameters based on the direct component and reverberant component. … In one aspect, the reverberation extractor can include a multi-channel dereverberator that performs linear dereverberation on each input processed audio signal to output a dereverberated direct component. In one aspect, the reverberation extractor can include a parametric multi-channel Wiener filter (PMWF) that applies the filter parameters to the input signals and outputs a dereverberated and de-noised direct component.”) Note: Per as filed specification, Par. 0076:” … the Neural Spatial and Residual Encoder (NSRE) 311 can be a single neural network that performs the spatial and residual information extraction of multi-channel networks. In an implementation, spatial and residual information extraction is for a collection of two channel networks, …”).
Eubank is considered to be analogous to the claimed invention because it is in the same field of endeavor. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garimella, as modified above, further in view of Eubank to wherein the spectral embedding vector is processed by a trained neural spatial and residual encoder (NSRE) without explicit decoding. Motivation to do so would improve the processing efficiency and reduce unwanted artifacts (Eubank, Par. 0057).


Claims 5, 6, 13, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over  Garimella, and Lu, and in further view of  Giri et al. (US20170094421)(herein " Giri ").

Regarding claims 5, and 13, Garimella, as modified above, teaches the system, and the method  of claims 1, and 9 respectively.
Garimella, as modified above, does not teach, however, Giri teaches extracting relative transfer functions (RTFs) from the multichannel speech data. (Giri, Par.  0007:” The use of a dynamic Relative Transfer Function (RTF) between two or more microphones [multichannel speech data] may be useful in multi-microphone speech processing applications. …. The use of an efficient and fast dynamic RTF estimation algorithm using short burst of noisy, reverberant mic recordings, which will be robust to head movements (e.g., microphone positions) may provide more accurate RTFs which may lead to a significant performance increase.”)
Giri is considered to be analogous to the claimed invention because it is in the same field of endeavor. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garimella, as modified above, further in view of Giri to extract relative transfer functions (RTFs) from the multichannel speech data. Motivation to do so would improve speech intelligibility and speech quality in the presence of environmental changes (Giri, Par. 0007).

Regarding claims 6, and 14, Garimella, as modified above, teaches the system, and the method  of claims 5, and 13 respectively.
Garimella, as modified above, does not teach, however, Giri further teaches wherein the RTFs map signals of different microphones to each other, the mappings satisfying reciprocity. (Giri, Par. 0007:” The use of a dynamic Relative Transfer Function (RTF) between two or more microphones may be useful in multi-microphone speech processing applications.”, and Par. 0010:” … System 100 includes a first transducer 102 and a second transducer 104, where each transducer converts an audio source into an audio signal. … Hearing device 106 may include transducers 102 and 104 within a common housing, such as two microphones within a pair of hearing aids or within a set of headphones. Hearing device 106 uses the received audio signals to determine an estimated Relative Transfer Function (RTF). To determine the RTF, the hearing device 106 iteratively determines a Relative Impulse Response (ReIR) point estimate until the ReIR point estimate converges, and then estimates the RTF based on the converged ReIR point estimate. .... the latest RTF estimate may be used in response to a packet drop or missing audio.”) Note: By definition the relative transfer function is defined between two (or more) microphones which establishes/maps signal between microphones, and as microphones are considered to be linear passive networks/components, the relationship between input and output is symmetric due to the principle of reciprocity.  


 
Claims 7, 15, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over  Garimella, Lu, and, Giri and in further view of  Kinoshita et al. (US5982903A)(herein " Kinoshita ").

Regarding claims 7, 15, and 19, Garimella, as modified above, teaches the system, the method, and the storage device of claims 5, 13 and 17 respectively.
[claim 19 only] Garimella, as modified above, does not teach, however, Giri teaches extracting relative transfer functions (RTFs) from the multichannel speech data. (Giri, Par.  0007:” The use of a dynamic Relative Transfer Function (RTF) between two or more microphones [multichannel speech data] may be useful in multi-microphone speech processing applications. …. The use of an efficient and fast dynamic RTF estimation algorithm using short burst of noisy, reverberant mic recordings, which will be robust to head movements (e.g., microphone positions) may provide more accurate RTFs which may lead to a significant performance increase.”)
Giri is considered to be analogous to the claimed invention because it is in the same field of endeavor. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garimella, as modified above, further in view of Giri to extract relative transfer functions (RTFs) from the multichannel speech data. Motivation to do so would improve speech intelligibility and speech quality in the presence of environmental changes (Giri, Par. 0007).
Garimella, as modified above, does not teach, however, Kinoshita teaches providing, using a distribution of coefficients of the RTFs, information about location of a sound source. (Kinoshita, Col. 14, ll. 29-37:” … The sound source 11 is placed at each of the 24 locations and the head related transfer functions ht(t) and hr(t) are measured for each subject. In the case of measuring the transfer functions st(t) and sr(t) according to Eqs. (3A') and (3b'), the output characteristic sp(t) of each sound source (loudspeaker) 11 should also be measured in advance. For instance, the numbers of coefficients composing the sound localization transfer functions st(t) and sr(t) are each set at 2048.”)
Kinoshita is considered to be analogous to the claimed invention because it is in the same field of endeavor. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garimella, as modified above, further in view of Kinoshita to provide, using a distribution of coefficients of the RTFs, information about location of a sound source. Motivation to do so would allow reduction of the number of variables indicating characteristics dependent on the direction of the sound source (Kinoshita, Col. 7, ll. 16-18).


Claims 8, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over  Garimella, Lu, Giri, and Kinoshita and in further view of  Serackis et al. (US20230067081A1)(herein " Serackis").

Regarding claims 8, 16, and 20, Garimella, as modified above, teaches the system, the method, and the storage device of claims 7, 15 and 19 respectively.
Garimella, as modified above, does not teach, however,  Serackis teaches training a machine learning system to learn mapping between shape of the RTFs across channels a relative direction of the sound source. (Serackis, Par. 0012:” … head tracking is used for selection of the desired HRTF, which is related to head orientation with respect to the sound source, the more common tracking solution is based on inertial sensor signal analysis. Here the accelerometer signal or fusion of accelerometer and gyroscope signals are used to estimate head orientation and motion. A pre-estimated bank of HRTF is used to prepare signal filter coefficients for audio signal processing. The HRTF bank stores transfer functions, pre-estimated for discrete head orientation (azimuth and elevation) angle step. … The head orientation is estimated using face landmarks tracking in 3D instead of using motion sensor signals. The head orientation is estimated from 3D marker cloud. The markers are the face landmarks, estimated and tracked using trained machine learning models. … the required HRFT is predicted by a digital twin, which is based on a flexible approximator, based on machine learning models, trained to approximate input-output mapping. Here in this application, the inputs are the 3D marker cloud which includes face landmark 3D coordinates. The outputs are the coefficients of the predicted desired HRTF function.”)
Serackis is considered to be analogous to the claimed invention because it is in the same field of endeavor. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Garimella, as modified above, with Serackis. As implied in Serackis, one of ordinary skill would have been motivated to combine the teachings because it would accurately simulate how sound behaves in a 3D environment relative to a listener's ears.

 


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Jang et al. (US20240161372A1) teaches in Par. 0077:” The speaker encoder 510 may display spectrograms corresponding to various speech data and embedding vectors corresponding to the spectrograms in a vector space. The speaker encoder 510 may input a first spectrogram generated from the speech data of the deceased person to the trained artificial neural network model and output an embedding vector of speech data most similar to the speech data of the deceased person in the vector space as the speaker embedding vector. That is, the trained artificial neural network model may receive the first spectrogram as an input and generate the embedding vector matching a specific point in the vector space.”
Examiner's Note: Examiner has cited particular columns and line numbers and/or paragraph numbers in the references applied to the claims above for the convenience of the applicant. Although the specified citations are representative of the teachings of the art and are applied to specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested from the applicant in preparing responses, to fully consider the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.
In the case of amending the Claimed invention, Applicant is respectfully requested to indicate the portion(s) of the specification which dictate(s) the structure relied on for proper interpretation and also to verify and ascertain the metes and bounds of the claimed invention.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689. The examiner can normally be reached Monday - Thursday and alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






DARIOUSH AGAHI, P.E.
Primary Examiner

/DARIOUSH AGAHI/Primary Examiner, Art Unit 2656

Read full office action

Prosecution Timeline

Sep 20, 2024

Application Filed

Apr 15, 2026

Non-Final Rejection mailed — §103

May 26, 2026

Interview Requested

Precedent Cases

Applications granted by this same examiner with similar technology

18/102,483

Patent 12639515

ELECTRONIC APPARATUS RECOMMENDING CONTENT-BASED SEARCH TERMS AND CONTROL METHOD THEREOF

3y 4m to grant Granted May 26, 2026

18/395,319

Patent 12639526

METHODS AND APPARATUS TO SELF-GUARDRAIL LARGE LANGUAGE MODEL RESPONSES

2y 5m to grant Granted May 26, 2026

18/442,982

Patent 12639512

SYSTEMS AND METHODS FOR SEEDED NEURAL TOPIC MODELING

2y 3m to grant Granted May 26, 2026

18/497,721

Patent 12639361

ISSUE HANDLING USING UNSUPERVISED MACHINE LEARNING

2y 6m to grant Granted May 26, 2026

18/628,373

Patent 12609134

ONSET ZONE DETECTION USING COHERENT FOCUSING SUMMATION OVER MULTIPLE GEOMETRIC POSITIONS

2y 0m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

85%

Grant Probability

99%

With Interview (+30.2%)

2y 7m (~11m remaining)

Median Time to Grant

Low

PTA Risk

Based on 174 resolved cases by this examiner. Grant probability derived from career allowance rate.