Last updated: July 17, 2026

Application No. 18/045,164

Method and System for Target Source Separation

Non-Final OA §103

Filed

Oct 09, 2022

Priority

Apr 06, 2022 — provisional 63/362,587

Examiner

MEIS, JON CHRISTOPHER

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Mitsubishi Electric Corporation

OA Round

4 (Non-Final)

Interview Optional

— +47.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 34% grant rate with +47.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 29 resolved cases, 2023–2026

Examiner Intelligence

MEIS, JON CHRISTOPHER View full profile →

Grants only 34% of cases

Career Allowance Rate

10 granted / 29 resolved

-27.5% vs TC avg

Strong +47% interview lift

Without

With

+47.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

14 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§103

98.7%

+58.7% vs TC avg

§102

1.3%

-38.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 29 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Claims 1-17 are pending. Claims 1 and 11 are independent.
This Application was published as US 20230326478.
Apparent priority is 04/06/2022.

Examiner’s Note
Should the Applicant incorporate the complete concept illustrated in Figure 7 into the independent claims (1 and 11), the Examiner would consider allowing the application, subject to a further search.

Response to Amendments
	Amendments to claim 4 overcome the objection.

Response to Arguments
Provisional Application
	Arguments regarding the support provided by the provisional application are not persuasive. Applicant cites the Abstract; however, the abstract is merely describing prior art, not the instant invention. Further, the Abstract prior art only mentions isolating the vocals or speakers specifically, not other sounds. Throughout the rest of the provisional application, language and examples are consistently used that apply specifically to people such as language or gender. While the term source is used, there is no indication that the invention described in the provisional application is able to extract anything other than a voice. However, for the purposes of compact prosecution, references prior to the date of the provisional application are relied upon.

35 USC 102/103
Applicant's arguments with respect to combining the digital representation with intermediate outputs of intermediate layers of the neural network have been fully considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant’s arguments regarding Sound Source Extraction are not persuasive. In response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., identifying and extracting a target sound source) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). Claim 1 merely requires extracting all sound signals corresponding to the source. The claim does not require the system to identify the source or extract the source itself.
In the examples given by Ochiai in Fig. 2, for example, extracting all telephone sounds reads on extracting all sounds from the telephone source. Further, there is no requirement that the identifier is determined by the system, and if a user determined the identifiers of the source, it would still read upon the claims. For example, using Fig. 2 of Ochiai for an example, the user can determine the desired source is a person who is coughing and laughing, and choose these identifiers accordingly. This is the same process with both Ochiai’s system and the instant invention; the user must adequately describe the desired source so that it is uniquely identified.
Therefore, the rejection is maintained.
Examiner still suggests that incorporating limitations (if supported by the original disclosure) such as “identify a sound source based on the digital representation, and execute a neural network trained to extract every sound signal originating from the identified sound source” would help clarify the distinction between Ochiai and the claimed invention.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1, 4-5, 7-11, 14-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ochiai et al. (Listen to What You Want: Neural Network-based Universal Sound Selector) in view of Kong et al. (“Source Separation with Weakly Labelled Data: An Approach to Computational Auditory Scene Analysis”).

Regarding claim 1, Ochiai discloses: 
A sound processing system to extract a target sound signal, ("a neural network-based AE sound selection approach, called Sound Selector, which directly extracts the desired AE sound from a mixture of AEs" Pg. 1, Section 1 – the desired AE sound is a source.)
the sound processing system comprising: at least one processor; and memory having instructions stored thereon that, when executed by the at least one processor, cause the sound processing system to: ("computational and memory costs" Pg. 3, Section 4.1 – implicit that processor and memory are part of the system.)
collect a mixture of sound signals; ("we created datasets of simulated sound event mixtures based on the Freesound Dataset Kaggle 2018 corpus (FSD) [6], which contains audio clips from 41 diverse AE classes, such as human sounds, object sounds, musical instruments, etc" Pg. 3, Section 4.2)
collect a query identifying the target sound source to be extracted from the mixture of sound signals, the query comprising one or more identifiers; extract from the query, each identifier of the one or more identifiers,  ("user-specified target AE classes" Pg. 1, Section 2 - Figure 1 shows an example where the query is extracted to select "knock" and "telephone" as the target source.)
said each identifier being present in a predetermined set of one or more identifiers,  ("For each mixture, three AE classes {n1, n2, n3} were pre-defined." Pg. 3, Section 4.2 – the AE classes are identifiers)
each identifier defining at least one of mutually inclusive and mutually exclusive characteristics of the target sound source;  (As shown in Figs. 1 and 2 of Ochiai, the target sound source could be a sound signal which contains only laughter and cough sounds. In this example, both laughter sounds and cough sounds are characteristics of the target sound signal. They can both be present in the target sound signal; therefore they are be mutually inclusive characteristics of that signal. Additionally, mutually exclusive sounds could be selected such as telephone and meow.)
determine any logical operators connecting the extracted one or more identifiers; transform the extracted one or more identifiers and any logical operators into a digital representation predetermined for querying the mixture of sound signals;  ("This formalization corresponds to having a target-class vector o set to a n-hot vector, where the n elements that correspond to the target AE classes are 1 and the others are 0." Pg. 2, Section 2.2 - This is a logical AND between the identifiers. The target sound must meet each AE class.)
execute a neural network trained to extract all the sound signals corresponding to the target sound source that is identified by the digital representation, from the mixture of sound signals, by combining the digital representation with intermediate outputs of intermediate layers of the neural network processing the mixture of sound signals, wherein the neural network is trained with machine learning to extract different sound signals identified in a predetermined set of digital representations;  ("In this paper, we propose a neural network-based AE sound selection approach, called Sound Selector, which directly extracts the desired AE sound from a mixture of AEs given a onehot vector representing the class of interest." Pg. 1, Section 1;  Pg. 2, Section 2.3 further describes the training. Figure 1 further shows that the digital representation (o) is combined with intermediate outputs of intermediate layers. In the example of Fig. 2, all the sound signals corresponding to the selected target sound source are extracted.)
and output the extracted target sound source. ("we output a signal that consists of the sum of all the AEs from these classes" Pg. 1, Section 1)
Ochiai does not explicitly disclose combining the digital representation with intermediate outputs of intermediate layers of the neural network. (Ochiai discloses combining it with the intermediate output of a single layer of the neural network.)
Kong discloses combining the digital representation with intermediate outputs of intermediate layers of the neural network. (“In addition to the anchor segment, a condition vector is used as an extra input to control what source to separate.” Pg. 3, para 1; “The condition vectors is mapped to embedding vectors by a learnable matrix. The embedding vectors are added to after each ReLU operation in all layers as a bias.” Pg. 4, para 2)
Ochiai and Kong are considered analogous art to the claimed invention because they disclose methods for source separation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Ochiai by adding the conditioning vector after each layer. Doing so would have been beneficial to control what sources to separate. (Kong pg. 4 para 2) This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding claim 4, Ochiai discloses: The sound processing system of claim 1, wherein the one or more identifiers are combined using any of the determined logical operators to extract the target sound source having mutually inclusive and exclusive characteristics, wherein any of the determined logical operators comprise at least one of: NOT operator, AND operator, and OR operator, wherein NOT operator is used with any single identifier of the one or more identifiers. ("This formalization corresponds to having a target-class vector o set to a n-hot vector, where the n elements that correspond to the target AE classes are 1 and the others are 0." Pg. 2, Section 2.2 - This is a logical AND between the identifiers.)

Regarding claim 5, Ochiai discloses: The sound processing system of claim 1, wherein the neural network is trained using the predetermined set of digital representations of a plurality of combinations of identifiers in the predetermined set of one or more identifiers. ("We assume that a set of input and target features {y, o, {xn} N n=1} is available for training the model" Pg. 2, Section 2.3 - (o) represents the predefined vector of identifiers; “Mix 3-5 contain the AEs of three, four, or five classes.” Pg. 3, Section 4.1 – The mixes 3-5 have a different number of AEs (identifiers) and therefore have a plurality of combinations.)

Regarding claim 7, Ochiai discloses: The sound processing system of claim 1, wherein the digital representation is represented by at least one of: a one hot conditional vector, a multi-hot conditional vector, and text description. ("o is a one-hot vector" Page 2, Section 2.1).

Regarding claim 8, Ochiai discloses: The sound processing system of claim 1, wherein the intermediate layers of the neural network comprise one or more intertwined blocks, wherein each of the one or more intertwined blocks comprise at least one of: a feature encoder, a conditioning network, a separation network, and a feature decoder, wherein the conditioning network comprises a feature-invariant linear modulation (FiLM) layer that takes as an input the mixture of sound signals and the digital representation and modulates the input into the conditioning input, wherein the FiLM layer processes the conditioning input and sends the processed conditioning input to the separation network.  ("an AE-class embedding layer generates target-class embedding c ∈ R D×1, which provides an encoded representation of the target AE class." Page 2, Figure 1; Section 2.1 - Figure 1 shows multiple blocks. Section 2.1 describes the blocks which include at least a feature encoder.)

Regarding claim 9, Ochiai discloses: The sound processing system of claim 8, wherein the separation network comprises a convolution block layer that utilizes the conditioning input to separate the target sound source from the mixture of sound signals, wherein the separation network is configured to produce a latent representation of the target sound source. ("an AE-class embedding layer generates target-class embedding c ∈ R D×1 , which provides an encoded representation of the target AE class" Page 2, Figure 1; Section 2.1 - Figure 1 shows the convolution blocks. Section 2.1 describes the embedding which produces a latent representation)

Regarding claim 10, Ochiai discloses: The sound processing signal of claim 8, wherein the feature decoder converts a latent representation of the target sound source produced by the separation network into an audio waveform.  ("passed to the upper blocks of the sound extraction network to output only the sounds from the target AE class." Page 2, Figure 1; Section 2.1 - Figure 1 shows 1d-deconv which decodes the encoded representation into a sound output.)

Regarding claim 11, arguments analogous to claim 1 are applicable.
Regarding claim 14, arguments analogous to claim 4 are applicable.
Regarding claim 15, arguments analogous to claim 5 are applicable.

Regarding claim 16, Ochiai discloses: The computer-implemented method of claim 11, further comprising: generating one or more queries associated with the mutually inclusive and exclusive characteristics of the target sound source during training of the neural network. ("To realize the proposed multi-class simultaneous extraction, we dynamically generated target-class vector o" Pg. 2, Section 2.3 – the vector o is a query, and it is described as generated during training procedure.)

Regarding claim 17, arguments analogous to claim 8 are applicable.

Claim(s) 2, 3, and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ochiai in view of Kong as applied in claim 1 above, further in view of CHO et al. (US 20090150146 A1).

Regarding claim 2, Ochiai discloses: The sound processing system of claim 1.
Ochiai does not disclose: wherein sound signals in the mixture of sound signals are collected from a plurality of sound sources with facilitation of one or more microphones, wherein each sound source of the plurality of sound sources corresponds to at least one of a speaker, a person or an individual, an industrial equipment, a vehicle, or a natural sound. Neither does Kong.
Cho discloses: wherein sound signals in the mixture of sound signals are collected from a plurality of sound sources with facilitation of one or more microphones,  ("a signal separator which separates mixed signals input through a plurality of microphone into sound-source signals" [0013]) wherein each sound source of the plurality of sound sources corresponds to at least one of a speaker, a person or an individual, an industrial equipment, a vehicle, or a natural sound. ("information representing that the target speech is a speech of a specific speaker." [0026])
Ochiai, Kong, and Cho are considered analogous art to the claimed invention because they discuss methods of separating target sounds using neural networks.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ochiai in view of Kong with the teaching of Cho to use a plurality of microphones and use a person as a sound source. This would have been beneficial because interference signals can be reduced or removed using a microphone array (Cho [0008]) and so that it could be used for speech recognition (Cho [0006]).

Regarding claim 3, Ochiai discloses: The sound processing system of claim 1.
Ochiai does not disclose: wherein the predetermined set of one or more identifiers is associated with a plurality of sound sources, wherein the each of the one or more identifiers in the predetermined set of one or more identifiers comprises at least one of: a loudest sound source identifier, quietest sound source identifier, a farthest sound source identifier, a nearest sound source identifier, a female speaker identifier, a male speaker identifier, and a language specific sound source identifier. Neither does Kong.
Cho discloses: wherein the predetermined set of one or more identifiers is associated with a plurality of sound sources, wherein the each of the one or more identifiers in the predetermined set of one or more identifiers comprises at least one of: a loudest sound source identifier, quietest sound source identifier, a farthest sound source identifier, a nearest sound source identifier, a female speaker identifier, a male speaker identifier, and a language specific sound source identifier. ("the target speech extraction of the target speech extractor 120 may be performed by taking into consideration information representing that the target speech is a male (or female) speech " [0026])
Ochiai, Kong, and Cho are considered analogous art to the claimed invention because they discuss methods of separating target sounds using neural networks.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ochiai in view of Kong with the teaching of Cho to use a gender identifier. This would have been beneficial because providing additional information on gender can allow the extractor to have higher reliability (Cho [0027]).

Regarding claim 12, arguments analogous to claim 2 are applicable.

Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ochiai in view of Kong as applied in claim 1 above, further in view of Shanahan et al. (US 20070136336 A1).

Regarding claim 6, Ochiai discloses: The sound processing system of claim 1.
Ochiai does not disclose: wherein the neural network is trained using a positive example selector and a negative example selector to extract the target sound signal. Neither does Kong.
Shanahan discloses: wherein the neural network is trained using a positive example selector and a negative example selector to extract the target sound source. ("Assembling positive and negative examples for a training set is well known to those of ordinary skill in the art" [0039])
Ochiai, Kong, and Shanahan are considered analogous art to the claimed invention because they discuss methods of training machine learning models.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ochiai in view of Kong with the teaching of Shanahan to use positive and negative examples in the training data. This would have been a known method with predictable results.

Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ochiai in view of Kong as applied in claim 11 above, in further view of Chang et al. (MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition), JUN (US 20230119203 A1), Cho, and KIM et al. (US 20230169988 A1).

Regarding claim 13, Ochiai discloses: The sound processing system of claim 11, wherein the predetermined set of one or more identifiers are associated with a plurality of sound sources. (Figure 1 shows the plurality of sound sources in vector o.)
Ochiai does not disclose: wherein each of the one or more identifiers in the predetermined set of one or more identifiers comprises at least one loudest sound source identifier, quietest sound source identifier, farthest sound source identifier, nearest sound source identifier, female speaker identifier, male speaker identifier, and language specific sound source identifier. Neither does Kong.
Chang discloses: loudest sound source identifier, quietest sound source identifier, ("we sort the multi-speaker data in ascending order of SNR between the loudest and quietest speaker" Pg. 3, Section 2.2)
Ochiai, Kong, Chang, Jun, Cho, and Kim are considered analogous art to the claimed invention because they discuss methods of identifying target sounds. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Chang to use a loudest and quietest source identifier. This would have been beneficial because recognition accuracy is different at different energy levels (Chang Pg. 3, Section 2.2]).
Jun discloses: farthest sound source identifier, nearest sound source identifier,  ("display the position information of other persons who reproduce the sound sources on a screen of the user terminal in the order of their distance from the closest to the farthest" [0023])
Ochiai, Kong, Chang, Jun, Cho, and Kim are considered analogous art to the claimed invention because they discuss methods of identifying target sounds. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Jun to use a nearest and farthest source identifier. This would have been beneficial in order to make use of position information (Jun [0023]).
Cho discloses: female speaker identifier, male speaker identifier, ("the target speech extraction of the target speech extractor 120 may be performed by taking into consideration information representing that the target speech is a male (or female) speech " [0026])
Ochiai, Kong, Chang, Jun, Cho, and Kim are considered analogous art to the claimed invention because they discuss methods of identifying target sounds. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Cho to use a gender identifier. This would have been beneficial because providing additional information on gender can allow the extractor to have higher reliability (Cho [0027]).
Kim discloses: language specific sound source identifier. ("An apparatus for processing speech data may include a processor configured to: separate speech signals from an input speech; identify a language of each of the speech signals that are separated from the input speech; extract speaker embeddings from the speech signals based on the language of each of the speech signals" Abstract)
Ochiai, Kong, Chang, Jun, Cho, and Kim are considered analogous art to the claimed invention because they discuss methods of identifying target sounds. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Kim to use a language identifier. This would have been beneficial in order to increase accuracy using acoustic characteristics of different languages (Kim [0003]).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JON C MEIS whose telephone number is (703)756-1566. The examiner can normally be reached Monday - Thursday, 8:30 am - 5:30 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached on 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JON CHRISTOPHER MEIS/Examiner, Art Unit 2654            

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Show 5 earlier events

Jun 17, 2025

Examiner Interview Summary

Aug 22, 2025

Request for Continued Examination

Aug 26, 2025

Response after Non-Final Action

Sep 16, 2025

Non-Final Rejection mailed — §103

Nov 21, 2025

Interview Requested

Dec 01, 2025

Examiner Interview Summary

Dec 05, 2025

Response Filed

Apr 08, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/881,473

Patent 12603087

VOICE RECOGNITION USING ACCELEROMETERS FOR SENSING BONE CONDUCTION

3y 8m to grant Granted Apr 14, 2026

18/303,296

Patent 12579975

Detecting Unintended Memorization in Language-Model-Fused ASR Systems

2y 11m to grant Granted Mar 17, 2026

17/979,989

Patent 12482487

MULTI-SCALE SPEAKER DIARIZATION FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

3y 0m to grant Granted Nov 25, 2025

18/020,514

Patent 12475312

FOREIGN LANGUAGE PHRASES LEARNING SYSTEM BASED ON BASIC SENTENCE PATTERN UNIT DECOMPOSITION

2y 9m to grant Granted Nov 18, 2025

18/065,374

Patent 12430329

TRANSFORMING NATURAL LANGUAGE TO STRUCTURED QUERY LANGUAGE BASED ON MULTI-TASK LEARNING AND JOINT TRAINING

2y 9m to grant Granted Sep 30, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

4-5

Expected OA Rounds

34%

Grant Probability

82%

With Interview (+47.0%)

2y 10m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 29 resolved cases by this examiner. Grant probability derived from career allowance rate.