Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-17 are pending. Claims 1 and 11 are independent.
This Application was published as US 20230326478.
Apparent priority is 04/06/2022.
Examiner’s Note
Should the Applicant incorporate the complete concept illustrated in Figure 7 into the independent claims (1 and 11), the Examiner would consider allowing the application, subject to a further search.
Response to Amendments
Amendments to claim 4 overcome the objection.
Response to Arguments
Provisional Application
Arguments regarding the support provided by the provisional application are not persuasive. Applicant cites the Abstract; however, the abstract is merely describing prior art, not the instant invention. Further, the Abstract prior art only mentions isolating the vocals or speakers specifically, not other sounds. Throughout the rest of the provisional application, language and examples are consistently used that apply specifically to people such as language or gender. While the term source is used, there is no indication that the invention described in the provisional application is able to extract anything other than a voice. However, for the purposes of compact prosecution, references prior to the date of the provisional application are relied upon.
35 USC 102/103
Applicant's arguments with respect to combining the digital representation with intermediate outputs of intermediate layers of the neural network have been fully considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant’s arguments regarding Sound Source Extraction are not persuasive. In response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., identifying and extracting a target sound source) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). Claim 1 merely requires extracting all sound signals corresponding to the source. The claim does not require the system to identify the source or extract the source itself.
In the examples given by Ochiai in Fig. 2, for example, extracting all telephone sounds reads on extracting all sounds from the telephone source. Further, there is no requirement that the identifier is determined by the system, and if a user determined the identifiers of the source, it would still read upon the claims. For example, using Fig. 2 of Ochiai for an example, the user can determine the desired source is a person who is coughing and laughing, and choose these identifiers accordingly. This is the same process with both Ochiai’s system and the instant invention; the user must adequately describe the desired source so that it is uniquely identified.
Therefore, the rejection is maintained.
Examiner still suggests that incorporating limitations (if supported by the original disclosure) such as “identify a sound source based on the digital representation, and execute a neural network trained to extract every sound signal originating from the identified sound source” would help clarify the distinction between Ochiai and the claimed invention.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 4-5, 7-11, 14-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ochiai et al. (Listen to What You Want: Neural Network-based Universal Sound Selector) in view of Kong et al. (“Source Separation with Weakly Labelled Data: An Approach to Computational Auditory Scene Analysis”).
Regarding claim 1, Ochiai discloses:
A sound processing system to extract a target sound signal, ("a neural network-based AE sound selection approach, called Sound Selector, which directly extracts the desired AE sound from a mixture of AEs" Pg. 1, Section 1 – the desired AE sound is a source.)
the sound processing system comprising: at least one processor; and memory having instructions stored thereon that, when executed by the at least one processor, cause the sound processing system to: ("computational and memory costs" Pg. 3, Section 4.1 – implicit that processor and memory are part of the system.)
collect a mixture of sound signals; ("we created datasets of simulated sound event mixtures based on the Freesound Dataset Kaggle 2018 corpus (FSD) [6], which contains audio clips from 41 diverse AE classes, such as human sounds, object sounds, musical instruments, etc" Pg. 3, Section 4.2)
collect a query identifying the target sound source to be extracted from the mixture of sound signals, the query comprising one or more identifiers; extract from the query, each identifier of the one or more identifiers, ("user-specified target AE classes" Pg. 1, Section 2 - Figure 1 shows an example where the query is extracted to select "knock" and "telephone" as the target source.)
said each identifier being present in a predetermined set of one or more identifiers, ("For each mixture, three AE classes {n1, n2, n3} were pre-defined." Pg. 3, Section 4.2 – the AE classes are identifiers)
each identifier defining at least one of mutually inclusive and mutually exclusive characteristics of the target sound source; (As shown in Figs. 1 and 2 of Ochiai, the target sound source could be a sound signal which contains only laughter and cough sounds. In this example, both laughter sounds and cough sounds are characteristics of the target sound signal. They can both be present in the target sound signal; therefore they are be mutually inclusive characteristics of that signal. Additionally, mutually exclusive sounds could be selected such as telephone and meow.)
determine any logical operators connecting the extracted one or more identifiers; transform the extracted one or more identifiers and any logical operators into a digital representation predetermined for querying the mixture of sound signals; ("This formalization corresponds to having a target-class vector o set to a n-hot vector, where the n elements that correspond to the target AE classes are 1 and the others are 0." Pg. 2, Section 2.2 - This is a logical AND between the identifiers. The target sound must meet each AE class.)
execute a neural network trained to extract all the sound signals corresponding to the target sound source that is identified by the digital representation, from the mixture of sound signals, by combining the digital representation with intermediate outputs of intermediate layers of the neural network processing the mixture of sound signals, wherein the neural network is trained with machine learning to extract different sound signals identified in a predetermined set of digital representations; ("In this paper, we propose a neural network-based AE sound selection approach, called Sound Selector, which directly extracts the desired AE sound from a mixture of AEs given a onehot vector representing the class of interest." Pg. 1, Section 1; Pg. 2, Section 2.3 further describes the training. Figure 1 further shows that the digital representation (o) is combined with intermediate outputs of intermediate layers. In the example of Fig. 2, all the sound signals corresponding to the selected target sound source are extracted.)
and output the extracted target sound source. ("we output a signal that consists of the sum of all the AEs from these classes" Pg. 1, Section 1)
Ochiai does not explicitly disclose combining the digital representation with intermediate outputs of intermediate layers of the neural network. (Ochiai discloses combining it with the intermediate output of a single layer of the neural network.)
Kong discloses combining the digital representation with intermediate outputs of intermediate layers of the neural network. (“In addition to the anchor segment, a condition vector is used as an extra input to control what source to separate.” Pg. 3, para 1; “The condition vectors is mapped to embedding vectors by a learnable matrix. The embedding vectors are added to after each ReLU operation in all layers as a bias.” Pg. 4, para 2)
Ochiai and Kong are considered analogous art to the claimed invention because they disclose methods for source separation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Ochiai by adding the conditioning vector after each layer. Doing so would have been beneficial to control what sources to separate. (Kong pg. 4 para 2) This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Regarding claim 4, Ochiai discloses: The sound processing system of claim 1, wherein the one or more identifiers are combined using any of the determined logical operators to extract the target sound source having mutually inclusive and exclusive characteristics, wherein any of the determined logical operators comprise at least one of: NOT operator, AND operator, and OR operator, wherein NOT operator is used with any single identifier of the one or more identifiers. ("This formalization corresponds to having a target-class vector o set to a n-hot vector, where the n elements that correspond to the target AE classes are 1 and the others are 0." Pg. 2, Section 2.2 - This is a logical AND between the identifiers.)
Regarding claim 5, Ochiai discloses: The sound processing system of claim 1, wherein the neural network is trained using the predetermined set of digital representations of a plurality of combinations of identifiers in the predetermined set of one or more identifiers. ("We assume that a set of input and target features {y, o, {xn} N n=1} is available for training the model" Pg. 2, Section 2.3 - (o) represents the predefined vector of identifiers; “Mix 3-5 contain the AEs of three, four, or five classes.” Pg. 3, Section 4.1 – The mixes 3-5 have a different number of AEs (identifiers) and therefore have a plurality of combinations.)
Regarding claim 7, Ochiai discloses: The sound processing system of claim 1, wherein the digital representation is represented by at least one of: a one hot conditional vector, a multi-hot conditional vector, and text description. ("o is a one-hot vector" Page 2, Section 2.1).
Regarding claim 8, Ochiai discloses: The sound processing system of claim 1, wherein the intermediate layers of the neural network comprise one or more intertwined blocks, wherein each of the one or more intertwined blocks comprise at least one of: a feature encoder, a conditioning network, a separation network, and a feature decoder, wherein the conditioning network comprises a feature-invariant linear modulation (FiLM) layer that takes as an input the mixture of sound signals and the digital representation and modulates the input into the conditioning input, wherein the FiLM layer processes the conditioning input and sends the processed conditioning input to the separation network. ("an AE-class embedding layer generates target-class embedding c ∈ R D×1, which provides an encoded representation of the target AE class." Page 2, Figure 1; Section 2.1 - Figure 1 shows multiple blocks. Section 2.1 describes the blocks which include at least a feature encoder.)
Regarding claim 9, Ochiai discloses: The sound processing system of claim 8, wherein the separation network comprises a convolution block layer that utilizes the conditioning input to separate the target sound source from the mixture of sound signals, wherein the separation network is configured to produce a latent representation of the target sound source. ("an AE-class embedding layer generates target-class embedding c ∈ R D×1 , which provides an encoded representation of the target AE class" Page 2, Figure 1; Section 2.1 - Figure 1 shows the convolution blocks. Section 2.1 describes the embedding which produces a latent representation)
Regarding claim 10, Ochiai discloses: The sound processing signal of claim 8, wherein the feature decoder converts a latent representation of the target sound source produced by the separation network into an audio waveform. ("passed to the upper blocks of the sound extraction network to output only the sounds from the target AE class." Page 2, Figure 1; Section 2.1 - Figure 1 shows 1d-deconv which decodes the encoded representation into a sound output.)
Regarding claim 11, arguments analogous to claim 1 are applicable.
Regarding claim 14, arguments analogous to claim 4 are applicable.
Regarding claim 15, arguments analogous to claim 5 are applicable.
Regarding claim 16, Ochiai discloses: The computer-implemented method of claim 11, further comprising: generating one or more queries associated with the mutually inclusive and exclusive characteristics of the target sound source during training of the neural network. ("To realize the proposed multi-class simultaneous extraction, we dynamically generated target-class vector o" Pg. 2, Section 2.3 – the vector o is a query, and it is described as generated during training procedure.)
Regarding claim 17, arguments analogous to claim 8 are applicable.
Claim(s) 2, 3, and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ochiai in view of Kong as applied in claim 1 above, further in view of CHO et al. (US 20090150146 A1).
Regarding claim 2, Ochiai discloses: The sound processing system of claim 1.
Ochiai does not disclose: wherein sound signals in the mixture of sound signals are collected from a plurality of sound sources with facilitation of one or more microphones, wherein each sound source of the plurality of sound sources corresponds to at least one of a speaker, a person or an individual, an industrial equipment, a vehicle, or a natural sound. Neither does Kong.
Cho discloses: wherein sound signals in the mixture of sound signals are collected from a plurality of sound sources with facilitation of one or more microphones, ("a signal separator which separates mixed signals input through a plurality of microphone into sound-source signals" [0013]) wherein each sound source of the plurality of sound sources corresponds to at least one of a speaker, a person or an individual, an industrial equipment, a vehicle, or a natural sound. ("information representing that the target speech is a speech of a specific speaker." [0026])
Ochiai, Kong, and Cho are considered analogous art to the claimed invention because they discuss methods of separating target sounds using neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ochiai in view of Kong with the teaching of Cho to use a plurality of microphones and use a person as a sound source. This would have been beneficial because interference signals can be reduced or removed using a microphone array (Cho [0008]) and so that it could be used for speech recognition (Cho [0006]).
Regarding claim 3, Ochiai discloses: The sound processing system of claim 1.
Ochiai does not disclose: wherein the predetermined set of one or more identifiers is associated with a plurality of sound sources, wherein the each of the one or more identifiers in the predetermined set of one or more identifiers comprises at least one of: a loudest sound source identifier, quietest sound source identifier, a farthest sound source identifier, a nearest sound source identifier, a female speaker identifier, a male speaker identifier, and a language specific sound source identifier. Neither does Kong.
Cho discloses: wherein the predetermined set of one or more identifiers is associated with a plurality of sound sources, wherein the each of the one or more identifiers in the predetermined set of one or more identifiers comprises at least one of: a loudest sound source identifier, quietest sound source identifier, a farthest sound source identifier, a nearest sound source identifier, a female speaker identifier, a male speaker identifier, and a language specific sound source identifier. ("the target speech extraction of the target speech extractor 120 may be performed by taking into consideration information representing that the target speech is a male (or female) speech " [0026])
Ochiai, Kong, and Cho are considered analogous art to the claimed invention because they discuss methods of separating target sounds using neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ochiai in view of Kong with the teaching of Cho to use a gender identifier. This would have been beneficial because providing additional information on gender can allow the extractor to have higher reliability (Cho [0027]).
Regarding claim 12, arguments analogous to claim 2 are applicable.
Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ochiai in view of Kong as applied in claim 1 above, further in view of Shanahan et al. (US 20070136336 A1).
Regarding claim 6, Ochiai discloses: The sound processing system of claim 1.
Ochiai does not disclose: wherein the neural network is trained using a positive example selector and a negative example selector to extract the target sound signal. Neither does Kong.
Shanahan discloses: wherein the neural network is trained using a positive example selector and a negative example selector to extract the target sound source. ("Assembling positive and negative examples for a training set is well known to those of ordinary skill in the art" [0039])
Ochiai, Kong, and Shanahan are considered analogous art to the claimed invention because they discuss methods of training machine learning models. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ochiai in view of Kong with the teaching of Shanahan to use positive and negative examples in the training data. This would have been a known method with predictable results.
Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ochiai in view of Kong as applied in claim 11 above, in further view of Chang et al. (MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition), JUN (US 20230119203 A1), Cho, and KIM et al. (US 20230169988 A1).
Regarding claim 13, Ochiai discloses: The sound processing system of claim 11, wherein the predetermined set of one or more identifiers are associated with a plurality of sound sources. (Figure 1 shows the plurality of sound sources in vector o.)
Ochiai does not disclose: wherein each of the one or more identifiers in the predetermined set of one or more identifiers comprises at least one loudest sound source identifier, quietest sound source identifier, farthest sound source identifier, nearest sound source identifier, female speaker identifier, male speaker identifier, and language specific sound source identifier. Neither does Kong.
Chang discloses: loudest sound source identifier, quietest sound source identifier, ("we sort the multi-speaker data in ascending order of SNR between the loudest and quietest speaker" Pg. 3, Section 2.2)
Ochiai, Kong, Chang, Jun, Cho, and Kim are considered analogous art to the claimed invention because they discuss methods of identifying target sounds. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Chang to use a loudest and quietest source identifier. This would have been beneficial because recognition accuracy is different at different energy levels (Chang Pg. 3, Section 2.2]).
Jun discloses: farthest sound source identifier, nearest sound source identifier, ("display the position information of other persons who reproduce the sound sources on a screen of the user terminal in the order of their distance from the closest to the farthest" [0023])
Ochiai, Kong, Chang, Jun, Cho, and Kim are considered analogous art to the claimed invention because they discuss methods of identifying target sounds. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Jun to use a nearest and farthest source identifier. This would have been beneficial in order to make use of position information (Jun [0023]).
Cho discloses: female speaker identifier, male speaker identifier, ("the target speech extraction of the target speech extractor 120 may be performed by taking into consideration information representing that the target speech is a male (or female) speech " [0026])
Ochiai, Kong, Chang, Jun, Cho, and Kim are considered analogous art to the claimed invention because they discuss methods of identifying target sounds. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Cho to use a gender identifier. This would have been beneficial because providing additional information on gender can allow the extractor to have higher reliability (Cho [0027]).
Kim discloses: language specific sound source identifier. ("An apparatus for processing speech data may include a processor configured to: separate speech signals from an input speech; identify a language of each of the speech signals that are separated from the input speech; extract speaker embeddings from the speech signals based on the language of each of the speech signals" Abstract)
Ochiai, Kong, Chang, Jun, Cho, and Kim are considered analogous art to the claimed invention because they discuss methods of identifying target sounds. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Kim to use a language identifier. This would have been beneficial in order to increase accuracy using acoustic characteristics of different languages (Kim [0003]).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JON C MEIS whose telephone number is (703)756-1566. The examiner can normally be reached Monday - Thursday, 8:30 am - 5:30 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached on 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JON CHRISTOPHER MEIS/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654