DETAILED ACTION
This communication is in response to the Application filed on 08/26/2024. Claims 1-20 are pending and have been examined. Claims 1 and 11 are independent. This Application was published as U.S. Pub. No. 20250201266A1.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/26/2024 was filed. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Priority
Acknowledgment is made of applicant’s claim for foreign priority based on application KR 10-2023-00180557 filed in Korean Intellectual Property Office (KIPO) on 12/13/2023 and receipt of a certified copy thereof.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more.
Regarding Claims 1 and 11,
Claim 1 and Claim 11 recite a method and device for recognizing an emotion of a vehicle occupants, respectively, which falls under the statutory category of process and machine, respectively (Step 1: Yes).
Claims recites limitations “(a) acquiring data…”, “(b) preparing, from the acquired data, a first type of input data…”, and “(c) preparing, from the acquired data, a second type of input data…”, “(d) inputting the first type…”, and “(e) providing…”. Except for the recitation of an emotion classification model, limitations (b) – (c) can be performed in the human mind or with pen and paper. The claims, under their broadest reasonable interpretation, cover the concept of person listening the speech and noise in the vehicle and recognizing the emotional states of occupants (see MPEP 2106.04(a)(2) III.
Under its broadest reasonable interpretation when read in light of the specification, the actions recited in limitations (b) and (c) encompass mental processes practically performed in the human mind. According, the claim recites an abstract idea (Step 2A, Prong one).
The judicial exception is not integrated into a practical application. In particular, limitation (d) recites an additional element of “an emotion classification model,” and “one or more processor” and “a storage medium” (in Claim 11), but they are recited at a high level of generality (i.e., combination of hardware and software are a generic computing device and generic computer components performing a generic computer functions such as processing and storing data from given input) such that it amounts to no more than mere instructions to apply to the exception using a generic computer component. Claim 11 recites additional elements of but they are
The claims recite the following additional limitations (a), (d), and (e). Limitations (a), (d), and (e) are recited at a high level of generality, and amounts to mere data gathering and output, which is a form of insignificant extra-solution activity. Each of the additional limitations is no more than mere instructions to apply the exception using a generic computer component or generally linking the use of the judicial exception to a particular technological environment or field of use.
Accordingly, the additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea and the claim is therefore directed to the judicial exception. (Step 2A: YES).
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception because they do not include subject matter that could not be performed by a human, as discussed above with respect to integration of the abstract idea into a practical application, the additional element of using the generic computing elements to perform the claimed elements amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
As noted previously, the claim as a whole merely describes how to generally linking the use of the aforementioned concept to a particular technological environment or field of use. Thus, even when viewed as a whole, nothing in the claim adds significantly more (i.e., an inventive concept) to the abstract idea. The claim is not patent eligible. (Step 2B: NO).
Regarding Dependent Claims 2-10, 12-20,
Claims 2-10, 12-20 are dependent on supra claims 1 and 11 and includes all the limitations of the claims and further limits the elements of Claims 1 and 11. Therefore, the dependent claims recite the same abstract idea. The claim recites the additional limitations of a long short-term memory (LSTM) model, a convolutional neural network (CNN) model (claims 2 and 12), an LSTM layer, a flatten layer, a rectified linear unit (ReLU) layer, a dropout layer, and a batch norm layer (claims 3 and 13), a linear layer, a one-dimensional convolutional blocks (Conv1D blocks) layer, a flatten layer, and a batch normalization layer (claims 4 and 14), a concatenation layer and softmax layer (claims 5 and 15), (PASE) neural network model (claims 6 and 16), a SincNet layer, seven convolutional blocks (Conv blocks) layer, a one-dimensional convolution (Conv1D) layer, a batch normalization layer, and a flatten layer (claims 7 and 17), a one-dimensional convolution (Conv1D) layer, a batch normalization layer, and a parametric rectified linear unit (PReLU) layer (claim 8 and 18), and decoder (claim 9 and 19), which are no more than mere instructions to apply the exception using a generic computer component, generally linking the use of the judicial exception to a particular technological environment or field of use, insignificant extra-solution activity, or that are well understood, routine and conventional activities previously known to the industry.
No additional elements beyond the use of generic computing elements are claimed, therefore the judicial exception is not integrated into a practical application nor are the claim elements sufficient to amount to significantly more than the judicial exception. Therefore, claims are not patent eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-6, 9-16 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lei et al., (CN115035916A, hereinafter, Lei) in view of Zhao et al., (CN114495915A, hereinafter, Zhao).
Regarding Claim 1,
Lei discloses a method of recognizing emotion of a vehicle occupant (Lei, pg.1, ll.17, "…a method for recognizing noisy speech emotion based on deep learning."), the method comprising:
acquiring data in which speech of a vehicle occupant and noise of a vehicle are mixed (pg.1, ll.52, "…S1. Acquire audio data and perform preprocessing to obtain preprocessed data");
preparing, from the acquired data, a second type of input data in the form of a Mel-Spectrogram (Lei, pg.2, lls.2-14, "…S2. Extract the Mel-spectrogram features and time-frequency features of the preprocessed data");
Lei discloses a method for recognizing emotion in noisy speech based on deep learning using learnable hybrid (or mixed) features (pg.2, 14-39, "…a method for recognizing emotion in noisy speech based on deep learning, using learnable mixed features as modeling input..."), but does not explicitly discloses "a first type of input data in a form of a latent vector."
However, Zhao, in the analogous field of endeavor, discloses preparing, from the acquired data, a first type of input data in a form of a latent vector (pg.5, ll.5-8, "…S101. Acquire a first feature and a second feature of the sample audio..."; Fig.2, pgs.6, ll.47- pg.7, ll.22, "…S101 of the method includes: acquiring a second feature of the sample audio…a multi-dimensional vector can be extracted from the audio, and the feature values in the vector can represent a second feature of the audio...");
inputting the first type of input data and the second type of input data into an emotion classification model (Zhang, pg.8, lls.7-17, "…S301. Input the first feature and the second feature into an encoder for encoding processing, so as to realize emotional feature decoupling. The encoder can extract emotion features based on the first feature and the second feature, so as to more accurately perform emotion recognition training..."); and
providing, based on an output of the emotion classification model, a result of classifying an emotion of the vehicle occupant (Zhang, pg.12, lls.26-37, "…S601. Acquire a first feature and a second feature of the audio to be recognized, where the first feature is used to represent a feature related to the waveform of the audio to be recognized, and the second feature is used to represent a speaker related to the audio to be recognized the relevant characteristics; S602, input the first feature and the second feature into a speech emotion recognition model to perform emotion category recognition, and obtain a first recognition result...").
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified a noise-containing speech emotion recognition method of Lei with incorporation of the second feature vector (e.g., speaker feature vector) the combination of CNN and LSTM neural networks of Zhang with a reasonable expectation of success to improve the recognition of speaker emotion in the presence of noise by applying various non-waveform features (Zhang, pg. 1-3, Background and Summary).
Regarding Claim 2,
Lei in view of Zhang discloses the method of claim 1, wherein the emotion classification model includes an ensemble model for a long short-term memory (LSTM) model processing the first type of input data (Zhang, pg.7, lls., "…S202, extracting the second feature from the sample audio by using a speaker classification model. For example, a speaker classification model can be called a speaker feature extraction model, and the model can include an LSTM (Long-Short Term Memory) layer, a linear mapping layer (may be referred to as a linear layer), a fully connected layer...") and a convolutional neural network (CNN) model processing the second type of input data (Zhang, Fig. 14, pg.19, lls.11-29, "…SincNet (Optimized Convolutional Neural Network) to extract emotional features from front end features, etc.").
Regarding Claim 3,
Lei in view of Zhang discloses the method of claim 1, further comprising processing the first type of input data along a first path including an LSTM layer, a flatten layer, a rectified linear unit (ReLU) layer, a dropout layer, and a batch norm layer (Zhang, pg., lls., "…the model can include an LSTM (Long-Short Term Memory) layer, a linear mapping layer (may be referred to as a linear layer), a fully connected layer..."; Fig.15, pg.21, ll.31-pg.22,ll.10, "…encoder includes a weighted average layer, a connection layer (Concat) , 3 layers of convolution regularization layer (Conv1D+BatchNormalization), 2 layers of BLSTM layer and downsampling layer (Downsampler)..."; Zhang discloses the flattening layer in the text sentiment classification module (Fig.14) and, in addition, it is also construed that fully connected layer or connection layer would require the flattening layer; Lei, pg.7, ll.30, "…σ represents the ReLU activation function").
Regarding Claim 4,
Lei in view of Zhang discloses the method of claim 3, further comprising processing the second type of input data along a second path including a linear layer, a one-dimensional convolutional blocks (Conv1D blocks) layer, a flatten layer, and a batch normalization layer ( Fig.15, pg.21, ll.31-pg.22,ll.10, "…encoder includes a weighted average layer, a connection layer (Concat) , 3 layers of convolution regularization layer (Conv1D+BatchNormalization)"; Zhang discloses the linear layer in the decoder (pg.22, "…linear layer (Liner) and 5 layers of convolutional regularization layer (Conv1D+BatchNormalization)...").
Regarding Claim 5,
Lei in view of Zhang discloses the method of claim 4.
Zhang further discloses comprising inputting a first result from the first path and a second result from the second path into a concatenation layer, combining, and then passing through a SoftMax layer to output a third result of the classification of the emotion of the vehicle occupant (Fig.15, pg.16, lls.2-8, "…In the first connection (concatenation) layer, the output feature of the weighted averaging layer is spliced with the second feature to obtain the first spliced feature..."; Fig.5, pg.11, lls.25-38, "…After inputting 33 the output feature of the encoder into the emotion recognition classifier, the emotion 34 classification result can be obtained through an activation function such as softmax. 35 Using the emotion category recognition results of purer emotion features... ").
Regarding Claim 6,
Lei in view of Zhang discloses the method of claim 1, further comprising deriving the latent vector from a problem-agnostic speech encoder (PASE) neural network model trained with a dataset in which speech data of the vehicle occupant and noise data of the vehicle are synthesized (Lei, pg.1, ll.17, "…a method for recognizing noisy speech emotion based on deep learning."; pg.1, ll.52, "…S1. Acquire audio data and perform preprocessing to obtain preprocessed data"; "acquiring audio data" is broadly interpreted as acquiring real or synthetic audio data for training and inference.).
Regarding Claim 9,
Lei in view of Zhang discloses the method of claim 6.
Zhang further discloses comprising, using a decoder of the PASE neural network model, decoding the latent vector into a worker associated with a plurality of feature points (Fig.15, pg.22, lls.12-35, "…d: decoder (i.e., Fig.15, 1501d)"; Fig.4, pg.9, ll.6-pg.10, ll.2, "…S402, the output feature of the encoder, the second feature and the phoneme 39 feature are input into the decoder for decoding processing...").
Regarding Claim 10,
Lei in view of Zhang discloses the method of claim 9.
Zhang further discloses wherein the plurality of feature points includes at least two of a log power spectrum (LPS) feature point, a mel-frequency cepstral coefficients (MFCC) feature point, a chroma feature point, a spectral feature point, and a temporal feature point (Zhang, pg.6, ll.47-ph.7, ll.22, "…the type of input audio features (i.e., feature values of the latent vector, which will be reconstructed by the decoder) may include MFCC (Mel-Frequency Cepstral Coefficients, Mel frequency cepstral coefficients), PLP (Perceptual linear prediction, perceptual linear prediction) or Fbank (FilterBank, filter bank-based features)...").
Claim 11 is a device claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Additionally,
Zhang discloses the device comprising: one or more processors; and a storage medium storing computer-readable instructions that, when executed by the one or more processors, enable the one or more processors to (Zhang, Fig.17, pg.26, ll.8 - pg.28, ll.11, "…the device 1700 includes a computing unit 1701 that can be executed according to a computer program stored in a read only memory (ROM) 1702 or a computer program loaded from a storage unit 1708 into a random access memory (RAM) 1703..."):
…
Rationale for combination is similar to that provided for Claim 1.
Claim 12 is a device claim with limitations similar to the limitations of Claim 2 and is rejected under similar rationale.
Claim 13 is a device claim with limitations similar to the limitations of Claim 3 and is rejected under similar rationale.
Claim 14 is a device claim with limitations similar to the limitations of Claim 4 and is rejected under similar rationale
Claim 15 is a device claim with limitations similar to the limitations of Claim 5 and is rejected under similar rationale.
Claim 16 is a device claim with limitations similar to the limitations of Claim 6 and is rejected under similar rationale.
Claim 19 is a device claim with limitations similar to the limitations of Claim 9 and is rejected under similar rationale.
Claim 20 is a device claim with limitations similar to the limitations of Claim 10 and is rejected under similar rationale.
Claims 7-8 and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Lei in view of Zhao further in view of Cai et al., ("An emotional EEG signal classification research based on deep learning." 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA). IEEE, 2022.)
Regarding Claim 7,
Lei in view of Zhang discloses the method of claim 6, wherein an encoder of the PASE neural network model includes seven convolutional blocks (Conv blocks) layer, a one-dimensional convolution (Conv1D) layer, a batch normalization layer, and a flatten layer (see claim 3).
But, neither Lei nor Zhang explicitly discloses a SincNet layer in the encoder. However, Cai, in the analogous field of endeavor, discloses a SincNet Layer (Cai, Fig.2, pg.323, "…In the final convolutional structure of the SincNet-E model, a convolutional layer, a normalization layer, a Leaky ReLU activation function layer, and a random dropout discard layer are defined as a convolutional group. In this way, the convolutional structure can be considered as a superposition of two pooling layers alternating with three convolutional groups...").
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified a noise-containing speech emotion recognition method of Lei in view of Zhang with the SincNet with the bandpass filer bank of Cai with a reasonable expectation of success to improve the classification performance of the deep learning algorithm, when dealing with large data volume, without the need for manual feature extraction (Cai, Abstract and Introduction).
Regarding Claim 8,
Lei in view of Zhang in view of Cai discloses the method of claim 7, wherein the convolution block layer includes a one-dimensional convolution (Conv1D) layer, a batch normalization layer, and a parametric rectified linear unit (PReLU) layer. (i.e., both PReLU and LeakyReLU are address the dying problem of the ReLu activation function with respect to negative inputs. The Examiner interprets it is an implementation choice and the difference is well known to those skilled in the art).
Claim 17 is a device claim with limitations similar to the limitations of Claim 7 and is rejected under similar rationale.
Claim 18 is a device claim with limitations similar to the limitations of Claim 8 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Zhao et al., ("Speech emotion recognition using deep 1D & 2D CNN LSTM networks." Biomedical signal processing and control 47 (2019): 312-323) discloses the1D & 2D CNN LSTM networks, which learn hierarchical local and global features to recognize speech emotion. Whereas most of the data models can only extract low-level features to classify emotion, and most of the previous DBN-based or CNN-based algorithmic models can only learn one type of emotion-related features to recognize emotion.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JANGWOEN LEE whose telephone number is (703)756-5597. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BHAVESH MEHTA can be reached at (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JANGWOEN LEE/Examiner, Art Unit 2656
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656