DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim(s) 1-36 is/are pending and has/have been examined.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 06/19/2024 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-36 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim(s) 1, 13, and 25, the limitation(s) of capturing audio data, extracting acoustic features, extracting neural network embedding features, and fusing, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind and/or with pen and paper but for the recitation of generic computer components. More specifically, the mental process of a human receiving written numerical data representative of audio data and performing a series of calculations, including the use of two specific sets of equations, which result in a desired value indicative of specific information. The first and second neural network model read to a set of equations developed to respectively use specific input values to result in a desired output value after the calculations are complete. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind and/or with pen and paper but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas. Accordingly, the claim(s) recite(s) an abstract idea.
This judicial exception is not integrated into a practical application because the recitation of a smart wearable device in claim 1, a smart wearable device and computer readable storage mediums in claim 13, and a smart wearable device, memory, and processor, in claim 25, reads to generalized computer components, based upon the claim interpretation wherein the structure is interpreted using [0068-79] in the specification. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim(s) is/are directed to an abstract idea.
The claim(s) do(es) not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to capture, extract, extract, and fuse, amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim(s) is/are not patent eligible.
With respect to claim(s) 2, 14, and 26, the claim(s) recite(s) detecting one or more speaker change points, which reads on a human performing calculations using the second set of equations, where the resulting value indicates whether or not the numerical data indicates that the person speaking has changed. No additional limitations are present.
With respect to claim(s) 3, 15, and 27, the claim(s) recite(s) detecting one or more speaker change points, which reads on a human performing calculations on sections of data representing 10-50 seconds of time. No additional limitations are present.
With respect to claim(s) 4, 6, 16, 18, 28, and 30, the claim(s) recite(s) specific features of the first and second neural networks, which read on a human using a specific set of equations with specific features pertaining to each stage of the process. No additional limitations are present.
With respect to claim(s) 5, 17, and 29, the claim(s) recite(s) the acoustic features correspond to spectrogram features, which reads on a human using calculations to extract specific values from the numerical data. No additional limitations are present.
With respect to claim(s) 7, 19, and 31, the claim(s) recite(s) the device is a smartwatch, which reads on a generalized computer component as per [0068-79] in the spec.
With respect to claim(s) 8, 20, and 32, the claim(s) recite(s) capturing the audio data via an adaptive sampling strategy, which reads on a human selecting which values of the numerical data to use for the following calculations. No additional limitations are present.
With respect to claim(s) 9, 21, and 33, the claim(s) recite(s) the audio data is temporarily retained, which reads on a human tearing up and discarding the documents with the numerical data once they are no longer needed. No additional limitations are present.
With respect to claim(s) 10, 22, and 34, the claim(s) recite(s) capturing inertial data, extracting inertial features, and extracting neural network embedding features, which reads on a human receiving numerical data representative of inertial values and performing a series of calculations, including the use of a specific set of equations, which results in a desired value indicative of specific information. No additional limitations are present.
With respect to claim(s) 11, 12, 23, 24, 35, and 36, the claim(s) recite(s) combining said extracted embedding features (claims 12, 24, and 36) using concatenation or cross-attention, which reads on a human combining the results of two specific sets of calculations using a specific mathematical process. No additional limitations are present.
These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-3, 5, 7, 9, 13-15, 17, 19, 21, 25-27, 29, 31, and 33, is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Neckermann et al. (US PG Pub No. 2021/0407520), hereinafter Neckermann.
Regarding claims 1, 13, and 25, Neckermann teaches
(claim 1) A method for detecting inter-person conversations using a smart wearable device (a method, where the user device is a smart watch, smart glasses, or wearable computer, i.e. using a smart wearable device [0019],[0043]), the method comprising:
(claim 13) A computer program product for detecting inter-person conversations using a smart wearable device, the computer program product comprising one or more computer readable storage mediums having program code embodied therewith, the program code comprising programming instructions for (the system includes storage for information including software program instructions, where the system includes a user device that is a smart watch, smart glasses, or wearable computer, i.e. using a smart wearable device [0019],[0043],[0102]):
(claim 25) A smart wearable device (a user device that is a smart watch, smart glasses, or wearable computer [0043]), comprising:
(claim 25) a memory for storing a computer program for detecting inter-person conversations (computer useable instructions stored on a computer storage media [0019],[0043],[0102]); and
(claim 25) a processor connected to said memory, wherein said processor is configured to execute program instructions of the computer program comprising (a processor executing instructions stored in memory [0019],[0043],[0102]):
capturing audio data on said smart wearable device (the shared audio source, which can be a smart watch, smart glasses, or wearable computer, i.e. smart wearable device, includes a microphone that converts utterances into audio signals, i.e. capturing audio data [0028],[0043],[0045],[0052]) ;
extracting acoustic features from said captured audio data (the audio source transmits the auditory sound information to the server, i.e. captured audio data, where the voice recognition component uses a machine learning model to extract amplitude, frequency and/or wavelength values, i.e. extracting acoustic features [0045],[0052],[0089]);
extracting neural network embedding features from said extracted acoustic features using a first neural network model (the incoming utterance has features extracted to determine corresponding values, i.e. from said extracted acoustic features, and the utterance is converted into a feature vector corresponding to voice utterance features, i.e. extracting…embedding features, where the feature vectors are part of a feature space that is the output of the voice recognition model, such as an embedding layer that uses deep learning to determine embeddings of features, where the machine learning model can be a neural network, i.e. extracting neural network embedding features…using a first neural network model [0088-9],[0115-6],[0138-40],[0143]); and
fusing said extracted neural network embedding features into a second neural network model to perform user conversation inference (the embedding layer may be prior to the LSTM, i.e. second neural network model, and the LSTM attributes the utterances to specific users, i.e. to perform user conversation inference, using the embedding of the feature vector in feature space, i.e. fusing said extracted neural network embedding features [0089],[0109-10],[0115-6]).
Regarding claims 2, 14, and 26, Neckermann teaches claims 1, 13, and 25, and further teaches
detecting one or more speaker change points in said captured audio data corresponding to a boundary of speech turns for different speakers in a conversation using said second neural network model (the LSTM, i.e. using said second neural network model, attributes the utterances to specific users, i.e. to perform user conversation inference, using the embedding of the feature vector of the utterance in feature space, i.e. detecting one or more speaker change points in said captured audio data, where a structured transcript report includes who has spoken and/or who is currently speaking, i.e. corresponding to a boundary of speech turns for different speakers in a conversation [0024-6],[0089],[0109-10],[0115-6]).
Regarding claims 3, 15, and 27, Neckermann teaches claims 1, 13, and 25, and further teaches
detecting said one or more speaker change points within a range of 10-50 second intervals (windows of 10 second utterances are identified as attributed to a specific user, i.e. a range of 10-50 second intervals, where different speakers are identified along with when they spoke, including when a different person is speaking, i.e. detecting said one or more speaker change points [0024-6],[0059],[0089],[0109-10],[0115-6]).
Regarding claims 5, 17, and 29, Neckermann teaches claims 1, 13, and 25, and further teaches
said acoustic features correspond to spectrogram features which are inputted to both said first and second neural network models (the audio source transmits the auditory sound information to the server, where the voice recognition component uses a machine learning model to extract amplitude, frequency, i.e. acoustic features correspond to spectrogram features, and/or wavelength values [0045],[0052],[0089], where the incoming utterance has features extracted to determine corresponding values, and the utterance is converted into a feature vector corresponding to voice utterance features, where the feature vectors are part of a feature space that is the output of the voice recognition model, such as an embedding layer that uses deep learning to determine embeddings of features, where the machine learning model can be a neural network, i.e. inputted to…said first neural network model [0088-9],[0115-6],[0138-40],[0143], and where the embedding layer may be prior to the LSTM, i.e. second neural network model, and the LSTM attributes the utterances to specific users, using the embedding of the feature vector in feature space, i.e. inputted to…said…second neural network model [0089],[0109-10],[0115-6]).
Regarding claims 7, 19, and 31, Neckermann teaches claims 1, 13, and 25, and further teaches
said smart wearable device is a smartwatch (a user device that is a smart watch, smart glasses, or wearable computer [0043]).
Regarding claims 9, 21, and 33, Neckermann teaches claims 1, 13, and 25, and further teaches
said audio data is temporarily retained in order to preserve privacy of an owner of said audio data (in order to deal with data privacy, a particular quantity of voice data, i.e. in order to preserve privacy of an owner of said audio data, is discarded after a particular time period, i.e. temporarily retained [0028],[0037],[0043],[0045],[0052]).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 4, 16, and 28, is/are rejected under 35 U.S.C. 103 as being unpatentable over Neckermann, in view of Zhang et al. (U.S. PG Pub No. 2025/0104727), hereinafter Zhang, and further in view of Muller et al. (U.S. PG Pub No. 2022/0059083), hereinafter Muller.
Regarding claims 4, 16, and 28, Neckermann teaches claims 2, 14, and 26.
While Neckermann provides the use of an LSTM, Neckermann does not specifically teach the use of a bi-LSTM and fully-connected layers, and thus does not teach
said second neural network model comprises…bi-directional long short-term memory layers and three fully-connected layers.
Zhang, however, teaches said second neural network model comprises…bi-directional long short-term memory layers and three fully-connected layers (fused features are fed into a second set of layers to output information related to speech, i.e. second neural network, including a BiLSTM, i.e. comprises…bi-directional long short-term memory layers, followed by three fully connected layers [0043-4]).
Neckermann and Zhang are analogous art because they are from a similar field of endeavor in processing speech where multiple people may be talking using neural networks. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of an LSTM teachings of Neckermann with the use of a BiLSTM with 3 fully connected layers as taught by Zhang. It would have been obvious to combine the references to improve speech quality and speech separation in multi-speaker environments (Zhang [0057]).
While Neckermann in view of Zhang provides the use of a BiLSTM, Neckermann in view of Zhang does not specifically teach the use of two BiLSTMs, and thus does not teach
said second neural network model comprises two bi-directional long short-term memory layers.
Muller, however, teaches said second neural network model comprises two bi-directional long short-term memory layers (the main task performance network, which may perform dialogue processing and speaker identification after acoustic features are extracted, i.e. second neural network model, may comprise two BiLSTM layers, i.e. comprises two bi-directional long short-term memory layers Fig. 8,[0031],[0042],[0055]).
Neckermann, Zhang, and Muller are analogous art because they are from a similar field of endeavor in processing speech for different purposes, such as during a dialog, using neural networks. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of a BiLSTM teachings of Neckermann, as modified by Zhang, with the use of two BiLSTM layers to perform dialogue processing and speaker identification as taught by Muller. It would have been obvious to combine the references to enable better performance in a multilingual speech environment than an individually trained monolingual network (Muller Abstract).
Claim(s) 6, 10-12, 18, 22-24, 30, and 34-36, is/are rejected under 35 U.S.C. 103 as being unpatentable over Neckermann, in view of Zhang.
Regarding claims 6, 18, and 30, Neckermann teaches claims 1, 13, and 25.
While Neckermann provides using a neural network to provide feature vectors, Neckermann does not specifically teach that the neural network comprises three 2D convolutional layers, and thus does not teach
said first neural network model comprises three two-dimensional convolutional layers.
Zhang, however, teaches said first neural network model comprises three two-dimensional convolutional layers (time-frequency information representative of a speech signal, such as a spectrogram, is input into a series of speech feature embedding layers, i.e. first neural network model, including 2 2D convolutional layers at the beginning and a last 2D convolutional layer at the end, i.e. comprises three two-dimensional convolutional layers Fig. 3A,[0038-41]).
Neckermann and Zhang are analogous art because they are from a similar field of endeavor in processing speech where multiple people may be talking using neural networks. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify using a neural network to provide feature vectors teachings of Neckermann with the use of 3 2D convolutional layers for feature embedding as taught by Zhang. It would have been obvious to combine the references to improve speech quality and speech separation in multi-speaker environments (Zhang [0057]).
Regarding claims 10, 22, and 34, Neckermann teaches claims 1, 13, and 25, and further teaches
capturing inertial data on said smart wearable device (the user device may be a smart watch, smart glasses, or wearable computer, i.e. smart wearable device, where additional user data may be collected via one or more sensors, including gyroscope data, accelerometer data, and data associated with the user such as motion, orientation, and position, i.e. capturing inertial data [0043],[0050],[0052]).
While Neckermann provides capturing sensor data associated with user motion, Neckermann does not specifically teach extracting features from the inertial data, and thus does not teach
extracting inertial features from said captured inertial data; and
extracting neural network embedding features from said extracted inertial features using a third neural network model.
Zhang, however, teaches extracting inertial features from said captured inertial data (the first input comprises time frequency information representative of the ultrasound, i.e. extracting inertial features, which includes the Doppler shift that corresponds to the velocity of the speaker’s articulatory gestures, i.e. captured inertial data Fig. 1,[0032],[0035-9],[0041]); and
extracting neural network embedding features from said extracted inertial features using a third neural network model (a ML model includes a subnetwork that provides feature embedding of the received ultrasound time frequency information, i.e. extracting neural network embedding features from said extracted inertial features, where the subnetwork includes multiple convolution layers, i.e. using a third neural network model Fig. 2,[0032],[0035-9],[0041]).
Neckermann and Zhang are analogous art because they are from a similar field of endeavor in processing speech where multiple people may be talking using neural networks. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify capturing sensor data associated with user motion teachings of Neckermann with the use of feature embedding from ultrasound data corresponding to the velocity of the speaker’s articulatory gestures as taught by Zhang. It would have been obvious to combine the references to improve speech quality and speech separation in multi-speaker environments (Zhang [0057]).
Regarding claims 11, 23, and 35, Neckermann in view of Zhang teaches claims 10, 22, and 34, and Zhang further teaches
combining said extracted neural network embedding features from said extracted inertial features with said extracted neural network embedding features from said extracted acoustic features (the output of the ultrasound and noisy spectrogram subnetworks, respectively the U-features and S-features, i.e. said extracted neural network embedding features from said extracted inertial features…said extracted neural network embedding features from said extracted acoustic features, are concatenated and fed into a self-attention fusion layer that fuses the U-features and S-features, i.e. combining…with Fig. 3A,[0035-9],[0042-3]).
Where the motivation to combine is the same as previously presented.
Regarding claims 12, 24, and 36, Neckermann in view of Zhang teaches claims 11, 23, and 35, and Zhang further teaches
said extracted neural network embedding features from said extracted inertial features are combined with said extracted neural network embedding features from said extracted acoustic features using concatenation or cross-attention (the output of the ultrasound and noisy spectrogram subnetworks, respectively the U-features and S-features, i.e. said extracted neural network embedding features from said extracted inertial features…said extracted neural network embedding features from said extracted acoustic features, are concatenated and fed into a self-attention fusion layer that fuses the U-features and S-features, i.e. combined with…using concatenation Fig. 3A,[0035-9],[0042-3]).
Where the motivation to combine is the same as previously presented.
Claim(s) 8, 20, and 32 is/are rejected under 35 U.S.C. 103 as being unpatentable over Neckermann, in view of Ryan et al. (U.S. PG Pub No. 2015/0269954), hereinafter Ryan.
Regarding claims 8, 20, and 32, Neckermann teaches claims 1, 13, and 25.
While Neckermann provides receiving speech through a microphone, Neckermann does not specifically teach the use of an adaptive sampling strategy, and thus does not teach
said audio data is captured via an adaptive sampling strategy.
Ryan, however, teaches said audio data is captured via an adaptive sampling strategy (the sampling rate of the microphone when capturing the audio signal, i.e. audio data is captured, is adaptive, i.e. captured via an adaptive sampling strategy [0016],[0018],[0042],[0048]).
Neckermann and Ryan are analogous art because they are from a similar field of endeavor in processing voice data for speech recognition. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the receiving speech through a microphone teachings of Neckermann with using an adaptive sampling rate for the microphone as taught by Ryan. It would have been obvious to combine the references to reduce the power consumption of the microphone by adaptively changing the sampling rate based on noise conditions (Ryan [0016]).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NICOLE A K SCHMIEDER/Primary Examiner, Art Unit 2659