DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Election/Restrictions
Claims 6-9 are withdrawn from further consideration pursuant to 37 CFR 1.142(b) as being drawn to a nonelected system, there being no allowable generic or linking claim. Election was made without traverse in the reply filed on 12/05/2025.
Priority
Acknowledgment is made of applicant's claim for foreign priority based on an application filed in China on 01/18/2024. It is noted, however, that applicant has not filed a certified copy of the CN202410080108.X application as required by 37 CFR 1.55.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-5 and 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lipsmeier et al. (US 2023/0172526 A1) in view of Hao et al. (US 2022/0280098 A1).
Regarding claim 1, Lipsmeier et al. (‘526) teach a method for early diagnosis of Parkinson's disease (“Parkinson’s disease” see [0105]) based on multimodal deep learning (“deep learning models” see [0208]), comprising: (1) acquiring audio data of a to-be-diagnosed subject while performing a speech task (“identifying a plurality of segments of the voice recording” see [0007]); (2) preprocessing the audio data to extract a plurality of audio segments; and calculating a Mel-spectrogram of each of the plurality of audio segments (“computing one or more Mel-frequency cepstral coefficients (MFCCs) for the segments” see [0007]); and (3) and the Mel-spectrogram into a multimodal deep learning model to output a classification result for Parkinson's disease early diagnosis of the to-be-diagnosed subject, wherein the multimodal deep learning model comprises a local feature extraction module, an audio feature extraction module, a feedforward network and a cross-attention module (“classifying the subject as belonging to one of the plurality of UHDRS dysarthria score classes” see [0123]); wherein step (3) is performed through steps of: (3.1) and extracting audio features from the Mel-spectrogram through the audio feature extraction module (“computing one or more Mel-frequency cepstral coefficients (MFCCs) for the segments” see [0007]); and (3.2) inputting the audio features to the feedforward network, and the audio features to the cross-attention module to learn a cross-modal attention weight (“threshold may be determined as a weighted average of the relative energy values assumed to correspond to signal and the relative energy values assumed to correspond to background noise” see [0042]); performing feature fusion based on the cross-modal attention weight to obtain multimodal features and outputting the classification result based on the multimodal features (“determining the speech rate associated with the voice recording comprises computing the total number of words in the recording and diving the total number of words by the length of the recording” see [0048]).
Lipsmeier et al. fails to explicitly teach wherein each of the plurality of audio segments corresponds to a synchronized one among the plurality of video segments. However, Hao et al. (‘098) from the same field of endeavor do teach a plurality of audio segments corresponds to a synchronized one among the plurality of video segments (“Parkinson's symptoms assessor 134 may convey the one or more Parkinson's symptoms assessments to the user in the form of audio, video, text, or any other manner” see [0051]); extracting a face image sequence from each of the plurality of video segments (“video processing to extract features from collected video of an individual moving their face” see [0038]); and inputting the face image sequence as part of the multimodal features (“assess Parkinson's disease symptoms” see [0053]). It would be obvious to one of ordinary skill in the art at the time of the invention to modify the invention of Lipsmeier et al. with the features of Hao et al. for the benefit of more accurate Parkinson's disease symptom identification (see Lipsmeier et al. [0009]).
Regarding claim 2, Lipsmeier et al. (‘526) in view of Hao et al. (‘098) teach the method of claim 1, wherein the local feature extraction module comprises a visual front-end network and a visual temporal network; the visual front-end network is based on ShuffleNet-V2, and further comprises a two-dimensional (2D) convolution module; the visual front-end network is configured to encode the face image sequence into a frame-based embedding sequence; and the visual temporal network consists of a video temporal convolution module, and is configured to capture facial motion visual features in different time intervals; and the step of extracting the visual features from the face image sequence through the local feature extraction module comprises: extracting facial visual features from each frame of the face image sequence through the visual front-end network, and extracting the visual features from the facial visual features through the visual temporal network, wherein the visual features are time-correlated (see Hao et al. [0036]).
Regarding claim 3, Lipsmeier et al. (‘526) in view of Hao et al. (‘098) teach method of claim 1, wherein the audio feature extraction module is a VGGish network provided with a convolution module; the audio feature extraction module is configured to extract the audio features at different time intervals from the plurality of audio segments; and the step of extracting the audio features from the Mel-spectrogram through the audio feature extraction module comprises: inputting the Mel-spectrogram into the audio feature extraction module, and extracting the audio features through the VGGish network, wherein the audio features are time-correlated. (see Lipsmeier et al. [0007]);
Regarding claim 4, Lipsmeier et al. (‘526) in view of Hao et al. (‘098) teach method of claim 1, wherein step (3.2) comprises: after the visual features and the audio features pass through the feedforward network, inputting the visual features and the audio features into the cross-attention module with the visual features as key vectors and value vectors and the audio features as query vectors to learn the cross-modal attention weight, and acquiring visual feature-enhanced audio features based on the cross-modal attention weight; and inputting the visual features and the audio features into the cross-attention module with the audio features as the key vectors and the value vectors and the visual features as the query vectors to learn the cross-modal attention weight, and acquiring audio feature-enhanced visual features based on the cross-modal attention weight; and fusing the visual feature-enhanced audio features with the audio features to obtain first fused features, and fusing the audio feature-enhanced visual features with the visual features to obtain second fused features, and concatenating the first fused features with the second fused features to obtain the multimodal features (see Hao et al. [0045]).
Regarding claim 5, Lipsmeier et al. (‘526) in view of Hao et al. (‘098) teach method of claim 1, wherein the multimodal deep learning model is trained through steps of: collecting a plurality of sets of audio-visual data of a plurality of test subjects while performing the speech task, wherein the plurality of test subjects comprise Parkinson's disease patients and healthy subjects; performing disease severity evaluation according to a unified Parkinson's disease rating scale (UPDRS) to annotate and score the plurality of sets of audio-visual data; and constructing a training data set based on the plurality of sets of annotated audio-visual data; and based on the training data set, training the multimodal deep learning model by means of a cross-entropy loss and a stochastic gradient descent optimizer until a preset number of iterations is reached (see Lipsmeier et al. [0123]-[0126]).
Regarding claim 10, the claim is rejected mutatis mutandis in view of the rejection of claim 1 above by Lipsmeier et al. (‘526) in view of Hao et al. (‘098) including at least one non-transitory computer readable medium containing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising the operations described in relation to the disclosed methods (see Lipsmeier et al [0130]).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARK REMALY whose telephone number is (571)270-1491. The examiner can normally be reached Mon - Fri 9:00 - 6:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Christopher Koharski can be reached at (571) 272-7230. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MARK D REMALY/Primary Examiner, Art Unit 3797