Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Applicant’s amendment filed on 12/15/2025 overcomes the following objection(s) and/or rejection(s):
Claims 1, 2, 12, 13, 17, and 18 are amended and claim 7 is canceled.
Claims 1-6 and 8-20 pending.
The objection to the specification has been withdrawn as the title has been amended.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-6 and 8-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim 1-6 and 8-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
The limitation and adjusting the initial audio information based on a disfluency detection model and a text correction model to obtain the audio information in the target video, wherein the text correction model is any model for correcting text information that is grammatically fluent but is semantically disfluent by determining whether text conforms to context based on abstract information of the text and abstract information of context is not clear. It is unclear what "context information" is being referred to in the claim, as it could be referring to context from the video (i.e. concepts in previous frames), or context from the text itself (repeated words or grammar)? Furthermore, it is also unclear where "abstract information" is being derived from. Is it from metadata, is it from the model, is it being extracted from the target video by giving the model before speech recognition? This is unclear. Corresponding claims 12 and 17 are rejected for the same reasons. Claims 2-6, 8-11, 13-16, and 18-20 are rejected based on dependency. The examiner has interpreted this limitation to claim a disfluency detection model and a text correction model.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-6, 8-10, and 12-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Qu et al (CN 109359636 A) in view of Goel et al (US 20190294668 A1) and Lin (CN 111833853 A).
Regarding claim 1, Qu et al teaches a method, comprising: extracting at least two types of modal information including at least audio information from a received target video, wherein extracting the audio information from the received target video comprises (Fig 2 and Page 5, the server first adopts the image classification model 211 for the original video 20 from the image dimension, the audio dimension, and the text dimension. i.e. extracting at least two types of modal information);
extracting, based on a preset feature extraction model, at least two modal features corresponding to the at least two types of modal information (Fig 2 and Page 5-6, image frame extraction and classification are performed to obtain image classification result 212; audio feature extraction and classification is performed on audio of original video 20 by audio classification model 221 to obtain audio classification result 222; text of original video 20 is adopted by text classification model 231 The description information is used for text feature extraction and classification to obtain a text classification result 232. i.e. extracting at least two modal features corresponding to the at least two types of modal information); and fusing the at least two modal features to obtain a target feature of the target video (Fig 2 and Page 5, further, the server combines the image classification result 212, the audio classification result 222, and the text classification result 232 to obtain the target classification result 24 of the original video 20, and further determines the original video according to the probability corresponding to each category indicated by the target classification result 24. i.e. fusing the at least two modal features to obtain a target feature - combining classification result to obtain the target classification result obtaining a target feature).
Qu et al does not teach inputting the target video into a speech recognition model to obtain initial audio information in the target video; and adjusting the initial audio information based on a disfluency detection model and a text correction model to obtain the audio information in the target video, wherein the text correction model is any model for correcting text information that is grammatically fluent but is semantically disfluent by determining whether text conforms to context based on abstract information of the text and abstract information of context.
In a similar field of endeavor, Goel et al teaches a speech recognition model, including correcting text and processing the text to generate keywords that represent context of the multimedia presented in a form of audio portions - the limitation: inputting the target video into a speech recognition model to obtain initial audio information in the target video; and adjusting the initial audio information based on a disfluency detection model and a text correction model to obtain the audio information in the target video (Para 45, the keyword/keyphrase generation unit 202 can be configured to generate the keywords and/or keyphrases by analyzing the contents of the multimedia. The keyword/keyphrase generation unit 202 extracts the audio portions from the multimedia using an automatic speech recognition (ASR) method. The keyword/keyphrase generation unit 202 uses a matrix of machine learned contextual data models to convert the audio portions into a transcript/textual summary. The textual summary can be structured text or unstructured text and with or without errors. The matrix of the machine learned contextual models represents an arrangement of different types of machine learnt data models carrying the context of the multimedia, which can be used by different methods in their independent capacities to produce outputs in a form of new data. Further, the keyword/keyphrase generation unit 202 performs a post processing of the transcript for filtering errors and other corrections to obtain structured text using the at least one machine learnt model. The structured text can be a punctuated, ordered text without any errors. The keyword/keyphrase generation unit 202 further processes the structured text automatically to generate the keywords and/or the keyphrases that represent context of the content of the multimedia presented in a form of the audio portions), but does not explicitly teach wherein the text correction model is any model for correcting text information that is grammatically fluent but is semantically disfluent by determining whether text conforms to context based on abstract information of the text and abstract information of context. Qu et al also does not teach wherein the text correction model is any model for correcting text information that is grammatically fluent but is semantically disfluent by determining whether text conforms to context based on abstract information of the text and abstract information of context.
In a similar field of endeavor, Lin teaches, wherein the text correction model is any model for correcting text information that is grammatically fluent but is semantically disfluent by determining whether text conforms to context based on abstract information of the text and abstract information of context (Page 7, as shown in Figure 4, the recognized text obtained by performing speech recognition for an exemplary speech does not contain punctuation marks, and there are unfluent text components such as modal particles, repeated words, words indicating correction, words indicating the restart of sentences, etc. Therefore, it is necessary to add punctuation marks in the recognized text and remove the text components that are not fluent, while retaining the removal traces of the text components that are not fluent, in order to extract the text features of the speech. i.e. grammatically correct the audio information, in order to extract text features of the speech. Furthermore, Page 7 states, the extracted text features may include voice-corresponding disfluent features, keyword features, semantic features, and pragmatic features. Among them, the feature of disfluentness specifically includes the text disfluentness score, which can be determined according to the ratio between the number of disfluent components contained in the recognized text and the total number of words contained in the recognized text. The proportion of unfluent text is smaller, so the score for unfluent text is also smaller. In an embodiment, the text disfluent score may also be obtained according to a preset machine learning model. i.e. disfluentness includes scoring the text with a disluentness score, which is based on voice-corresponding disfluent features, keyword features, semantic features, and pragmatic features (abstract information of the text and abstract information of the context), and is obtained using machine learning model (considered determining whether text conforms to context based on abstract information of the text and abstract information of the context). Also see Pages 8-9).
Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date to incorporate the teachings of Qu et al (CN 109359636 A) in view of Goel et al (US 20190294668 A1) and Lin (CN 111833853 A) so that the method adjusts the initial audio information based on a disfluency detection model and a text correction model to obtain the audio information. Doing so would allow for generating contextual data elements from a comprehensive analysis of multimedia to identify intent of a user when consuming the multimedia (Para 24, Goel et al). Doing so would also improve the accuracy of the smart scoring system for scoring the user’s spoken language (Page 1, Lin).
Regarding claim 2, Qu et al teaches the method according to claim 1, wherein extracting the at least two types of modal information from the received target video comprises: extracting audio information in the target video from the received target video; extracting text information in the target video from the received target video; and extracting image information in the target video from the received target video (Fig 2 and Page 5-6, image frame extraction and classification are performed to obtain image classification result 212; audio feature extraction and classification is performed on audio of original video 20 by audio classification model 221 to obtain audio classification result 222; text of original video 20 is adopted by text classification model 231 The description information is used for text feature extraction and classification to obtain a text classification result 232. i.e. extracting audio, text, and video information from the target video).
Regarding claim 3, Qu et al teaches the method according to claim 2, wherein extracting, based on the preset feature extraction model, the at least two modal features corresponding to the at least two types of modal information comprises: extracting a speech feature of the audio information based on a preset speech feature extraction model; extracting a text feature of the text information based on a preset text feature extraction model; and extracting an image feature of the image information based on a preset image feature extraction model (Fig 2 and Page 5-6, image frame extraction and classification are performed to obtain image classification result 212; audio feature extraction and classification is performed on audio of original video 20 by audio classification model 221 to obtain audio classification result 222; text of original video 20 is adopted by text classification model 231 The description information is used for text feature extraction and classification to obtain a text classification result 232. i.e. extracting audio, text, and video information from the target video based on respective target models).
Regarding claim 4, Qu et al teaches the method according to claim 3, wherein extracting the image information in the target video from the received target video comprises: extracting a target object and/or video frame picture information in the target video from the received target video (Page 5-6 step 302, and further, Page 8-9 and step 302B, RGB image frames are classified by the residual network and the RGB classifier in the first classification model to obtain a first image classification result, and the RGB classifier is used to classify based on the static image features. i.e. extracting a target object and/or video frame picture information (can be RGB difference frame in step 302C or fine-grained classification in step 302E) from the received target video).
Regarding claim 5, Qu et al teaches the method according to claim 4, wherein extracting the image feature of the image information based on the preset image feature extraction model comprises: extracting an object feature of the target object based on a first preset image feature extraction model, and/or extracting a picture feature of the video frame picture information based on a second preset image feature extraction model (Page 5-6 step 302, and further, Page 8-10 and step 302B, RGB image frames are classified by the residual network and the RGB classifier in the first classification model to obtain a first image classification result, and the RGB classifier is used to classify based on the static image features. i.e. extracting a target object and/or video frame picture information (can be RGB difference frame in step 302C or fine-grained classification in step 302E) from the received target video).
Regarding claim 6, Qu et al teaches the method according to claim 5, wherein fusing the at least two modal features to obtain the target feature of the target video comprises: fusing the speech feature, the text feature, the object feature, and the picture feature to obtain the target feature of the target video (Page 10-11, in this embodiment, while lifting the overall feature of the image, the server performs fine-grained image feature extraction on the RGB image frame through the target detection network, and fuses the extracted fine-grained image features for classification, thereby further improving the accuracy of the image classification result. And comprehensive. i.e. fusing the modal features includes the fine-grained image feature extraction (picture feature) along with the visual feature, text feature, and speech feature).
Regarding claim 8, Qu et al does not teach, the method according to claim 2, wherein extracting the text information in the target video from the received target video comprises: extracting a target video frame from the received target video in a preset extraction manner; inputting the target video frame into a text recognition model to obtain initial text information in the target video;
and adjusting the initial text information based on a disfluency detection model and a text correction model to obtain the text information in the target video.
In a similar field of endeavor, Goel et al teaches, the method according to claim 2, wherein extracting the text information in the target video from the received target video comprises: extracting a target video frame from the received target video in a preset extraction manner; inputting the target video frame into a text recognition model to obtain initial text information in the target video (Para 46, the keyword/keyphrase generation unit 202 can be further configured to use an Optical Character Recognition (OCR) method to identify text portions from the extracted video portions (including image frames) of the multimedia);
and adjusting the initial text information based on a disfluency detection model and a text correction model to obtain the text information in the target video (Para 46, the keyword/keyphrase generation unit 202 processes the identified text portions to generate the keywords and/or keyphrases that represent context of the content presented in a form of the text/visual content (of the image frames). Further, in Para 47, The summary generation unit 204 can be configured for generating the summary for the contents of the multimedia. The summary can include at least one of a text summary and a video/visual summary. The summary generation unit 204 can process the structured text generated associated with the textual summary (generated by the keyword/keyphrase generation unit 202) using at least one of an extractive technique and an abstractive technique to generate the text summary. i.e. the disfluency detection modal can be considered structuring the text generated with the textual summary using an abstractive technique to generate the text summary. The summary modal can be considered disfluency/correction and creating brevity from the extracted text features).
Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date to incorporate the teachings of Qu et al (CN 109359636 A) in view of Goel et al (US 20190294668 A1) and Lin (CN 111833853 A) so that the method adjusts the initial text information based on a disfluency detection model and a text correction model to obtain the text information in the target video. Doing so would allow for generating contextual data elements from a comprehensive analysis of multimedia to identify intent of a user when consuming the multimedia (Para 24, Goel et al).
Regarding claim 9, Qu et al teaches the method according to claim 5, wherein extracting the target object and/or video frame picture information in the target video from the received target video comprises: extracting a target video frame from the received target video in a preset extraction manner (Page 8, Step 302A, determining an original image frame extracted from the target video as an RGB image frame);
inputting the target video frame into an object recognition model to obtain the target object in the target video and attribute information of the target object; and/or inputting the target video frame into an image recognition model to obtain the video frame picture information in the target video (Page 5-6 step 302, and further, Page 8-10 and step 302B, RGB image frames are classified by the residual network and the RGB classifier in the first classification model to obtain a first image classification result, and the RGB classifier is used to classify based on the static image features. i.e. extracting a target object and/or video frame picture information (can be RGB difference frame in step 302C or fine-grained classification in step 302E) from the received target video).
Regarding claim 10, Qu et al teaches the method according to claim 9, wherein extracting the object feature of the target object based on the first preset image feature extraction model comprises: inputting the target object in the target video and the attribute information of the target object into the first preset image feature extraction model to extract the object feature of the target object (Page 5-6, correspondingly, after the server extracts the image frame from the target video, the image frame is input into the first classification model, and the image features of the image frame are extracted by the deep learning network in the first classification model, and the image features are further classified by the classifier. , thereby obtaining image classification results. i.e. inputting the target object in the target video frame and the attribute to further extract the object feature (classification results)).
Regarding claim 12, claim 12 rejected for the same reasons as claim 1. Claim 12 further teaches a computer device, comprising: a processor; and a memory, wherein the memory stores computer executable instructions that, when executed by the processor, cause the processor to (Qu et al, Page 2, another aspect, a server is provided, the server including a processor and a memory, the memory storing at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program The set of codes or sets of instructions are executed by the processor to implement a video classification device as described in the above aspects).
Regarding claim 13, claim 13 rejected for the same reasons as claim 2. Claim 13 further teaches a computer device, comprising: a processor; and a memory, wherein the memory stores computer executable instructions that, when executed by the processor, cause the processor to (Qu et al, Page 2, another aspect, a server is provided, the server including a processor and a memory, the memory storing at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program The set of codes or sets of instructions are executed by the processor to implement a video classification device as described in the above aspects).
Regarding claim 14, claim 14 rejected for the same reasons as claim 3. Claim 14 further teaches a computer device, comprising: a processor; and a memory, wherein the memory stores computer executable instructions that, when executed by the processor, cause the processor to (Qu et al, Page 2, another aspect, a server is provided, the server including a processor and a memory, the memory storing at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program The set of codes or sets of instructions are executed by the processor to implement a video classification device as described in the above aspects).
Regarding claim 15, claim 15 rejected for the same reasons as claim 4. Claim 15 further teaches a computer device, comprising: a processor; and a memory, wherein the memory stores computer executable instructions that, when executed by the processor, cause the processor to (Qu et al, Page 2, another aspect, a server is provided, the server including a processor and a memory, the memory storing at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program The set of codes or sets of instructions are executed by the processor to implement a video classification device as described in the above aspects).
Regarding claim 16, claim 16 rejected for the same reasons as claim 5. Claim 16 further teaches a computer device, comprising: a processor; and a memory, wherein the memory stores computer executable instructions that, when executed by the processor, cause the processor to (Qu et al, Page 2, another aspect, a server is provided, the server including a processor and a memory, the memory storing at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program The set of codes or sets of instructions are executed by the processor to implement a video classification device as described in the above aspects).
Regarding claim 17, claim 17 rejected for the same reasons as claim 12.
Regarding claim 18, claim 18 rejected for the same reasons as claim 13.
Regarding claim 19, claim 19 rejected for the same reasons as claim 14.
Regarding claim 20, claim 20 rejected for the same reasons as claim 15.b
Claim(s) 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Qu et al (CN 109359636 A) in view of Goel et al (US 20190294668 A1) and Lin (CN 111833853 A) Ye et al (CN 110737801 A).
Regarding claim 11, Qu et al and Goel et al and Lin do not teach, the method according to claim 1, wherein fusing the at least two modal features to obtain the target feature of the target video comprises: separately encoding the at least two modal features, and fusing the at least two encoded modal features to obtain the target feature of the target video.
In a similar field of endeavor, Ye et al teaches, the method according to claim 1, wherein fusing the at least two modal features to obtain the target feature of the target video comprises: separately encoding the at least two modal features, and fusing the at least two encoded modal features to obtain the target feature of the target video (Page 5-6, for the audio file, obtain the corresponding mel spctrogram. Feature extraction. NetVlad (Net Vector of locally aggregated descriptors) clusters and encodes the extracted vector to obtain the audio feature vector. NetVlad can save the distance between each feature point and its nearest cluster center. i.e. Pages 5-6 discuss encoding two encoded modal features (audio, video, or text feature) with separate encoders. See Page 7 regarding fusing of audio, video, or text feature).
Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date to incorporate the teachings of Qu et al (CN 109359636 A) in view of Goel et al (US 20190294668 A1) and Lin (CN 111833853 A) and Ye et al (CN 110737801 A) so that fusing the at least two modal features to obtain the target feature of the target video comprises separately encoding the at least two modal features. Doing so would provide a content classification method, device, computer device, and storage medium for the problems of fineness and accuracy of the content categories obtained by the above classification (Ye et al., Page 1).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
US 20190378501 A1
H. Li, J. Zhu, C. Ma, J. Zhang and C. Zong, "Read, Watch, Listen, and Summarize: Multi-Modal Summarization for Asynchronous Text, Image, Audio and Video," in IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 5, pp. 996-1009, 1 May 2019, doi: 10.1109/TKDE.2018.2848260
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACK PETER KRAYNAK whose telephone number is (703)756-1713. The examiner can normally be reached Monday - Friday 7:30 AM - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached at (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JACK PETER KRAYNAK/Examiner, Art Unit 2668
/UTPAL D SHAH/Primary Examiner, Art Unit 2668