DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments filed 12/01/2025 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-5 and 7-20 are rejected under 35 U.S.C. 103 as being unpatentable over Gautam et al. (US 2022/0414381 A1 – hereinafter Gautam), Evans et al. (US 2022/0335972 A1 – hereinafter Evans), and Chaudhuri et al. (US 10,381,022 B1 – hereinafter Chaudhuri).
Regarding claim 1, Gautam discloses a method, comprising: obtaining a video comprising a plurality of frames (Fig. 2; [0052] – obtaining an input video comprising a plurality of frames); identifying a particular frame of the plurality of frames ([0025] – identifying a particular frame to be extracted from the plurality frames); obtaining one or more representations associated with the particular frame from a model (Fig. 2; [0026]-[0028] – obtaining an audio vector), the model trained using inputs from one or more of a plurality of data pairs, text representations, and image representations ([0040] – the model trained using inputs from at least image representations, i.e. training video input); generating, by the model, one or more recommended audio segments associated with the particular frame based on a similarity of the one or more representations to one or more database audio segments stored in a database (Fig. 2; [0038] – generating one or more audio sequences as audio recommendations associated with the particular frame for the video input based on similarity of the audio vector 216 with one or more database audio segments stored in database 220); and combining the particular audio segment with the particular frame ([0083] – embedding the particular audio segment with the particular frame for the input video sequence).
However, Gautam does not disclose causing a graphical user interface (GUI) to display one or more recommended audio segments associated with the particular frame; obtaining an additional text representation that includes a label for the particular frame; updating the one or more recommended audio segments associated with the particular frame based on the additional text representation; causing the GUI to display the updated one or more recommended audio segments associated with the particular frame; obtaining a selection of a particular audio segment from the updated one or more recommended audio segments.
Evans discloses causing a graphical user interface (GUI) to display one or more recommended audio segments associated with the particular frame (Fig. 4 – displaying a GUI comprising one or more recommended audio segments associated with a particular frame of a video, i.e. a particular frame in the scene); obtaining, from a user input, an additional text representation ([0109]; Fig. 4 – a user enter an additional text/search terms); updating one or more recommended audio segments associated with a particular frame based on the additional text representation ([0109]; Fig. 4 – with an audio element type selection control being selected, performing the search across relevant sources to provide recommended audio segments represented by corresponding thumbnails, and displaying the search results to the user via a GUI as shown in Fig. 4, the audio segments found in the search result are associated with at least a particular frame of a video scene); causing the GUI to display the updated one or more recommended audio segments associated with the particular frame ([0109]-[0111]; Fig. 4 –displaying the search results to the user via a GUI as shown in Fig. 4); obtaining a selection of a particular audio segment from the updated one or more recommended audio segments (Fig. 4; [0112] – obtaining a selected audio segment from the search results to add to the scene).
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate the teachings of Evans into the method taught by Gautam to enhance the user interface of the method and to allow the user to add additional audio segments into a scene.
However, Gautam and Evans do not disclose the additional text representation including a label for the particular frame.
Chaudhuri discloses obtaining an additional text representation that includes a label for a particular frame (column 3, line 60 – column 4, line 7; column 8, lines 24-38 – obtaining additional text representation as a search query that includes a label ‘birthday’ for a particular video frame, and presenting audio segments as search results based on such a search query).
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate the teachings of Chaudhuri into the method taught by Gautam and Evans to quickly search audio segments for a particular video frame.
Regarding claim 2, Gautam in view of Evans and Chaudhuri also discloses the method of claim 1, wherein the one or more recommended audio segments are recommended in view of a prior particular audio selection (Fig. 2 – in view of a prior selection of the audio vector 216).
Regarding claim 3, Gautam in view of Evans and Chaudhuri also discloses the method of claim 1, wherein one or more of the plurality of data pairs, the text representations, and the image representations are mean pooled together prior to being the inputs to the model.
Official Notice is taken that mean pooling prior to inputting to a model is well known in the art.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate mean pooling of image representations into the method taught by Gautam, Evans, and Chaudhuri to enable more efficient and effective model training.
Regarding claim 4, Gautam in view of Evans and Chaudhuri also discloses the method of claim 1, wherein the one or more representations are individually assigned a weight based on a representation type ([0026]-[0028]; Fig. 2 – at least an audio vector is assigned a weight of 1).
Regarding claim 5, see the teachings of Gautam in view of Evans and Chaudhuri as discussed in claim 4 above, in which Gautam also discloses the user can input configurable parameters ([0057]). However, Gautam, Evans, and Chaudhuri do not disclose in response to a second user input, the weight is adjusted.
Receiving a user input to adjust a weight assigned to an input of system is well known in the art.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate receiving a second user input to adjust the weight into the method of claim 4 above to allow the user to correct the influence of the representation in case he or she finds it is incorrect.
Regarding claim 7, see the teachings of Gautam, Evans, and Chaudhuri as discussed in claim 1 above. However, Gautam, Evans, and Chaudhuri do not disclose the one or more representations include text and are obtained from the particular frame via optical character recognition.
Deriving representations in form of text obtained from OCR of a particular video frame is well known in the art, e.g. from caption, subtitle, etc.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate generating representations in form of text obtained from OCR of a particular video frame into the method taught by Gautam, Evans, and Chaudhuri to enrich the sources to generate the representations.
Regarding claim 8, see the teachings of Gautam, Evans, and Chaudhuri as discussed in claim 1 above. However, Gautam, Evans, and Chaudhuri do not disclose the similarity between the one or more representations and the database audio segments is maximized using a gradient-descent based machine learning technique.
Minimizing a cost function using a gradient-descent based machine learning technique is well known in the art.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate minimizing a cost function, thus maximizing the similarity between the one or more representations and the database audio segments, using a gradient-descent based machine learning technique into the method taught by Gautam, Evans, and Chaudhuri to improve the model performance.
Regarding claim 9, see the teachings of Gautam, Evans, and Chaudhuri as discussed in claim 1 above. However, Gautam, Evans, and Chaudhuri do not disclose the similarity between the one or more representations and the database audio segments is measured using one of a cosine similarity or a negative mean standard error similarity.
At least using a cosine similarity is well known in the art to measure similarity of data.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate using a cosine similarity to measure the similarity between the one or more representations and the database audio segments into the method taught by Gautam, Evans, and Chaudhuri because cosine similarity method is well known due to its effectiveness.
Regarding claim 10, see the teachings of Gautam, Evans, and Chaudhuri as discussed in claim 1 above, in which Gautam also discloses the database audio segments are precomputed, stored in the database ([0037]; Fig. 2; Fig. 4).
However, Gautam, Evans, and Chaudhuri do not disclose the database audio segments searched using a nearest neighbor search.
Searching using a nearest neighbor search is well known in the art.
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate a nearest neighbor search to search the database audio segments into the method taught by Gautam, Evans, and Chaudhuri because the search method is well known due to its efficiency and ease of implementation.
Regarding claim 11, Evans in view of Gautam and Chaudhuri also discloses the method of claim 1, wherein the selection of the particular audio segment relative to the one or more recommended audio segments is in response to a second user input (Fig. 4; [0112]). The motivation for incorporating the teachings of Evans into the method has been discussed in claim 1 above.
Regarding claim 12, Evans in view of Gautam and Chaudhuri also discloses the method of claim 1, wherein the particular frame is selected by a second user input relative to the video displayed in a graphical user interface (Figs. 4-5 – a particular frame is selected by selection of the scene within which the particular frame is located). The motivation for incorporating the teachings of Evans into the method has been discussed in claim 1 above.
Claim 13 is rejected for the same reason as discussed in claim 1 above in view of Gautam also disclosing a system (Fig. 2; Fig. 4), comprising: a model (Fig. 2; Fig. 4 – any of audio classification model and video classification model); a database (Figs. 2, 4 – database 220); a processor, operable to perform the recited method ([0098]).
Claim 14 is rejected for the same reason as discussed in claim 2 above.
Claim 15 is rejected for the same reason as discussed in claim 3 above.
Claim 16 is rejected for the same reason as discussed in claim 4 above.
Claim 17 is rejected for the same reason as discussed in claim 8 above.
Claim 18 is rejected for the same reason as discussed in claim 9 above.
Claim 19 is rejected for the same reason as discussed in claim 11 above.
Claim 20 is rejected for the same reason as discussed in claim 12 above.
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Gautam, Evans, and Chaudhuri as applied to claims 1-5 and 7-20 above, and further in view of Oplustil Gallegos et al. (US 2024/0087558 A1 – hereinafter Oplustil Gallegos).
Regarding claim 6, see the teachings of Gautam, Evans, and Chaudhuri as discussed in claim 1 above. However, Gautam, Evans, and Chaudhuri do not disclose a data pair of the plurality of data pairs is a text- audio pair.
Oplustil Gallegos discloses a data pair of the plurality of data pairs is a text- audio pair ([0275]).
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate the teachings of Oplustil Gallegos into the method taught by Gautam, Evans, and Chaudhuri to enrich the sources to generate the representations.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUNG Q DANG whose telephone number is (571)270-1116. The examiner can normally be reached IFT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Thai Q Tran can be reached at 571-272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/HUNG Q DANG/Primary Examiner, Art Unit 2484