Last updated: May 04, 2026

Application No. 18/789,429

VIDEO-TO-MUSIC MACHINE LEARNING MODEL

Non-Final OA §102§103

Filed

Jul 30, 2024

Examiner

KANG, ANNABELLE

Art Unit

2695

Tech Center

2600 — Communications

Assignee

Lemon Inc.

OA Round

1 (Non-Final)

Interview Optional

— -8.3% interview lift. Interview lift (-8.3%) is below the 15.0% threshold. A written response is recommended.

Based on 16 resolved cases, 2023–2026

Examiner Intelligence

KANG, ANNABELLE View full profile →

Grants 81% — above average

Career Allowance Rate

13 granted / 16 resolved

+19.3% vs TC avg

Minimal -8% lift

Without

With

+-8.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

23 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

6.9%

-33.1% vs TC avg

§103

56.1%

+16.1% vs TC avg

§102

31.8%

-8.2% vs TC avg

§112

5.2%

-34.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 16 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1 and 11 is/are rejected under 35 U.S.C. 102(a)(1) as being unpatentable by Wang (CN 118366416 A, hereinafter “Wang”).

Regarding claim 1, Wang teaches a computing system comprising: one or more processing devices configured to: receive an input video including a plurality of frames; (see pg. 1 ¶5-6, pg. 4 ¶8: input at least one set of image features (associated with the visual frame) and motion features into a pretrained vector quantization VQ generation model)
at a video-to-music machine learning model including a video encoder and an autoregressive decoder: (see pg. 4 ¶9-10, pg. 7 ¶4-9: pre-trained visual encoder implemented to the training process of VQ generation model, Jukebox decoder (technique used in machine learning models particularly in context of GANs))
compute a plurality of video feature tensors at the video encoder based at least in part on the input video; (see pg. 4 ¶5: normalize each of the video units in a time dimension to obtain a standard video unit, in which the process inherently leads toward tensor representation)
and autoregressively generate a plurality of music tokens at the autoregressive decoder based at least in part on the video feature tensors, wherein the video-to-music machine learning model has been trained using: (see pg. 7 ¶3-12: obtaining a Vector Quantized (VQ) representation sequence for audio converts continuous audio signals into a discrete sequence of symbols or tokens.)
a training data set including a plurality of training input pairs that each include a training input video and respective training background music; (see pg. 4-5: build a training sample dataset including one set of sample motion features extracted from sample video in time sequence and at least one set of background music matching the sample video, as sample audio)

and a loss function including a video-music contrastive loss term and an autoregressive loss term; (see pg. 5-6: loss function, feature matching loss function as optimization objective function)
convert the music tokens into background music associated with the input video; (see pg. 7 ¶10, page 8 ¶5-6: data modeling the correlation between video and music generating background music consistent with video by synthesizing the VQ representation sequence into audio )
and output the background music.  (see pg. 7 ¶10, page 8 ¶5-6: synthesizing the VQ representation sequence into audio as the background music for the video to be processed and output)

Regarding claim 11, the claimed limitations are a method claim directly corresponding to the system claim 1; therefore, is rejected for the significant similar reasons as claim 1-2 as discussed above.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 2 and 5-10 and 12 and 15-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang (CN 118366416 A, hereinafter “Wang”) in view of Sanil (US 20170024614 A1, hereinafter “Sanil”).

Regarding claim 2, Wang teaches the autoregressive loss term includes a video-music alignment weighting factor that is computed for each of the training input pairs at least in part by: computing a plurality of music beat locations within the training background music; computing a plurality of video beat locations within the training input video; and computing the video-music alignment (see pg. 7 ¶13 – pg. 8 ¶1: a beat coverage score and beat hit score which evaluates the rhythm of the music corresponding to the beat of a video)
Wang does not mention a weighting factor. However, it would have been obvious to a person skilled in the art to evaluate the video-music alignment using a weighting factor. Official notice is taken that it is well known to use a weighting factor to ensure that the results accurately represent the data set being evaluated. It gives weight to a given data point to assign a ‘weight’ or importance into a group, thus would not yield unexpected results in applying a weighting factor.

Regarding claim 5, Wang does not explicitly teach the one or more processing devices are configured to compute the video-music alignment weighting factor at least in part by determining whether, for each of the video beat locations, the training background music includes a music beat location within a predefined temporal distance of that video beat location.  
However, it would have been obvious to a person of ordinary sill to compute a video-music alignment weighting factor by determining, for each video beat, whether a music beat occurs within a predefined temporal distance, because it is a standard and well-known approach to quantify temporal correspondence between rhythmic events, and applying this weighting factor to guide music alignment is one way for implementation.

Regarding claim 6, Wang does not explicitly teach the one or more processing devices are configured to compute the music beat locations at least in part by: processing the training background music at a pretrained music tokenizer model to obtain a plurality of training music tokens; and performing onset detection on the training music tokens to identify the music beat locations.  
However, it would have been obvious to a person of ordinary skill to train a predetermined music tokenizer model on the training background music to obtain music tokens and perform onset direction to identify music beat locations, because tokenizing musical input and detecting onsets are standard techniques for extracting temporal structure from music, and applying them to determine beats in Wang’s system is a predictable and known method.

Regarding claim 7, Wang teaches the video-music contrastive loss term is computed between: aggregated video representations computed by applying mean pooling to respective training video feature tensors computed from the training input videos; and aggregated music representations computed by applying mean pooling to estimated music tokens predicted at the video-to-music machine learning model.  (see pg. 5 ¶7-15: perform global average pooling and global maximum pooling)

Regarding claim 8, Wang does not explicitly teach the video encoder includes a plurality of spatial down sampling blocks; and at each of the spatial down sampling blocks, the one or more processing devices are configured to spatially downscale a respective intermediate video representation computed at the video encoder.  
However, it would have been obvious to a person of ordinary skill in the art to recognize that including down sampling block in the video encoder. The examiner takes official notice that it is notoriously well known in the art for down sampling in order to reduce spatial resolution, decrease computational cos, and retain essential visual features for further processing. Applying such a block to Wang’s encoder would have been an obvious to efficiently process video without altering the functionality

Regarding claim 9, Wang does not explicitly teach the spatial down sampling blocks are interspersed among a plurality of transformer blocks.  
However, it would have been obvious to a person of ordinary skill in the art that down sampling at intermediate stages. The examiner takes official notice that the method to reduce computational load while preserving essential features is notoriously well known in the art. Arranging down sampling within an encoder is a standard practice in video processing and applying this modification would have been obvious and yield no unexpected results.

Regarding claim 10, Wang does not explicitly teach the autoregressive decoder includes a plurality of causal attention blocks alternating with a plurality of multi-head attention blocks.  
However, it would have been obvious to a person of ordinary skill in the art that for an autoregressive decoder includes a plurality of causal attention blocks alternating with a plurality of multi-head attention blocks. The examiner takes official notice that alternating attentions patterns are standard in sequence modeling to capture both past dependencies and contextual information is notoriously well known in the art. Hence, applying this structure in Wang’s decoder would yield no unexpected results. 

	Regarding claim 12 and 15-19, the claimed limitations are a method claim directly corresponding to the system claim 2 and 5-9; therefore, is rejected for the significant similar reasons as claim 2 and 5-9 as discussed above.

	Regarding claim 20, the claimed limitations are directly corresponding to the claim 1, 2, and 5; therefore, is rejected for the significant similar reasons as claim 1, 2 and 5 as discussed above.

Claim(s) 3-4 and 13-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang (CN 118366416 A, hereinafter “Wang”) in view of Sanil (US 20170024614 A1, hereinafter “Sanil”).

Regarding claim 3, Wang teaches video beat locations (see pg. 7 ¶13) but does not explicitly teach the video beat locations are computed at least in part by: computing a plurality of optical flow magnitudes of respective training frame pairs included in the training input video; and computing the video beat locations within the training input video based at least in part on the optical flow magnitudes.  
However, Sanil teaches computing a plurality of optical flow magnitudes of respective training frame pairs included in the training input video; and computing the video beat locations within the training input video based at least in part on the optical flow magnitudes. (see [0065]-[0072]: calculating the optical flow (measured motion) between frames.)
Wang and Sanil are considered to be analogous to the claimed invention because both are in the field of analyzing video content in motion. It would have been obvious to a person of ordinary skill in the art to modify Wang to determine video beat locations using optical flow magnitudes as taught by Sanil, because optical flow provides a known and predictable technique for measuring frame-to-frame motion, and video beats in dance or motion-based content corresponds to change in motion intensity. The combination applies a known motion analysis technique to obtain timing information required by Wang, yielding no unpredictable results.
	
Regarding claim 4, Wang is silent to the video beat locations are local maxima of the optical flow magnitudes.  
However, Sanil teaches the video beat locations are local maxima of the optical flow magnitudes.  (see [0065]-[0072]: local maxima is identified and the segments of the video corresponding to the local maxima may be selected as hotspots - in this case beat locations.)
Wang and Sanil are considered to be analogous to the claimed invention because both are in the field of analyzing video content in motion. It would have been obvious to a person of ordinary skill in the art to determine video beat locations as the local maxima of optical flow magnitudes, because peaks in motion intensity naturally correspond to beat points, and using known maxima of optical flow to identify temporal events in video is a predictable application of Sanil’s teachings.

	Regarding claim 13-14, the claimed limitations are a method claim directly corresponding to the system claim 3-14; therefore, is rejected for the significant similar reasons as claim 3-14 as discussed above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNABELLE KANG whose telephone number is (571)270-3403. The examiner can normally be reached Monday-Thursday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vivian Chin can be reached at 571-272-7848. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANNABELLE KANG/Examiner, Art Unit 2695   

/VIVIAN C CHIN/Supervisory Patent Examiner, Art Unit 2695

Read full office action

Prosecution Timeline

Jul 30, 2024

Application Filed

Mar 21, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/227,038

Patent 12610194

VIBRATION APPARATUS AND APPARATUS INCLUDING THE SAME

2y 8m to grant Granted Apr 21, 2026

18/330,327

Patent 12604141

ULTRA-LOW FREQUENCY SOUND COMPENSATION METHOD AND SYSTEM BASED ON HAPTIC FEEDBACK, AND COMPUTER-READABLE STORAGE MEDIUM

2y 10m to grant Granted Apr 14, 2026

18/080,608

Patent 12581255

SYSTEMS AND METHODS FOR ASSESSING HEARING HEALTH BASED ON PERCEPTUAL PROCESSING

3y 3m to grant Granted Mar 17, 2026

18/313,349

Patent 12556868

Speaker

2y 9m to grant Granted Feb 17, 2026

18/067,545

Patent 12549895

DYNAMIC WIND DETECTION FOR ADAPTIVE NOISE CANCELLATION (ANC)

3y 1m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

81%

Grant Probability

73%

With Interview (-8.3%)

2y 8m (~11m remaining)

Median Time to Grant

Low

PTA Risk

Based on 16 resolved cases by this examiner. Grant probability derived from career allowance rate.