Last updated: May 29, 2026

Application No. 17/809,046

Systems and Methods for Video Genre Classification

Final Rejection §103

Filed

Jun 27, 2022

Examiner

REINIER, BARBARA DIANE

Art Unit

2682

Tech Center

2600 — Communications

Assignee

Microsoft Technology Licensing, LLC

OA Round

4 (Final)

Interview Optional

— +9.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 80% grant rate with +9.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 641 resolved cases, 2023–2026

Examiner Intelligence

REINIER, BARBARA DIANE View full profile →

Grants 80% — above average

Career Allowance Rate

511 granted / 641 resolved

+17.7% vs TC avg

Moderate +10% lift

Without

With

+9.5%

Interview Lift

resolved cases with interview

Typical timeline

2y 7m

Avg Prosecution

10 currently pending

Career history

664

Total Applications

across all art units

Statute-Specific Performance

§101

8.1%

-31.9% vs TC avg

§103

65.8%

+25.8% vs TC avg

§102

11.0%

-29.0% vs TC avg

§112

14.6%

-25.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 641 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-3, 5, 7-13, 15-20 and 22-24 have been considered but are moot because the new ground of rejection based in amendment.  Though asserting that previously cited prior art does not teach particular features, the applicant does not discuss nor challenge the prior art further.  
The 35 USC 112(f) interpretation of claims 15-20 is maintained as previously presented.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-3, 7-11, 13, 15-20 and 22-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gaur et al., (US Pub No. 20200275158) in view of Lin et al., (US Pub No. 20220014807) and in further view of Gupta et al., (US Pub No. 20230017489).
Claim 8: Gaur discloses a system comprising:
a processing system; and a memory coupled to the processing system, the memory comprising computer executable instructions that, when executed by the processing system [processor and memory, p0019-0020], perform operations comprising:
receiving video content [the media device 110 may receive media content 122 from one or more content delivery networks (CDNs) 120 … media content 122 may include television shows, movies, and/or other media content created by a third-party content creator or provider (e.g., television network, production studio, streaming service, and the like), p0021];
generating sampled data by sampling a plurality of sliding windows of the video content, the plurality of sliding windows comprising audio data and video data [video and audio frames of received content … images 402 may include pixel data to be displayed (e.g., as video content) over the duration of the K frames [i.e., sliding windows] … audio frames 404 may include audio data to be output (e.g., as audio content) over the duration of the K frames [i.e., sliding windows], p0043 & p0073];
identifying a set of audio features by analyzing the audio data, wherein the set of audio features comprises audio information for a set of speakers represented in the sampled data [e.g., horror movies often contain scenes with blood, screams [e.g., speaker(s)], and/or tense music. Thus, in order to identify a scene as a horror scene, the label detection module 232 may be configured to detect blood (e.g., from received video frames), screams (e.g., from received video and/or audio frames), and tense music (e.g., from receive audio frames) in the received content items 201, p0043];
identifying a set of video features by analyzing the video data, wherein the set of video features comprises video information for a set of objects in the sampled data [e.g., horror movies often contain scenes with blood [e.g., object(s)], screams, and/or tense music. Thus, in order to identify a scene as a horror scene, the label detection module 232 may be configured to detect blood (e.g., from received video frames), screams (e.g., from received video and/or audio frames), and tense music (e.g., from receive audio frames) in the received content items 201, p0043];
detecting a first genre and a second genre for the video content using the set of audio features and the set of video features [the neural network 230 may be configured to perform deep content tagging in various forms of media (e.g., by identifying one or more content genres associated with individual portions or segments of the media), p0046];
generating indexed video content by indexing the video content based on the first genre and the second genre, wherein indexing the video content comprises adding one or more keywords associated with the first genre and the second genre to the video content such that the one or more keywords are displayed with the video content during playback of the video content [each content item and/or scene may be tagged or labeled with one or more associated genres using locally-generated genre information (e.g., by the neural network application 114) … the genre map 202 may include a listing of one or more scenes (e.g., in the form of a timeline) detected in the content item 201 and an indication of one or more genres associated with each scene … the genre map repository 240 may be categorized or indexed based on the content items 201 … the media playback interface 250 may generate an interactive output 203 based on the content items 201 and genre maps 202 … output 203 may be displayed, via the display interface 260 … the scene selection module 254 may tag or label each content item 201 with one or more associated genres using the genre map 202 associated with the content item 201, p0035, p0047-0049 & p0053];
selecting a subset of Artificial Intelligence (AI) models from a set of Al models based on the first genre and the second genre, wherein a first Al model of the subset of Al models is configured to evaluate the first genre and a second Al model of the subset of Al models is configured to evaluate the second genre [CNNs 410(1)-410(4) are configured to infer genre information from a number (K) of frames of media content. For example, each of the CNNs 410(1)-410(4) may be an example embodiment of the deep content genre tag generator 300 of FIG. 3. Thus, each of the CNNs 410(1)-410(4) may generate a respective genre tag 412-418 based on a different media component 402-408 of the K frames, p0072]; and
evaluating the indexed video content using the subset of Al models, wherein evaluating the indexed video content comprises generating insights for the video content using the subset of Al models [the genre map 422 is generated by aggregating individual genre tags 412-418 produced by respective CNNs 410(1)-410(4), p0077].
Although Gaur discloses displaying content based at least in part on genre during playback [the media playback interface 250 may generate an interactive output 203 based on the content items 201 and genre maps 202 … media playback interface 250 is configured to render the content items 201 for display while providing a user interface through which the user may control, navigate, or otherwise manipulate playback of the content items 201 based, at least in part, on the genre maps 202, p0049], Gaur does not appear to explicitly disclose where the one or more keywords are displayed with the video content during playback of the video content.
Lin discloses in a related system from the same field of endeavor [Abstract] that the one or more keywords are displayed with the video content [generate corresponding length of text caption by acquiring the length information of the text caption to be generated, so as to meet the needs of different application scenarios, p0155].
It would have been obvious to persons of ordinary skill in the art before the effective filing date of the invention to have modified Gaur to support where the one or more keywords are displayed with the video content as disclosed by Lin because it allows videos to be accessible to viewers who are deaf, hard of hearing, or watching without sound.
Although Lin strongly suggests where captioned text (i.e., keywords) are output (i.e., display) for a video or image [p0046], Lin does not appear to explicitly discuss playback of the video content. 
Gupta discloses in a related system [Abstract] where one or more keywords are displayed with the video content during playback of the video content [Based on the retrieved metadata and the image analysis of the scene 201, the media player application may identify six frames of interest 206, 208, 210, 214, 216, and metadata 218, 220, 222, and 224 associated with some of the frames … e.g., the media player application may include a text overlay of the metadata 222 in the generated preview image 128 (e.g., “Spiderman takes Captain America's shield”), p0033-0035].
It would have been obvious to persons of ordinary skill in the art before the effective filing date of the invention to have modified Gaur in view of Lin to support where one or more keywords are displayed with the video content during playback of the video content as disclosed by Gupta because it allows the user to more easily identify a point of interest during a playback operation as discussed by Gupta in at least paragraphs 0035-0037.

Claim 9: Gaur in view of Lin and Gupta discloses the system of claim 8, wherein the video content is received via an upload to a web-based platform [Examples of media devices may include, but are not limited to, personal computing devices (e.g., desktop computers, laptop computers, netbook computers, tablets, web browsers, e-book readers, and personal digital assistants (PDAs)), data input devices (e.g., remote controls and mice), data output devices (e.g., display screens and printers), remote terminals, kiosks, video game machines (e.g., video game consoles, portable gaming devices, and the like), communication devices (e.g., cellular phones such as smart phones), media devices (e.g., recorders, editors, and players such as televisions, set-top boxes, music players, digital photo frames, and digital cameras), and the like, p0019].  

Claim 10: Gaur in view of Lin and Gupta discloses the system of claim 8, wherein each sliding window of the plurality of sliding windows has a predetermined size [each scene classification 303 may include one or more elementary labels 302 from the last K frames 301 of media content … multiple genres may be inferred from the elementary labels 302 and/or scene classification 303 associated with a given set of frames, p0067-0068].  

Claim 11: Gaur in view of Lin and Gupta discloses the system of claim 8, wherein the audio data and video data are comprised in keyframes of the video content [the spatio-temporal filter 320 may detect scene transitions and/or boundaries based, at least in part, on the continuity of video and/or audio data in consecutive frames 301 of media content, p0067].  

Claim 13: Gaur in view of Lin and Gupta discloses the system of claim 8, further comprising calculating a probability factor using the set of audio features and the set of video features, wherein detecting the first genre and the second genre comprises using the probability factor [the genre detector 330 may use one or more neural network models to infer a probability or likelihood of the scene classification 303 matching each of a plurality of pre-identified genres … the genre detector 330 may associated multiple genres with any given scene, p0068-0070].

Claims 1-3: the methods herein have been executed or performed by the systems of claims 8-10 and are therefore likewise rejected.

Claim 7: Gaur in view of Lin and Gupta discloses the method of claim 1, further comprising applying a pipeline of video analysis models based on the first genre and the second genre [CNNs 410(1)-410(4) are configured to infer genre information from a number (K) of frames of media content. For example, each of the CNNs 410(1)-410(4) may be an example embodiment of the deep content genre tag generator 300 of FIG. 3. Thus, each of the CNNs 410(1)-410(4) may generate a respective genre tag 412-418 based on a different media component 402-408 [interpreted to be indicative of some type pipeline] of the K frames, p0072].  

Claims 15-19: the systems herein have been executed or performed by the systems of claims 8-11 and 13  and are therefore likewise rejected.

Claim 20: Gaur in view of Lin and Gupta discloses the system of claim 15, wherein a determined set of video analysis models is applied to the video content based on the first genre and the second genre [CNNs 410(1)-410(4) are configured to infer genre information from a number (K) of frames of media content. For example, each of the CNNs 410(1)-410(4) may be an example embodiment of the deep content genre tag generator 300 of FIG. 3. Thus, each of the CNNs 410(1)-410(4) may generate a respective genre tag 412-418 based on a different media component 402-408 of the K frames, p0072].  

Claim 22: the system herein has been executed or performed by the system of claim 8 and is therefore likewise rejected.

Claim 23: Gaur in view of Lin and Gupta discloses the method of claim 22, wherein selecting the set of Al models comprises: configuring a set of parameters for each Al model in the set of Al models based on at least one of the first genre and the second genre, the set of Al models comprising a plurality of Al models; and evaluating the video content using each Al model in the set of Al models based on a respective set of parameters for each Al model [CNNs 410(1)-410(4) are configured to infer genre information from a number (K) of frames of media content. For example, each of the CNNs 410(1)-410(4) may be an example embodiment of the deep content genre tag generator 300 of FIG. 3. Thus, each of the CNNs 410(1)-410(4) may generate a respective genre tag 412-418 based on a different media component 402-408 of the K frames, p0072-0073]. 

Claim 24: the system herein has been executed or performed by the system of claim 8 and is therefore likewise rejected.

Claim(s) 5 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gaur et al., (US Pub No. 20200275158) in view of Lin et al., (US Pub No. 20220014807) and in further view of Gupta et al., (US Pub No. 20230017489) and Jasinschi et al., (US Pub No. 20020159750).
Claim 12: Gaur in view of Lin and Gupta discloses the system of claim 8, wherein the set of audio features comprises a number of speakers in the video content [e.g., horror movies often contain scenes with blood, screams [e.g., speaker(s)], and/or tense music. Thus, in order to identify a scene as a horror scene, the label detection module 232 may be configured to detect blood (e.g., from received video frames), screams (e.g., from received video and/or audio frames), and tense music (e.g., from receive audio frames) in the received content items 201, p0043].  
	Gaur, Lin nor Gupta appear to explicitly disclose a number of speakers.
	Jasinschi disclosed in a related system from the same field of endeavor [Abstract] wherein the set of audio features comprises a number of speakers in the video content [In the audio domain, for each twenty-two (22) ms temporal window "a segment" classification is realized between silence, noise, speech, music, speech plus noise, speech plus speech, and speech plus music categories … a change in speaker [indicative of more than one speaker] or subject [i.e., video feature] could indicate a significant change in the video content information, it may be desirable to divide the video segments in such as way as to respect speaker change information, p0025 & p0029].
	It would have been obvious to persons of ordinary skill in the art before the effective filing date of the invention to have included in Gaur in view of Lin and Gupta the support wherein the set of audio features comprises a number of speakers in the video content as taught by Jasinschi because it allows for better discernment of the genre of the media as discussed by Jasinschi in at least paragraph 0029.

Claim 5: the method herein has been executed or performed by the system of claim 12 and is therefore likewise rejected.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Zhang et al., Chinese Pub No. 113537371, discloses a machine learning classifier pipeline.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BARBARA D REINIER whose telephone number is (571)270-5082. The examiner can normally be reached M-Tu 10am - 6pm.
Examiner interviews are available via telephone and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benny Tieu can be reached at 571-272-7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/BARBARA D REINIER/Primary Examiner, Art Unit 2682

Read full office action

Prosecution Timeline

Show 12 earlier events

Sep 18, 2025

Request for Continued Examination

Sep 30, 2025

Response after Non-Final Action

Nov 18, 2025

Non-Final Rejection mailed — §103

Jan 22, 2026

Interview Requested

Feb 03, 2026

Examiner Interview Summary

Feb 03, 2026

Applicant Interview (Telephonic)

Feb 18, 2026

Response Filed

May 15, 2026

Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/764,707

Patent 12632951

METHOD FOR DETECTING DEFECT AND METHOD FOR TRAINING MODEL

4y 1m to grant Granted May 19, 2026

18/456,552

Patent 12634405

IMAGE PROCESSING APPARATUS, PRINTING SYSTEM, AND IMAGE PROCESSING METHOD

2y 8m to grant Granted May 19, 2026

17/978,384

Patent 12602910

METHOD FOR DETECTING DEFECT AND METHOD FOR TRAINING MODEL

3y 5m to grant Granted Apr 14, 2026

18/063,492

Patent 12542859

METHOD OF DETERMINING THE CONCENTRATION OF AN ANALYTE IN A SAMPLE OF A BODY FLUID USING A CAMERA AND A COLOR REFERENCE CARD

3y 1m to grant Granted Feb 03, 2026

17/857,098

Patent 12536685

IMAGE FEATURE MATCHING METHOD, COMPUTER DEVICE, AND STORAGE MEDIUM

3y 6m to grant Granted Jan 27, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

5-6

Expected OA Rounds

80%

Grant Probability

89%

With Interview (+9.5%)

2y 7m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 641 resolved cases by this examiner. Grant probability derived from career allowance rate.