Last updated: April 19, 2026

Application No. 18/528,606

MACHINE LEARNING-BASED PERSONALIZED AUDIO-VIDEO PROGRAM SUMMARIZATION AND ENHANCEMENT

Non-Final OA §103

Filed

Dec 04, 2023

Examiner

PARK, SUNGHYOUN

Art Unit

2484

Tech Center

2400 — Computer Networks

Assignee

Samsung Electronics Co., Ltd.

OA Round

3 (Non-Final)

Interview Optional

— +10.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 613 resolved cases, 2023–2026

Examiner Intelligence

PARK, SUNGHYOUN View full profile →

Grants 75% — above average

Career Allow Rate

459 granted / 613 resolved

+16.9% vs TC avg

Moderate +10% lift

Without

With

+10.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 9m

Avg Prosecution

43 currently pending

Career history

656

Total Applications

across all art units

Statute-Specific Performance

§101

5.9%

-34.1% vs TC avg

§103

51.8%

+11.8% vs TC avg

§102

26.4%

-13.6% vs TC avg

§112

6.9%

-33.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 613 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendments, filed 12/29/2025, have been entered and made of record. Claims 1, 9, and 17 have been amended. Claims 1-20 are pending.
Response to Arguments
Applicant’s arguments in the Remarks filed on 12/29/2025 have been considered but are moot in view of the new ground(s) of rejection.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Mahyar in view of Mitra and Akella
Claims 1-3, 5, 9-11, 13, and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Mahyar et al.(USPN 10,455,297) in view of Mitra et al.(USPubN 2023/0262237; hereinafter Mitra) further in view of Akella et al.(USPubN 2019/0139441; hereinafter Akella).
As per claim 1, Mahyar teaches a method comprising: 
identifying video features and audio features of an audio-video program(“content may be separated into segments using one or more video processing algorithms, text processing algorithms, and/or audio processing algorithms to identify and/or determine scenes that may take place in various portions of the content. The identified segments may be analyzed to determine the most important portions of the segments (which may or may not be the entire segment). Importance or relevance to a certain theme may be determined using one or more scores generated for the segment. For example, a video score may be generated for the segment or for portions of the segment indicative of various activities or events that occur in the video, objects that appear, and other video-based features of the video content” in Col. 2 lines 53-67, Col. 3 lines 1-3); 
processing the video features and the audio features using a semantic video cut machine learning model that is trained to (i) segment the audio-video program into multiple scenes and (ii) cluster the scenes based on one or more user preferences(“This disclosure relates to, among other things, devices, systems, methods, computer-readable media, techniques, and methodologies for customized video content summary generation and presentation. Certain embodiments may automatically generate custom content summaries for digital content, such as video content (e.g., movies, television programs, streaming content, etc.), based at least in part on user data, such as user preference data indicative of types of content the user prefers to consume. Certain embodiments generate content summaries using, for example, analysis of audio, video, and text (e.g., closed captioning, synopses, metadata, etc.) components of digital content, such as a movie, video, other digital content. Certain embodiments determine aspects of content summaries, such as length and/or scene selection, using machine learning and other methodologies. In some instances, machine learning (e.g., deep neural networks, long short term memory units and/or recurrent neural networks, etc.) may be used to identify various elements of content that may increase a rate of conversion for users that consume a content summary” in Col. 4 lines 3-22, “Embodiments of the disclosure include systems and methods to automatically generate customized video content summaries that may account for specific user preferences and/or interests. Certain embodiments generate content summaries by automatically separating content into semantic clips and clustering and labeling some or each semantic clip into actions and adverbs. Semantic clips may be sequential or non-sequential portions of content, such as a movie, that may narrate a continuous story. For example, non-stop and/or ongoing conversation between two people, non-stop and/or ongoing car chasing, fighting, etc. Content may include one or more semantic clips that are happening in different locations and/or times with different people. The length of each semantic clip can be anywhere from a few seconds to more than 10 minutes” in Col. 2 lines 38-52); and 
generating an audio-video summarization of the audio-video program using a subset of the scenes based on the one or more user preferences(Col. 2 lines 38-52, “The detected segmentation data 380 may be input at a summary generation engine 390 and/or one or more summary/trailer generation module(s). The summary generation engine 390 may be configured to generate one or more content summaries using the segmentation data 380. For example, the summary generation engine 390 may generate a first content summary that may be an action-themed content summary using action-themed segments of the content. Action-themed segments may be segments that include certain types of objects (e.g., cars, guns, etc.), certain types of sounds (e.g., explosions, gunshots, etc.), certain types of human poses (e.g., fighting, etc.), and/or other types of indicators of action themes” in Col. 13 lines 62-67 and Col. 14 lines 1-8).
generating an enhanced audio-video summarization based on second video features and second audio features of the audio-video summarization(“the summary selection engine 392 may be configured to select and/or customize content summaries … the summary selection engine 392 and/or the summary generation engine 390 may optionally be in communication with region policy data 396 and/or region constraint data 398. Optional region policy data 396 may include data such as aggregate consumer preferences for content for users associated with a particular region. Regions may be geographic regions, and may be defined by country, zip code, continent, geographic territory, and/or by a different suitable metric. For example, the region policy data 396 may indicate that users from certain Asian countries prefer viewing content with red colors, such as red clothing, etc. In another example, users from North America may prefer viewing content with purple colors, or with certain types of themes, and so forth. The region policy data 396 may therefore be used during selection and/or ranking of segments to determine relevance to a particular region, or the region that a particular user is associated with. Optional region constraint data 398 may include specific rules or other data indicative of themes, colors, scenes, and/or other information indicative of content that is disliked generally in a certain region. For example, while certain regions may be relatively more open to viewing gun violence, other regions may dislike such content. The region constraint data 398 may therefore include data related to certain regions and content that may be disliked by users associated with the region. Accordingly, certain embodiments may determine region policy data associated with a region for which the video summary is to be generated, determine region constraint data associated with the region, and may determine that a certain segment of content is more relevant than the second segment using at least the user preference data, the region policy data, and/or the region constraint data” in Col. 15 lines 5-53).
Mahyar is silent about generating an enhanced audio-video summarization using a frame interpolation machine learning model comprising an encoder and a decoder, and a flow estimator configured to process embeddings and generate optical flow estimations.
Mitra teaches generating an enhanced audio-video summarization using a frame interpolation machine learning model comprising an encoder and a decoder (“first subset of frames 1200 may include R-frames and two I-frames (e.g., key-frames), I.sub.1 and I.sub.2. The R-frames may be interpolated using I.sub.1 and I.sub.2. In some embodiments, a machine learning apparatus according to the present disclosure may include context network 1225. Context network 1225 (e.g., context network C: I.fwdarw.{f.sup.(1), f.sup.(2), . . . }) may be pre-trained to extract context feature maps f.sup.l of various spatial resolutions. In some embodiments, context network 1225 may be a U-Net. A U-Net is a CNN based on a fully convolutional network in which a large number of upscaling feature channels propagate context information to higher resolution layers. In some embodiments, the U-Net may be fused with individual layers of the convolution-LSTM layers by concatenating corresponding U-Net features of a same spatial resolution before each convolution-LSTM layer” in Para.[0118], “The term “compressed frame features” refers to a compressed representation of video data generated by an encoder network” in Para.[0029], “The term “reconstructed frames” refers to images (e.g., frames of a video) that have been reconstructed by an image generation network based on compressed frame features i.e., a decoder network of a generative adversarial network (GAN)“ in [0030]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Mahyar with the above teachings of Mitra in order to improve accuracy of result so the viewer can enjoy output media.
Akella teaches a flow estimator configured to process embeddings and generate optical flow estimations(“The dense optical flow computation unit 210 can be configured to estimate an optical flow, which is a two-dimension (2D) vector field where each vector is a displacement vector showing the movement of points from a first frame to a second frame. The CNNs 220 can receive the stream of frame-based sensor data 250 and the optical flow estimated by the dense optical flow computation unit 210. The CNNs 220 can be applied to video frames to create a digest of the frames. The digest of the frames can also be referred to as the embedding vector. The digest retains those aspects of the frame that help in identifying actions, such as the core visual clues that are common to instances of the action in question” in Para.[0045]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Mahyar Mitra with the above teachings of Akella in order to improve quality and optimize image process significantly.
As per claim 2, Mahyar, Mitra and Akella teach all of limitation of claim 1.
Mahyar teaches wherein generating the audio-video summarization comprises: providing the subset of the scenes to a generative machine learning model; and generating artificial video and audio content based on the subset of the scenes using the generative machine learning model, the audio-video summarization comprising the artificial video and audio content(“customized video content summary generation and presentation. Certain embodiments may automatically generate custom content summaries for digital content, such as video content (e.g., movies, television programs, streaming content, etc.), based at least in part on user data, such as user preference data indicative of types of content the user prefers to consume. Certain embodiments generate content summaries using, for example, analysis of audio, video, and text (e.g., closed captioning, synopses, metadata, etc.) components of digital content, such as a movie, video, other digital content. Certain embodiments determine aspects of content summaries, such as length and/or scene selection, using machine learning and other methodologies. In some instances, machine learning (e.g., deep neural networks, long short term memory units and/or recurrent neural networks, etc.) may be used to identify various elements of content that may increase a rate of conversion for users that consume a content summary” in Col. 4 lines 3-22, “The text processing module(s) 340 may include one or more natural language processing modules or algorithms and may be configured to detect or determine the presence of features such as certain words or phrases, themes, sentiment, topics, and/or other features. The text processing module(s) 340 may be configured to perform semantic role labeling, semantic parsing, or other processes configured to assign labels to words or phrases in a sentence that indicate the respective word or phrase's semantic role in a sentence, such as object, result, subject, goal, etc. Semantic role labeling may be a machine learning or artificial intelligence based process.” in Col. 13 lines 13-26).
As per claim 3, Mahyar, Mitra and Akella teach all of limitation of claim 1.
Mahyar teaches further comprising: identifying the second video features and the second audio features of the audio-video summarization; and generating at least one of enhanced video content and enhanced audio content based on the second video features and the second audio features; wherein the enhanced audio-video summarization comprises at least one of the enhanced video content and the enhanced audio content(Col. 15 lines 5-53).
As per claim 5, Mahyar, Mitra and Akella teach all of limitation of claim 3.
Mahyar is silent about wherein frame interpolation machine learning model that is trained to generate one or more video effects based on at least one of:(i) one or more frames of the audio-video program or the audio- video summarization and (ii) motion and optical flow estimation associated with the audio- video program or the audio-video summarization.
Mitra teaches wherein frame interpolation machine learning model that is trained to generate one or more video effects based on at least one of:(i) one or more frames of the audio-video program or the audio- video summarization and (ii) motion and optical flow estimation associated with the audio- video program or the audio-video summarization (“first subset of frames 1200 may include R-frames and two I-frames (e.g., key-frames), I.sub.1 and I.sub.2. The R-frames may be interpolated using I.sub.1 and I.sub.2. In some embodiments, a machine learning apparatus according to the present disclosure may include context network 1225. Context network 1225 (e.g., context network C: I.fwdarw.{f.sup.(1), f.sup.(2), . . . }) may be pre-trained to extract context feature maps f.sup.l of various spatial resolutions. In some embodiments, context network 1225 may be a U-Net. A U-Net is a CNN based on a fully convolutional network in which a large number of upscaling feature channels propagate context information to higher resolution layers. In some embodiments, the U-Net may be fused with individual layers of the convolution-LSTM layers by concatenating corresponding U-Net features of a same spatial resolution before each convolution-LSTM layer” in Para.[0118], “an LSTM network includes a cell, an input gate, an output gate and a forget gate. The cell stores values for a certain amount of time, and the gates dictate the flow of information into and out of the cell. LSTM networks may be used for making predictions based on series data where there can be gaps of unknown size between related information in the series. LSTM networks can help mitigate vanishing gradients and exploding gradients when training an RNN. In the convolutional-LSTM network, an input to each LSTM cell is a hidden state of a previous layer and an output of a convolution network for each feature map (reduced embedding)” in Para.[0079], Para.[0084]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Mahyar with the above teachings of Mitra in order to improve accuracy of result so the viewer can enjoy output media.
As per claim 9, the limitations in the claim 9 has been discussed in the rejection claim 1 and rejected under the same rationale.
As per claim 10, the limitations in the claim 10 has been discussed in the rejection claim 2 and rejected under the same rationale.
As per claim 11, the limitations in the claim 11 has been discussed in the rejection claim 3 and rejected under the same rationale.
As per claim 13, the limitations in the claim 13 has been discussed in the rejection claim 5 and rejected under the same rationale.
As per claim 17, Mahyar teaches a non-transitory machine readable medium containing instructions that when executed cause at least one processor of an electronic device to(Col. 30 lines 30-50) and the other limitations in the claim 17 has been discussed in the rejection claim 1 and rejected under the same rationale.
As per claim 18, the limitations in the claim 18 has been discussed in the rejection claim 2 and rejected under the same rationale.
As per claim 19, the limitations in the claim 19 has been discussed in the rejection claim 3 and rejected under the same rationale.

Mahyar in view of Mitra, Akella and Kumar
Claims 4 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Mahyar et al.(USPN 10,455,297) in view of Mitra et al.(USPubN 2023/0262237; hereinafter Mitra) further in view of Akella et al.(USPubN 2019/0139441; hereinafter Akella) further in view of Kumar et al.(USPubN 2022/0159086; hereinafter Kumar).
As per claim 4, Mahyar, Mitra and Akella teach all of limitation of claim 3. 
Mahyar, Mitra and Akella are silent about wherein: the enhanced video content comprises at least one of: video content having a slow motion video effect or video content having a 360 video effect; and the enhanced audio content comprises at least one of: enhanced background music from the audio-video program or machine-generated background music.
Kumar teaches wherein: the enhanced video content comprises at least one of: video content having a slow motion video effect or video content having a 360 video effect; and the enhanced audio content comprises at least one of: enhanced background music from the audio-video program or machine-generated background music(“Some attributes that may significantly impact popularity include filter (m), genre (n), speed (p), embed background music (q) and element of unexpectedness (u). In one embodiment, the content popularity application recommends some options using the one or more of the attributes to modify the content item using the attributes to increase popularity of the content item. One option includes using the filter, which is a design overlay that can be added on top of the content item in order to enhance the appearance of content item on the content item. For example, a beauty filter in Snapchat. Another option includes using the genre, which is a type of content item on the content item. In one embodiment, multiple combination of genre affects popularity of the content item. For example, a combination of pop and rap music is very much liked by many users in social media. Another option includes using the speed of the content item in the content item. For example, frames of a video content item moving at a combination of slow and normal speed as per scene in the video content item impacts popularity of the content item. In one example, if the scene is an old movie in which two lovers meet after 10 years, such scene may be edited to include a mix of slow and normal motion to increase the surprise motion and suspense element in order to increase attention of the audience. Another option includes using the embed background music, which includes music playing in the background of the content item. Such background music engages the audience attention on the content item in the content item” in Para.[0026]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Mahyar, Mitra and Akella with the above teachings of Kumar in order to improve user experience.
As per claim 12, the limitations in the claim 12 has been discussed in the rejection claim 4 and rejected under the same rationale.

Mahyar in view of Mitra, Akella and Varshney
Claims 8 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Mahyar et al.(USPN 10,455,297) in view of Mitra et al.(USPubN 2023/0262237; hereinafter Mitra) further in view of Akella et al.(USPubN 2019/0139441; hereinafter Akella) further in view of Varshney et al.(USPubN 2023/0403441; hereinafter Varshney).
As per claim 8, Mahyar, Mitra and Akella teach all of limitation of claim 1. 
Mahyar, Mitra and Akella are silent about wherein: the semantic video cut machine learning model is configured to identify locations in the audio-video program for ads; and generating the audio-video summarization comprises generating the audio-video summarization with one or more ads at one or more of the identified locations, the one or more ads based on one or more additional user preferences.
Varshney teaches wherein: the semantic video cut machine learning model is configured to identify locations in the audio-video program for ads; and generating the audio-video summarization comprises generating the audio-video summarization with one or more ads at one or more of the identified locations, the one or more ads based on one or more additional user preferences (“a summary generator module 706 may be configured to receive key frame identification-based summary using a pre-trained deep neural network model. The summary generator module 706 may be configured to share the key frame identification-based summary with a creative modifier 708 and then a relevance computer 710. The creative modifier 708 may be configured to make one or more creative modifications in the selected/identified advertisement(s) based on one or more parameters of user profile. For example, in addition to user preference of a preferred viewing duration of ads, the user's ethnicity may be identified as “Asian” from the user profile store. To present advertisements for enhanced user interaction, the creative modifier 708 may be configured to modify advertisements to include one or more aspects associated with user's ethnicity. For instance, the main actor in the selected/identified advertisement may be identified that may be of non-Asian ethnicity, and an Asian face may be selected from a database for swapping with the non-Asian face in the originally identified ad” in Para.[0077]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Mahyar, Mitra and Akella with the above teachings of Varshney in order to improve user experience.
As per claim 16, the limitations in the claim 16 has been discussed in the rejection claim 8 and rejected under the same rationale. 	
Allowable Subject Matter
Claims 6, 7, 14, 15, and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SUNGHYOUN PARK whose telephone number is (571)270-1333. The examiner can normally be reached M - Thur 6:00 am - 4 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI Q TRAN can be reached at (571)272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SUNGHYOUN PARK/Examiner, Art Unit 2484

Read full office action

Prosecution Timeline

Dec 04, 2023

Application Filed

Mar 22, 2025

Non-Final Rejection — §103

May 28, 2025

Examiner Interview Summary

May 28, 2025

Applicant Interview (Telephonic)

Jun 27, 2025

Response Filed

Nov 03, 2025

Final Rejection — §103

Dec 16, 2025

Interview Requested

Dec 23, 2025

Applicant Interview (Telephonic)

Dec 23, 2025

Examiner Interview Summary

Dec 29, 2025

Response after Non-Final Action

Feb 09, 2026

Request for Continued Examination

Feb 21, 2026

Response after Non-Final Action

Mar 07, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/392,872

Patent 12586377

SYSTEMS AND METHODS TO PREDICT AGGRESSION IN SURVEILLANCE CAMERA VIDEO

2y 5m to grant Granted Mar 24, 2026

18/399,135

Patent 12556650

FLEET WIDE VIDEO SEARCH

2y 5m to grant Granted Feb 17, 2026

18/779,629

Patent 12556795

FOLDING PRINTED CIRCUIT BOARD ASSEMBLY FOR ENDOSCOPE CAMERA

2y 5m to grant Granted Feb 17, 2026

18/050,040

Patent 12549697

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND PROGRAM

2y 5m to grant Granted Feb 10, 2026

18/078,411

Patent 12549797

METHODS AND SYSTEMS FOR PROVIDING MEDIA CONTENT

2y 5m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

75%

Grant Probability

85%

With Interview (+10.2%)

2y 9m

Median Time to Grant

High

PTA Risk

Based on 613 resolved cases by this examiner. Grant probability derived from career allow rate.