Last updated: May 29, 2026
Application No. 18/569,566
METHOD, APPARATUS, DEVICE AND STORAGE MEDIUM FOR VIDEO PROCESSING

Final Rejection §103
Filed
Dec 12, 2023
Priority
Sep 29, 2021 — CN 202111154001.8 +2 more
Examiner
CLOTHIER, MATTHEW MORRIS
Art Unit
2614
Tech Center
2600 — Communications
Assignee
BEIJING ZITIAO NETWORK TECHNOLOGY CO., LTD.
OA Round
2 (Final)
Interview Optional

— +0.0% interview lift. Interview lift (+0.0%) is below the 15.0% threshold. A written response is recommended.
Based on 3 resolved cases, 2023–2026
Examiner Intelligence

CLOTHIER, MATTHEW MORRIS View full profile →
Grants 100% — above average
Career Allowance Rate
3 granted / 3 resolved
+38.0% vs TC avg
Minimal +0% lift
Without
With
+0.0%
Interview Lift
resolved cases with interview
Fast prosecutor
2y 1m
Avg Prosecution
14 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
96.7%
+56.7% vs TC avg
§102
3.3%
-36.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 3 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
1.	This action is in response to the amendment filed on 12/17/2025.  Claims 1, 6-7, 10, 12, 14, 16-17, 20, and 24-25 have been amended.  Applicant’s remarks did not indicate claims 17 and 20 as being amended.  In addition, claim 20 was not properly indicated as amended in the claims.  Applicant’s amendments to the drawings have overcome the objection previously set forth in the Non-Final Office Action mailed 9/25/2025.

Response to Arguments
2.	Applicant’s arguments are directed toward claim amendments filed on 12/17/2025.  Regarding claim 1, and similarly claims 24-25, with respect to the rejection under 35 U.S.C. 103 regarding that the prior art does not teach the limitation(s): “performing target object segmentation on the original image to obtain a target object image and a background image;”, “adjusting a size of the target object image according to a plurality of different adjustment proportions to obtain a plurality of adjusted target object images, wherein each of the plurality of adjusted target object images depicts the target object image at a different size that corresponds to one of the plurality of different adjustment proportions;”, “fusing each of the plurality of adjusted target object images with the background image to obtain a plurality of target images, wherein each of the plurality of target images comprises one of the plurality of adjusted target object images fused with the background image;” and “performing audio and video encoding on the plurality of target images and the accented audio to obtain a video.” have been fully considered, but is moot because of new grounds for rejection.  Claim 1 is now disclosed by Kim, Miyasaka, and Lin.

3.	Additionally, regarding claim 1, and similarly claims 24-25, with respect to the rejection under 35 U.S.C. 103 regarding that the prior art does not teach the limitation(s): “acquiring an original image and an original audio;” and “performing accent recognition on the original audio to obtain an accented audio;” has been fully considered but is not persuasive.  Specifically, Kim discloses the following: “For example, in accordance with the beat of background music, an effect may be provided (changed), an image collage effect may be provided, a fast change effect of zoom-in/zoom-out may be provided based on a face position, or an image effect (e.g., an effect such as gif, a slow motion effect, an overlapping effect, etc.) for continuous pictures may be provided.” (Kim, col. 11, lines 38-44) and “For example, the processor 140 may determine whether the volume of the audio increases or decreases suddenly or whether a waveform corresponding to a specified sound (e.g., a shout, a clap, or a specific phrase (e.g., “goal”, “shoot”, “homerun”, etc.)) is detected, and include audio information in the additional information according the determination result.” (Kim, col. 5, lines 51-57).  As noted in paragraph [0053] of applicant’s specification: “The accent may be understood as a note with a strong sense of rhythm.” Kim is able to perform “accent recognition” through the means of determining the beat of background music (or potentially from sudden increase or decrease in audio volume as disclosed in Kim, col. 5, lines 51-57) and uses that beat for various effects.  The disclosed effects, such as “a fast change effect of zoom-in/zoom-out may be provided based on a face position …” would not be possible without the detection of the music’s beat.  In addition, since Kim discloses a process in which beat detection of the background music is necessary for various effect, by performing the “accent recognition” it essentially recognizes/classifies, or “obtains,” the music as an “accented audio.”  Thus, Kim discloses the limitation of claim 1.
4.	Regarding arguments with respect to claims 2-12, 14-17, and 19-20, they are dependent on independent claim 1.  Applicant does not argue anything other than independent claim 1, and similarly claims 24-25.  The limitations in those claims, in conjunction with their combination, has previously been established and explained.

Claim Rejections - 35 USC § 103
5.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


6.	Claims 1-4, 7, 10-11, 17, and 24-25  are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US-10708650-B2, hereinafter "Kim") in view of Miyasaka et al. (US-2008/0055469-A1, hereinafter "Miyasaka"), and further in view of Lin et al. (US-2017/0287137-A1, hereinafter "Lin").

7.	As per claim 1, Kim discloses: A method for video processing, comprising:
	acquiring an original image and an original audio; (Kim, col. 5, lines 47-55, “Additionally, the processor 140 may generate additional information including a time that satisfies a specified condition and corresponding audio information, which are included in audio information obtained when an image is captured. For example, the processor 140 may determine whether the volume of the audio increases or decreases suddenly or whether a waveform corresponding to a specified sound ... is detected …”)
	performing target object segmentation on the original image to obtain a target object image and a [[background image;]] (Kim, col. 8, lines 10-26, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information. Through such additional information, the electronic device 100 may generate a highlight image based on an important target.  For example, the processor 140 may analyze information (e.g., a face image) on a frequently contacted user or a user added to favorites among users included in a contact list of the electronic device 100 in advance. Alternatively, the processor 140 may perform analysis while an image is captured. When face recognition on the important object occurs in an image, the processor 140 may include an image of the recognized face in addition to a high priority indication in the additional information.”)
	performing accent recognition on the original audio to obtain an accented audio; (See Kim, col. 11, lines 38-44 rejection below.)
adjusting a size of the target object image according to a plurality of different adjustment proportions to obtain a plurality of adjusted target object images, wherein each of the plurality of adjusted target object images [[depicts the target object image at a different size that corresponds to one of the plurality of different adjustment proportions;]] (Kim, col. 11, lines 38-44, “For example, in accordance with the beat of background music, an effect may be provided (changed), an image collage effect may be provided, a fast change effect of zoom-in/zoom-out may be provided based on a face position, or an image effect (e.g., an effect such as gif, a slow motion effect, an overlapping effect, etc.) for continuous pictures may be provided.”; Examiner’s note: As disclosed by Kim the “fast change effect of zoom-in/zoom-out may be provided based on a face position ...” and col. 8, lines 10-26, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information. Through such additional information, the electronic device 100 may generate a highlight image based on an important target.  For example, the processor 140 may analyze information (e.g., a face image) on a frequently contacted user or a user added to favorites among users included in a contact list of the electronic device 100 in advance. Alternatively, the processor 140 may perform analysis while an image is captured. When face recognition on the important object occurs in an image, the processor 140 may include an image of the recognized face in addition to a high priority indication in the additional information.”; Examiner’s note:  As Kim discloses in col. 8, lines 10-26, a face image can be selected as “an important object” and is used for the segmented target object image.  Also, the zoom-in/zoom-out effect disclosed by Kim provides a plurality of different adjustment proportions in accordance to the beat of the background music as it progresses in time.)
	fusing each of the plurality of adjusted target object images with the [[background image]] to obtain a plurality of target images, wherein each of the plurality of target images comprises one of the plurality of adjusted target object images fused with the [[background image;]] and (Kim, col. 11, lines 32-44, “Effects may be applied with respect to data selected for generating a highlight video (e.g., some durations of image or video data) ... For example, in accordance with the beat of background music, an effect may be provided (changed), an image collage effect may be provided, a fast change effect of zoom-in/zoom-out may be provided based on a face position, or an image effect (e.g., an effect such as gif, a slow motion effect, an overlapping effect, etc.) for continuous pictures may be provided.”; Examiner’s note: The video effects disclosed by Kim act as a fusion process bringing images together.)
	performing audio and video encoding on the plurality of target images and the accented audio to obtain a video. (Kim, col. 4, lines 47-51, “Herein, the buffer 130 may correspond to a storage space where image data encoded or being encoded is temporarily stored, before being stored as a video file (e.g., an mp4 file) in the memory 150.” and col. 11, lines 28-31, “Herein, a theme may include an acoustic effect, e.g., background music, in addition to a visual effect applied to the highlight video playback.” and col. 11, lines 38-39, “For example, in accordance with the beat of background music …”)

8.	Kim doesn't explicitly disclose but Miyasaka discloses: [[performing target object segmentation on the original image to obtain a target object image and a]] background image;
	[[fusing each of the plurality of adjusted target object images with the]] background image [[to obtain a plurality of target images, wherein each of the plurality of target images comprises one of the plurality of adjusted target object images fused with the]] background image; [[and]] (Miyasaka, page 4, ¶ [0064], “For an image group to which images with a face extracted therein belong, a display effect, such as zoom-out or zoom-in of the face, is provided synchronically with the beats or the up beats.”; Examiner’s note: With the “face extracted” and a specific “zoom-out” or “zoom-in” display effect of the face, the face is separated from the rest of the image, creating an image background along with the target object, the face.)

9.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method for video processing of Kim to include the disclosure of segmenting out a background image from a target in an image of Miyasaka. The motivation for this modification could have been to separate the background from a target object so that various processing or special effects could be done, like isolating an target object by itself or performing special effects on only the background or the target object in isolation.

10.	Kim in view of Miyasaka doesn't explicitly disclose but Lin discloses: [[adjusting a size of the target object image according to a plurality of different adjustment proportions to obtain a plurality of adjusted target object images, wherein each of the plurality of adjusted target object images]] depicts the target object image at a different size that corresponds to one of the plurality of different adjustment proportions; (Lin, [0097], “Furthermore, as illustrated in FIG. 9, the digital segmentation system 900 also includes the digital image manager 910. ... Furthermore, the digital image manager 910 can receive selection information identifying an object, and accordingly, modify the digital image (e.g., the identified object pixels or background pixels). ... Similarly, the digital image manager 910 can move, resize, rotate, or orient a segmented object portrayed in a digital image. Similarly, the digital image manager 910 can isolate, cut, and paste a segmented object portrayed in a digital image. Moreover, the digital image manager 910 can delete or remove a segmented object (or background) in a digital image.” and [0036], “By utilizing enhanced object detection, deep learning techniques, and an iterative optimization process, the digital segmentation system detects an object portrayed in a digital image quickly and accurately.” and [0037], “As used herein, the term “object” refers to a portrayal of a person or thing in digital visual media (digital images or digital videos).”; Examiner’s note: As disclosed by Lin, the digital segmentation system is able to identify objects and then “move, resize, rotate, or orient a segmented object” in any way.  Thus, for each target object image, this can provide a plurality of “different size(s) that corresponds to one of the plurality of different adjustment proportions.”)

11.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method for video processing of Kim in view of Miyasaka to include the disclosure of generating different sizes of a target object that corresponds to one of the plurality of different adjustment proportions, of Lin. The motivation for this modification could have been to use the variety in different sizes to make for a dynamic and interesting video that changes the target object size.  This would be to make the video potentially more entertaining and attractive to watch.

12.	As per claim 2, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 1, wherein the original audio matches the original image. (Kim, col. 5, lines 47-55, “Additionally, the processor 140 may generate additional information including a time that satisfies a specified condition and corresponding audio information, which are included in audio information obtained when an image is captured. For example, the processor 140 may determine whether the volume of the audio increases or decreases suddenly or whether a waveform corresponding to a specified sound ... is detected …”)

13.	As per claim 3, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 2, wherein the acquiring the original audio comprises:
	acquiring the original audio according to a selection operation of a user; or (Miyasaka, page 2, ¶ [0033]-[0034], “A music input device 12 is for inputting music required for generating a scenario, and it is, for example, a memory card reader. An input device 13 is used by a user to perform various operations of the system, and it is, for example, a keyboard and a mouse.”; Examiner's note:  By "inputting music" via an "music input device," a user is able to "perform various operations of the system" specifically for "generating a scenario" by selecting the music from the "music input device.")
	recognizing type information of the original image; and
	acquiring the original audio based on the type information. (Kim, col. 10, lines 17-18, “In operation 505, the electronic device 100 may determine metadata to be used for generating a highlight video.” and col. 11, lines 17-30, “When metadata is determined, a predefined theme may be applied or a theme may be applied according to information of metadata. For example, when there are many dynamic images among images corresponding to metadata, a dynamic theme may be applied. Additionally, when an image is captured at a specific time (e.g., Christmas, New Year, Thanksgiving, Easter, Halloween, etc.), a theme fit for a corresponding season may be applied. Additionally, when there are many bright atmospheric images or dark atmospheric images according to an analysis result of an image, a theme according thereto may be applied. Herein, a theme may include an acoustic effect, e.g., background music, in addition to a visual effect applied to the highlight video playback.”)
14.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 2 of Kim in view of Lin to include the disclosure acquiring original audio either through a selection from a user of Miyasaka. The motivation for this modification could have been to provide an addition method to acquire original audio for generating a video.

15.	As per claim 4, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 2, wherein the performing target object segmentation on the original image to obtain the target object image and the background image comprises:
	performing human figure recognition on the original image;
	in response to determining that a human figure is recognized, determining the recognized human figure as a target object; (Kim, col. 8, lines 10-14, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information.” and col. 8, lines 36-39, “An important object in an image is not limited to a person or a person's face. For example, the processor 140 may configure a specific image (e.g., a vehicle) as an important object in advance.” and col. 6, lines 1-4, “Additionally, when facial recognition or object recognition occurs in an image, the processor 140 may include information that the facial recognition and/or object recognition occurs ...”)
	in response to determining that no human figure is recognized, performing subject object recognition on the original image, and determining the recognized subject object as a target object; and (Kim, col. 8, lines 36-39, “An important object in an image is not limited to a person or a person's face. For example, the processor 140 may configure a specific image (e.g., a vehicle) as an important object in advance.” and col. 6, lines 1-4, “Additionally, when facial recognition or object recognition occurs in an image, the processor 140 may include information that the facial recognition and/or object recognition occurs ...”)
	segmenting the target object from the background to obtain the target object image and the background image. (Kim, col. 8, lines 10-26, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information. Through such additional information, the electronic device 100 may generate a highlight image based on an important target.  For example, the processor 140 may analyze information (e.g., a face image) on a frequently contacted user or a user added to favorites among users included in a contact list of the electronic device 100 in advance. Alternatively, the processor 140 may perform analysis while an image is captured. When face recognition on the important object occurs in an image, the processor 140 may include an image of the recognized face in addition to a high priority indication in the additional information.” and Miyasaka, page 4, ¶ [0064], “For an image group to which images with a face extracted therein belong, a display effect, such as zoom-out or zoom-in of the face, is provided synchronically with the beats or the up beats.”; Examiner’s note: With the “face extracted” and a specific “zoom-out” or “zoom-in” display effect of the face, the face is separated from the rest of the image, creating an image background along with the target object, the face.)
In addition, the rational for modifying is the same as claim 1 above.

16.	As per claim 7, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 2, wherein the accented audio comprises an accent start point and an accent end point; and
	the encoding the plurality of target images and the accented audio to obtain the video comprises:
	aligning a first frame in the plurality of target images with the accent start point, and aligning a last frame in the plurality of target images with the accent end point; and (Miyasaka, page 4, ¶ [0066], “All the beats extracted from the music are not used. A pattern is determined, and images are assigned in accordance with the pattern. For example, the structure of the music is separated every eight beats, and an image is displayed at the first beat, the third beat, the sixth up-beat and the eighth beat, as shown in FIG. 8. By increasing and repeating such a pattern, it is possible to avoid monotonous image switching. By combining the examples described above, images are assigned to all the components of the music.”)
	performing audio and video encoding on the plurality of aligned target images and accented audio to obtain the video. (Kim, col. 4, lines 47-51, “Herein, the buffer 130 may correspond to a storage space where image data encoded or being encoded is temporarily stored, before being stored as a video file (e.g., an mp4 file) in the memory 150.” and col. 11, lines 28-31, “Herein, a theme may include an acoustic effect, e.g., background music, in addition to a visual effect applied to the highlight video playback.” and col. 11, lines 38-39, “For example, in accordance with the beat of background music …”)

17.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 2 of Kim in view of Lin to include the disclosure of aligning frames of target images with accent start and end points of Miyasaka. The motivation for this modification could have been to allow a series of images and the accented audio to line up with the rhythm of the music to make a generated video more impactful on a user watching it.

18.	As per claim 10, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 1, wherein the original image is a video frame corresponding to a video clip extracted from the original video;
	the original audio matches the original video; and (Kim, col. 5, lines 47-55, “Additionally, the processor 140 may generate additional information including a time that satisfies a specified condition and corresponding audio information, which are included in audio information obtained when an image is captured. For example, the processor 140 may determine whether the volume of the audio increases or decreases suddenly or whether a waveform corresponding to a specified sound ... is detected …”)
	the method further comprises:
	acquiring the original video; (Kim, col. 1, lines 15-17, “The present disclosure relates generally to a technique for generating video content by using media content such as an image or a video.” and col. 9, lines 59-62, “That is, a media library of the electronic device 100 may include a plurality of events and each event may include a plurality of contents (e.g., content files such as video files and/or image files).”)
	extracting the video clip meeting set conditions from the original video; (Kim, col. 1, lines 30-24, “Similarly, when a conventional electronic device is used to generate a highlight video from an original video, an analysis on original video data is required in order to determine a section of the original video to be used for the highlight video.” and col. 12, lines 55-62, “In FIG. 6, an example is provided where a specific section of a video file is played when a highlight video generated from the video file is played. That is, the specific section of the video file is used to generate the highlight video. However, in another example, according to metadata determined based on additional information, an image (e.g., I-frame) corresponding to a specific time point (e.g., PTS) may be used for generating a highlight video.”)
	the performing target object segmentation on the original image to obtain the target object image and a background image comprises:
	performing target object segmentation on each video frame of the video clip respectively to obtain target object images and background images respectively corresponding to the plurality of video frames; (Kim, col. 8, lines 10-26, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information. Through such additional information, the electronic device 100 may generate a highlight image based on an important target.  For example, the processor 140 may analyze information (e.g., a face image) on a frequently contacted user or a user added to favorites among users included in a contact list of the electronic device 100 in advance. Alternatively, the processor 140 may perform analysis while an image is captured. When face recognition on the important object occurs in an image, the processor 140 may include an image of the recognized face in addition to a high priority indication in the additional information.” and Miyasaka, page 4, ¶ [0064], “For an image group to which images with a face extracted therein belong, a display effect, such as zoom-out or zoom-in of the face, is provided synchronically with the beats or the up beats.”; Examiner’s note: With the “face extracted” and a specific “zoom-out” or “zoom-in” display effect of the face, the face is separated from the rest of the image, creating an image background along with the target object, the face.; and page 4, ¶ [0079], “At step S23, the motion picture generation device 15 generates a motion picture from the acquired images and music on the basis of the description in the scenario. Specifically, for all the acquired image data, the motion picture generation device 15 performs image processing in consideration of effect or change at certain moments and overlaps the image data in accordance with an order to create frame images. By connecting the frame images in order of time, a motion picture is created.”)
	the adjusting the size of the target object image according to different adjustment proportions comprises:
	sequentially adjusting sizes of the target object images in the plurality of video frames according to different adjustment proportions; and (Kim, col. 11, lines 38-44, “For example, in accordance with the beat of background music, an effect may be provided (changed), an image collage effect may be provided, a fast change effect of zoom-in/zoom-out may be provided based on a face position, or an image effect (e.g., an effect such as gif, a slow motion effect, an overlapping effect, etc.) for continuous pictures may be provided.”)
	the performing audio and video encoding on the plurality of target images and the accented audio comprises:
	performing audio and video encoding on the plurality of target frames and the accented audio to obtain the video, the plurality of target frames being obtained by fusing the adjusted target object images in the plurality of video frames with the corresponding background image. (Kim, col. 4, lines 47-51, “Herein, the buffer 130 may correspond to a storage space where image data encoded or being encoded is temporarily stored, before being stored as a video file (e.g., an mp4 file) in the memory 150.” and col. 11, lines 28-31, “Herein, a theme may include an acoustic effect, e.g., background music, in addition to a visual effect applied to the highlight video playback.” and col. 11, lines 38-39, “For example, in accordance with the beat of background music …”)
In addition, the rational for modifying is the same as claim 1 above.

19.	As per claim 11, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 10, wherein the acquiring the original audio comprises:
	acquiring the original audio matching the original video according to a selection operation of a user; or (Miyasaka, page 2, ¶ [0033]-[0034], “A music input device 12 is for inputting music required for generating a scenario, and it is, for example, a memory card reader. An input device 13 is used by a user to perform various operations of the system, and it is, for example, a keyboard and a mouse.”; Examiner's note:  By "inputting music" via an "music input device," a user is able to "perform various operations of the system" specifically for "generating a scenario" by selecting the music from the "music input device." and Kim, col. 1, lines 30-24, “Similarly, when a conventional electronic device is used to generate a highlight video from an original video, an analysis on original video data is required in order to determine a section of the original video to be used for the highlight video.” and col. 10, line 5-9, ”A user may select a specific event for generating/playing a highlight video on a specific event. Alternatively, regardless of an event, a user may select content stored in the memory 150 and command the generation of a highlight video based on the selected content.”)
	recognizing type information of the original video; and
	acquiring the original audio matching the original video based on the type information. (Kim, col. 10, lines 17-18, “In operation 505, the electronic device 100 may determine metadata to be used for generating a highlight video.” and col. 11, lines 17-30, “When metadata is determined, a predefined theme may be applied or a theme may be applied according to information of metadata. For example, when there are many dynamic images among images corresponding to metadata, a dynamic theme may be applied. Additionally, when an image is captured at a specific time (e.g., Christmas, New Year, Thanksgiving, Easter, Halloween, etc.), a theme fit for a corresponding season may be applied. Additionally, when there are many bright atmospheric images or dark atmospheric images according to an analysis result of an image, a theme according thereto may be applied. Herein, a theme may include an acoustic effect, e.g., background music, in addition to a visual effect applied to the highlight video playback.”)
In addition, the rational for modifying is the same as claim 3 above.

20.	As per claim 14, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 10, wherein the performing target object segmentation on each video frame of the video clip respectively to obtain the target object images and background images respectively corresponding to the plurality of video frames comprises:
	performing human figure recognition on each video frame of the video clip;
	determining the recognized human figure as a target object, in response to determining that a human figure is recognized; (Kim, col. 8, lines 10-14, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information.” and col. 8, lines 36-42, “An important object in an image is not limited to a person or a person's face. For example, the processor 140 may configure a specific image (e.g., a vehicle) as an important object in advance. When video capturing starts, after an important object is configured, the processor 140 may also include a high priority indication with respect to an object recognized ...” and col. 6, lines 1-4, “Additionally, when facial recognition or object recognition occurs in an image, the processor 140 may include information that the facial recognition and/or object recognition occurs ...”)
	performing subject object recognition on each video frame of the video clip, and determining the recognized subject object as a target object, in response to determining that no human figure is recognized; and (Kim, col. 8, lines 36-42, “An important object in an image is not limited to a person or a person's face. For example, the processor 140 may configure a specific image (e.g., a vehicle) as an important object in advance. When video capturing starts, after an important object is configured, the processor 140 may also include a high priority indication with respect to an object recognized ...”)
	segmenting the target object from the background to obtain the target object images and the background images respectively corresponding to the plurality of video frames. (Kim, col. 8, lines 10-26, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information. Through such additional information, the electronic device 100 may generate a highlight image based on an important target.  For example, the processor 140 may analyze information (e.g., a face image) on a frequently contacted user or a user added to favorites among users included in a contact list of the electronic device 100 in advance. Alternatively, the processor 140 may perform analysis while an image is captured. When face recognition on the important object occurs in an image, the processor 140 may include an image of the recognized face in addition to a high priority indication in the additional information.” and Miyasaka, page 4, ¶ [0064], “For an image group to which images with a face extracted therein belong, a display effect, such as zoom-out or zoom-in of the face, is provided synchronically with the beats or the up beats.”; Examiner’s note: With the “face extracted” and a specific “zoom-out” or “zoom-in” display effect of the face, the face is separated from the rest of the image, creating an image background along with the target object, the face.)
In addition, the rational for modifying is the same as claim 1 above.

21.	As per claim 17, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 10, wherein the accented audio comprises an accent start point and an accent end point; and
	the performing audio and video encoding on the plurality of target frames and the accented audio to obtain the video comprises:
	aligning a first frame in the plurality of target frames with the accent start point, and aligning a last frame in the plurality of target frames with the accent end point; and (Miyasaka, page 4, ¶ [0066], “All the beats extracted from the music are not used. A pattern is determined, and images are assigned in accordance with the pattern. For example, the structure of the music is separated every eight beats, and an image is displayed at the first beat, the third beat, the sixth up-beat and the eighth beat, as shown in FIG. 8. By increasing and repeating such a pattern, it is possible to avoid monotonous image switching. By combining the examples described above, images are assigned to all the components of the music.” and page 4, ¶ [0079], “At step S23, the motion picture generation device 15 generates a motion picture from the acquired images and music on the basis of the description in the scenario. Specifically, for all the acquired image data, the motion picture generation device 15 performs image processing in consideration of effect or change at certain moments and overlaps the image data in accordance with an order to create frame images. By connecting the frame images in order of time, a motion picture is created.”)
	performing audio and video encoding on the aligned video frames and accented audio to obtain the video. (Kim, col. 4, lines 47-51, “Herein, the buffer 130 may correspond to a storage space where image data encoded or being encoded is temporarily stored, before being stored as a video file (e.g., an mp4 file) in the memory 150.” and col. 11, lines 28-31, “Herein, a theme may include an acoustic effect, e.g., background music, in addition to a visual effect applied to the highlight video playback.” and col. 11, lines 38-39, “For example, in accordance with the beat of background music …”)
In addition, the rational for modifying is the same as claim 7 above.

22.	Claim 24 is similar in scope to claim 1 except for additional limitations that Kim in view of Miyasaka, and further in view of Lin discloses: An electronic device, comprising:
	one or more processing devices;
	a storage device configured to store one or more programs, wherein the one or more programs, when executed by the one or more processing devices, cause the one or more processing devices to implement operations comprising: (Kim, col. 23, lines 4-12, “At least part of a device (e.g., modules or functions thereof) or a method (e.g., operations) according to this disclosure, for example, as in a form of a programming module, may be implemented using an instruction stored in computer-readable storage media. When at least one processor executes an instruction, the at least one processor may perform a function corresponding to the instruction. For example, the non-transitory computer-readable storage media may include the memory 830.” and col. 22, lines 46-48, “At least part of the program module 1010 may include a module, a program, a routine, sets of instructions, or a process to perform at least one function, for example.”)
23.	Claim 25, which is similar in scope to claim 24 is thus rejected under the same rationale as described above.

24.	Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US-10708650-B2, hereinafter "Kim") in view of Miyasaka et al. (US-2008/0055469-A1, hereinafter "Miyasaka"), further in view of Lin et al. (US-2017/0287137-A1, hereinafter "Lin"), and further in view of Huang et al. (US-12183315-B2, hereinafter "Huang").

25.	As per claim 5, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 2, wherein the performing accent recognition on the original audio to obtain the accented audio comprises: (See rejection for claim 2 above.)

26.	Kim in view of Miyasaka, and further in view of Lin doesn't explicitly disclose but Huang discloses: denoising the original audio; (Huang, col. 9, lines 50-56, “The original audio data is filtered by using a target frequency range. … The computer device adopts the target frequency range, which can effectively filter out the low-frequency audio and noise that the human ear cannot hear …”)
	detecting a note start point of the denoised original audio to obtain a note start point; (Huang, Fig. 5C-5D; col. 3, lines 57-62, “FIG. 5C is a schematic diagram of determining a musical note starting point according to a target time point according to an embodiment of this application. FIG. 5D is a schematic diagram of determining a musical note starting point according to a target time point according to an embodiment of this application.” and col 16, lines 33-38, “In an implementation, musical note starting points may be screened to supplement the stress points in the target stress point set. The computer device may extract a musical note starting point of at least one musical note from the target audio data according to a musical note starting point detection algorithm (such as the librosa.onset algorithm).”)
	detecting a peak of the denoised original audio by using a peak-picking algorithm to obtain a peak point satisfying a preset condition; and (Huang, Fig. 5B; col. 3, lines 54-56, “FIG. 5B is a schematic diagram of acquiring a plurality of peaks from time points according to an embodiment of this application.” and col. 20, line 65-col. 21, line 13, “In an implementation, the acquiring unit 701 is configured to acquire initial energy change values of time points in the target audio data; and the processing unit 702 is further configured to: determine a plurality of peaks from the initial energy change values of the time points, each peak referring to an initial energy change value of a peak time point in the target audio data, and the peak time point satisfying the following condition: the initial energy change value of the peak time point being greater than an initial energy change value of each of two time points respectively on left and right sides of the peak time point and adjacent to the peak time point; and normalize the initial energy change value of the target time point by using a mean of the plurality of peaks to obtain the audio energy change value of the target time point.”)
	determining the accented audio based on the peak point and the note start point. (Huang, col. 10, lines 29-37, “The computer device determines a starting stress point and an ending stress point from the initial stress point set. The starting stress point refers to the earliest stress point in the initial stress point set. The ending stress point refers to the latest stress point in the initial stress point set. The computer device determines a starting arrangement position of the starting stress point in the target audio data and an end arrangement position of the ending stress point in the target audio data.” and col. 5, lines 28-39, “It can be learned from the above description that different types of audio data may be recognized adaptively through the audio detection solution; and the initial stress points such as the time points with local maximum energy, volume, and timbre or the time points that suddenly change are recognized from the audio data, and the accuracy verification is performed on the initial stress points by further using the correlation between the adjacent time points and the initial stress points, so that the extraction accuracy of stress points can be effectively improved, thereby providing a target stress point set accurate to the frame level (that is, the time point level).”)

27.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 2 of Kim in view of Miyasaka, and further in view of Lin to include the disclosure denoising the original audio and detecting the note and peak points in the audio to determine the accented audio start point of Huang. The motivation for this modification could have been to process an original audio to detect where the beats and rhythms are located so that it can be used to generate video that lines up with frame images, making the video more impactful.

28.	Claim 15, which is similar in scope to claims 5 and 10, is thus rejected under the same rationale as described above.  In addition, the rational for modifying is the same as claim 5 above.

29.	Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US-10708650-B2, hereinafter "Kim") in view of Miyasaka et al. (US-2008/0055469-A1, hereinafter "Miyasaka"), further in view of Lin et al. (US-2017/0287137-A1, hereinafter "Lin"), and further in view of Li (CN-110197459-A).

30.	As per claim 6, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 2, wherein the adjusting the size of the target object image according to the plurality of different adjustment proportions to obtain the plurality of adjusted target object images comprises:
	determining the number of images which are required based on a duration of the accented audio; (Miyasaka, page 4, ¶ [0066], “All the beats extracted from the music are not used. A pattern is determined, and images are assigned in accordance with the pattern. For example, the structure of the music is separated every eight beats, and an image is displayed at the first beat, the third beat, the sixth up-beat and the eighth beat, as shown in FIG. 8. By increasing and repeating such a pattern, it is possible to avoid monotonous image switching. By combining the examples described above, images are assigned to all the components of the music.”)
	adjusting the size of the target object image respectively according to the plurality of different adjustment proportions to obtain adjusted target object images … (Kim, col. 11, lines 38-44, “For example, in accordance with the beat of background music, an effect may be provided (changed), an image collage effect may be provided, a fast change effect of zoom-in/zoom-out may be provided based on a face position, or an image effect (e.g., an effect such as gif, a slow motion effect, an overlapping effect, etc.) for continuous pictures may be provided.”)

31.	Kim in view of Miyasaka, and further in view of Lin doesn't explicitly disclose but Li discloses: determining a change mode of the adjustment proportion based on the number of images to obtain the plurality of different adjustment proportions, the change mode comprising a change trend and a change step length; and
… corresponding to the number of images. (Li, page 11, ¶ [0107], “According to an optional implementation of the embodiments of the present disclosure, the speed threshold may be determined by identifying the target object from the plurality of images, calculating a size of the recognized target object image, and determining a speed threshold corresponding to the plurality of images according to the size ratio of the target in the plurality of images.” and page 8, ¶ [0072], ”The target object is a specific object defined in the present disclosure, and the target object has a certain shape.” and page 8, ¶ [0079], “Specifically, the determination may be made based on the state change of the plurality of images on the time series. When it is detected that the state of the target object on the plurality of images is converted from the motion state to the static state, the graphic instruction represented by the first graphic region in the stationary state is parsed into the operation instruction of the target object. The operation instruction may be expressed in a plurality of manners, and the form of the operation instruction may include, but is not limited to, at least one of the following: a number, a text, a symbol, a level signal, or the like.” and page 8, ¶ [0077], “Alternatively, the movement speed is determined according to the movement distance and the play time difference between the image and the target image, and if the movement speed is greater than or equal to the preset speed threshold, it is determined that the action state corresponding to the image is the motion state.”)

32.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 2 of Kim in view of Miyasaka, and further in view of Lin to include the disclosure that upon acquiring a number of frame images, determining a change mode of the adjustment proportion based on the number of images to obtain a plurality of different adjustment proportions of Li. The motivation for this modification could have been to adjust the parameters of a generated video based on the total number and size of the frame images and to determine how long various effects last to create a more impactful and dynamic video depending on the music or image size/sequence.

33.	As per claim 16, Kim in view of Miyasaka, further in view of Lin, and further in view of Li discloses: The method of claim 10, wherein the sequentially adjusting sizes of the target object images in the plurality of video frames according to different adjustment proportions comprises:
	acquiring the number of video frames contained in the video clip; (Kim, col. 1, lines 30-24, “Similarly, when a conventional electronic device is used to generate a highlight video from an original video, an analysis on original video data is required in order to determine a section of the original video to be used for the highlight video.” and Li, page 4, ¶ [0011]-[0013], “According to a specific implementation of the embodiments of the present disclosure, the obtaining a plurality of images including a target object displayed on an interactive interface includes: collecting video content in the interaction interface to obtain a video file including a plurality of video frames; selecting one or more video frames from the video file to form a plurality of images including the target object.”)
	determining a change mode of the adjustment proportion according to the number of video frames, and acquiring adjustment proportions corresponding to the number of video frames, the change mode comprising a change trend and a change step length; and (Li, page 11, ¶ [0107], “According to an optional implementation of the embodiments of the present disclosure, the speed threshold may be determined by identifying the target object from the plurality of images, calculating a size of the recognized target object image, and determining a speed threshold corresponding to the plurality of images according to the size ratio of the target in the plurality of images.” and page 8, ¶ [0072], ”The target object is a specific object defined in the present disclosure, and the target object has a certain shape.” and page 8, ¶ [0079], “Specifically, the determination may be made based on the state change of the plurality of images on the time series. When it is detected that the state of the target object on the plurality of images is converted from the motion state to the static state, the graphic instruction represented by the first graphic region in the stationary state is parsed into the operation instruction of the target object. The operation instruction may be expressed in a plurality of manners, and the form of the operation instruction may include, but is not limited to, at least one of the following: a number, a text, a symbol, a level signal, or the like.” and page 8, ¶ [0077], “Alternatively, the movement speed is determined according to the movement distance and the play time difference between the image and the target image, and if the movement speed is greater than or equal to the preset speed threshold, it is determined that the action state corresponding to the image is the motion state.”)
	adjusting the sizes of the target object images in the plurality of video frames sequentially according to the adjustment proportions corresponding to the number of video frames. (Kim, col. 11, lines 38-44, “For example, in accordance with the beat of background music, an effect may be provided (changed), an image collage effect may be provided, a fast change effect of zoom-in/zoom-out may be provided based on a face position, or an image effect (e.g., an effect such as gif, a slow motion effect, an overlapping effect, etc.) for continuous pictures may be provided.”)
In addition, the rational for modifying is the same as claim 6 above.

34.	Claims 8 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US-10708650-B2, hereinafter "Kim") in view of Miyasaka et al. (US-2008/0055469-A1, hereinafter "Miyasaka"), further in view of Lin et al. (US-2017/0287137-A1, hereinafter "Lin"), and further in view of Yan et al. (WO-2022/222842-A1, hereinafter "Yan").

35.	As per claim 8, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 2, prior to performing audio and video encoding on the plurality of target images and the accented audio, the method further comprising:
	extracting a target region from the plurality of target images, … (Kim, col. 8, lines 10-26, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information. Through such additional information, the electronic device 100 may generate a highlight image based on an important target.  For example, the processor 140 may analyze information (e.g., a face image) on a frequently contacted user or a user added to favorites among users included in a contact list of the electronic device 100 in advance. Alternatively, the processor 140 may perform analysis while an image is captured. When face recognition on the important object occurs in an image, the processor 140 may include an image of the recognized face in addition to a high priority indication in the additional information.” and Miyasaka, page 4, ¶ [0064], “For an image group to which images with a face extracted therein belong, a display effect, such as zoom-out or zoom-in of the face, is provided synchronically with the beats or the up beats.”; Examiner’s note: With the “face extracted” and a specific “zoom-out” or “zoom-in” display effect of the face, the face is separated from the rest of the image, creating an image background along with the target object, the face.)
	executing at least one of the following processing on the target region:
	randomly zooming in the target region, randomly zooming out the target region, or performing mirror image rotation on the target region. (Kim, col. 11, lines 38-44, “For example, in accordance with the beat of background music, an effect may be provided (changed), an image collage effect may be provided, a fast change effect of zoom-in/zoom-out may be provided based on a face position, or an image effect (e.g., an effect such as gif, a slow motion effect, an overlapping effect, etc.) for continuous pictures may be provided.” and col. 13, lines 35-39, “Accordingly, when playing the highlight video in operation 509, the electronic device 100 may differently set playback conditions of a playback duration with metadata and a playback without metadata.” and col. 16, lines 36-37, “The processor 820 may vary a playback condition relating to the at least one image and perform playback.”  and col. 13, lines 19-22, “Alternatively, if metadata for generating a highlight video is insufficient or some durations in a highlight video are configured to be filled with random images ...”)

36.	Kim in view of Miyasaka, and further in view of Lin doesn't explicitly disclose but Yan discloses: … the target region containing some or all pixel points of the target object, and a center point of the target region being a pixel point of the target object; and (Yan, page 1, [0007], “In the method, a semantic segmentation is performed on any frame image in a dynamic image to obtain an image segmentation mask, the dynamic image includes a plurality of objects, and the image segmentation mask includes a plurality of image regions in one-to-one correspondence with the plurality of objects.” and page 2, [0018], “That is, by scanning each pixel point in the image segmentation mask, pixel points whose pixel values are pixel values corresponding to the selected moving object are determined ...” and page 3, [0025], “It should be noted that the specified position in the image area where the moving object is located may be a position with the smallest coordinate, or may be a position with the largest coordinate, or may be a position of the geometric center point.”)

37.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 2 of Kim in view of Miyasaka, and further in view of Lin to include the disclosure of the target region containing some or all pixel points of the target object, and a center point of the target region being a pixel point of the target object of Yan. The motivation for this modification could have been to provide a more precise and finer level of control of how pixels of a target object are processed for a generated video based on the total number of pixels and its center point.

38.	Claim 19, which is similar in scope to claims 8 and 10, is thus rejected under the same rationale as described above.  In addition, the rational for modifying is the same as claim 8 above.

39.	Claims 9 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US-10708650-B2, hereinafter "Kim") in view of Miyasaka et al. (US-2008/0055469-A1, hereinafter "Miyasaka"), further in view of Lin et al. (US-2017/0287137-A1, hereinafter "Lin"), further in view of Yan et al. (WO-2022/222842-A1, hereinafter "Yan"), further in view of Li et al. (NPL: "Depth-Wise Asymmetric Bottleneck With Point-Wise Aggregation Decoder for Real-Time Semantic Segmentation in Urban Scenes," in IEEE Access, 2020, hereinafter "Li-NPL"), and further in view of Wang et al. (NPL: "Lednet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation," 2019 IEEE International Conference on Image Processing (ICIP), 2019, hereinafter "Wang").

40.	As per claim 9, Kim in view of Miyasaka, further in view of Lin, and further in view of Yan discloses: The method of claim 2, wherein the performing target object segmentation on the original image to obtain the target object image and the background image comprises:
	inputting the original image into an image segmentation model to obtain the target object image and the background image, wherein the image segmentation model comprises: (Kim, col. 8, lines 10-26, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information. Through such additional information, the electronic device 100 may generate a highlight image based on an important target.  For example, the processor 140 may analyze information (e.g., a face image) on a frequently contacted user or a user added to favorites among users included in a contact list of the electronic device 100 in advance. Alternatively, the processor 140 may perform analysis while an image is captured. When face recognition on the important object occurs in an image, the processor 140 may include an image of the recognized face in addition to a high priority indication in the additional information.” and Miyasaka, page 4, ¶ [0064], “For an image group to which images with a face extracted therein belong, a display effect, such as zoom-out or zoom-in of the face, is provided synchronically with the beats or the up beats.”; Examiner’s note: With the “face extracted” and a specific “zoom-out” or “zoom-in” display effect of the face, the face is separated from the rest of the image, creating an image background along with the target object, the face.” and Yan, page 1, [0007], “In the method, a semantic segmentation is performed on any frame image in a dynamic image to obtain an image segmentation mask, the dynamic image includes a plurality of objects, and the image segmentation mask includes a plurality of image regions in one-to-one correspondence with the plurality of objects.” and page 2, [0018], “That is, by scanning each pixel point in the image segmentation mask, pixel points whose pixel values are pixel values corresponding to the selected moving object are determined ...”)

41.	Kim in view of Miyasaka, further in view of Lin, and further in view of Yan doesn't explicitly disclose but Li-NPL discloses: … a deep separable convolutional network; (Li-NPL, Fig. 3 (c); page 27498, ¶ 4, “DAB module follows the bottleneck structure similar to ResNet [32].” and page 27498, ¶ 4, “We use 3 × 3 convolution at the beginning of each DAB module. Here we clarify the reason: although a 1 × 1 convolution has fewer parameters than 3 × 3 convolution, the intention of ResNet [32] is to make a deep model ...” and, Fig. 3 (c); page 27496, ¶ 4, “The overall architecture is presented in Fig. 2. More specifically, we design a depth-wise asymmetric bottleneck to extract dense feature under a large receptive field, which has common advantages of both dilated convolution and depth-wise separable convolution ...”)
	the deep separable convolutional network comprises a first channel convolutional sub-network, a deep convolutional sub-network, a second channel convolutional sub-network … (Li-NPL, Fig. 3 (c); page 27498, ¶ 7, “To extract local information, we use a simple 3 × 3 depth-wise convolution in the first branch. For further reducing the computation, referred by ERFNet [20] non-bottleneck-1D module (Fig. 3(b)), we apply convolution factorization to depth-wise convolution. Namely, a standard n×n depth-wise convolution is substituted for an n×1 depth-wise convolution followed by a 1 × n depth-wise convolution.”)
	the first channel convolutional sub-network comprises a first channel convolutional layer, a nonlinear activation layer and a linear transformation layer;
	the deep convolutional sub-network comprises a deep convolutional layer, a nonlinear activation layer and a linear transformation layer;
	the second channel convolutional sub-network comprises a second channel convolutional layer, a nonlinear activation layer and a linear transformation layer; and (Li-NPL, Fig. 3 (c); page 27498, ¶ 7, “To extract local information, we use a simple 3 × 3 depth-wise convolution in the first branch. For further reducing the computation, referred by ERFNet [20] non-bottleneck-1D module (Fig. 3(b)), we apply convolution factorization to depth-wise convolution. Namely, a standard n×n depth-wise convolution is substituted for an n×1 depth-wise convolution followed by a 1 × n depth-wise convolution.” and page 27498, ¶ 1, “In our DAB module, we adopt pre-activation scheme [33] and batch normalization [34] is used before every non-linear function. Referring to ENet [6], we use PReLU [35] as non-linearity function, as PReLU achieves slightly better performance than ReLU due to the shallow network model.”)

42.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 2 of Kim in view of Miyasaka, further in view of Lin, and further in view of Yan to include the disclosure utilizing an image segmentation model components of a deep separable convolutional network of Li-NPL. The motivation for this modification could have been to provide a machine learning model that is able to simplify complicated image segmentations by generating them automatically after the model has been trained.

43.	Kim in view of Miyasaka, further in view of Lin, further in view of Yan, and further in view of Li-NPL doesn't explicitly disclose but Wang discloses: a channel exchange network, a channel split network and … (Wang, Fig. 2 (d); page 2, ¶ 3, “To balance performance and efficiency given limited computational budgets, we introduce two simple operators, called channel split and shuffle, in residual layer. We refer to this proposed module as split-shuffle-non-bottleneck (SS-nbt), as depicted in Figure 2 (d).”)
… and a channel merge layer; (Wang, Fig. 2 (d); page 2, ¶ 3, “Motivated from [12, 18], a split-transform-merge strategy is employed in the designment of our SS-nbt, approaching the representational power of large and dense layers, but at a considerably lower computational complexity.” and page 2, ¶ 3, “To avoid pointwise convolution, the transformation is performed using a set of specialized 1D filters (e.g., 1 × 3, 3 × 1), and the convolutional outputs of two branches are merged using concatenation so that the number of channels keeps the same. To facilitate training, the stacked output is added with input through the branch of identity mapping.”)
	the channel exchange network, the channel split network, the first channel convolutional sub-network, the deep convolutional sub-network, the second channel convolutional sub-network and the channel merge layer are sequentially connected; and (Wang, Fig. 2 (d); page 2, ¶ 3, “To balance performance and efficiency given limited computational budgets, we introduce two simple operators, called channel split and shuffle, in residual layer. We refer to this proposed module as split-shuffle-non-bottleneck (SS-nbt), as depicted in Figure 2 (d).” and page 2, ¶ 3, “Motivated from [12, 18], a split-transform-merge strategy is employed in the designment of our SS-nbt, approaching the representational power of large and dense layers, but at a considerably lower computational complexity.” and page 2, ¶ 3, “To avoid pointwise convolution, the transformation is performed using a set of specialized 1D filters (e.g., 1 × 3, 3 × 1), and the convolutional outputs of two branches are merged using concatenation so that the number of channels keeps the same. To facilitate training, the stacked output is added with input through the branch of identity mapping.” and page 3, ¶ 3, “Besides SS-nbt unit, the encoder also includes down-sampling unit, which is performed by stacking two parallel outputs of a single 3 × 3 convolution with stride 2 and a Max-pooling. Downsampling enables more deeper network to gather context, while at the same time helps to reduce computation.”)
	an output of the channel split network is in skip connection with an input of the channel merge layer; (Wang, Fig. 2 (d); page 1, ¶ 4-page 2, ¶ 1, “Following CFP, the core unit of encoder is a novel residual module that leverages skip connections and convolutions with channel split and shuffle. While the skip connections allow the convolutions to learn residual functions that facilitate training, the split and shuffle operations enhance the information exchange within the feature channels while maintaining similar computational costs compared to 1D factorized convolutions.”)
	the deep convolutional layer is composed of a plurality of parallel convolutional kernels. (Wang, Fig. 2 (d); page 3, ¶ 3, “Besides SS-nbt unit, the encoder also includes down-sampling unit, which is performed by stacking two parallel outputs of a single 3 × 3 convolution with stride 2 and a Max-pooling. Downsampling enables more deeper network to gather context, while at the same time helps to reduce computation.” and page 3, ¶ 3, “Moreover, the usage of dilated convolutions [14, 23] allows our architecture to have large receptive field, leading to an improvement in accuracy. Compared to the use of larger kernel sizes, this technique has been proven more effective in terms of computational cost and parameters.”)

44.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 2 of Kim in view of Miyasaka, further in view of Lin, further in view of Yan, and further in view of Li-NPL to include the disclosure additional component of an image segmentation model, including a channel exchange network, a channel split network, skip connection, merge layer, and parallel convolutional kernels of Wang. The motivation for this modification could have been to further optimize the image segmentation model to increase its ability to process data faster and more accurately.

45.	Claim 20 is similar in scope to claims 9 and 10 except for additional limitations that Kim in view of Miyasaka, further in view of Lin, further in view of Yan, further in view of Li-NPL, and further in view of Wang discloses: The method of claim 10, wherein the performing target object segmentation on each video frame of the video clip respectively to obtain target object images and background images respectively corresponding to the plurality of video frames comprises:
	inputting each video frame of the video clip respectively into an image segmentation model to obtain the target object images and the background images corresponding to the plurality of video frames, wherein the image segmentation model comprises: (Kim, col. 1, lines 30-24, “Similarly, when a conventional electronic device is used to generate a highlight video from an original video, an analysis on original video data is required in order to determine a section of the original video to be used for the highlight video.” and col. 8, lines 10-26, “The processor 140 may also select an important object (e.g., an important person) based on internal/external information of the electronic device 100 at the time of capturing, and include information on the important object in the additional information. Through such additional information, the electronic device 100 may generate a highlight image based on an important target.  For example, the processor 140 may analyze information (e.g., a face image) on a frequently contacted user or a user added to favorites among users included in a contact list of the electronic device 100 in advance. Alternatively, the processor 140 may perform analysis while an image is captured. When face recognition on the important object occurs in an image, the processor 140 may include an image of the recognized face in addition to a high priority indication in the additional information.” and Miyasaka, page 4, ¶ [0064], “For an image group to which images with a face extracted therein belong, a display effect, such as zoom-out or zoom-in of the face, is provided synchronically with the beats or the up beats.”; Examiner’s note: With the “face extracted” and a specific “zoom-out” or “zoom-in” display effect of the face, the face is separated from the rest of the image, creating an image background along with the target object, the face.” and Yan, page 1, [0007], “In the method, a semantic segmentation is performed on any frame image in a dynamic image to obtain an image segmentation mask, the dynamic image includes a plurality of objects, and the image segmentation mask includes a plurality of image regions in one-to-one correspondence with the plurality of objects.” and page 2, [0018], “That is, by scanning each pixel point in the image segmentation mask, pixel points whose pixel values are pixel values corresponding to the selected moving object are determined ...”)
In addition, the rational for modifying is the same as claims 8, 9, and 10 as shown above.

46.	Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US-10708650-B2, hereinafter "Kim") in view of Miyasaka et al. (US-2008/0055469-A1, hereinafter "Miyasaka"), further in view of Lin et al. (US-2017/0287137-A1, hereinafter "Lin"), and further in view of Sundareson (US-11023732-B2).

47.	As per claim 12, Kim in view of Miyasaka, and further in view of Lin discloses: The method of claim 10, wherein the extracting the video clip meeting set conditions from the original video to obtain the video clip comprises: (See rejection for claim 10 above.)

48.	Kim in view of Miyasaka, and further in view of Lin doesn't explicitly disclose but Sundareson discloses: acquiring feature vectors of each video frame in the original video; (Sundareson, col. 3, lines 58-63, “Once training data, which includes video clips based on durations of predicted interest, has been generated, the training data may be used to train a classification model to classify video clips in an unsupervised manner. To do so, features (e.g., feature vectors) of the video clips may be generated on a per-frame basis.”)
	clustering the feature vectors to obtain a plurality of clustered initial video clips; and (Sundareson, col. 3, line 61-col. 4, line 2, “To do so, features (e.g., feature vectors) of the video clips may be generated on a per-frame basis. For example, a trained feature extractor of a trained image classifier may be used to generate the feature vectors. As the feature extractor has learned which image features are important to classification in a general sense, the features may be clustered into meaningful classes without relying on overly specific game content, such as identifying particular objects.” and col. 1, 53-58, “Systems and methods are disclosed that identify video clips of interest in an unsupervised manner that is game-agnostic, and that group the interesting video clips into classes by clustering. Other video clips can then be classified based on comparing the video clips to the clusters.”)
	extracting a video clip meeting the set conditions respectively from the plurality of initial video clips based on the feature vectors to obtain the video clip. (Sundareson, col. 1, 53-58, “This unsupervised approach may be used to select clips to train a classification model or to identify potentially interesting clips from a gameplay video.” and col. 3, 39-47, “For example, a time segment having data points (e.g., continuously) above a threshold (e.g., an average of values of the interest level algorithm) may be identified as correlating with a potentially highlight-worthy clip. Once the time segments are identified, video clips of the gameplay session correlating with these time segments may be generated, and these video clips may represent durations of predicted interest.”)

49.	Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 10 of Kim in view of Miyasaka, and further in view of Lin to include the disclosure of acquiring feature vectors for video frames and clustering the vectors to determine and obtain video clip meeting certain conditions of Sundareson. The motivation for this modification could have been to provide a means of classifying video clips so that various clips can be associated with each other or a particular clip can be isolated in order to generate videos in which the clips are relevant to the desired set conditions.




Conclusion
50.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

51.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW CLOTHIER whose telephone number is (571)272-4667. The examiner can normally be reached Mon-Fri 8:00am-4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached at (571)272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MATTHEW CLOTHIER/Examiner, Art Unit 2614                                                                                                                                                                                                        
/KENT W CHANG/Supervisory Patent Examiner, Art Unit 2614
Read full office action
Prosecution Timeline

Dec 12, 2023
Application Filed
Sep 25, 2025
Non-Final Rejection mailed — §103
Dec 17, 2025
Response Filed
Apr 29, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/583,705
Patent 12530842
AIRBORNE LiDAR POINT CLOUD FILTERING METHOD DEVICE BASED ON SUPER-VOXEL GROUND SALIENCY
1y 11m to grant Granted Jan 20, 2026
18/393,677
Patent 12499800
IN-VEHICLE DISPLAY DEVICE
1y 12m to grant Granted Dec 16, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
100%
Grant Probability
99%
With Interview (+0.0%)
2y 1m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 3 resolved cases by this examiner. Grant probability derived from career allowance rate.