Prosecution Insights
Last updated: April 19, 2026
Application No. 18/429,179

AI-ENHANCED VIDEO EDITING WITH INTERMEDIATE DATA MODEL REPRESENTATION AND WEB-BASED INTERFACE

Final Rejection §103
Filed
Jan 31, 2024
Examiner
NAZAR, AHAMED I
Art Unit
2178
Tech Center
2100 — Computer Architecture & Software
Assignee
Jobpixel Inc.
OA Round
4 (Final)
53%
Grant Probability
Moderate
5-6
OA Rounds
3y 11m
To Grant
88%
With Interview

Examiner Intelligence

Grants 53% of resolved cases
53%
Career Allow Rate
202 granted / 378 resolved
-1.6% vs TC avg
Strong +35% interview lift
Without
With
+35.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
29 currently pending
Career history
407
Total Applications
across all art units

Statute-Specific Performance

§101
9.2%
-30.8% vs TC avg
§103
59.7%
+19.7% vs TC avg
§102
15.3%
-24.7% vs TC avg
§112
9.6%
-30.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 378 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application is being examined under the pre-AIA first to invent provisions. Response to Amendment This communication is responsive to the amendment filed 8/27/2025. Claims 1, 9, and 17 have been amended and no claims have been added and/or canceled. In light of Applicant’ argument, previous claim rejections based on 35 USC 103 with respect to claims 1-20 have been withdrawn. Claims 1-20 are pending with claims 1, 9, and 17 as independent claims. Claim Interpretation The following is a quotation of 35 U.S.C. 112(f): (f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph: An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked. As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph: (A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; (B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and (C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitations uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitations are: “means for selecting a set of video clips”, “means for generating a natural language prompt”, “means for providing the generated natural language prompt as input”, “means for constructing a project data model”, and “means for rendering a dynamic and interactive web-based user interface” in claim 17 and “means for pre-processing the video clips” in claim 18. Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. Specifically, paragraph [0108] and corresponding fig. 10 cover instructions 1016 being stored in memory 1030 and executed by processors 1010. However, there is no specific structure in the specification for each of the “means for” limitations above. If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The specification appears to provide sufficient structure, material, or acts for performing the claimed functions indicated above. The memory 1030 appears to store instructions or functions 1016 and the processor 1010 appears to execute instructions or functions 1016 as indicated in [0109-0110] and corresponding fig. 10. [0109] The machine 1000 may include processors 1010, memory 1()30, and VO components 1050, which may be configured to communicate with each other such as via a bus 1002. In an example embodiment, the processors 1010 (e.g., a Central Processing Unit (CPU), a Reduced Instruction Set Computing (RISC) processor, a Complex Instruction Set Computing (CISC) processor, a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an ASIC, a Radio-Frequency Integrated Circuit (RFIC), another processor, or any suitable combination thereof) may include, for example, a processor 1012 and a processor 1014 that may execute the instructions 1016. [0110] The memory 1030 may include a main memory 1032, a static memory 1034, and a storage unit 1036, an accessible to the processors 101() such as via the bus 1002. The main memory 1030, the static memory 1034, and storage unit 1036 store the instructions 1016 embodying any one or more of the methodologies or functions described herein. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Cheng et al. (US 2014/0328570, published 11/06/2014, hereinafter as Cheng) in view of Lee et al. (US 2024/0362272, filed 4/27/2023, hereinafter as Lee) in view of Buckley et al. (US 2025/0021769, filed 7/13/2023, hereinafter as Buckley). Claim 1. A computer-implemented method for generating a video editing project, the method comprising: with selection criteria received via a user interface of a web-based application, selecting a set of video clips from a collection of pre-processed video clips, each preprocessed video clip being associated with metadata comprising text corresponding with speech, the text derived from applying a speech-to-text algorithm to an audio track of the video clip and associated with timing data indicating a temporal occurrence of the speech within the video clip; Cheng teaches in [0017] “Any video of the input 102 may include or have associated therewith an audio soundtrack and/or a speech transcript, where the speech transcript may be generated by, for example, an automated speech recognition (ASR) module of the computing system 100.” And in [0050] “The event detection module 228 identifies the event description 106 to the salient activity detector module 234. The salient activity detector module 234 uses the salient event criteria 138 to evaluate the event description 106 and/or the detected features 220, 222, 224, 226 of the multimedia input 102 to determine the salient activities associated with the event description 106 or with the multimedia input 102 more generally… The salient event criteria 138 may also include salient event criteria that is specified by or derived from user inputs 256. For example, the interactive storyboard module 124 may determine salient event criteria based on user inputs received by the editing module 126.” And in [0051] “the salient event detection criteria may include data values and/or algorithm parameters that indicate a particular combination of features 220, 222, 224, 226 that is associated with a salient activity. Each or any of the salient event criteria 138 may have associated therewith one or more saliency indicators 238. The saliency indicators 238 can be used by the computing system 100 to select or prioritize the salient event segments 112.” And in [0057] “the system 100 identifies the salient event segments 112 in the multimedia input file(s). To do this, the system 100 executes one or more of the feature detection algorithms 130 using the salient activity detection criteria determined at block 328. The system 100 also uses the saliency indicators 238, if any, to filter or prioritize the salient event segments 112 (block 334). At block 336, the computing system 100 generates the visual presentation 120 (e.g., a video clip or "montage"), and/or the NL description 122, using, e.g., the templates 142, 144 as described above (block 338). At block 340, the system 100 presents the visual presentation 120 and/or the NL description 122 to the end user using an interactive storyboard type of user interface mechanism.” And in [0059] “The system 100 analyzes the video using feature detection algorithms and semantic reasoning as described above. Time-dependent output of the system 100's semantic analysis of the detected features is shown by the graphics 512, 514, 516, 518. In the graphic 512, the portion 520 represents a salient event segment 112 of the video 510. The system 100 has determined, using the techniques described above, that the salient event segment 520 depicts the salient activity of blowing out candles. Similarly, the system 100 has identified, using the techniques described above, salient event segments 522, 524, each of which depicts a person singing, and a salient event segment 526, which depicts a person opening a box.” And in [0060] “One usage scenario of the technology disclosed herein provides a fully-automated video creation service with a "one-click" sharing capability… the service takes one or more input video files, which can be e.g. raw footage captured by smartphone, and identifies a most-relevant event type for the uploaded video, e.g. birthday party, wedding, football game. The service identifies the event using feature recognition algorithms, such as described above… the step of event identification can be done manually, such as by the user selecting an event type from a menu or by typing in keywords… the event type corresponds to a stored template identifying the key activities typically associated with that event type… for a birthday party, associated activities would include blowing out candles, singing the Happy Birthday song, opening gifts, posing for pictures, etc. Using feature detection algorithms for complex activity recognition, such as those described above and in the aforementioned priority patent applications, the service automatically identifies segments (sequences of frames) within the uploaded file that depict the various key activities or moments associated with the relevant event type.” (emphasis added). Examiner Note: a user may select an event type or type in keywords as a selection criteria. Based on the selected event type or typed keywords, the system identifies video segments or video clips, from multimedia input 102, that are relevant to the selected criteria. The multimedia content understanding module 104 comprises feature detection modules 202, wherein each module, from the feature detection modules, processes a feature in the multimedia input 102. For example, visual feature detection module 212 detects visual feature 220 such as someone eating cake, whereas audio feature detection module 214 detects audio feature 222 such as cheering sound or speech or singing as indicated in [0016] and [0030], generating, [using a first generative language model comprising a prompt generator and an output analyzer module, a dynamic constructed a prompt] for use as input to [a second generative language model], the prompt including i) a natural language instruction derived from the selection criteria that directs the second generative language model to identify timing data for salient snippets relevant to the selection criteria, and ii) a context portion that includes the metadata from the selected video dips, wherein [the prompt is formatted for processing by a transformer-based large language model (LLM)]; Cheng teaches in [0018-0023] “The illustrative event description 106 generated by the understanding module 104 indicates an event type or category, such as "birthday party," "wedding," "soccer game," "hiking trip," or "family activity." The event description 106 may be embodied as, for example, a natural language word or phrase that is encoded in a tag or label, which the computing system 100 associates with the multimedia input 102 (e.g., as an extensible markup language or XML tag). Alternatively or in addition, the event description 106 may be embodied as structured data, e.g., a data type or data structure including semantics, such as "Party(retirement)," "Party(birthday)," "Sports_Event(soccer)," "Performance(singing)," or "Performance(dancing)."… The multimedia content understanding module 104 applies a number of different feature detection algorithms 130 to the multimedia input 102, using a multimedia content knowledge base 132, and generates an event description 106 based on the output of the algorithms 130… Once the salient activities are determined… identify particular portions or segments of the multimedia input 102 that depict those salient activities… if the computing system 100 determines that the multimedia input 102 depicts a birthday party (the event), the illustrative multimedia content understanding module 104 accesses the multimedia content knowledge base 132 to determine the constituent activities that are associated with a birthday party (e.g., blowing out candles, etc.), and selects one or more of the feature detection algorithms 130 to execute on the multimedia input 102 to look for scenes in the input 102 that depict those constituent activities. The understanding module 104 executes the selected algorithms 130 to identify salient event segments 112 of the input 102, such that the identified salient event segments 112 each depict one (or more) of the constituent activities that are associated with the birthday party.” And in [0025-0026] “The visual presentation generator module 116 of the output generator module 114 automatically extracts (e.g., removes or makes a copy of) the salient event segments 112 from the input 102 and incorporates the extracted segments 112 into a visual presentation 120, such as a video clip (e.g., a "highlight reel") or multimedia presentation, using a presentation template 142. In doing so, the visual presentation generator module 116 may select the particular presentation template 142 to use to create the presentation 120 based on a characteristic of the multimedia input 102, the event description 106, user input, domain-specific criteria, and/or other presentation template selection criteria… The natural language generator module 118 of the output generator module 114 automatically generates a natural language description 122 of the event 106, including natural language descriptions of the salient event segments 112 and suitable transition phrases, using a natural language template 144. In doing so, the natural language presentation generator module 118 may select the particular natural language template 144 to use to create the NL description 122 based on a characteristic of the multimedia input 102, the event description 106, user input, domain-specific criteria, and/or other NL template selection criteria.” And in [0035] “The auto-suggest module 162 evaluates the user inputs over time, compares the user inputs to the event descriptions 106 and/or the NL descriptions 122 (using, e.g., a matching algorithm), determines if any user inputs match any of the event descriptions 106 or NL descriptions 122, and, if an input matches an event description 106 or an NL description 122, generates an image suggestion, which suggests the relevant images/videos 102, 120 in response to the user input based on the comparison of the description(s) 106, 122 to the user input. For example, if the auto-suggest module 162 detects a textual description input as a wall post to a social media page or a text message, the auto-suggest module 162 looks for images/videos in the collection 150 or stored in other locations, which depict visual content relevant to the content of the wall post or text message. If the auto-suggest module 162 determines that an image/video 102, 120 contains visual content that matches the content of the wall post or text message, the auto-suggest module 162 displays a thumbnail of the matching image/video as a suggested supplement or attachment to the wall post or text message.” (emphasis added) Examiner Note: let’s assume that the multimedia content understanding module 104 is a prompt generator. Module 104 would generate event description 106 and salient event segment 112 such that the event description 106 may be utilized by the system to determine salient activities. Thus, the event description event 106 may be embodied as a natural language word or phrase that is encoded in a tag or label, which the computing system 100 associates with the multimedia input 102 (e.g., as an extensible markup language or XML tag) or as structured data, e.g., a data type or data structure including semantics, such as "Party(retirement)," "Party(birthday)," and transition phrases as timing data. A second input may be user inputs (as context portion) received by the editing module 126 to be matched with event descriptions 106 as indicated in [0035], Cheng does not explicitly teach generating, using a first generative language model comprising a prompt generator and an output analyzer module, a dynamic constructed a prompt… the prompt is formatted for processing by a transformer-based large language model (LLM). However, Lee, in an analogous art, teaches in [0022] “The video analysis system 130 also generates a set of prompt embeddings representing at least a portion of the query in a latent space. The set of prompt embeddings and a set of input embeddings are combined to generate an input tensor. The video analysis system 130 applies at least a component of a machine-learned decoder (e.g., machine-learned alignment model or LLM) to the input tensor to generate an output including a set of output embeddings. The video analysis system 130 converts the set of output embeddings into a response based on the content of the video clip. The video analysis system 130 provides the response to a user of the client device.” And in [0068-0070] “As shown in FIG. 2, a video clip includes a sequence of frames, Frame 1, . . . , Frame N from time stamps 16:05 to 17:18… The visual encoder 210 may be configured as a convolutional neural network (CNN), transformer architecture, or any other model configured to process images or frame data… As shown in FIG. 2, the audio signals from the video clip captures the sounds occurring in the hospital room for time stamps 16:05 to 17:18. The audio encoder 214 may be configured as a transformer architecture or any other model configured to process soundwave data… the text form the video clip captures the transcribed text of the people in the video talking in the hospital room for time stamps 16:05 to 17:18. The text encoder 216 may be configured as a transformer architecture of any other model configured to process text data.” And in [0072] “The video analysis system 130 also receives a user query and generates a prompt to the decoder 230… The prompt includes at least a portion of the user query but may differ from the user query by including additional context or instructions to the decoder 230 that was not included in the original query from the user. The video analysis system 130 converts the prompt into a set of prompt embeddings 225 in a latent space. In one embodiment, the query may include different data modalities, including text data, visual data (e.g., images), and the like. The set of prompt embeddings 225 are generated by at least applying a visual encoder to the visual data in the prompt, applying an audio encoder to the audio data in the prompt, and applying a text encoder to the text data in the prompt”. (emphasis added) Examiner Note: the video analysis system (or the encoder 210) receives input (sequence of video frames and generates visual embeddings that encode the visual information in the frames; an audio signal and generates audio embeddings to encode the audio information; and one or more tokens representing text and generate a set of text embeddings that encode the textual information), wherein the visual, audio, text embeddings may be the generated prompt as input to decoder 230 (large language model). Similarly, Cheng teaches in [0019] “The event description 106 may be embodied as, for example, a natural language word or phrase that is encoded in a tag or label, which the computing system 100 associates with the multimedia input 102 (e.g., as an extensible markup language or XML tag).” And in [0024-0026] The output generator module 114 and its submodules, a visual presentation generator module 116 and a natural language generator module 118, are each embodied as software, firmware, hardware, or a combination thereof… The natural language generator module 118 of the output generator module 114 automatically generates a natural language description 122 of the event 106, including natural language descriptions of the salient event segments 112 and suitable transition phrases, using a natural language template 144… An example of a natural language description 122 for a highlight reel of a child's birthday party, which may be output by the NL generator module 118, may include: "Child's birthday party, including children playing games followed by singing, blowing out candles, and eating cake."” Clearly, Cheng’s understanding module acts as an encoder that generates at least one prompt (event description and salient event segments such as 520, 522, 524, 526, wherein each salient event segment may include timestamp indicating the beginning and ending of each salient event segment as shown in fig. 5 of Cheng) to be as input to generator module 114, which may acts as decoder, which transform the generated prompt to output “highlight reel” by extracting the salient event segments 520, 522, 524, and 526 from video 510 and incorporate them into “highlight reel”, see [0059], and transition phrases, see [0062], as timing data for different segment in the highlight reel. Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Cheng with the teaching of Lee because “a user may submit a query to the video analysis system of “a guy in a red shirt playing tennis in a court” to request videos that include a man in a red shirt playing tennis in a court. The video analysis system performs a relevance analysis and identifies videos that include segments that relate to the query, for example, videos that includes a man in a red shirt playing tennis in a court.” Lee [Background]. providing the generated natural language prompt as input to the second generative language model via an API-based integration service, wherein the second generative language model comprising an LLM that processes the prompt using a transformer-based neural network architecture and generates output comprising data for constructing a project data model, the data including timing data for salient snippets within the selected video clips; Cheng teaches in [0021] “The computing system 100 uses the event description 106 and the knowledge base 132 to determine one or more "salient" activities that are associated with the occurrence of the detected event… the computing system 100 may access salient event criteria 138 and/or the mapping 140 of the knowledge base 132. The illustrative salient event criteria 138 indicate one or more criteria for determining whether an activity is a salient activity in relation to one or more events… the salient event criteria 138 identify salient activities and the corresponding feature detection information that the computing system 100 needs in order to algorithmically identify those salient activities in the input 102 (where the feature detection information may include, for example, parameters of computer vision algorithms 130)… the salient event criteria 138 includes saliency indicators 238 (FIG. 2), which indicate, for particular salient activities, a variable degree of saliency associated with the activity as it relates to a particular event… A saliency indicator 238 may be… a priority, a weight or a rank that can be used to arrange or prioritize the salient event segments 112.” And in [0025] “The visual presentation generator module 116 of the output generator module 114 automatically extracts (e.g., removes or makes a copy of) the salient event segments 112 from the input 102 and incorporates the extracted segments 112 into a visual presentation 120, such as a video clip (e.g., a "highlight reel") or multimedia presentation, using a presentation template 142.” And in [0026] “The natural language generator module 118 of the output generator module 114 automatically generates a natural language description 122 of the event 106, including natural language descriptions of the salient event segments 112 and suitable transition phrases, using a natural language template 144.” And in [0028] “a presentation template 142 specifies, for a particular event type, the type of content to include in the visual presentation 120, the number of salient event segments 112, the order in which to arrange the segments 112, (e.g., chronological or by subject matter), the pace and transitions between the segments 112, the accompanying audio or text, and/or other aspects of the visual presentation 120.” (emphasis added) Examiner Note: the output of the multimedia content understanding module 104 generates the event description 106 and/or salient event segments, representing metadata describing audio such as shouting, singing, etc. in multimedia input 102 as salient event segments, and salient event segments 112, which are input to the output generator module 114. The output of the output generator module 114 receives the event description 106 and/or salient event segments 112 as input and output visual presentation 120 and/or natural language description 122 of the salient event segments template 142 and/or template 144, in which the segments are chronologically arranged accompanying audio or text. Accordingly, the output generator module 114 appears to function similarly as the generative language model, constructing a non-destructive project data model from the output of the second generative language model, wherein the project data model includes references to one or more of the selected video clips without modifying any original video clips and specifies a beginning point and an ending point for the salient snippets identified by the second generative language model, wherein the project data model enables iterative editing and refinement prior to final video rendering; Cheng teaches in [0015 and 0025] “the illustrative computing system 100 can, among other things, help users quickly and easily locate "salient activities" in lengthy and/or large volumes of video footage, so that the most important or meaningful segments can be extracted, retained and shared. The computing system 100 can compile the salient event segments into a visual presentation 120 (e.g., a "highlight reel" video clip) that can be stored and/or shared over a computer network… The visual presentation generator module 116 of the output generator module 114 automatically extracts (e.g., removes or makes a copy of) the salient event segments 112 from the input 102 and incorporates the extracted segments 112 into a visual presentation 120, such as a video clip (e.g., a "highlight reel") or multimedia presentation, using a presentation template 142.” And in [0028] “a presentation template 142 specifies, for a particular event type, the type of content to include in the visual presentation 120, the number of salient event segments 112, the order in which to arrange the segments 112, (e.g., chronological or by subject matter), the pace and transitions between the segments 112, the accompanying audio or text, and/or other aspects of the visual presentation 120… The output generator module 114 or more specifically, the visual presentation generator module 116, formulates the visual presentation 120 according to the system- or user-selected template 142 (e.g., by inserting the salient event segments 112 extracted from the multimedia input 102 into appropriate "slots" in the template 142). And in [0050] “The salient event criteria 138 also includes salient activity templates 254. The templates 254 may include portions of the presentation templates 142. For example, a presentation template 142 may specify a list of activities that are considered to be "salient" for a particular type of video montage or other visual presentation 120. The salient event criteria 138 may also include salient event criteria that is specified by or derived from user inputs 256. For example, the interactive storyboard module 124 may determine salient event criteria based on user inputs received by the editing module 126.” And in [0061] “the user can also change the beginning and/or ending frames of a segment by selecting the segment for editing, previewing it along with neighboring frames from the original footage, and using interactive controls to mark desired start and end frames for the system.” And in [0062] “the automated creation of the highlight clip can be performed by the system 100 in real time, on live video (e.g., immediately after a user has finished filming, the system 100 can initiate the highlight creation processes). In the interactive embodiments, users can add transition effects between segments, where the transition effects may be automatically selected by the service and/or chosen by the user.” (emphasis added). Examiner Note: the insertion of selected segments in chronological order using a template meets constructing a project data model, wherein transition effects may be inserted between the segments to indicate beginning and ending of a segment. It is clear that the project data model may be highlight reel or video clip compiled from salient event segments copied from the input video 102, which indicates that the input video has not been changed (non-destructive) by making copies of selected portions of the input video 102, and rendering a dynamic and interactive web-based user interface [implemented using HTML5, CSS3 and JavaScript] that visually represents the project data model, the user interface providing a timeline view of the video editing project and enabling user interaction for editing and refining the video project based on the project data model, wherein the user interface includes a snippet selector module displaying additional LLM-recommended video snippets not initially included in the project data model and an asset library module containing editing assets pre-selected by a generative language model based on relevance to the video project; Cheng teaches in [0026 and 0028] The natural language generator module 118 of the output generator module 114 automatically generates a natural language description 122 of the event 106, including natural language descriptions of the salient event segments 112 and suitable transition phrases, using a natural language template 144… a presentation template 142 specifies, for a particular event type, the type of content to include in the visual presentation 120, the number of salient event segments 112, the order in which to arrange the segments 112, (e.g., chronological or by subject matter), the pace and transitions between the segments 112, the accompanying audio or text, and/or other aspects of the visual presentation 120.” And in [0061-0062] “the service displays to the user an interactive storyboard, with thumbnails or other icons representing each of the segments along a timeline, and a visual indication of the segments that the system tentatively proposes to use in the highlight clip. If there is video from multiple sources for the same activities, then multiple corresponding rows of segments can be displayed, with visual indication of which segments from each row are to be used in the edited clip. An example situation in which this may occur is a child's birthday party, where both parents and other relatives (e.g., grandparents) may be taking video of the child's party. In this case, the system 100 can select salient event segments from the different videos in the group of videos taken by the family members and merge them into a highlight clip. In this embodiment, the user can modify the content of the highlight clip interactively by selecting different segments to use, with the interactive storyboard interface. In some embodiments, the user can also change the beginning and/or ending frames of a segment by selecting the segment for editing, previewing it along with neighboring frames from the original footage, and using interactive controls to mark desired start and end frames for the system. Once the user review/editing is complete, the service constructs a highlight clip by splicing segments together in accordance with the user's edits. As above, the user can preview the clip and decide to save or share the clip, for example… users can add transition effects between segments, where the transition effects may be automatically selected by the service and/or chosen by the user.” (emphasis added). Examiner Note: the project data model may be a child’s birthday party displaying selected segments, highlight clip along a timeline, recommended by the system, wherein the user can modify (refine) the content of the highlight clip. The addition LLM -recommended video snippets may be transition effects to be added between the selected salient event segments for presentation 120. Cheng discloses in [0033] “The output generator module 114 interfaces with an interactive storyboard module 124 to allow the end user to modify the (machine-generated) visual presentation 120 and/or the (machine-generated) NL description 122, as desired… The interactive storyboard module 124 presents the salient event segments 112 using a storyboard format that enables the user to intuitively review, rearrange, add and delete segments of the presentation 120 (e.g. by tapping on a touchscreen of the HCI subsystem 638). When the user's interaction with the presentation 120 is complete, the interactive storyboard module 124 stores the updated version of the presentation 120 in computer memory (e.g., a data storage 620).” (emphasis added). Cheng does not explicitly teach user interface implemented using HTML5, CSS3 and JavaScript that visually represents the project data model. However, Buckley, in an analogous art, discloses in [0080] “The application windows 146 may include browser windows 159 and browser tabs 162 rendered by a browser application 108. A browser application 108 is a web browser configured to access information on the Internet. The browser application 106 may launch one or more browser tabs 162 in the context of one or more browser windows 159 on a display 138 of the user device 102. A browser tab 162 may display content (e.g., web content) associated with a web document (e.g., webpage, PDF, images, videos, etc.) and/or an application 106 such as a web application, progressive web application (PWA), and/or extension… An extension adds a feature or function to the browser application 108. In some examples, an extension may be HTML, CSS, and/or JavaScript based (for browser-based extensions).” And in [0089-0090] “the task assistant 110 may transmit one or more prompts 114 to one or more language model 152 that causes multiple language models 152 to generate source code 118 for the computer task 122. The language models 152 may be associated with different applications 106. For example, a language model 152-1 may be associated with the operating system 105 and/or the browser application 108, and a language model 152-2 may be associated with another application 106 (e.g., webpage, web application, or native application). In some examples, the language model 152-2 is trained to understand the structure of application 106, including any interfaces associated with the application 106.” (emphasis added) examiner note: the use of HTML5, CSS, and JavaScript would allow the user to structure and/or restructure the highlight reel by using HTML for structuring, CSS for styling, and JavaScript for user interaction with the presentation. Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Cheng with the teaching of Buckley “In order to group work tabs in a new browser window, a user may launch a new browser window, visually identify each tab to determine which tab relates to the user's work, and then move the user's work tabs to the new browsing window. In some examples, the user device includes a tab search feature that allows users to search for a specific tab or page among their open tabs, making it easier to locate and switch to a particular tab without having to manually navigate through many tabs.” Lee [Background and 0056]. Claims 2 and 10. The rejection of the computer-implemented method of claim 1 is incorporated, further comprising pre-processing the video clips to include text corresponding with objects depicted in the video clips, the text derived from applying one or more computer vision algorithm to the video clips, wherein the metadata associated with each pre-processed video clip includes this text, and wherein the generative language model utilizes the text to enhance the identification of salient snippets based on both the speech and depicted objects within the video clips; Cheng teaches in [0019-0021] “To generate the event description 106, the illustrative multimedia content understanding module 104 accesses one or more feature models 134 and/or concept models 136. The feature models 134 and the concept models 136 are embodied as software, firmware, hardware, or a combination thereof, e.g., a knowledge base, database, table, or other suitable data structure or computer programming construct. The models 134, 136 correlate semantic descriptions of features and concepts with instances or combinations of output of the algorithms 130 that evidence those features and concepts. For example, the feature models 134 may define relationships between sets of low level features detected by the algorithms 130 with semantic descriptions of those sets of features (e.g., "object," "person," "face," "ball," "vehicle," etc.). Similarly, the concept model 136 may define relationships between sets of features detected by the algorithms 130 and higher-level "concepts," such as people, objects, actions and poses (e.g., "sitting," "running," "throwing," etc.). The semantic descriptions of features and concepts that are maintained by the models 134, 136 may be embodied as natural language descriptions and/or structured data. As described in more detail below with reference to FIG. 4, a mapping 140 of the knowledge base 132 indicates relationships between various combinations of features, concepts, events, and activities.” And in [0022] “The mapping 140 of the knowledge base 132 links activities with events, so that, once the event description 106 is determined, the understanding module 104 can determine the activities that are associated with the event description 106 and look for those activities in the input 102.” (emphasis added). Examiner Note: the system generates event description 106 as text corresponding to event or activities in the video content such that the visual feature module 212 such as loud sound followed by a bright light, person running, a person giving speech, people drinking, etc. see [0047]. Claims 3, 11, and 19. The rejection of the computer-implemented method of claim 1 is incorporated, wherein the natural language prompt characterizes content desirable to a user as specified in the selection criteria, the prompt comprising one or more of a topic, a theme, a subject matter, a sentiment, specific keywords or phrases, questions or answers, narrative elements, or actionable content; Cheng teaches in [0060] “the step of event identification can be done manually, such as by the user selecting an event type from a menu or by typing in keywords. In this embodiment, the event type corresponds to a stored template identifying the key activities typically associated with that event type. For example, for a birthday party, associated activities would include blowing out candles, singing the Happy Birthday song, opening gifts, posing for pictures, etc.” (emphasis added). wherein the generative language model identifies salient snippets from the selected video clips that correspond to the characterized content; Cheng teaches in [0060] “Using feature detection algorithms for complex activity recognition, such as those described above and in the aforementioned priority patent applications, the service automatically identifies segments (sequences of frames) within the uploaded file that depict the various key activities or moments associated with the relevant event type. The service automatically creates a highlight clip by splicing together the automatically-identified segments. The user can review the clip and instruct the service to save the clip, download it, and/or post/share it on a desired social network or other site.” (emphasis added). Claims 4 and 12. The rejection of the computer-implemented method of claim 1 is incorporated, wherein the selection criteria received via the user interface further include one or more of the following: video clip tags, folder hierarchy, source-based selection, date and time filters, content analysis metrics, user engagement data, quality and resolution specifications, or custom queries, which collectively or individually contribute to the selection of the set of video clips from the collection; Cheng teaches in [0053] “the system 100 receives one or more input files, e.g. a multimedia input 102. The input file(s) can be embodied as, for example, raw video footage or digital pictures captured by smartphone or other personal electronics device. The input file(s) may be stored on a local computing device and/or a remote computing device (e.g., in personal cloud, such as through a document storing application like DROPBOX. Thus, the input file(s) may be received by a file uploading, file transfer, or messaging capability of the end user's computing device and/or the computing system 100 (e.g., the communication subsystems 644, 672).” (emphasis added). Claims 5 and 13. The rejection of the computer-implemented method of claim 1 is incorporated, wherein the selection criteria received via the user interface further include a desired length for a final video, and wherein a default rate of speech is applied to the text associated with the selected video clips to determine a duration of each snippet to be included in the video project, such that the cumulative length of all selected snippets approximates the desired video length; Cheng teaches in [0056] “The system 100 may also use the saliency indicators 238 at block 326 to prioritize the salient activities so that, for example, if a template 142 specifies a limitation on the length or duration of the visual presentation 120, segments of the input that depict the higher priority salient activities can be included in the presentation 120 and segments that depict lower priority activities may be excluded from the presentation 120.” (emphasis added). Claims 6, 14, and 20. The rejection of the computer-implemented method of claim 1 is incorporated, wherein the selection criteria received via the user interface for selecting the set of video clips from the collection include at least one or more of the following, expressed in the alternative or in any combination: video clip tags corresponding to topics or descriptive elements; an indicator of one or more folders, wherein video clips are organized within a folder hierarchy; selection based on a source of the video clips; Cheng teaches in [0053] “the system 100 receives one or more input files, e.g. a multimedia input 102. The input file(s) can be embodied as, for example, raw video footage or digital pictures captured by smartphone or other personal electronics device. The input file(s) may be stored on a local computing device and/or a remote computing device (e.g., in personal cloud, such as through a document storing application like DROPBOX. Thus, the input file(s) may be received by a file uploading, file transfer, or messaging capability of the end user's computing device and/or the computing system 100 (e.g., the communication subsystems 644, 672).” (emphasis added) examiner note: the source of the video footage may be local computing device or personal cloud, and filter selections based on date and time of video clip creation or modification, source, or other content analysis metrics that categorize the video clips according to predefined parameters. Claims 7 and 15. The rejection of the computer-implemented method of claim 1 is incorporated, wherein the project data model facilitates the presentation of user interface elements representing additional relevant video snippets that are not initially included in the video project, allowing the user to preview and select these snippets for addition to or replacement of existing snippets within the project, thereby providing an advantage of identifying and presenting potential content options within the user interface without formally incorporating them into the project data model until selected by the user; Cheng teaches in [0033] “The output generator module 114 interfaces with an interactive storyboard module 124 to allow the end user to modify the (machine-generated) visual presentation 120 and/or the (machine-generated) NL description 122, as desired… The editing module 126 displays the elements of the visual presentation 120 on a display device (e.g., a display device 642, FIG. 6) and interactively modifies the visual presentation 120 in response to human-computer interaction (HCI) received by a human-computer interface device (e.g., a microphone 632, the display device 642, or another part of an HCI subsystem 638). The interactive storyboard module 124 presents the salient event segments 112 using a storyboard format that enables the user to intuitively review, rearrange, add and delete segments of the presentation 120 (e.g. by tapping on a touchscreen of the HCI subsystem 638).” (emphasis added). Claims 8 and 16. The rejection of the computer-implemented method of claim 1 is incorporated, wherein the user interface enables the user to specify additional editing parameters for the video editing project, including but not limited to desired video clip length, transition effects between clips, background music selection, overlay graphics, and text annotations, which are incorporated into the project data model to guide the rendering of the web-based user interface and a final video editing workflow; Cheng teaches in [0061] “the system may identify more salient event segments than it actually proposes to use in a highlight clip, e.g., due to limits on total clip length, uncertainty about which segments are best, redun
Read full office action

Prosecution Timeline

Jan 31, 2024
Application Filed
May 08, 2024
Non-Final Rejection — §103
Jun 13, 2024
Interview Requested
Jun 25, 2024
Applicant Interview (Telephonic)
Jun 26, 2024
Examiner Interview Summary
Aug 16, 2024
Response Filed
Sep 13, 2024
Final Rejection — §103
Sep 24, 2024
Interview Requested
Dec 06, 2024
Examiner Interview Summary
Dec 17, 2024
Response after Non-Final Action
Feb 10, 2025
Request for Continued Examination
Feb 12, 2025
Response after Non-Final Action
Feb 21, 2025
Non-Final Rejection — §103
Aug 27, 2025
Response Filed
Nov 25, 2025
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12564342
METHODS, SYSTEMS, AND DEVICES FOR THE DIAGNOSIS OF BEHAVIORAL DISORDERS, DEVELOPMENTAL DELAYS, AND NEUROLOGIC IMPAIRMENTS
2y 5m to grant Granted Mar 03, 2026
Patent 12548333
DYNAMIC NETWORK QUANTIZATION FOR EFFICIENT VIDEO INFERENCE
2y 5m to grant Granted Feb 10, 2026
Patent 12549503
INFORMATION INTERACTION METHOD AND APPARATUS, AND ELECTRONIC DEVICE
2y 5m to grant Granted Feb 10, 2026
Patent 12539042
Multi-Modal Imaging System and Method Therefor
2y 5m to grant Granted Feb 03, 2026
Patent 12541546
LOSSLESS SUMMARIZATION
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

5-6
Expected OA Rounds
53%
Grant Probability
88%
With Interview (+35.1%)
3y 11m
Median Time to Grant
High
PTA Risk
Based on 378 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month