Last updated: April 19, 2026
Application No. 18/666,415
ATTENTION-BASED VIDEO TOKEN GENERATION

Non-Final OA §101§103
Filed
May 16, 2024
Examiner
HSU, JONI
Art Unit
2611
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
Interview Optional

— +7.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 848 resolved cases, 2023–2026
Examiner Intelligence

HSU, JONI View full profile →
Grants 87% — above average
Career Allow Rate
741 granted / 848 resolved
+25.4% vs TC avg
Moderate +7% lift
Without
With
+7.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
34 currently pending
Career history
882
Total Applications
across all art units
Statute-Specific Performance

§101
8.4%
-31.6% vs TC avg
§103
59.7%
+19.7% vs TC avg
§102
11.4%
-28.6% vs TC avg
§112
3.1%
-36.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 848 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on May 22, 2025 was filed after the mailing date of the application on May 16, 2024.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 20 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter.  Claim 20 is directed to a computer storage medium.  According to MPEP 2106.03 II, when the broadest reasonable interpretation (BRI) encompasses transitory forms of signal transmission, a rejection under 35 U.S.C. 101 as failing to claim statutory subject matter would be appropriate.  Thus, a claim to a computer readable medium that can be a compact disc or a carrier wave covers a non-statutory embodiment and therefore should be rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.  Applicant’s disclosure describes “The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them” ([94], p. 24) without excluding signal and carrier wave.  Thus, the BRI encompasses signal.  Thus, Claim 20 is rejected under 35 U.S.C. 101.  The Examiner suggests amending Claim 20 to instead recite “non-transitory computer storage medium”.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1, 2, 5, 10, 19, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) in view of Shazeer (US 20200089755A1).
As per Claim 1, Ramsauer teaches a computer-implemented method for generating an output comprising an output video, the method comprising:  obtaining a model input; processing the model input to generate an input sequence of embeddings that represents the model input (at step 1202, where the design generation system 120 receives the model prompt, [0288], at step 1204, the model prompt (the design description and the [GEN] token) is passed to the LLM 302, [0289], at step 1206, the LLM 302 takes the design description and the [GEN] token as an input prompt and predicts the next token in the sequence, [0290], step 1208 where the next token is selected, [0293], step 1210, where the LLM 302 determines whether the next token is a special token, [0294], step 1210, if the LLM 302 determines that the next token is a special token, method step 1211 is executed, this method step is described with reference to Fig. 13, [0298], Fig. 13 shows that step 1302 is project token to comparison space to generate vector embedding); autoregressively generating, by processing the input sequence of embeddings using an autoregressive token generation neural network (LLM 302 then replaces the vector embedding of the predicted next token with the mapped vector embedding of the closest matching design asset to generate a replacement token at step 1312, the method then proceeds to step 1214, where this replacement token is provided to the input of the LLM 302 as the previous token so that it can recommence the autoregressive token generation, [0308], the method then proceeds to step 1214, where the image replacement token and aspect ratio token are provided to the input of the LLM 302 so that it can recommence the autoregressive token generation, that is, thereafter, the LLM 302 predicts the bounding box tokens for the image element based on the image embedding and the aspect ratio token, [0386], LLM 302 may have any other architecture, such as a neural network architecture, [0087]), a combined output sequence that comprises a plurality of output sequences of tokens from a unified vocabulary of tokens, wherein each output sequence of tokens corresponds to a respective output modality of tokens from a set of a plurality of modalities that includes a video modality and one or more other modalities; and generating a model output that includes a video output of the video modality and a respective output for each of the one or more other modalities (LLM 302 to predict output tokens sequentially, LLM 302 to predict additional tokens that are distinct from the regular vocabulary tokens of the LLM, [0086], using an autoregressive multi-modal machine learning system that generates design tokens sequentially, [0336], large language model (LLM) that is grounded to a multi-modal domain, enabling the model to process and generate arbitrary interleaved other-modality data and text data, this is achieved by fine-tuning weights for newly added vocabulary and the input and output layers of the design generation system 120 to enable cross-modality interactions, [0084], system that is capable of processing and integrated information from multiple modalities, the modalities may be distinct types of data such as video, [0041]). 
	However, Ramsauer does not teach for each output sequence of tokens, decoding the sequence of tokens using a decoder neural network corresponding to the modality of the output sequence to generate an output of the modality of the output sequence.  However, Shazeer teaches for each output sequence of tokens, decoding the sequence of tokens using a decoder neural network corresponding to the modality of the output sequence to generate an output of the modality of the output sequence (receive as input a sequence of tokens from a token vocabulary, and map the sequence of tokens to a predetermined dimensionality, the predetermined dimensionality dependent on a dimension of the decoder neural networks, [0011], each output modality neural network is configured to map data outputs of the unified representation space received from the decoder neural network, decoder data output 116, to mapped data outputs of one of the multiple modalities, each output modality neural network is specific to a respective modality and defines transformations between the unified representation and the modality, [0047]).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer so that for each output sequence of tokens, decoding the sequence of tokens using a decoder neural network corresponding to the modality of the output sequence to generate an output of the modality of the output sequence because Shazeer suggests that this ensures that the output is of the correct modality [0047].
As per Claim 2, Ramsauer teaches wherein obtaining the model input comprises receiving a respective input for each of one or more input modalities from a set of a plurality of input modalities, the plurality of input modalities comprising one or more of text, image, video, or audio modality inputs. large language model (LLM) that is grounded to a multi-modal domain, enabling the model to process and generate arbitrary interleaved other-modality data and text data, this is achieved by fine-tuning weights for newly added vocabulary and the input and output layers of the design generation system 120 to enable cross-modality interactions, [0084], system that is capable of processing and integrated information from multiple modalities, the modalities may be distinct types of data such as text, images, audio, video, [0041]). 
As per Claim 5, Ramsauer teaches wherein the model input comprises one or more of image, video, or audio modality inputs, and wherein processing the one or more of the image, video, or audio modality inputs to generate an input sequence of embeddings that represents the one or more of the image, video or audio modality inputs further comprises:  processing each modality input of the one or more of the image, video, or audio modality inputs using a respective encoder model corresponding to the modality of the modality input to generate a respective sequence of token embeddings from the modality input (for example, a font encoder 304B, may be utilized to generate vector embeddings for any fonts present in the input, a media encoder 304A may be utilized to generate vector embeddings for any media items in the input, each type of media item—images, videos, and audio—may have its own separate encoders, [0090], generating, using the encoders, the vector embeddings for the special tokens, [0422]).
As per Claim 10, Ramsauer teaches wherein the autoregressive token generation neural network (302) has been trained, the training comprising:  pretraining the autoregressive token generation neural network on one or more multimodal generative tasks by prepending a task token from a set of corresponding task tokens indicative of using the model input for training a particular generative task object to each input sequence of embeddings, wherein each corresponding task token is used to condition the output in accordance with each multimodal generative task; and fine-tuning the autoregressive token generation neural network based at least on one of the multimodal generative tasks (auto-regressive pre-trained large language model (LLM), [0006], LLM 302 is pre-trained to predict output tokens sequentially, token refers to the basic unit of input and output that the design generation system processes during training, tokens usually represent various linguistic elements that the model has been pre-trained on, LLM 302 is also trained to predict additional tokens that are distinct from the regular vocabulary tokens of the LLM, these additional tokens are called special tokens, special tokens include tokens related to non-text modalities, [0086], LLM 302 may have any other architecture, such as a neural network architecture, [0087]).
As per Claim 19, Claim 19 is similar in scope to Claim 1, except that Claim 19 is directed to a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the method of Claim 1.  Ramsauer teaches a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the method (computer processing system, including: one or more processing units; and one or more non-transitory computer-readable storage media storing instructions, which when executed by the one or more processing units, cause the one or more processing units to perform a method as described above, [0007]).  Thus, Claim 19 is rejected under the same rationale as Claim 1.
As per Claim 20, Claim 20 is similar in scope to Claim 19, and therefore is rejected under the same rationale.  
Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Wang (US 20250094713A1).
	Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 1.
However, Ramsauer and Shazeer do not teach wherein obtaining the model input comprises:  obtaining one or more of pixel masks or monocular depth maps of a first video frame in a video modality input.  However, Wang teaches wherein obtaining the model input comprises:  obtaining one or more of pixel masks or monocular depth maps of a first video frame in a video modality input (second data modalities such as videos may be used as a second data modality, since the videos may all be represented as images (point cloud data may be converted into a depth map), these data modalities may be collectively denoted as an image modality, [0068]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer so that obtaining the model input comprises:  obtaining one or more of pixel masks or monocular depth maps of a first video frame in a video modality input as suggested by Wang.  It is well-known in the art to use depth maps to model 3D shapes.
Claim(s) 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Hinz (US 20240320872A1).
	Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 1.
	However, Ramsauer and Shazeer do not teach wherein the model input comprises a text modality input, and wherein processing the text modality input to generate an input sequence of embeddings that represents the text modality input comprises:  processing the text modality input using a text encoder to generate a sequence of text embeddings; and mapping the text embeddings in the sequence of text embeddings to a subset of the embeddings in the input sequence of embeddings.  However, Hinz teaches wherein the model input comprises a text modality input, and wherein processing the text modality input to generate an input sequence of embeddings that represents the text modality input comprises:  processing the text modality input using a text encoder to generate a sequence of text embeddings (text encoder trained to encode a text prompt to obtain the text embedding, [0038]); and mapping the text embeddings in the sequence of text embeddings to a subset of the embeddings in the input sequence of embeddings (maps a text embedding of the text prompt and an image embedding of the image prompt to a joint embedding space, [0041]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer so that the model input comprises a text modality input, and wherein processing the text modality input to generate an input sequence of embeddings that represents the text modality input comprises:  processing the text modality input using a text encoder to generate a sequence of text embeddings; and mapping the text embeddings in the sequence of text embeddings to a subset of the embeddings in the input sequence of embeddings because Hinz suggests that this way, the image generation model generates images that accurately reflect a textual description and stylistic input from an image condition [0032].
Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Barbieri (US 20230262293A1).
	Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 5.  Ramsauer teaches wherein processing each modality input of the one or more of the image, video, or audio modality inputs using a respective encoder model corresponding to the modality of the modality input to generate a respective sequence of token embeddings from the modality input [0090, 0422].
	However, Ramsauer and Shazeer do not teach encoding the video modality input comprising encoding each of a plurality of segments of the video using a temporally-consistent visual tokenizer; or encoding the image modality input as a single video frame using the temporally-consistent visual tokenizer.  However, Barbieri teaches encoding the video modality input comprising encoding each of a plurality of segments of the video using a temporally-consistent visual tokenizer; or encoding the image modality input as a single video frame using the temporally-consistent visual tokenizer (autoencoder 103, special VID token 134, textual embedding, to help generate temporally consistent videos, [0073]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer to include encoding the video modality input comprising encoding each of a plurality of segments of the video using a temporally-consistent visual tokenizer; or encoding the image modality input as a single video frame using the temporally-consistent visual tokenizer because Barbieri suggests that this generates temporally consistent videos, so that objects, textures, lighting, and motion patterns appear stable from frame to frame, rather than jittering, drifting, or flickering [0073].
Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1), Shazeer (US 20200089755A1), and Barbieri (US 20230262293A1) in view of Pan (US 20070067166A1).
	Ramsauer, Shazeer, and Barbieri are relied upon for the teachings as discussed above relative to Claim 6.  Ramsauer teaches wherein processing each modality input of the one or more of the image, video, or audio modality inputs using a respective encoder model corresponding to the modality of the modality input to generate a respective sequence of token embeddings from the modality input [0090, 0422].
	However, Ramsauer, Shazeer, and Barbieri do not teach encoding the audio modality input using a residual vector quantizer to generate one or more vectors from a set of vector codebooks, each codebook specifying a respective frequency of the audio modality input.  However, Pan teaches encoding the audio modality input using a residual vector quantizer to generate one or more vectors from a set of vector codebooks, each codebook specifying a respective frequency of the audio modality input (vector quantization for audio encoding comprises: filtering an input audio signal so as to gain a time-frequency filter coefficient and outputting a filtered signal, dividing vectors of the filtered signal in a time-frequency plane so as to gain a vector combination, quantizing the selected vectors and calculating a residual error of quantization, and transmitting a quantized codebook information as a side-information of an encoder to an audio decoder to quantize and encode the residual error of quantization, [0007], calculate the energy and the values of each order difference of each selected point from the codebook according to the index, obtain the location information of the vector quantization in the time-frequency plane from the code stream, [0071]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer, Shazeer, and Barbieri to include encoding the audio modality input using a residual vector quantizer to generate one or more vectors from a set of vector codebooks, each codebook specifying a respective frequency of the audio modality input because Pan suggests that this increases encoding efficiency [0006].
Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Fayyaz (US 20250053748A1).
	Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 1.  Ramsauer teaches wherein autoregressively generating the output sequence of tokens comprises: generating a sequence of video modality tokens (autoregressive token generation, [0308], special token that includes special characters that indicate that it is a special token, VIDEO, [0154]).
	However, Ramsauer and Shazeer do not teach generating a sequences of video modality tokens comprising a sequence of image modality tokens with corresponding audio modality tokens.  However, Fayyaz teaches generating a sequences of video modality tokens comprising a sequence of image modality tokens with corresponding audio modality tokens (any combination of image-based tokens, video-based tokens, audio-based tokens, [0027]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer to include generating a sequences of video modality tokens comprising a sequence of image modality tokens with corresponding audio modality tokens because Fayyaz suggests that this is useful for processing different modalities [0027].
Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1), Shazeer (US 20200089755A1), and Fayyaz (US 20250053748A1) in view of Barbieri (US 20230262293A1) and Yuan (US 20250173825A1).
	Ramsauer, Shazeer, and Rayyaz are relied upon for the teachings as discussed above relative to Claim 8.
However, Ramsauer, Shazeer, and Rayyaz do not teach further comprising generating a sequence of high-resolution image modality tokens from the image modality tokens, wherein generating a sequence of high-resolution image modality tokens comprise using a non-autoregressive bidirectional transformer.  However, Barbieri teaches further comprising generating a sequence of high-resolution image modality tokens from the image modality tokens, wherein generating a sequence of high-resolution image modality tokens comprise using a non-autoregressive bidirectional transformer (142) (non-autoregressive generation pipeline with a bidirectional transformer is applied, [0021], image-based models capable of achieving improved resolutions, [0003], train the non-autoregressive BERT module 142 on video tokens, [0031]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer, Shazeer, and Rayyaz to include generating a sequence of high-resolution image modality tokens from the image modality tokens, wherein generating a sequence of high-resolution image modality tokens comprise using a non-autoregressive bidirectional transformer because Barbieri suggests that this improves video quality and consistency (Abstract).
However, Ramsauer, Shazeer, Rayyaz, and Barbieri do not teach windowed local-attention, cross-attending the super-resolution image modality tokens with the image modality tokens along each of a spatial vertical, spatial horizontal, and temporal axis; and self-attending the super-resolution image modality tokens.  However, Yuan teaches windowed local-attention, cross-attending the super-resolution image modality tokens with the image modality tokens along each of a spatial vertical, spatial horizontal, and temporal axis; and self-attending the super-resolution image modality tokens (cross-attention fusion module in the super-resolution diffusion model may resize all cross-attention accumulated weight maps in the image generation process, establish the cross-attention weight maps of each token corresponding to different positions of the image and incorporates them into the feature output by the self-attention module, thereby realizing the guidance of the semantic information to the features in the super-resolution diffusion model, [0185], extract horizontal and vertical relative features, and thus can correct the generated texture misalignment, horizontal or vertical texture, thereby realizing the high-quality super-resolution processing process, [0177]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer, Shazeer, Rayyaz, and Barbieri to include windowed local-attention, cross-attending the super-resolution image modality tokens with the image modality tokens along each of a spatial vertical, spatial horizontal, and temporal axis; and self-attending the super-resolution image modality tokens because Yuan suggests that this reduces the size and the time consumed for the model that obtains high-resolution images [0006].
Claim(s) 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Jin (US 20240420458A1) and Caba Heilbron (US 20230325685A1).
	Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 10.
	However, Ramsauer and Shazeer do not teach further comprising processing a training set of model inputs comprising one or more of a plurality of labelled image-text pairs.  However, Jin teaches further comprising processing a training set of model inputs comprising one or more of a plurality of labelled image-text pairs (training image-text pair, image that is pre-labeled with a corresponding text may be used as an image-text pair, [0050]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer to include processing a training set of model inputs comprising one or more of a plurality of labelled image-text pairs because Jin suggests that this way, the accuracy and efficiency of a cross-modal data processing task is improved [0026].
However, Ramsauer, Shazeer, and Jin do not teach processing a training set of model inputs comprising a plurality of unlabeled video-only data items.  However, Caba Heilbron teaches processing a training set of model inputs comprising a plurality of unlabeled video-only data items (untrained model is then trained using the unlabeled videos and text labels provided as inputs to the pre-trained image-text classification model, [0028]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer, Shazeer, and Jin to include processing a training set of model inputs comprising a plurality of unlabeled video-only data items because Caba Heilbron suggests that this way, it can accurately output video classifying a video input [0004].
Claim(s) 14-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Cambronero Sanchez (US 20250278573A1).
As per Claim 14, Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 10.
	However, Ramsauer and Shazeer do not expressly teach further comprising processing the model input in accordance with sequentially chaining two or more multimodal generative tasks.  However, Cambronero Sanchez teaches further comprising processing the model input in accordance with sequentially chaining two or more multimodal generative tasks (performing certain types of multimodal tasks which require comprehension of information from a single modality at a time and require the performance of a predetermined sequence of steps to generate an output, [0001]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer to include processing the model input in accordance with sequentially chaining two or more multimodal generative tasks because Cambronero Sanchez suggests that it is well-known in the art that this is needed to perform certain types of multimodal tasks [0001].
As per Claim 15, Ramsauer teaches performing a first multimodal generative task by prepending a first corresponding task token for the first multimodal generative task to the model input; generating a first model output using the first corresponding task token; performing a second multimodal generative task by prepending a second corresponding task token for the second multimodal generative task to the first model output; and generating a second model output using the second corresponding task token (LLM 302 to predict output tokens sequentially, LLM 302 to predict additional tokens that are distinct from the regular vocabulary tokens of the LLM, [0086], using an autoregressive multi-modal machine learning system that generates design tokens sequentially, [0336], large language model (LLM) that is grounded to a multi-modal domain, enabling the model to process and generate arbitrary interleaved other-modality data and text data, this is achieved by fine-tuning weights for newly added vocabulary and the input and output layers of the design generation system 120 to enable cross-modality interactions, [0084], system that is capable of processing and integrated information from multiple modalities, the modalities may be distinct types of data such as video, [0041]).
	However, Ramsauer and Shazeer do not expressly teach sequentially chaining two or more multimodal generative tasks.  However, Cambronero Sanchez teaches sequentially chaining two or more multimodal generative tasks [0001].  This would be obvious for the reasons given in the rejection for Claim 14.
Claim(s) 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Graciarena (US 20240147025A1).
	Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 1.  Ramsauer teaches wherein generating the model output that includes the video modality and the one or more other modalities [0086, 0336, 0084, 0041].
	However, Ramsauer and Shazeer do not teach wherein generating the model output that includes the video modality and the one or more other modalities comprises generating a stylized video output.  However, Graciarena teaches wherein generating the model output that includes the one or more other modalities comprises generating a stylized video output (separate the modality feature vectors into classes based on determined characteristics (video styles) of the modality feature vectors, [0029]).  Thus, this teaching of the stylized video output from Graciarena can be implemented into the device of Ramsauer so that generating the model output that includes the video modality and the one or more other modalities comprises generating a stylized video output.  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer so that generating the model output that includes the video modality and the one or more other modalities comprises generating a stylized video output as suggested by Graciarena.  It is well-known in the art to generate stylized video for many applications such as animation and games.
Claim(s) 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Song (US 20250029384A1).
	Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 1.  Ramsauer teaches wherein generating the model output that includes the video modality and the one or more other modalities [0086, 0336, 0084, 0041].
	However, Ramsauer and Shazeer do not teach wherein generating the model output that includes the video modality and the one or more other modalities comprises generating an inpainted video output.  However, Song teaches wherein generating the model output that includes the one or more other modalities comprises generating an inpainted video output (inpainted video is output, [0142], inpainting process is implemented by a neural network, [0087]).  Thus, this teaching of the inpainted video output from Song can be implemented into the device of Ramsauer so that generating the model output that includes the video modality and the one or more other modalities comprises generating an inpainted video output.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer so that generating the model output that includes the video modality and the one or more other modalities comprises generating an inpainted video output as suggested by Song.  It is well-known in the art that inpainting fills in missing or unwanted parts of an image.
Claim(s) 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ramsauer (US 20250251853A1) and Shazeer (US 20200089755A1) in view of Palczewski (US 20250063136A1).
	Ramsauer and Shazeer are relied upon for the teachings as discussed above relative to Claim 1.  Ramsauer teaches wherein generating the model output that includes the video modality and the one or more other modalities [0086, 0336, 0084, 0041].
	However, Ramsauer and Shazeer do not teach generating the model output that includes the video modality and the one or more other modalities comprises generating an outpainted video output. However, Palczewski teaches generating the model output that includes one or more other modalities comprises generating outpainted video output (electronic device 101 can use any suitable machine learning model, electronic device 101 outputs outpainted video 265, [0069]). Thus, this teaching of the outpainted video output from Palczewski can be implemented into the device of Ramsauer so that it generates the model output that includes the video modality and the one or more other modalities comprises generating an outpainted video output.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ramsauer and Shazeer to include generating the model output that includes the video modality and the one or more other modalities comprises generating an outpainted video output as suggested by Palczewski.  It is well-known in the art that outpainting is an AI-powered image editing technique that extends an image beyond its original borders, generating new, contextually relevant content to seamlessly expand the scene, allowing users to change aspect ratios, add visual elements, and create wider images by describing what should be added in text prompts.
Allowable Subject Matter
Claims 12-13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  The prior art taken singly or in combination do not teach or suggest the combination of all the limitations of Claim 12 and base Claim 1 and intervening Claims 10-11, and in particular, do not teach wherein the plurality of labelled image-text pairs includes a first number of model inputs and the plurality of unlabeled video-only data items includes a second number of model inputs, and wherein the first number is greater than the second number.  Claim 13 depends from Claim 12, and therefore also contains allowable subject matter.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONI HSU whose telephone number is (571)272-7785. The examiner can normally be reached M-F 10am-6:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached at (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JH
/JONI HSU/Primary Examiner, Art Unit 2611
Read full office action
Prosecution Timeline

May 16, 2024
Application Filed
Jan 14, 2026
Non-Final Rejection — §101, §103
Apr 09, 2026
Interview Requested
Apr 13, 2026
Applicant Interview (Telephonic)
Apr 13, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/257,410
Patent 12592028
METHODS AND DEVICES FOR IMMERSING A USER IN AN IMMERSIVE SCENE AND FOR PROCESSING 3D OBJECTS
2y 5m to grant Granted Mar 31, 2026
18/337,537
Patent 12586306
METHOD, ELECTRONIC DEVICE, AND COMPUTER PROGRAM PRODUCT FOR MODELING OBJECT
2y 5m to grant Granted Mar 24, 2026
18/432,989
Patent 12586260
CREATING IMAGE ENHANCEMENT TRAINING DATA PAIRS
2y 5m to grant Granted Mar 24, 2026
18/027,304
Patent 12581168
A METHOD FOR A MEDIA FILE GENERATING AND A METHOD FOR A MEDIA FILE PROCESSING
2y 5m to grant Granted Mar 17, 2026
18/449,286
Patent 12561850
IMAGE GENERATION WITH LEGIBLE SCENE TEXT
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
87%
Grant Probability
95%
With Interview (+7.2%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 848 resolved cases by this examiner. Grant probability derived from career allow rate.