Last updated: April 18, 2026
Application No. 18/790,535
AUTOMATIC ANNOTATION OF THREE-DIMENSIONAL SHAPE DATA FOR TRAINING TEXT TO 3D GENERATIVE AI SYSTEMS AND APPLICATIONS

Non-Final OA §103
Filed
Jul 31, 2024
Examiner
SALVUCCI, MATTHEW D
Art Unit
2613
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
1 (Non-Final)
Interview Optional

— +28.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 485 resolved cases, 2023–2026
Examiner Intelligence

SALVUCCI, MATTHEW D View full profile →
Grants 72% — above average
Career Allow Rate
348 granted / 485 resolved
+9.8% vs TC avg
Strong +28% interview lift
Without
With
+28.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 12m
Avg Prosecution
17 currently pending
Career history
502
Total Applications
across all art units
Statute-Specific Performance

§101
4.6%
-35.4% vs TC avg
§103
60.8%
+20.8% vs TC avg
§102
17.0%
-23.0% vs TC avg
§112
14.3%
-25.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 485 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Menashof et al. (US Pub. 2025/0148753), hereinafter Menashof, in view of Greasley (US Pub. 2021/0397842).
Regarding claim 1, Menashof discloses a method comprising: generating, based at least on one or more language models processing first data associated with one or more images depicting one or more shapes (Paragraph [0005]: the encoder includes a large language model (LLM) or other natural language model to generate descriptors of the pivot images. For example, the descriptors can include natural language descriptions of the objects, backgrounds, interactions, and concepts included in the pivot images): one or more first captions describing the one or more shapes (Paragraph [0024]: the encoder extracts every tenth frame of the video and causes a first generative machine model to generate a descriptor for the extracted frames (e.g., a natural language description of the soccer ball and the location). In this example, the extracted frames are selected based on an interval of time (e.g., every tenth frame), although, as described in greater detail below, other algorithms can be used for selecting frames of the video to be extracted); and one or more second captions describing the one or more shapes (Paragraph [0074]: neural network 202 generates one or more text descriptions 204 of a scene (e.g., sentences). In at least one embodiment, neural network 202 generates one or more text descriptions 204 of one or more scenes creating description set; Paragraph [0128]: generated back prompts 303a-303n are text that includes a word, phrase, or sentence that describes a background. In at least one embodiment, for example, neural network 301 generates the background prompts 303a-303n “in snow, in forest, . . . in autumn.” Neural network 301 combines the subjects of input 301 and the one or more background prompts 303a-303n into a output vector space (e.g., a subject-background prompt vector space). When combing the subjects with the background prompts, neural network 302 learns the connection between the text describing subject of the subject set 203 and the one or more text descriptions of the one or more background prompts 303a-303n); and generating second data that associates the one or more shapes with the one or more first captions and the one or more second captions (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set).
	Menashof does not explicitly disclose the one or more first captions associated with one or more first numbers of words that are less than a threshold number of words; or the one or more second captions being associated with one or more second numbers of words that are equal to or greater than the threshold number of words. 
	However, Greasley teaches generating a text description for images (Paragraphs [0038]-[0039]), further comprising: the one or more first captions associated with one or more first numbers of words that are less than a threshold number of words (Paragraph [0038]: if the user 106 indicates a preference for a short listening experience, the electronic device 102 and/or the controller 104 may generate a terse text description, e.g., with a low word count, for example, less than a threshold number of words. On the other hand, if the user 106 indicates a preference for a longer listening experience, the electronic device 102 and/or the controller 104 may generate a more verbose text description, e.g., with a high word count, for example, greater than a threshold number of words); and the one or more second captions being associated with one or more second numbers of words that are equal to or greater than the threshold number of words (Paragraph [0039]: if the user 106 is moving at a low velocity or is stationary, the electronic device 102 and/or the controller 104 may generate a text description with a high word count (e.g., greater than a threshold number of words). If the user 106 is moving at a high velocity, the electronic device 102 and/or the controller 104 may generate a text description with a low word count (e.g., less than a threshold number of words) or may increase the speed (e.g., words per minute) of the narration to keep pace with the movement of the user 106; Paragraph [0053]: length of the text description may be determined based on the velocity of the user. For example, if the user is stationary or moving at a low velocity, the audio output generator 224 may generate a verbose text description with a high word count (e.g., greater than a threshold number of words). If the user is moving at a high velocity, the audio output generator 224 may generate a terse text description with a low word count (e.g., less than a threshold number of words). In some implementations, the audio output generator 224 selects a depth to which the ontology 226 is traversed). Greasley teaches that this will allow for user preference of descriptions to be realized (Paragraph [0038]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Menashof with the features of above as taught by Greasley so as to allow for user preference of descriptions to be realized as presented by Greasley.
Regarding claim 2, Menashof, in view of Greasley teaches the method of claim 1, Menashof discloses further comprising: generating one or more first embeddings associated with the one or more first captions and one or more second embeddings associated with the one or more second captions, wherein the second data associates the one or more shapes with the one or more first embeddings and the one or more second embeddings (Paragraph [0102]: each word or character in the input(s) 601 is mapped into the input embedding 602 in parallel or at the same time, unlike existing long short-term memory (LSTM) models, for example. The input embedding 602 maps a word to a feature vector representing the word. But the same word (for example, “apple”) in different sentences may have different meanings (for example, phone versus fruit). This is why a positional encoder 604 can be implemented. A positional encoder 604 is a vector that gives context to words (for example, “apple”) based on a position of a word in a sentence. For example, with respect to a message “I just sent the document,” because “I” is at the beginning of a sentence, embodiments can indicate a position in an embedding closer to “just,” as opposed to “document.” Some embodiments use a sine/cosine function to generate the positional encoder vector using the following two example equations; Paragraph [0110]: the initial embedding (for example, the input embedding 602) is constructed from three vectors: the token embeddings, the segment or context-question embeddings, and the position embeddings. In some embodiments, the following functionality occurs in the pre-training phase. The token embeddings are the pre-trained embeddings. The segment embeddings are the sentence numbers (that includes the input [s] 601) that is encoded into a vector (for example, first sentence, second sentence, and so forth, assuming a top-down and right-to-left approach). The position embeddings are vectors that represent the position of a particular word in such a sentence that can be produced by positional encoder 604. When these three embeddings are added or concatenated together, an embedding vector is generated that is used as input into the encoder/decoder block(s) 606. The segment and position embeddings are used for temporal ordering since all of the vectors are fed into the encoder/decoder block(s) 606 simultaneously, and language models need some sort of order preserved).
Regarding claim 3, Menashof, in view of Greasley teaches the method of claim 1, Menashof discloses further comprising: generating one or more embeddings associated with one or more combinations of the one or more first captions and the one or more second captions, wherein the second data associates the one or more shapes with the one or more embeddings (Paragraphs [0102]-[0103]: each word or character in the input(s) 601 is mapped into the input embedding 602 in parallel or at the same time, unlike existing long short-term memory (LSTM) models, for example. The input embedding 602 maps a word to a feature vector representing the word. But the same word (for example, “apple”) in different sentences may have different meanings (for example, phone versus fruit). This is why a positional encoder 604 can be implemented. A positional encoder 604 is a vector that gives context to words (for example, “apple”) based on a position of a word in a sentence. For example, with respect to a message “I just sent the document,” because “I” is at the beginning of a sentence, embodiments can indicate a position in an embedding closer to “just,” as opposed to “document.” Some embodiments use a sine/cosine function to generate the positional encoder vector using the following two example equations… passing the input(s) 601 through the input embedding 602 and applying the positional encoder 604, the output is a word embedding feature vector, which encodes positional information or context based on the positional encoder 604. These word embedding feature vectors are then passed to the encoder and/or decoder block(s) 606, where it goes through a multi-head attention layer 606-1 and a feedforward layer 606-2. The multi-head attention layer 606-1 is generally responsible for focusing or processing certain parts of the feature vectors representing specific portions of the input(s) 601 by generating attention vectors; Paragraph [0110]: the initial embedding (for example, the input embedding 602) is constructed from three vectors: the token embeddings, the segment or context-question embeddings, and the position embeddings. In some embodiments, the following functionality occurs in the pre-training phase. The token embeddings are the pre-trained embeddings. The segment embeddings are the sentence numbers (that includes the input [s] 601) that is encoded into a vector (for example, first sentence, second sentence, and so forth, assuming a top-down and right-to-left approach). The position embeddings are vectors that represent the position of a particular word in such a sentence that can be produced by positional encoder 604. When these three embeddings are added or concatenated together, an embedding vector is generated that is used as input into the encoder/decoder block(s) 606. The segment and position embeddings are used for temporal ordering since all of the vectors are fed into the encoder/decoder block(s) 606 simultaneously, and language models need some sort of order preserved).
Regarding claim 4, Menashof, in view of Greasley teaches the method of claim 1, Menashof discloses wherein the generating the one or more first captions and the one or more second captions is further based at least on the one or more machine learning models processing third data representative of one or more descriptions associated with the one or more shapes (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set; Paragraphs [0086]-[0087]: the operations further comprise causing, at the decoder, a third machine learning model to generate a reconstructed video by at least providing as a first input to the third machine learning model the pivot image and the descriptor, where the third machine learning model uses the pivot image and at least a portion of the descriptor to output a second plurality of images that are combined to generate the reconstructed video…the first machine learning model comprises a large language model, the second machine learning model comprises a neural network, and the third machine learning model comprises a diffusion model).
Regarding claim 5, Menashof, in view of Greasley teaches the method of claim 4, Menashof discloses wherein the one or more descriptions may include at least one of: one or more categories associated with the one or more shapes; or one or more tags indicating one or more characteristics associated with the one or more shapes (Fig. 4; Paragraph [0075]: FIG. 4 is a flow diagram showing a method 400 for generating pivot images and descriptors for compressed video data, in accordance with at least one embodiment. The method 400 can be performed, for instance, by the encoder 124 of the video compression tool 104 of FIG. 1. As shown at block 402, the system implementing the method 300 obtains a video frame. In various embodiments, a video is obtained, and individual frames of the video (e.g., video frames) are extracted and/or processed. At block 404, the system implementing the method 400 determines and/or detects objects in the video frame. For example, as described above, an object detection model takes the video frame as an input and outputs data associated with objects in the video frame (e.g., labels, confidence intervals, tags, names, type information etc.)).
Regarding claim 6, Menashof, in view of Greasley teaches the method of claim 1, Menashof discloses wherein the generating the one or more first captions and the one or more second captions is further based at least on the one or more machine learning models processing third data representative of an output format, the output format associated with generating both the one or more first captions and the one or more second captions for the one or more shapes (Paragraph [0020]: compression techniques rely on eliminating or reducing redundant data and are limited by various constraints including video and audio fidelity. In addition, the focus on maintaining fidelity limits the effectiveness of compression techniques. In addition, video compression is necessary in order to share or otherwise transmit videos on the Internet because compression reduces the amount of data that is needed to stream or send the video to the viewer, and network bandwidth is a limited resource. One way to address this issue is by using a video coding format (e.g., a video compression format), which is a content representation format for storage or transmission of digital video content (e.g., in a data file or bitstream). Typically these formats use a video compression algorithm, such as discrete cosine transform (DCT) coding or motion compensation; Paragraph [0112]: after pre-training is performed, the encoder/decoder block(s) 606 performs prompt engineering or fine-tuning on a variety of QA data sets by converting different QA formats into a unified sequence-to-sequence format. For example, some embodiments perform the QA task by adding a new question-answering head or encoder/decoder block, just the way a masked language model head is added (in pre-training) for performing an MLM task, except that the task is a part of prompt engineering or fine-tuning. This includes the encoder/decoder block(s) 606 processing the inputs (e.g., pivot images 142 and/or descriptors 144 of FIG. 1) in order to make the predictions and generate a prompt response, as indicated in 604. Prompt engineering, in some embodiments, is the process of crafting and optimizing text prompts for language models to achieve desired outputs. In other words, prompt engineering comprises a process of mapping prompts (for example, a question) to the output (for example, an answer) that it belongs to for training. For example, if a user asks a model to generate a poem about a person fishing on a lake, the expectation is it will generate a different poem each time. Users may then label the output or answers from best to worst. Such labels are an input to the model to make sure the model is giving a more human-like or best answers, while trying to minimize the worst answers (for example, via reinforcement learning). In some embodiments, a “prompt” as described herein includes one or more of: a request (for example, a question or instruction [for example, “write a poem”]), target content, and one or more examples, as described herein).
Regarding claim 7, Menashof, in view of Greasley teaches the method of claim 1, Menashof discloses further comprising: generating, based at least on image data representative of the one or more images, one or more input tokens, wherein the first data represents the one or more input tokens (Paragraph [0060]: inference phase can be divided into two stages: a prompt stage and an auto-regressive stage. The prompt stage can include receiving and processing input as a batch of new tokens as part of the same inference. The prompt stage may operate based on a Key-Value (KV) cache technique, where a KV cache is created for tokens in a batch. During the prompt stage, the input is being digested. The auto-regressive state can include using the model to generate the tokens one-by-one, based on previous tokens, relying on reading the KV cache of previously-processed tokens, and adding the data of the new of only new tokens to the KV cache. This auto-regressive stage includes the model generating a response to the input from the prompt stage).
Regarding claim 8, Menashof, in view of Greasley teaches the method of claim 1, Menashof discloses further comprising: receiving third data representative of one or more poses associated with the one or more shapes (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set; Paragraphs [0086]-[0087]: the operations further comprise causing, at the decoder, a third machine learning model to generate a reconstructed video by at least providing as a first input to the third machine learning model the pivot image and the descriptor, where the third machine learning model uses the pivot image and at least a portion of the descriptor to output a second plurality of images that are combined to generate the reconstructed video…the first machine learning model comprises a large language model, the second machine learning model comprises a neural network, and the third machine learning model comprises a diffusion model); and generating, based at least on the one or more poses, the one or more images to represent the one or more shapes from the perspective of a canonical viewpoint (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set).
Regarding claim 9, Menashof, in view of Greasley teaches the method of claim 8, Menashof discloses further comprising three-dimensional information associated with a first portion of the one or more shapes and two-dimensional information associated with a second portion of the one or more shapes (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set; Paragraphs [0086]-[0087]: the operations further comprise causing, at the decoder, a third machine learning model to generate a reconstructed video by at least providing as a first input to the third machine learning model the pivot image and the descriptor, where the third machine learning model uses the pivot image and at least a portion of the descriptor to output a second plurality of images that are combined to generate the reconstructed video…the first machine learning model comprises a large language model, the second machine learning model comprises a neural network, and the third machine learning model comprises a diffusion model), and wherein the generating the one or more images comprises: generating one or more first images based at least on the one or more poses, the three-dimensional information, and one or more first camera parameters, the one or more first images depicting the first portion of the one or more shapes from the canonical viewpoint (Fig. 2; Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set; Paragraph [0121]: I/O components 720 can provide a natural user interface (NUI) that processes air gestures, voice, or other physiological inputs generated by a user. In some instances, inputs are transmitted to an appropriate network element for further processing. An NUI may implement any combination of speech recognition, touch and stylus recognition, facial recognition, biometric recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, and touch recognition associated with displays on the computing device 700. In one example, the computing device 700 is equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, red-green-blue (RGB) camera systems, and combinations of these, for gesture detection and recognition. Additionally, the computing device 700 may be equipped with accelerometers or gyroscopes that enable detection of motion. The output of the accelerometers or gyroscopes may be provided to the display of the computing device 700 to render immersive augmented reality or virtual reality); determining, based at least on the one or more first camera parameters, one or more second parameters for rendering the one or more shapes (Paragraph [0048]: the descriptors 144 include data that can be provided to the decoder 128 or component thereof (e.g., a generative machine learning model 130) to generate or otherwise reconstruct the video 126B. In one example, the descriptors 144 are generated based at least in part on the video 126A and/or pivot images 142. In an embodiment, an LLM or other machine learning model (e.g., GPT, bidirectional encoder representations from transformers [BERT], CNN, etc.) takes, as an input, the video 126A and/or pivot images 142 and outputs a natural language description that is used by the encoder 124 to generate the descriptors 144. For example, the encoder 124, after extracting a particular pivot image, provides the particular pivot image as an input to the LLM, and the LLM then generates a prompt (e.g., natural language description) that guides a generative machine learning model of the decoder 128 to an image. In other example, the video 126A and the pivot images 142 are provided to the LLM and the LLM generates the descriptors 144 based at least in part on frames of the video 126A between the pivot images 142. In a specific example, the video 126A includes a background with two trees and a cat running between the trees, the pivot images 142 include a first image of the cat at the first tree and a second image of the cat at the second tree. Continuing this specific example, the descriptors 144 include a description of the conceptual elements of the video 126A (e.g., the cat running from the first tree to the second tree). In various embodiments, the descriptors 144 are modified or otherwise used to generate a set of prompts for the generative model 130; Paragraphs [0076]-[0077]: the system implementing the method 400 compares the number of objects detected in a previous frame to the number of objects detected in the video frame. Furthermore, in another example, the system implementing the method 400 detects other changes of state, such as position, size, shape, orientation, or other aspects of an object that can convey conceptual information in a video. If no change of state is detected, the system implementing the method 400 continues to block 408. At block 408, the system implementing the method 400 determines whether additional video frames are present in the video. If there are additional video frames (e.g., the video has not terminated), the system implementing the method 400 continues the method 400 at block 402 and obtains the next video frame. If no additional video frames are included in the video, the method 400 continues to block 412 described below…block 406 above, if a change of state is detected, the system implementing the method 400 continues to block 410 and selects the video frame as a pivot image. For example, the video frame is extracted and stored as a pivot image. At block 412, the system implementing the method 400 generates descriptors. As described above, in various embodiments, the descriptors include textual data such as natural language descriptions associated with the pivot images and/or video. For example, the pivot images are provided to an LLM that generates natural language descriptions of the pivot images and/or conceptual elements connecting or otherwise associated with the pivot images. At block 414, the system implementing the method 400 provides the pivot images and descriptors. For example, the pivot images and descriptors are transmitted to a decoder executed by a user device); and generating, based at least on the two-dimensional information and the one or more second camera parameters, one or more second images that depict the second portion of the one or more shapes from the canonical viewpoint (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set).
Regarding claim 10, Menashof discloses a system comprising: one or more processors (Fig. 7) to: generate, using one or more language models and based at least on first data associated with one or more images of one or more objects (Paragraph [0005]: the encoder includes a large language model (LLM) or other natural language model to generate descriptors of the pivot images. For example, the descriptors can include natural language descriptions of the objects, backgrounds, interactions, and concepts included in the pivot images): one or more first captions associated with the one or more shapes (Paragraph [0024]: the encoder extracts every tenth frame of the video and causes a first generative machine model to generate a descriptor for the extracted frames (e.g., a natural language description of the soccer ball and the location). In this example, the extracted frames are selected based on an interval of time (e.g., every tenth frame), although, as described in greater detail below, other algorithms can be used for selecting frames of the video to be extracted); and one or more second captions associated with the one or more shapes (Paragraph [0074]: neural network 202 generates one or more text descriptions 204 of a scene (e.g., sentences). In at least one embodiment, neural network 202 generates one or more text descriptions 204 of one or more scenes creating description set; Paragraph [0128]: generated back prompts 303a-303n are text that includes a word, phrase, or sentence that describes a background. In at least one embodiment, for example, neural network 301 generates the background prompts 303a-303n “in snow, in forest, . . . in autumn.” Neural network 301 combines the subjects of input 301 and the one or more background prompts 303a-303n into a output vector space (e.g., a subject-background prompt vector space). When combing the subjects with the background prompts, neural network 302 learns the connection between the text describing subject of the subject set 203 and the one or more text descriptions of the one or more background prompts 303a-303n); and generate second data that associates the one or more images with the one or more first captions and the one or more second captions (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set). 
Menashof does not explicitly disclose the one or more first captions being associated with one or more first lengths; or the one or more second captions being associated with one or more second lengths that is different than the one or more first lengths. 
However, Greasley teaches generating a text description for images (Paragraphs [0038]-[0039]), further comprising: the one or more first captions being associated with one or more first lengths (Paragraph [0038]: if the user 106 indicates a preference for a short listening experience, the electronic device 102 and/or the controller 104 may generate a terse text description, e.g., with a low word count, for example, less than a threshold number of words. On the other hand, if the user 106 indicates a preference for a longer listening experience, the electronic device 102 and/or the controller 104 may generate a more verbose text description, e.g., with a high word count, for example, greater than a threshold number of words); and the one or more second captions being associated with one or more second lengths that is different than the one or more first lengths (Paragraph [0039]: if the user 106 is moving at a low velocity or is stationary, the electronic device 102 and/or the controller 104 may generate a text description with a high word count (e.g., greater than a threshold number of words). If the user 106 is moving at a high velocity, the electronic device 102 and/or the controller 104 may generate a text description with a low word count (e.g., less than a threshold number of words) or may increase the speed (e.g., words per minute) of the narration to keep pace with the movement of the user 106; Paragraph [0053]: length of the text description may be determined based on the velocity of the user. For example, if the user is stationary or moving at a low velocity, the audio output generator 224 may generate a verbose text description with a high word count (e.g., greater than a threshold number of words). If the user is moving at a high velocity, the audio output generator 224 may generate a terse text description with a low word count (e.g., less than a threshold number of words). In some implementations, the audio output generator 224 selects a depth to which the ontology 226 is traversed). Greasley teaches that this will allow for user preference of descriptions to be realized (Paragraph [0038]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Menashof with the features of above as taught by Greasley so as to allow for user preference of descriptions to be realized as presented by Greasley.
Regarding claim 11, Menashof, in view of Greasley teaches the system of claim 10, Menashof discloses wherein the one or more processors are further to: generate one or more first embeddings associated with the one or more first captions and one or more second embeddings associated with the one or more second captions, wherein the second data associates the one or more shapes with the one or more first embeddings and the one or more second embeddings (Paragraph [0102]: each word or character in the input(s) 601 is mapped into the input embedding 602 in parallel or at the same time, unlike existing long short-term memory (LSTM) models, for example. The input embedding 602 maps a word to a feature vector representing the word. But the same word (for example, “apple”) in different sentences may have different meanings (for example, phone versus fruit). This is why a positional encoder 604 can be implemented. A positional encoder 604 is a vector that gives context to words (for example, “apple”) based on a position of a word in a sentence. For example, with respect to a message “I just sent the document,” because “I” is at the beginning of a sentence, embodiments can indicate a position in an embedding closer to “just,” as opposed to “document.” Some embodiments use a sine/cosine function to generate the positional encoder vector using the following two example equations; Paragraph [0110]: the initial embedding (for example, the input embedding 602) is constructed from three vectors: the token embeddings, the segment or context-question embeddings, and the position embeddings. In some embodiments, the following functionality occurs in the pre-training phase. The token embeddings are the pre-trained embeddings. The segment embeddings are the sentence numbers (that includes the input [s] 601) that is encoded into a vector (for example, first sentence, second sentence, and so forth, assuming a top-down and right-to-left approach). The position embeddings are vectors that represent the position of a particular word in such a sentence that can be produced by positional encoder 604. When these three embeddings are added or concatenated together, an embedding vector is generated that is used as input into the encoder/decoder block(s) 606. The segment and position embeddings are used for temporal ordering since all of the vectors are fed into the encoder/decoder block(s) 606 simultaneously, and language models need some sort of order preserved).
Regarding claim 12, Menashof, in view of Greasley teaches the system of claim 10, wherein the one or more processors are further to: generate one or more embeddings associated with one or more combinations of the one or more first captions and the one or more second captions, wherein the second data associates the one or more shapes with the one or more embeddings (Paragraphs [0102]-[0103]: each word or character in the input(s) 601 is mapped into the input embedding 602 in parallel or at the same time, unlike existing long short-term memory (LSTM) models, for example. The input embedding 602 maps a word to a feature vector representing the word. But the same word (for example, “apple”) in different sentences may have different meanings (for example, phone versus fruit). This is why a positional encoder 604 can be implemented. A positional encoder 604 is a vector that gives context to words (for example, “apple”) based on a position of a word in a sentence. For example, with respect to a message “I just sent the document,” because “I” is at the beginning of a sentence, embodiments can indicate a position in an embedding closer to “just,” as opposed to “document.” Some embodiments use a sine/cosine function to generate the positional encoder vector using the following two example equations… passing the input(s) 601 through the input embedding 602 and applying the positional encoder 604, the output is a word embedding feature vector, which encodes positional information or context based on the positional encoder 604. These word embedding feature vectors are then passed to the encoder and/or decoder block(s) 606, where it goes through a multi-head attention layer 606-1 and a feedforward layer 606-2. The multi-head attention layer 606-1 is generally responsible for focusing or processing certain parts of the feature vectors representing specific portions of the input(s) 601 by generating attention vectors; Paragraph [0110]: the initial embedding (for example, the input embedding 602) is constructed from three vectors: the token embeddings, the segment or context-question embeddings, and the position embeddings. In some embodiments, the following functionality occurs in the pre-training phase. The token embeddings are the pre-trained embeddings. The segment embeddings are the sentence numbers (that includes the input [s] 601) that is encoded into a vector (for example, first sentence, second sentence, and so forth, assuming a top-down and right-to-left approach). The position embeddings are vectors that represent the position of a particular word in such a sentence that can be produced by positional encoder 604. When these three embeddings are added or concatenated together, an embedding vector is generated that is used as input into the encoder/decoder block(s) 606. The segment and position embeddings are used for temporal ordering since all of the vectors are fed into the encoder/decoder block(s) 606 simultaneously, and language models need some sort of order preserved).
Regarding claim 13, Menashof, in view of Greasley teaches the system of claim 10, Menashof discloses wherein the generation of the one or more first captions and the one or more second captions is further based at least on the one or more machine learning models processing third data representative of one or more descriptions associated with the one or more shapes (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set; Paragraphs [0086]-[0087]: the operations further comprise causing, at the decoder, a third machine learning model to generate a reconstructed video by at least providing as a first input to the third machine learning model the pivot image and the descriptor, where the third machine learning model uses the pivot image and at least a portion of the descriptor to output a second plurality of images that are combined to generate the reconstructed video…the first machine learning model comprises a large language model, the second machine learning model comprises a neural network, and the third machine learning model comprises a diffusion model).
Regarding claim 14, Menashof, in view of Greasley teaches the system of claim 10, Menashof discloses wherein the generation of the one or more first captions and the one or more second captions is further based at least on the one or more machine learning models processing third data representative of an output format, the output format associated with generating both the one or more first captions and the one or more second captions for the one or more shapes (Paragraph [0020]: compression techniques rely on eliminating or reducing redundant data and are limited by various constraints including video and audio fidelity. In addition, the focus on maintaining fidelity limits the effectiveness of compression techniques. In addition, video compression is necessary in order to share or otherwise transmit videos on the Internet because compression reduces the amount of data that is needed to stream or send the video to the viewer, and network bandwidth is a limited resource. One way to address this issue is by using a video coding format (e.g., a video compression format), which is a content representation format for storage or transmission of digital video content (e.g., in a data file or bitstream). Typically these formats use a video compression algorithm, such as discrete cosine transform (DCT) coding or motion compensation; Paragraph [0112]: after pre-training is performed, the encoder/decoder block(s) 606 performs prompt engineering or fine-tuning on a variety of QA data sets by converting different QA formats into a unified sequence-to-sequence format. For example, some embodiments perform the QA task by adding a new question-answering head or encoder/decoder block, just the way a masked language model head is added (in pre-training) for performing an MLM task, except that the task is a part of prompt engineering or fine-tuning. This includes the encoder/decoder block(s) 606 processing the inputs (e.g., pivot images 142 and/or descriptors 144 of FIG. 1) in order to make the predictions and generate a prompt response, as indicated in 604. Prompt engineering, in some embodiments, is the process of crafting and optimizing text prompts for language models to achieve desired outputs. In other words, prompt engineering comprises a process of mapping prompts (for example, a question) to the output (for example, an answer) that it belongs to for training. For example, if a user asks a model to generate a poem about a person fishing on a lake, the expectation is it will generate a different poem each time. Users may then label the output or answers from best to worst. Such labels are an input to the model to make sure the model is giving a more human-like or best answers, while trying to minimize the worst answers (for example, via reinforcement learning). In some embodiments, a “prompt” as described herein includes one or more of: a request (for example, a question or instruction [for example, “write a poem”]), target content, and one or more examples, as described herein).
Regarding claim 15, Menashof, in view of Greasley teaches the system of claim 10, Greasley discloses wherein: the one or more first lengths are associated with one or more first numbers of words that are less than a threshold number of words (Paragraph [0038]: if the user 106 indicates a preference for a short listening experience, the electronic device 102 and/or the controller 104 may generate a terse text description, e.g., with a low word count, for example, less than a threshold number of words. On the other hand, if the user 106 indicates a preference for a longer listening experience, the electronic device 102 and/or the controller 104 may generate a more verbose text description, e.g., with a high word count, for example, greater than a threshold number of words); and the one or more second lengths are associated with one or more second numbers of words that are equal to or greater than the threshold number of words (Paragraph [0039]: if the user 106 is moving at a low velocity or is stationary, the electronic device 102 and/or the controller 104 may generate a text description with a high word count (e.g., greater than a threshold number of words). If the user 106 is moving at a high velocity, the electronic device 102 and/or the controller 104 may generate a text description with a low word count (e.g., less than a threshold number of words) or may increase the speed (e.g., words per minute) of the narration to keep pace with the movement of the user 106; Paragraph [0053]: length of the text description may be determined based on the velocity of the user. For example, if the user is stationary or moving at a low velocity, the audio output generator 224 may generate a verbose text description with a high word count (e.g., greater than a threshold number of words). If the user is moving at a high velocity, the audio output generator 224 may generate a terse text description with a low word count (e.g., less than a threshold number of words). In some implementations, the audio output generator 224 selects a depth to which the ontology 226 is traversed).
Regarding claim 16, Menashof, in view of Greasley teaches the system of claim 10, Menashof discloses wherein the one or more processors are further to: receive third data representative of one or more poses associated with the one or more shapes (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set; Paragraphs [0086]-[0087]: the operations further comprise causing, at the decoder, a third machine learning model to generate a reconstructed video by at least providing as a first input to the third machine learning model the pivot image and the descriptor, where the third machine learning model uses the pivot image and at least a portion of the descriptor to output a second plurality of images that are combined to generate the reconstructed video…the first machine learning model comprises a large language model, the second machine learning model comprises a neural network, and the third machine learning model comprises a diffusion model); and generate, based at least on the one or more poses, the one or more images to represent the one or more shapes from a canonical viewpoint (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set).
Regarding claim 17, Menashof, in view of Greasley teaches the system of claim 16, Menashof discloses wherein the one or more processors are further to obtain three-dimensional information associated with a first portion of the one or more shapes and two-dimensional information associated with a second portion of the one or more shapes (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set; Paragraphs [0086]-[0087]: the operations further comprise causing, at the decoder, a third machine learning model to generate a reconstructed video by at least providing as a first input to the third machine learning model the pivot image and the descriptor, where the third machine learning model uses the pivot image and at least a portion of the descriptor to output a second plurality of images that are combined to generate the reconstructed video…the first machine learning model comprises a large language model, the second machine learning model comprises a neural network, and the third machine learning model comprises a diffusion model), and wherein the generation of the one or more images comprises: generating one or more first images based at least on the one or more poses, the three-dimensional information, and one or more first camera parameters, the one or more first images depicting the first portion of the one or more shapes from the canonical viewpoint (Fig. 2; Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set; Paragraph [0121]: I/O components 720 can provide a natural user interface (NUI) that processes air gestures, voice, or other physiological inputs generated by a user. In some instances, inputs are transmitted to an appropriate network element for further processing. An NUI may implement any combination of speech recognition, touch and stylus recognition, facial recognition, biometric recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, and touch recognition associated with displays on the computing device 700. In one example, the computing device 700 is equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, red-green-blue (RGB) camera systems, and combinations of these, for gesture detection and recognition. Additionally, the computing device 700 may be equipped with accelerometers or gyroscopes that enable detection of motion. The output of the accelerometers or gyroscopes may be provided to the display of the computing device 700 to render immersive augmented reality or virtual reality); determining, based at least on the one or more first camera parameters, one or more second parameters for rendering the one or more shapes (Paragraph [0048]: the descriptors 144 include data that can be provided to the decoder 128 or component thereof (e.g., a generative machine learning model 130) to generate or otherwise reconstruct the video 126B. In one example, the descriptors 144 are generated based at least in part on the video 126A and/or pivot images 142. In an embodiment, an LLM or other machine learning model (e.g., GPT, bidirectional encoder representations from transformers [BERT], CNN, etc.) takes, as an input, the video 126A and/or pivot images 142 and outputs a natural language description that is used by the encoder 124 to generate the descriptors 144. For example, the encoder 124, after extracting a particular pivot image, provides the particular pivot image as an input to the LLM, and the LLM then generates a prompt (e.g., natural language description) that guides a generative machine learning model of the decoder 128 to an image. In other example, the video 126A and the pivot images 142 are provided to the LLM and the LLM generates the descriptors 144 based at least in part on frames of the video 126A between the pivot images 142. In a specific example, the video 126A includes a background with two trees and a cat running between the trees, the pivot images 142 include a first image of the cat at the first tree and a second image of the cat at the second tree. Continuing this specific example, the descriptors 144 include a description of the conceptual elements of the video 126A (e.g., the cat running from the first tree to the second tree). In various embodiments, the descriptors 144 are modified or otherwise used to generate a set of prompts for the generative model 130; Paragraphs [0076]-[0077]: the system implementing the method 400 compares the number of objects detected in a previous frame to the number of objects detected in the video frame. Furthermore, in another example, the system implementing the method 400 detects other changes of state, such as position, size, shape, orientation, or other aspects of an object that can convey conceptual information in a video. If no change of state is detected, the system implementing the method 400 continues to block 408. At block 408, the system implementing the method 400 determines whether additional video frames are present in the video. If there are additional video frames (e.g., the video has not terminated), the system implementing the method 400 continues the method 400 at block 402 and obtains the next video frame. If no additional video frames are included in the video, the method 400 continues to block 412 described below…block 406 above, if a change of state is detected, the system implementing the method 400 continues to block 410 and selects the video frame as a pivot image. For example, the video frame is extracted and stored as a pivot image. At block 412, the system implementing the method 400 generates descriptors. As described above, in various embodiments, the descriptors include textual data such as natural language descriptions associated with the pivot images and/or video. For example, the pivot images are provided to an LLM that generates natural language descriptions of the pivot images and/or conceptual elements connecting or otherwise associated with the pivot images. At block 414, the system implementing the method 400 provides the pivot images and descriptors. For example, the pivot images and descriptors are transmitted to a decoder executed by a user device); and generating, based at least on the two-dimensional information and the one or more second camera parameters, one or more second images that represent the one or more shapes from the canonical viewpoint (Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set).
Regarding claim 18, Menashof, in view of Greasley teaches the system of claim 10, Menashof discloses wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs) (Paragraph [0005]: the encoder includes a large language model (LLM) or other natural language model to generate descriptors of the pivot images; a system for performing operations using one or more visual language models (VLMs); a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
Regarding claim 19, Menashof discloses one or more processors (Fig. 7) comprising: processing circuitry to associate one or more shapes with one or more first captions and one or more second captions (Paragraph [0024]: the encoder extracts every tenth frame of the video and causes a first generative machine model to generate a descriptor for the extracted frames (e.g., a natural language description of the soccer ball and the location). In this example, the extracted frames are selected based on an interval of time (e.g., every tenth frame), although, as described in greater detail below, other algorithms can be used for selecting frames of the video to be extracted; Paragraph [0074]: neural network 202 generates one or more text descriptions 204 of a scene (e.g., sentences). In at least one embodiment, neural network 202 generates one or more text descriptions 204 of one or more scenes creating description set; Paragraph [0128]: generated back prompts 303a-303n are text that includes a word, phrase, or sentence that describes a background. In at least one embodiment, for example, neural network 301 generates the background prompts 303a-303n “in snow, in forest, . . . in autumn.” Neural network 301 combines the subjects of input 301 and the one or more background prompts 303a-303n into a output vector space (e.g., a subject-background prompt vector space). When combing the subjects with the background prompts, neural network 302 learns the connection between the text describing subject of the subject set 203 and the one or more text descriptions of the one or more background prompts 303a-303n), wherein the one or more first captions and the one or more second captions are determined based at least on one or more language models processing data associated with one or more images depicting the one or more shapes (Paragraph [0005]: the encoder includes a large language model (LLM) or other natural language model to generate descriptors of the pivot images. For example, the descriptors can include natural language descriptions of the objects, backgrounds, interactions, and concepts included in the pivot images; Paragraph [0024]: the encoder extracts every tenth frame of the video and causes a first generative machine model to generate a descriptor for the extracted frames (e.g., a natural language description of the soccer ball and the location). In this example, the extracted frames are selected based on an interval of time (e.g., every tenth frame), although, as described in greater detail below, other algorithms can be used for selecting frames of the video to be extracted; Paragraph [0074]: neural network 202 generates one or more text descriptions 204 of a scene (e.g., sentences). In at least one embodiment, neural network 202 generates one or more text descriptions 204 of one or more scenes creating description set; Paragraph [0128]: generated back prompts 303a-303n are text that includes a word, phrase, or sentence that describes a background. In at least one embodiment, for example, neural network 301 generates the background prompts 303a-303n “in snow, in forest, . . . in autumn.” Neural network 301 combines the subjects of input 301 and the one or more background prompts 303a-303n into a output vector space (e.g., a subject-background prompt vector space). When combing the subjects with the background prompts, neural network 302 learns the connection between the text describing subject of the subject set 203 and the one or more text descriptions of the one or more background prompts 303a-303n; Paragraph [0075]: neural network 205 generates images for an image set 206 and connects a subject from each text description data set to said images forming text image pairs (e.g., to generate training data). In at least one embodiment, neural network 205 receives outputs such as text descriptions 204 and generates an image set 205. In at least one embodiment, neural network 205 includes a stable diffusion model (e.g., SDXL), generative artificial intelligence model, diffusion neural network, or text to image generating neural network model. In at least one embodiment, neural network 205 receives as input subject text description data set and generates a collage or set of images 206 corresponding to one or more subjects of the subject description data set (e.g., based on an input prompt to generate four images corresponding to text inputs). In at least one embodiment, for example, second neural network 206 parses each subject of a subject set …from the subject text description data set and generates a collage or set of images 206 for image set…each of the images that make up the image set 206 are images of the input 201 subject in different poses. In at least one embodiment, image set 206 includes one or more images of each subject in different poses and may also include the differently posed subjects over one or more different backgrounds. For example, for the subject “fox” in the subject text description data set, neural network 205 generates a fox having different poses and different backgrounds making up the image set 203…for the subject “dog.” In at least one embodiment, this is repeated for each subject in the subject set and for each text description of the description set).
	Menashof does not explicitly disclose one or more first captions that are associated with a first length and one or more second captions that are associated with a second length.
	However, Greasley teaches generating a text description for images (Paragraphs [0038]-[0039]), further comprising: one or more first captions that are associated with a first length and one or more second captions that are associated with a second length (Paragraph [0038]: if the user 106 indicates a preference for a short listening experience, the electronic device 102 and/or the controller 104 may generate a terse text description, e.g., with a low word count, for example, less than a threshold number of words. On the other hand, if the user 106 indicates a preference for a longer listening experience, the electronic device 102 and/or the controller 104 may generate a more verbose text description, e.g., with a high word count, for example, greater than a threshold number of words; Paragraph [0039]: if the user 106 is moving at a low velocity or is stationary, the electronic device 102 and/or the controller 104 may generate a text description with a high word count (e.g., greater than a threshold number of words). If the user 106 is moving at a high velocity, the electronic device 102 and/or the controller 104 may generate a text description with a low word count (e.g., less than a threshold number of words) or may increase the speed (e.g., words per minute) of the narration to keep pace with the movement of the user 106; Paragraph [0053]: length of the text description may be determined based on the velocity of the user. For example, if the user is stationary or moving at a low velocity, the audio output generator 224 may generate a verbose text description with a high word count (e.g., greater than a threshold number of words). If the user is moving at a high velocity, the audio output generator 224 may generate a terse text description with a low word count (e.g., less than a threshold number of words). In some implementations, the audio output generator 224 selects a depth to which the ontology 226 is traversed). Greasley teaches that this will allow for user preference of descriptions to be realized (Paragraph [0038]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Menashof with the features of above as taught by Greasley so as to allow for user preference of descriptions to be realized as presented by Greasley.
Regarding claim 20, Menashof, in view of Greasley teaches the one or more processors of claim 19, Menashof discloses wherein the one or more processors are comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs) (Paragraph [0005]: the encoder includes a large language model (LLM) or other natural language model to generate descriptors of the pivot images); a system for performing operations using one or more visual language models (VLMs); a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW D SALVUCCI whose telephone number is (571)270-5748. The examiner can normally be reached M-F: 7:30-4:00PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, XIAO WU can be reached at (571) 272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW SALVUCCI/Primary Examiner, Art Unit 2613
Read full office action
Prosecution Timeline

Jul 31, 2024
Application Filed
Mar 30, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/267,368
Patent 12597198
RAY TRACING METHOD AND APPARATUS BASED ON ATTENTION FOR DYNAMIC SCENES
2y 5m to grant Granted Apr 07, 2026
18/498,919
Patent 12597207
Camera Reprojection for Faces
2y 5m to grant Granted Apr 07, 2026
18/227,616
Patent 12579753
Phased Capture Assessment and Feedback for Mobile Dimensioning
2y 5m to grant Granted Mar 17, 2026
17/463,439
Patent 12561899
Vector Graphic Parsing and Transformation Engine
2y 5m to grant Granted Feb 24, 2026
18/458,942
Patent 12548256
IMAGE PROCESSING APPARATUS FOR GENERATING SURFACE PROFILE OF THREE-DIMENSIONAL GEOMETRIC MODEL, CONTROL METHOD THEREFOR, AND STORAGE MEDIUM
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
99%
With Interview (+28.5%)
2y 12m
Median Time to Grant
Low
PTA Risk
Based on 485 resolved cases by this examiner. Grant probability derived from career allow rate.