DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 7 is objected to because of the following informalities: the claim contains a typographical error reciting “at least on remaining layer” when it should be “one”. Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 4-5, 8 and 9 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 4, the claim recites “adjusting the portions according to a parameter” where “the portions” has antecedent basis from “portions of the subject masks are removed when the subject masks are combined.” The claim does not specify when such “adjusting the portions” occurs nor is it implicitly evident. Thus “adjusting the portions,” when given its broadest reasonable interpretation, could be read to have a meaning wherein portions that have been removed in the previous step are adjusted according to a parameter such that further operations are performed on removed portions. However, another valid reading could be that the portions that are to be removed in the previous step are subject to an “adjusting the portions” such that this also would be adjusting the portions according to a parameter but would refer to adjusting a different type and different functioning data. Here then “the language of the claim is such that a person of ordinary skill in the art could not interpret the metes and bounds of the claim so as to understand how to avoid infringement” and “a claim is indefinite when the boundaries of the protected subject matter are not clearly delineated and the scope is unclear” (see MPEP 2173.02II). Here the boundaries of the protected subject matter are not clearly delineated and the scope is unclear as it cannot be determined which of the functionally different claim interpretations defines the scope of the claimed subject matter. In the interest of compact prosecution, the Examiner will interpret the claim in line with the second reading above, such that it is “adjusting the portions to be removed when the subject masks are combined, according to a parameter.”
Regarding claim 5, the claim recites “wherein the two or more text descriptions” in line 1 of the claim. There is insufficient antecedent basis for this limitation in the claim. There is no antecedent basis for “the two more text descriptions” as previously claim 1 recites “a text description” singularly, and “two or more prompts” which are not “two or more text descriptions.” Thus it is unclear if a new element of text descriptions is meant to be introduced or whether this should refer to “two or more prompts.” In the interest of compact prosecution the Examiner will interpret the claim as if it recites “for the two or more prompts” which could include multiple text descriptions but is not required to.
Claim 8 recites the limitation " the subject masks for the two or more text descriptions" in line 4. There is insufficient antecedent basis for this limitation in the claim. There is no antecedent basis for “the two more text descriptions” as previously claim 1 recites “a text description” singularly, and “two or more prompts” which are not “two or more text descriptions.” Thus it is unclear if a new element of text descriptions is meant to be introduced or whether this should refer to “two or more prompts.” In the interest of compact prosecution the Examiner will interpret the claim as if it recites “for the two or more prompts” which could include multiple text descriptions but is not required to.
Additionally, claim 8 recites the limitation “compute the cross-image aligned data” in lines 4-5. There is insufficient antecedent basis for this limitation in the claim. There is no antecedent basis for “cross-image aligned data” making it unclear as to what such data refers to as it could refer to an unclaimed element or could correspond to a subset of cross-image consistency data or could be equivalent to cross-image consistency data. In the interest of compact prosecution the Examiner will interpret the limitation as if it recites “to compute
Additionally, claim 8 recites the limitation “the intermediate data produced for each pair of the two or more prompts”. There is insufficient antecedent basis for this limitation in the claim. The claim previously recites only producing intermediate data generally but not that there is specific intermediate data produced by any step which is “produced for each pair of the two or more prompts”. Thus again it is unclear as to what is being referred to by such data for each pair as it could refer to a separate process that generates such intermediate data for pairs of prompts or could refer more generally to any data produced for any combination of multiple prompts. In the interest of compact prosecution the Examiner will interpret the claim as if it recites “the intermediate data produced for
Claim 9 similarly recites in a broader version of claim 8, the limitation “the intermediate data produced for each pair of the two or more prompts” in lines 3-4. There is insufficient antecedent basis for this limitation in the claim. The claim previously recites only producing intermediate data generally but not that there is specific intermediate data produced by any step which is “produced for each pair of the two or more prompts”. Thus again it is unclear as to what is being referred to by such data for each pair as it could refer to a separate process that generates such intermediate data for pairs of prompts or could refer more generally to any data produced for any combination of multiple prompts. In the interest of compact prosecution the Examiner will interpret the claim as if it recites “the intermediate data produced for
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-4, 7-14, 17, and 19-22 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Khachatryan et al1 (“Khachatryan”).
Regarding claim 1, Khachatryan teaches a computer-implemented method, comprising (see Khachatryan, section, ):
receiving a text description related to a subject with two or more prompts describing scenes for generation of two or more images depicting the subject (note that here a “text description related to a subject” could be considered an element received and this could be received “with two or more prompts describing scenes” such that a text description and two or more prompts functioning as recited are required and such receiving of these is for generation of two or more images depicting the subject, and this would include a text description related to a subject that functions to provide two or more prompts describing scenes for generation of images depicting the subject (a prompt being any type of input that causes or prompts a response from any other element of the system) and could also include a text description and some separate functional two or more prompts; see Khachatryan, Section 1, teaching addressing the “problem of zero-shot, ‘training free’ text-to-video synthesis, which is the task of generating videos from textual prompts without requiring any optimization or fine-tuning” where the “approach is to modify a pre-trained text-to-image model (e.g., Stable Diffusion), enriching it with temporally consistent generation” where as further explained in sections 3.2 this text-to-video (T2V) synthesis is “given a text description
τ
and a positive integer m ∈ N, the goal is to design a function F that outputs video frames V ∈ Rm×H×W×3 (for predefined resolution H × W) that exhibit temporal consistency” such that this generates multiple video frame images corresponding to “m” and where the text description related to the subject is “
τ
” and the text description functions as two or more prompts describing scenes depicting the subject such as for example in figure 2 with the example text description and two or more prompts such as “A horse is galloping on the street” where “horse” and/or “the street” are subjects to be depicted in the images and comprise two or more prompts describing scene for generation of the two or more images of the video depicting the subject );
generating intermediate data associated with the two or more images by processing the text description with two or more prompts by at least one layer of a neural network according to pre-trained weights (see Khachatryan, section 3-3.1 teaching use of the pre-trained Stable Diffusion model which models the noise as a neural network with a UNet-like neural network composed of convolutional and attention blocks where as in section 3.3.2, to “leverage the power of cross-frame attention and at the same time exploit a pretrained SD without retraining, we replace each of its self-attention layers with a cross-frame attention, with the attention for each frame being on the first frame” where “in the original SD UNet architecture)… each self-attention layer takes a feature map…linearly projects it into query, key, value features Q,K,V ∈ Rh×w×c, and computes the layer output” which is adapted instead to “replace each self-attention layer with a cross-frame attention of each frame on the first frame” where “each attention layer receives m inputs” and “the linear projection layers produce m queries, keys, and values Q1:m,K1:m,V1:m, respectively” such that “using cross frame attention, the appearance and structure of the objects and background as well as identities are carried over from the first frame to sub sequent frames, which significantly increases the temporal consistency of the generated frames” such that here intermediate data associated with the two or more images corresponds to any data generated by any layer of the pre-trained UNet of the model and could correspond to the feature map produced by an attention or convolutional layer of the SD UNet architecture or to the linear projected version of these maps into the Q, K, V, features, or to the latent codes processed and generated in sequence );
computing cross-image consistency data specific to the subject using the intermediate data (note that “cross-image consistency data specific to the subject” is any data, metric, mathematical value, or data structure or organization that reflects a relationship, similarity, or constraint between the generation paths of the two images” and such computing is any function that derives comparative or shared data regarding the subject across multiple images where using of the intermediate data to compute such data includes any direct or indirect use of such intermediate data and does not exclude other inputs; see Khachatryan, teaching as in section 3.3.2, to “leverage the power of cross-frame attention and at the same time exploit a pretrained SD without retraining, we replace each of its self-attention layers with a cross-frame attention, with the attention for each frame being on the first frame” where “in the original SD UNet architecture)… each self-attention layer takes a feature map…linearly projects it into query, key, value features Q,K,V ∈ Rh×w×c, and computes the layer output” which is adapted instead to “replace each self-attention layer with a cross-frame attention of each frame on the first frame” where “each attention layer receives m inputs” and “the linear projection layers produce m queries, keys, and values Q1:m,K1:m,V1:m, respectively” such that “using cross frame attention, the appearance and structure of the objects and background as well as identities are carried over from the first frame to sub sequent frames, which significantly increases the temporal consistency of the generated frames” where these “queries, keys, and values Q1:m,K1:m,V1:m” are intermediate values as explained above, which are used to compute this cross-image consistency data specific to the subject); and
processing the cross-image consistency data by at least one remaining layer of the neural network according to the pre-trained weights to generate the two or more images (see Khachatryan, figure 2 where it can be seen the output of the Cross-Frame Attention which is cross-image consistency data, is passed forward through a Feed-Forward Network (FFN) and additional transformer blocks which are all remaining layers of the neural network processed according to the pre-trained weights of the SD model being used where “the latent codes are passed to our modified SD model using the proposed cross-frame attention, which uses keys and values from the first frame to generate the image of frame k = 1,...,m. By using cross-frame attention, the appearance and the identity of the foreground object are preserved throughout the sequence” such that here the cross-image consistency data is processed by the remaining layers of the diffusion model which are then passed through as well to remaining layers of the neural network such as the neural network of the “chosen autoencoder” as introduced in section 3.1).
Regarding claim 2, Khachatryan teaches all that is required as applied to claim 1 above and further teaches wherein computing the cross-image consistency data comprises: computing subject masks localizing the subject in the intermediate data (see Khachatryan, figures 2 teaching “employ salient object detection to obtain for each frame k a mask Mk indicating the foreground pixels“ and “for the background (using the mask Mk)” where as in section 3.3.3 it is taught “to obtain a corresponding foreground mask Mk for each frame k” and “Background smoothing is achieved by a convex combi nation between the actual latent code…and the warped latent code…on the background” as in equation 9 where a foreground mask and background mask derived from the foreground mask localize the subject in the intermediate data as the mask is used to inform the local areas where attention should be paid to the subject and foreground mask to influence the cross-image consistency data); and
combining the subject masks to compute the cross-image consistency data (see Khachatryan, figure 2 and section 3.3.3 as explained above where the subject masks are combined as seen in equation 9 so that they influence the cross-image consistency data ensuring attention is paid to the masked regions in the proper amount).
Regarding claim 3, Khachatryan teaches all that is required as applied to claim 2 above and further teaches wherein portions of the subject masks are removed when the subject masks are combined (see Khachatryan, figure 2 and section 3.3.3 as explained in the rejection of claim 2 above where as in equation 9 it can be seen that portions of the foreground mask are removed to determine the background mask when combining the influence of both regions such that these portions removed are portions of the subject masks).
Regarding claim 4, as rendered definite as explained above, Khachatryan teaches all that is required as applied to claim 3 above and further teaches adjusting the portions to be removed when the subject masks are combined, according to a parameter (note that the claim does not specify when such adjusting takes place and does not specify what parameter the adjusting is according to; see Khachatryan, figure 2 where for example a first example of the adjusting of the portions to be removed according to a parameter would be the text prompt parameter defining the subject which would define the foreground mask which would adjust the portions to be removed compared to using another text prompt parameter that would create a different foreground and mask to be removed from the background mask; and as in section 3.3.3 another example of the adjusting the portions to be removed according to a parameter would be according to the “salient object detection” model parameters that segment the subjects such as to “apply (an in-house solution for) salient object detection [40] to the decoded images to obtain a corresponding foreground mask Mk for each frame k”).
Regarding claim 7, Khachatryan teaches all that is required as applied to claim 2 above and further teaches aligning common features within the cross-image consistency data before processing the cross-image consistency data by the at least one remaining layer (see Khachatryan, section 3.3.2 and figure 2 teaching “specified motion field results for each frame k in a warping function” and “By enhancing the latent codes with motion dynamics, we determine the global scene and camera motion and achieve temporal consistency in the background and the global scene” and “the latent codes are passed to our modified SD model using the proposed cross-frame attention, which uses keys and values from the first frame to generate the image of frame k = 1,...,m. By using cross-frame attention, the appearance and the identity of the foreground object are preserved throughout the sequence” and “using cross frame attention, the appearance and structure of the objects and background as well as identities are carried over from the first frame to sub sequent frames, which significantly increases the temporal consistency of the generated frames” such that the cross-frame attention mechanism aligns common features within the cross-image consistency data using equation 9).
Regarding claim 8, as rendered definite as explained above, Khachatryan teaches all that is required as applied to claim 1 above and further teaches wherein computing the cross-image consistency data comprises: computing subject masks localizing the subject in the intermediate data (see Khachatryan, figures 2 teaching “employ salient object detection to obtain for each frame k a mask Mk indicating the foreground pixels“ and “for the background (using the mask Mk)” where as in section 3.3.3 it is taught “to obtain a corresponding foreground mask Mk for each frame k” and “Background smoothing is achieved by a convex combi nation between the actual latent code…and the warped latent code…on the background” as in equation 9 where a foreground mask and background mask derived from the foreground mask localize the subject in the intermediate data as the mask is used to inform the local areas where attention should be paid to the subject and foreground mask to influence the cross-image consistency data);
combining the subject masks for the two or more prompts to compute see Khachatryan, figure 2 and section 3.3.3 as explained above where the subject masks are combined as seen in equation 9 so that they influence the cross-image consistency data ensuring attention is paid to the masked regions in the proper amount and as in section 3.3.2 and figure 2 teaching “specified motion field results for each frame k in a warping function” and “By enhancing the latent codes with motion dynamics, we determine the global scene and camera motion and achieve temporal consistency in the background and the global scene” and “the latent codes are passed to our modified SD model using the proposed cross-frame attention, which uses keys and values from the first frame to generate the image of frame k = 1,...,m. By using cross-frame attention, the appearance and the identity of the foreground object are preserved throughout the sequence” and “using cross frame attention, the appearance and structure of the objects and background as well as identities are carried over from the first frame to sub sequent frames, which significantly increases the temporal consistency of the generated frames” such that the cross-frame attention mechanism aligns common features within the cross-image consistency data using equation 9);
extracting a correspondence map between the intermediate data produced for see Khachatryan, figure 2 and sections 3.3-3.3.3 teaching intermediate data produced for the two or more prompts such as the feature maps output by each layer where a correspondence map is extracted through equation 9 which maps correspondences “using cross frame attention” so that “the appearance and structure of the objects and background as well as identities are carried over from the first frame to sub sequent frames, which significantly increases the temporal consistency of the generated frames”);
identifying common features associated with the subject within the cross-image aligned data using the correspondence map (see Khachatryan, figure 2 and sections 3-3.3.3 teaching such identification of common features associated with the cross-image aligned data as the softmax function effectively highlights or identifies the highest correlating features between the frames while suppressing irrelevant ones); and
exchanging the common features in the cross-image aligned data to compute the cross- image consistency data (see Khachatryan, figure 2 and sections 3-3.3.3 where the cross-frame attention module is utilized to exchange the common features in the cross-image aligned data as it enforces the “appearance and structure of the object and background as well as identities” to be “carried over from the first frame to subsequent frames” such that this carrying over is an exchanging of common feature to influence the cross-image consistency data being processed step by step).
Regarding claim 9, as rendered definite as explained above, Khachatryan teaches all that is required as applied to claim 1 above and further teaches wherein computing the cross-image consistency data comprises: extracting a correspondence map between the intermediate data produced for see Khachatryan, figure 2 and sections 3.3-3.3.3 teaching intermediate data produced for the two or more prompts such as the feature maps output by each layer where a correspondence map is extracted through equation 9 which maps correspondences “using cross frame attention” so that “the appearance and structure of the objects and background as well as identities are carried over from the first frame to sub sequent frames, which significantly increases the temporal consistency of the generated frames”);
identifying common features associated with the subject within the cross-image aligned data using the correspondence map (see Khachatryan, figure 2 and sections 3-3.3.3 teaching such identification of common features associated with the cross-image aligned data as the softmax function effectively highlights or identifies the highest correlating features between the frames while suppressing irrelevant ones); and
exchanging the common features in the cross-image aligned data to compute the cross- image consistency data (see Khachatryan, figure 2 and sections 3-3.3.3 where the cross-frame attention module is utilized to exchange the common features in the cross-image aligned data as it enforces the “appearance and structure of the object and background as well as identities” to be “carried over from the first frame to subsequent frames” such that this carrying over is an exchanging of common feature to influence the cross-image consistency data being processed step by step).
Regarding claim 10, Khachatryan teaches all that is required as applied to claim 1 above and further teaches wherein the pre-trained weights are determined by training the neural network to generate independent images in response to training text descriptions (see Khachatryan, Abstract and Section 1, teaching “a new task, zero shot text-to-video generation, and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g. Stable Diffusion)” and “generating videos from textual prompts without requiring any optimization or fine tuning. A key concept of our approach is to modify a pre-trained text-to-image model (e.g., Stable Diffusion), enriching it with temporally consistent generation. By building upon already trained text-to-image models, our method takes advantage of their excellent image generation quality and enhances their applicability to the video domain without performing additional training” and see sections 3-3.3 teaching use of “Stable Diffusion” where Stable Diffusion comprises a neural network that has been “pre-trained” with weights determined by training the neural network to generate independent images in response to training text descriptions).
Regarding claim 11, Khachatryan teaches all that is required as applied to claim 1 above and further teaches receiving a definition of the subject and processing the definition along with the text description and prompts to generate the intermediate data (see Khachatryan, section 3.4 teaching a conditioning function that receives a subject definition through a “controlNet” which “enables to condition the generation process using edges, pose, semantic masks, image depths, etc” and “ControlNet creates a trainable copy of the encoder (including the middle blocks) of the UNet…while additionally taking the input xt and a condition c, and adds the outputs of each layer to the skip connections of the original UNet. Here c can be any type of condition, such as edge map, scribbles, pose (body land marks), depth map, segmentation map, etc.” where these additional conditions help to define the subject and are used with the text descriptions which function as prompts based on their content; additionally and alternatively, Khacharyan also teaches receiving such a definition of the subject as in section 3.5 teaching “enables the adoption of any SD-based text-guided image editing algorithm to the video domain without any training or fine-tuning. Here we take the text-guided image editing method Instruct-Pix2Pix and combine it with our approach. More precisely, we change the self-attention mechanisms in Instruct-Pix2Pix to cross frame attentions according to Eq. 8. Our experiments show that this adaptation significantly improves the consistency of the edited videos” such that here the original input video acts as a definition of the subject and is received along with the text and prompts to generate the intermediate data in the SD Unet during inference).
Regarding claim 12, Khachatryan teaches all that is required as applied to claim 1 above and further teaches wherein the definition includes a noise seed that is extracted from a real image depicting the subject (here a “noise seed” in this context is some latent noise tensor (some type of Guassian noise) that serves as the starting process for the backward generation process of diffusion, and extracted from a real image would mean applying a process to derive this noise tensor from actual pixel data rather than purely random sampling; see Khachatryan, section 3.5 as explained above where the subject definition comes from the video which is a real image depicting the subject and as in section 3.1 it is confirmed how Stable Diffusion works where “in SD, the function…is modeled as a neural network with a UNet-like [30] architecture composed of convolutional and (self- and cross-) attentional blocks. xT is called the latent code of the signal x0 and there is a method [3] to apply a deterministic forward process to reconstruct the latent code xT given a signal x0. This method is known as DDIM inversion. Sometimes for simplicity, we will call xt,t = 1,...,T also the latent codes of the initial signal x0” where “x0” is defined as the encoded “latent tensor of an input image” and thus the real image is the input image depicting the subject, the noise seed is the latent code Xt and when the video frames are used as input along with the text and by a DDIM inversion “deterministic forward process to reconstruct the latent code Xt given a signal Xo” such that running the diffusion process in reverse adds noise deterministically and extracts the specific noise seed Xt corresponding to the real image, where this noise seed is used to start the generation process along with the text instruction and prompts).
Regarding claim 13, Khachatryan teaches all that is required as applied to claim 1 above and further teaches wherein the text description is separate from the two or more prompts (note that above in claim 1 the text description was not required to be separate from the two or more prompts as the text content and arrangement serves to prompt the generative model in numerous ways, however, the instant claim requires two or more prompts to be separate in some way from the text description, and Khachatryan also teaches such an embodiment as in section 3.4 teaching a conditioning function that receives a subject definition through a “controlNet” which “enables to condition the generation process using edges, pose, semantic masks, image depths, etc” and “ControlNet creates a trainable copy of the encoder (including the middle blocks) of the UNet…while additionally taking the input xt and a condition c, and adds the outputs of each layer to the skip connections of the original UNet. Here c can be any type of condition, such as edge map, scribbles, pose (body land marks), depth map, segmentation map, etc.” where these additional conditions function as two or more prompts that are separate from the text description; additionally and alternatively, Khacharyan also teaches receiving two or more prompts as in section 3.5 teaching “enables the adoption of any SD-based text-guided image editing algorithm to the video domain without any training or fine-tuning. Here we take the text-guided image editing method Instruct-Pix2Pix and combine it with our approach. More precisely, we change the self-attention mechanisms in Instruct-Pix2Pix to cross frame attentions according to Eq. 8. Our experiments show that this adaptation significantly improves the consistency of the edited videos” such that here the image frames and each pixel of the input video are two or more prompts received with the text description).
Regarding claim 14, Khachatryan teaches all that is required as applied to claim 1 above and further teaches wherein the text description is included in at least one of the two or more prompts (see Khachatryan, Section 1, teaching addressing the “problem of zero-shot, ‘training free’ text-to-video synthesis, which is the task of generating videos from textual prompts without requiring any optimization or fine-tuning” where the “approach is to modify a pre-trained text-to-image model (e.g., Stable Diffusion), enriching it with temporally consistent generation” where as further explained in sections 3.2 this text-to-video (T2V) synthesis is “given a text description
τ
and a positive integer m ∈ N, the goal is to design a function F that outputs video frames V ∈ Rm×H×W×3 (for predefined resolution H × W) that exhibit temporal consistency” such that this generates multiple video frame images corresponding to “m” and where the text description related to the subject is “
τ
” and the text description functions as two or more prompts describing scenes depicting the subject such as for example in figure 2 with the example text description and two or more prompts such as “A horse is galloping on the street” where “horse” and/or “the street” are subjects to be depicted in the images and comprise two or more prompts describing scene for generation of the two or more images of the video depicting the subject).
Regarding claim 17, Khachatryan teaches all that is required as applied to claim 1 above and further teaches wherein at least one of the steps of receiving, generating, computing, or processing is performed for training, testing, or certifying a neural network employed in a machine, robot, or autonomous vehicle (see Khachatryan, Abstract, teaching “code is publicly available” and section 4.1 teaching “Implementation” where they teach “take the Stable Diffusion code with its pre-trained weights…and implement our modifications” where such implementation of computer code would be by the computer executing the code and such a computer is a machine that is thus used for testing or certifying the SD neural network based system on the machine running the code; note further that section 2.2 teaches “our approach…does not require massive computing power or dozens of GPUs” meaning that at least less than massive computing power from some computing machine is used to test the network where in context of course it is understood that the process cannot be implemented without a machine to perform the computations).
Regarding claim 19, the instant claim recites a “system” comprising two functionally defined devices in “a memory that stores pre-trained weights for a neural network” and “a processor that is connected to the memory, wherein the processor is configured to” perform the same functions as those recited, addressed and rejected with respect to the computer-implemented method of claim 1. Thus the only difference is the pre-trained weights are associated with a “memory” and the steps performed by the computer are by a processor that is connected to the memory. Khachatryan already teaches such pre-trained weights and a computer implementation using such pre-trained weights in order to perform the recited functions as explained in the rejection of claim 1. Khachatryan also teaches such pre-trained weights are stored in a memory given that these weights are digital data structures comprising millions of numerical values, such that to be available for use they must necessarily be stored in some memory, in order for the processor necessary to calculate these millions of numerical value, to retrieve such weights and output the computer-generated results as seen in the figures for example (see Khachatryan, Abstract “code is publicly available” at a website providing access to memory with such code and pre-trained weights as part of the code relies on access and use of such pre-trained weights of the Stable Diffusion model, such that this is also the code implemented to obtain the results as in Section 4 showing the result of such a system where as explained above a computer of some sort is required to execute the method and generate such results). In light of this, the limitations of claim 19 correspond to the limitations of claim 1; thus it is rejected on the same grounds as claim 1.
Regarding claim 20, in view of the analysis above, it can be seen that the limitations of claim 20 correspond to the limitations of claim 2; thus they are rejected on the same grounds as claim 2.
Regarding claim 21, the instant claim is directed toward an apparatus in the form of a “non-transitory computer-readable media storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the steps” which are the same as the same steps as recited in claim 1. Thus similarly to the analysis of claim 19 above, the only difference is the pre-trained weights are associated with a memory in the form of a “non-transitory computer-readable media” and the steps performed by the computer are by a processor that uses the stored computer instructions. Khachatryan already teaches such pre-trained weights and a computer implementation using such pre-trained weights in order to perform the recited functions as explained in the rejection of claim 1. Khachatryan also teaches such pre-trained weights are stored in a memory given that these weights are digital data structures comprising millions of numerical values, such that to be available for use they must necessarily be stored in some memory, in order for the processor necessary to calculate these millions of numerical value, to retrieve such weights and output the computer-generated results as seen in the figures for example (see Khachatryan, Abstract “code is publicly available” at a website providing access to memory with such code and pre-trained weights as part of the code relies on access and use of such pre-trained weights of the Stable Diffusion model, such that this is also the code implemented to obtain the results as in Section 4 showing the result of such a system where as explained above a computer of some sort is required to execute the method and generate such results). In light of this, the limitations of claim 21 correspond to the limitations of claim 1; thus it is rejected on the same grounds as claim 1.
Regarding claim 22, in view of the analysis above, it can be seen that the limitations of claim 22 correspond to the limitations of claim 2; thus they are rejected on the same grounds as claim 2.
Claim(s) 1, 5 and 6 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Guo et al2 (“Guo”).
Regarding claim 1, Guo teaches a computer-implemented method, comprising (note the method is addressed by rejection of the method steps below; see Guo, page 2, noting that “code and pre-trained weights” are available for the project such that the method is computer-implemented code as would also be apparent to one of ordinary skill in the art recognizing such utilization of a neural network and generation of image pixel data requires computer-implementation):
receiving a text description related to a subject with two or more prompts describing scenes for generation of two or more images depicting the subject (note that here a “text description related to a subject” could be considered an element received and this could be received “with two or more prompts describing scenes” such that a text description and two or more prompts functioning as recited are required and such receiving of these is for generation of two or more images depicting the subject, and this would include a text description related to a subject that functions to provide two or more prompts describing scenes for generation of images depicting the subject (a prompt being any type of input that causes or prompts a response from any other element of the system) and could also include a text description and some separate functional two or more prompts; see Guo, section 3.2 teaching “a T2I model is personalized for a specific 2D anime style. In that case, the corresponding animation generator should be capable of generating animation clips of that style with proper motions, such as foreground/background segmentation, character body movements, etc” where this is done by choosing “to separately train a generalizable motion modeling module and plug it into the personalized T2I at inference time. By doing so, we avoid specific tuning for each personalized model and retain their knowledge by keeping the pre-trained weights unchanged” such that as seen in figure 2 this corresponds to the input of the Personlized T2I taking text descriptions of subjects to generate related images and in this case video images where as in figure 4 and section 4.2 various subject definitions comprising a text description of a subject are obtained such as the string “A racoon is playing the electronic guitar” or “photo of coastline, rocks, storm weather, wind, waves, lightning”; and see section 3.2 teaching “a T2I model is personalized for a specific 2D anime style. In that case, the corresponding animation generator should be capable of generating animation clips of that style with proper motions, such as foreground/background segmentation, character body movements, etc” where this is done by choosing “to separately train a generalizable motion modeling module and plug it into the personalized T2I at inference time. By doing so, we avoid specific tuning for each personalized model and retain their knowledge by keeping the pre-trained weights unchanged” such that as seen in figure 2 this corresponds to the input of the Personlized T2I taking text descriptions of subjects to generate related images and in this case video images where as in figure 4 and section 4.2 various subject definitions comprising a text description of a subject are obtained such as the string “A racoon is playing the electronic guitar” or “photo of coastline, rocks, storm weather, wind, waves, lightning” and this text string including a definition of the subject is used as a prompt where the T2I front end of the inference portion as in figure 2 receives this string which serves to prompt the model to generate two or more images depicting the subject such that the model is “an animation generator” and can “produce diverse and personalized animated images via iteratively denoise process” where for example figure 4 show two or more image depicting the subject);
generating intermediate data associated with the two or more images by processing the text description with two or more prompts by at least one layer of a neural network according to pre-trained weights (note that intermediate data is broadly any internal or hidden data data representation such as latent codes or features maps, or may be any data produced in the intermediate between reception of the prompt data and ultimate output data, so long as it is generated associated with the first shot by processing the first prompt by a layer of a neural network having pre-trained weights; thus see Guo, section 3.1 and figure 2 teaching the text to image generator as “Stable Diffusion” and “SD is based on the Latent Diffusion Model (LDM), which executes the denoising process in the latent space of an autoencoder, namely E(·) and D(·), implemented as VQ-GAN or VQ-VAE pre-trained on large image datasets” and “uses “a modified UNet that incorporates four downsample/upsample blocks and one middle block, resulting in four resolution levels within the networks’ latent space. Each resolution level integrates 2D convolution layers as well as self- and cross-attention mechanisms. Text model τθ(·) is implemented using the CLIP [19] ViT-L/14 text encoder” and as in section 3.3, “we transform each 2Dconvolution and attention layer in the original image model into spatial only pseudo-3Dlayers by reshaping the frame axis in to the batch axis and allowing the network to process each frame independently” and “operates across frames in each batch to achieve motion smoothness and content consistency in the animation clips” where in the motion module, “the spatial dimensions height and width of the feature map z will first be reshaped to the batch dimension, resulting in batch × height × width sequences at the length of frames” where these feature maps generated by the SD model based on the text prompts are intermediate data associated with the first shot by processing the first prompt by at least one layer of the neural network according to pre-trained weights as “modules are inserted between the pre-trained image layers” as in figure 3 and as in section 3.2 “we avoid specific tuning for each personalized model and retain their knowledge by keeping the pre-trained weights unchanged” such that here the intermediate data generated by the neural network of the SD model such as the Unet is according to pre-trained weights as no training is to occur at inference time and all models have pre-trained weights);
computing cross-image consistency data specific to the subject using the intermediate data (note that “cross-image consistency data” is extremely broad and for example does not specify what images “cross-images” necessarily correspond to, nor is the manner of determining consistency limited (other than that it must broadly be “using the intermediate data and the subject definition” in any manner) and thus for example then note that any data that relates to the manner in which across two images any aspect related to the subject is considered consistent or not consistent, or data which functions to provide or ensure consistency across images specific to a subject, would be cross-image consistency data; note further that the manner in which the computing the cross-image consistency specific to the subject is “using the intermediate data and the subject definition” data is not limited and thus if computation of such cross-image consistency data can be seen to be using the intermediate data as well as the subject definition in any manner than such computing can be seen as “using” such data elements; thus see Guo, section 3.3 in which the motion modeling module computes such cross-image consistency data specific to the subject using “vanilla temporal transformers” “to enable efficient information exchange across frames” to compute attention maps which is used to ensure cross-image consistency with regard to the subject of the prompt and the motion of the subject in the prompt where “inserted motion module operates across frames in each batch to achieve motion smoothness and content consistency in the animation clips” and “When passing through our motion module, the spatial dimensions height and width of the feature map z will first be reshaped to the batch dimension, resulting in batch × height × width sequences at the length of frames. The reshaped feature map will then be projected and go through several self-attention blocks” and this “enables the module to capture the temporal dependencies between features at the same location across the temporal axis” such that here is the computing of cross-image consistency data specific to the subject which uses the intermediate data of the feature maps to interrelate features relating to the subject and its position and appearance so that the subject appearance and motion appears consistent across images, where as in section 3.3, z corresponds to the feature map with features specific to the subject definition, and these features of the subject are projected and go through several self-attention blocks where “z = Attention(Q,K,V ) = Softmax(QKT /√ d )·V” such that these attention maps are consistency data and using self-attention, QKT functions as cross-image consistency data such that since Q and K are projections of the intermediate data z and since z is the feature map generated from the text prompt subject definition, the calculation of QKT thus uses both the intermediate data z and the subject definition encoded within z); and
processing the cross-image consistency data by at least one remaining layer of the neural network according to the pre-trained weights to generate the two or more images (see Guo, section 3 and figures 2 and 3, teaching the cross-image consistency data generated by the motion module is processed by at least one remaining layer of the neural network according to the pretrained weights where the Decoder and subsequent Unet blocks as in figure 2 are remaining layers of the neural network which process the data using the cross-image consistency data from the motion module explained above utilizing the pre-trained weights of the SD model for the decoding, and this processing is to generate a video comprising two or more images such as the “animated images” generated “via iterative denoise process” as seen in figures 2 and 4, with figure 4 showing other examples of video comprising two or more images related to prompts input to the model).
Regarding claim 2, Guo teaches all that is required as applied to claim 1 above and further teaches wherein computing the cross-image consistency data comprises: computing subject masks localizing the subject in the intermediate data (note that a “subject mask” is broadly interpreted to refer to any data structure (such as a vector, map, tensor, etc) that localizes or highlights or designates features of the subject within the intermediate data; see Guo, section 3.3 where the Q and K vectors serve as subject masks given that in such an attention mechanism the Query vector localizes the features to be attended to (that of the subject here) in the current frame, and the Key vector localizes the corresponding features in other frames allowing “to capture the temporal dependencies between features at the same location across the temporal axis”); and combining the subject masks to compute the cross-image consistency data (see Guo, section 3.3 teaching the calculation of QKT where this dot product combines the localization vectors (functioning as the subject masks) to compute the full attention matrix where the attention matrix is the cross-image consistency data representing the consistency relationships between the localized features across frames).
Regarding claim 5, as rendered definite as explained above, Guo teaches all that is required as applied to claim 2 above and further teaches wherein the two or more prompts are related to an additional subject (see Guo, figure 4 and section 4.2 where the two or more prompts correspond to the different text strings and their combination where for example a “1girl” and “cherry blossoms” and “pink flowers” are all functioning as prompts related to additional subjects and where “motion modeling module can understand the textual prompt and assign appropriate motions to each pixel, such as the motion of sea waves (3rd row) and the leg motion of the Pallas’s cat (7th row). We also find that our method can distinguish major subjects from foreground and background in the picture, creating a feeling of vividness and realism. For instance, the character and background blossoms in the first animation move separately, at different speeds, and with different blurring strengths” such that for example “character” and “background blossoms” are a subject and additional subject coming from the two or more prompts) and further comprising: computing additional subject masks localizing the additional subject in the intermediate data (see Guo, section 3.3 as explained in the rejection of claim 2 where the Q and K vectors serve as subject masks given that in such an attention mechanism the Query vector localizes the features to be attended to (that of the subject here) in the current frame, and the Key vector localizes the corresponding features in other frames allowing “to capture the temporal dependencies between features at the same location across the temporal axis” and each subject has such a subject mask determined which localizes the subject mask in the intermediate data such that during denoising the proper attention is paid to the proper local areas of the image according to the cross-attention and motion modules); and
merging the subject masks and the additional subject masks before the combining (see Guo, where as above the subject masks are combined to generate the cross-image consistency data as in section 3.3 and figure 3 where the multiplication by V combines the masks to generate the cross attention data as in equation 4, and prior to this combining of the masks, they are merged according to the “Softmax” operation which will mathematically merge their relative weights before they are multiplied (combined) with the trailing V matrix to output the final cross-image consistency data).
Regarding claim 6, Guo teaches all that is required as applied to claim 2 above and further teaches interpolating between the cross-image consistency data produced by combining the subject masks and denoised intermediate data before processing the cross-image consistency data by the at least one remaining layer (note that interpolating is interpreted as any use of surrounding values to determine some unknown value between two values; see Guo, section 3.3 where such interpolation is performed with respect to the “residual connection” and “zero-initialized output projection” within the motion module where denoised intermediate data is the input feature map z entering the module and the cross-image consistency data is the output of the self-attention block such as the projected vectors going through the attention phase, and as the architecture has “a residual connection so that the motion module is an identify mapping” and as it is taught to “zero initialize the output projection layer” then this calculates the final output as a weighted combination of the intermediate data (through the identity path) and the consistency data (through the attention path), where this determines a new feature value that lies between the original static representation and the motion-adjusted representation in the latent feature space and thus interpolates between the two data sources before the combined result is processed by the remaining at least one layer such as the final output layers of the Unet of the diffusion model).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 15-16 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Khachatryan in view of Ceylon Aksit et al3 (“Aksit”).
Regarding claim 15, Khachatryan teaches all that is required as applied to claim 1 above but fails to specifically teach wherein at least one of the steps of receiving, generating, computing, or processing is performed on a server or in a data center to generate the two or more images, and the two or more images are streamed to a user device. Rather while Khachatryan’s method is and must be computer-implemented, such an arrangement with a server and user device is not explicitly detailed. It is noted however that a “server” need not be a remote server as recited and streaming to a user device does not necessarily require streaming over a network to a remote device. Regardless, Khachatryan is silent with respect to implementing the method using any particular arrangement of servers or data centers responsible for certain steps and streaming such data to a user device. Thus Khachatryan stands as a base device upon which the claimed invention can be seen as an improvement through such an arrangement defining processing at a server and streaming outputs to a user device where this would give the benefits of distributed computing to the system and user.
In the same field of endeavor relating to utilizing image diffusion models and their neural networks for generating video images relating to text prompts guiding generation of video (see Aksit, paragraphs 0026-0035 teaching “the video editing system of the present disclosure enables editing of a video using a pre-trained image diffusion model and a prompt (e.g., text-based editing instruction) without any additional training. In some embodiments, the input video clip is inverted and edited based on a textual prompt received from a user or other entity” and “image generation model 110 may be implemented as a neural network. A neural network may include a machine-learning model that can be tuned (e.g., trained) based on training input to approximate unknown functions. In particular, a neural network can include a model of interconnected digital neurons that communicate and learn to approximate complex functions and generate outputs based on a plurality of inputs provided to the model. For instance, the neural network includes one or more machine learning algorithms. In other words, a neural network is an algorithm that implements deep learning techniques, i.e., machine learning that utilizes a set of algorithms to attempt to model high-level abstractions in data. In various embodiments, the image generation model 110 may execute in a neural network manager 112. The neural network manager 112 may include an execution environment for the image generation model, providing any needed libraries, hardware resources, software resources, etc. for the image generation model” and “Edited video 126 has had its appearance edited based on the input prompt 104 without requiring any per-scene training by the image generation model and also maintains appearance and temporal consistency”), Aksit teaches that receiving, generating, computing, or processing related to the generation of some generated images from text prompts is performed on a server or in a data center (see Aksit, paragraphs 0026-0035 and figure 1, teaching “a video editing system 100 can enable video editing using an image generation model, such as Stable Diffusion, or other image diffusion-based model(s). The video editing system 100 may be implemented as a standalone system, such as an application executing on a client computing device, server computing device, or other computing device. In some embodiments, the video editing system may be implemented as a tool incorporated into another system, service, application, etc. to provide image diffusion-based video editing. The video editing system 100 may be implemented in a user device, in a service provider device as part of a cloud computing model, or other device which may receive input videos and return output videos” and see also paragraphs 0067-0081 and figures 9-11 teaching “the components 902-910 of the video editing system 900 may, for example, be implemented as one or more stand-alone applications, as one or more modules of an application, as one or more plug-ins, as one or more library functions or functions that may be called by other applications, and/or as a cloud-computing model. Thus, the components 902-910 of the video editing system 900 may be implemented as a stand-alone application, such as a desktop or mobile application. Furthermore, the components 902-910 of the video editing system 900 may be implemented as one or more web-based applications hosted on a remote server. Alternatively, or additionally, the components of the video editing system 900 may be implemented in a suite of mobile device applications or “apps.”” And “the video editing system 900 can be implemented as a single system. In other embodiments, the video editing system 900 can be implemented in whole, or in part, across multiple systems. For example, one or more functions of the video editing system 900 can be performed by one or more servers, and one or more functions of the video editing system 900 can be performed by one or more client devices. The one or more servers and/or one or more client devices may generate, store, receive, and transmit any type of data used by the video editing system 900, as described herein” and finally “the one or more client devices can include or implement at least a portion of the video editing system 900. In other implementations, the one or more servers can include or implement at least a portion of the video editing system 900. For instance, the video editing system 900 can include an application running on the one or more servers or a portion of the video editing system 900 can be downloaded from the one or more servers. Additionally or alternatively, the video editing system 900 can include a web hosting application that allows the client device(s) to interact with content hosted at the one or more server(s)” where the computing devices utilized for the processing and computing may be “GPUs”, such that here a user device can be streamed the output through being sent it through one of the remote server or cloud based or data center based implementations where any of the steps may be performed by any combination of the device architecture ), and the images are streamed to a user device (see Aksit, paragraphs 0067-0089 and figure 9-11 and paragraphs 0098-0106 as explained above teaching for example “the components 902-910 of the video editing system 900 may, for example, be implemented as one or more stand-alone applications, as one or more modules of an application, as one or more plug-ins, as one or more library functions or functions that may be called by other applications, and/or as a cloud-computing model. Thus, the components 902-910 of the video editing system 900 may be implemented as a stand-alone application, such as a desktop or mobile application. Furthermore, the components 902-910 of the video editing system 900 may be implemented as one or more web-based applications hosted on a remote server. Alternatively, or additionally, the components of the video editing system 900 may be implemented in a suite of mobile device applications or “apps” and “cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly” such that here a user device can be streamed the output through being sent it through one of the remote server or cloud based or data center based implementations where any of the steps may be performed by any combination of the device architecture). Thus Aksit teaches known techniques applicable to the base system of Khachatryan which is ready for improvement through solutions to known ways to implement video diffusion networks based on input prompts and pre-trained weights.
Therefore it would have been obvious for one of ordinary skill the art before the effective filing date of the invention to modify Khachatryan by applying the known techniques of Aksit as doing so would be no more than application of a known technique to a base device ready for improvement which would yield predictable results and result in an improved system. The predictable result of the combination would be that the video generation method of Khachatryan is utilized and implemented as in Aksit, where at least one of the above claimed steps would be implemented on a server or data center where such result would then be streamed to a user device. This would result in an improved system as it would allow a user of any end user device to avoid the intensive video generation steps requiring specialized hardware while allowing to leverage the massive computing power of servers running specialized hardware not available to user devices as would be apparent to one having ordinary skill in the art as well as is suggested by Aksit (see Aksit, paragraph 0099 teaching “cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly”).
Regarding claim 16, Khachatryan teaches all that is required as applied to claim 1 above but fails to specifically teach wherein at least one of the steps of receiving, generating, computing, or processing is performed within a cloud computing environment. Rather while Khachatryan’s method is and must be computer-implemented, such an arrangement or mention of a cloud computing environment is not detailed. Thus Khachatryan stands as a base device upon which the claimed invention can be seen as an improvement through utilization of cloud computing to perform one of the recited steps of the method.
In the same field of endeavor relating to utilizing image diffusion models and their neural networks for generating video images relating to text prompts guiding generation of video (see Aksit, paragraphs 0026-0035 teaching “the video editing system of the present disclosure enables editing of a video using a pre-trained image diffusion model and a prompt (e.g., text-based editing instruction) without any additional training. In some embodiments, the input video clip is inverted and edited based on a textual prompt received from a user or other entity” and “image generation model 110 may be implemented as a neural network. A neural network may include a machine-learning model that can be tuned (e.g., trained) based on training input to approximate unknown functions. In particular, a neural network can include a model of interconnected digital neurons that communicate and learn to approximate complex functions and generate outputs based on a plurality of inputs provided to the model. For instance, the neural network includes one or more machine learning algorithms. In other words, a neural network is an algorithm that implements deep learning techniques, i.e., machine learning that utilizes a set of algorithms to attempt to model high-level abstractions in data. In various embodiments, the image generation model 110 may execute in a neural network manager 112. The neural network manager 112 may include an execution environment for the image generation model, providing any needed libraries, hardware resources, software resources, etc. for the image generation model” and “Edited video 126 has had its appearance edited based on the input prompt 104 without requiring any per-scene training by the image generation model and also maintains appearance and temporal consistency”), Aksit teaches that at least one of the steps of receiving, generating, computing, or processing related to the generation of some first generated video shot is performed within a cloud computing environment (see Aksit, paragraphs 0026-0035 and figure 1, teaching “a video editing system 100 can enable video editing using an image generation model, such as Stable Diffusion, or other image diffusion-based model(s). The video editing system 100 may be implemented as a standalone system, such as an application executing on a client computing device, server computing device, or other computing device. In some embodiments, the video editing system may be implemented as a tool incorporated into another system, service, application, etc. to provide image diffusion-based video editing. The video editing system 100 may be implemented in a user device, in a service provider device as part of a cloud computing model, or other device which may receive input videos and return output videos” and see also paragraphs 0067-0081 and figures 9-11 teaching “the components 902-910 of the video editing system 900 may, for example, be implemented as one or more stand-alone applications, as one or more modules of an application, as one or more plug-ins, as one or more library functions or functions that may be called by other applications, and/or as a cloud-computing model. Thus, the components 902-910 of the video editing system 900 may be implemented as a stand-alone application, such as a desktop or mobile application. Furthermore, the components 902-910 of the video editing system 900 may be implemented as one or more web-based applications hosted on a remote server. Alternatively, or additionally, the components of the video editing system 900 may be implemented in a suite of mobile device applications or “apps.”” And “the video editing system 900 can be implemented as a single system. In other embodiments, the video editing system 900 can be implemented in whole, or in part, across multiple systems. For example, one or more functions of the video editing system 900 can be performed by one or more servers, and one or more functions of the video editing system 900 can be performed by one or more client devices. The one or more servers and/or one or more client devices may generate, store, receive, and transmit any type of data used by the video editing system 900, as described herein” and finally “the one or more client devices can include or implement at least a portion of the video editing system 900. In other implementations, the one or more servers can include or implement at least a portion of the video editing system 900. For instance, the video editing system 900 can include an application running on the one or more servers or a portion of the video editing system 900 can be downloaded from the one or more servers. Additionally or alternatively, the video editing system 900 can include a web hosting application that allows the client device(s) to interact with content hosted at the one or more server(s)” where the computing devices utilized for the processing and computing may be “GPUs”, such that here a user device can be streamed the output through being sent it through one of the remote server or cloud based or data center based implementations where any of the steps may be performed by any combination of the device architecture ). Thus Aksit teaches known techniques applicable to the base system of Gui which is ready for improvement through solutions to known ways to implement video diffusion networks based on input prompts and pre-trained weights.
Therefore it would have been obvious for one of ordinary skill the art before the effective filing date of the invention to modify Khachatryan by applying the known techniques of Aksit as doing so would be no more than application of a known technique to a base device ready for improvement which would yield predictable results and result in an improved system. The predictable result of the combination would be that the video generation method of Khachatryan is implemented using the architectures as in Aksit where at least one of the above claimed steps would be implemented in a cloud computing environment with results provided to the user through the cloud computing architecture chosen by the system designer. This would result in an improved system as it would allow a user of any end user device to avoid the intensive video generation steps requiring specialized hardware while allowing to leverage the massive computing power of servers running specialized hardware not available to user devices as would be apparent to one having ordinary skill in the art as well as is suggested by Aksit (see Aksit, paragraph 0099 teaching “cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly”).
Regarding claim 18, Khachatryan teaches all that is required as applied to claim 1 above but fails to specifically teach wherein at least one of the steps of receiving, generating, computing, or processing is performed on a virtual machine comprising a portion of a graphics processing unit. Rather while Khachatryan’s method is and must be computer-implemented, such an arrangement or mention of a virtual machine comprising a portion of a graphics processing unit performing at least one of the steps is not disclosed. Thus Khachatryan stands as a base device upon which the claimed invention can be seen as an improvement through utilization of a virtual machine comprising a portion of a graphics processing unit as recited which would give the benefits of virtualization as understood by one having ordinary skill in the art.
In the same field of endeavor relating to utilizing image diffusion models and their neural networks for generating video images relating to text prompts guiding generation of video (see Aksit, paragraphs 0026-0035 teaching “the video editing system of the present disclosure enables editing of a video using a pre-trained image diffusion model and a prompt (e.g., text-based editing instruction) without any additional training. In some embodiments, the input video clip is inverted and edited based on a textual prompt received from a user or other entity” and “image generation model 110 may be implemented as a neural network. A neural network may include a machine-learning model that can be tuned (e.g., trained) based on training input to approximate unknown functions. In particular, a neural network can include a model of interconnected digital neurons that communicate and learn to approximate complex functions and generate outputs based on a plurality of inputs provided to the model. For instance, the neural network includes one or more machine learning algorithms. In other words, a neural network is an algorithm that implements deep learning techniques, i.e., machine learning that utilizes a set of algorithms to attempt to model high-level abstractions in data. In various embodiments, the image generation model 110 may execute in a neural network manager 112. The neural network manager 112 may include an execution environment for the image generation model, providing any needed libraries, hardware resources, software resources, etc. for the image generation model” and “Edited video 126 has had its appearance edited based on the input prompt 104 without requiring any per-scene training by the image generation model and also maintains appearance and temporal consistency”), Aksit teaches that at least one of the steps of receiving the first prompt, receiving the second prompt, generating, computing, or processing is performed on a virtual machine comprising a portion of a graphics processing unit (see Aksit, paragraphs 0026-0035 and figure 1, teaching “a video editing system 100 can enable video editing using an image generation model, such as Stable Diffusion, or other image diffusion-based model(s). The video editing system 100 may be implemented as a standalone system, such as an application executing on a client computing device, server computing device, or other computing device. In some embodiments, the video editing system may be implemented as a tool incorporated into another system, service, application, etc. to provide image diffusion-based video editing. The video editing system 100 may be implemented in a user device, in a service provider device as part of a cloud computing model, or other device which may receive input videos and return output videos” and see also paragraphs 0067-0081 and figures 9-11 teaching “the components 902-910 of the video editing system 900 may, for example, be implemented as one or more stand-alone applications, as one or more modules of an application, as one or more plug-ins, as one or more library functions or functions that may be called by other applications, and/or as a cloud-computing model. Thus, the components 902-910 of the video editing system 900 may be implemented as a stand-alone application, such as a desktop or mobile application. Furthermore, the components 902-910 of the video editing system 900 may be implemented as one or more web-based applications hosted on a remote server. Alternatively, or additionally, the components of the video editing system 900 may be implemented in a suite of mobile device applications or “apps.”” And “the video editing system 900 can be implemented as a single system. In other embodiments, the video editing system 900 can be implemented in whole, or in part, across multiple systems. For example, one or more functions of the video editing system 900 can be performed by one or more servers, and one or more functions of the video editing system 900 can be performed by one or more client devices. The one or more servers and/or one or more client devices may generate, store, receive, and transmit any type of data used by the video editing system 900, as described herein” and finally “the one or more client devices can include or implement at least a portion of the video editing system 900. In other implementations, the one or more servers can include or implement at least a portion of the video editing system 900. For instance, the video editing system 900 can include an application running on the one or more servers or a portion of the video editing system 900 can be downloaded from the one or more servers. Additionally or alternatively, the video editing system 900 can include a web hosting application that allows the client device(s) to interact with content hosted at the one or more server(s)” where the computing devices utilized for the processing and computing may be “GPUs”, such that here the shared pool of computing resources can be “rapidly provisioned via virtualization” where these resources may be “GPUs” and these are ”securely divided between multiple customers” thus provisioning a virtual machine with access to a portion or slice of that graphics processing unit resource to the multiple users). Thus Aksit teaches known techniques applicable to the base system of Khachatryan which is ready for improvement through solutions to known ways to implement video diffusion networks based on input prompts and pre-trained weights.
Therefore it would have been obvious for one of ordinary skill the art before the effective filing date of the invention to modify Khachatryan by applying the known techniques of Aksit as doing so would be no more than application of a known technique to a base device ready for improvement which would yield predictable results and result in an improved system. The predictable result of the combination would be that the video generation method of Khachatryan would be utilized and then implemented as in Aksit, where at least one of the above claimed steps would be performed on a virtual machine comprising a portion of a graphics processing unit to allow the end user device to avoid processing such data. This would result in an improved system as it would allow a user of any end user device to avoid the intensive video generation steps requiring specialized hardware while allowing to leverage the massive computing power of servers running specialized hardware not available to user devices as would be apparent to one having ordinary skill in the art as well as is suggested by Aksit (see Aksit, paragraph 0099 teaching “cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly”).
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1-7, 10 and 15-22 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-5 and 9-12 of copending Application No. 18650761 (co-pending conflicting application). Although the claims at issue are not identical, they are not patentably distinct from each other as explained below.
Pending Application No. 18650726
Co-Pending Application 18650761
Claim 1.
A computer-implemented method, comprising: receiving a text description related to a subject with two or more prompts describing scenes for generation of two or more images depicting the subject;
generating intermediate data associated with the two or more images by processing the text description with two or more prompts by at least one layer of a neural network according to pre-trained weights;
computing cross-image consistency data specific to the subject using the intermediate data; and
processing the cross-image consistency data by at least one remaining layer of the neural network according to the pre-trained weights to generate the two or more images.
Claim 1.
A computer-implemented method, comprising: obtaining a subject definition comprising a text description of a subject;
receiving a first prompt describing a first scene for generation of a first shot comprising two or more images depicting the subject;
generating intermediate data associated with the first shot by processing the first prompt by at least one layer of a neural network according to pre-trained weights;
computing cross-image consistency data specific to the subject using the intermediate data and the subject definition; and
processing the cross-image consistency data by at least one remaining layer of the neural network according to the pre-trained weights to generate a video comprising at least the first shot.
Thus it can be seen that the conflicting claim 1 is a slightly different and with regard to necessitating generating “video”, narrower version of the conflicting claim and thus pending claim 1 is effectively a genus to the species recited in claim 1 of the instant application. The differences between the claims are the conflicting claims require generating a video whereas the pending claims require generation of two or more images but not necessarily a video. Furthermore the conflicting claims requires a first prompt with the instant claim requiring a first and second prompt, but in the context of a text description of a subject and whatever two or more prompts are related to the text description are for generation of two or more images depicting the subject. Thus in the case that a text description of a subject as in the conflicting claims contained multiple prompts related to generating the scene such as multiple text keywords or tokens then such reception of a first prompt would also be reception of a first and second prompt as in the pending claim. Thus the more specific conflicting independent claim effectively is a species to the genus of pending claim 1 such that it anticipates the limitations of pending claim 1. Independent claims 14 and 18 correspond to claim 1 as explained above and are rejected for the same reasons in this instance as claim 1 as well.
Note that the dependent claims correspond and conflict according to the following table.
Pending Application No. 18650726
Conflicting Copending Application 18650761
2
2
3
3
6
4
7
5
10
9
15
10
16
11
17
12
18
13
2
17
2
19
This is a provisional nonstatutory double patenting rejection.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SCOTT E SONNERS whose telephone number is (571)270-7504. The examiner can normally be reached Mon-Friday 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached at (571) 272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SCOTT E SONNERS/Examiner, Art Unit 2613
/XIAO M WU/Supervisory Patent Examiner, Art Unit 2613
1 Khachatryan, Levon, et al. "Text2video-zero: Text-to-image diffusion models are zero-shot video generators." Proceedings of the IEEE/CVF International Conference on Computer Vision. March 23, 2023.
2 Guo Y, Yang C, Rao A, Liang Z, Wang Y, Qiao Y, Agrawala M, Lin D, Dai B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725. 2023 Jul 10.
3 US PGPUB No. 20250111866