DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to because Fig. 4, step 430 “mode” should read “model”. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities:
Paragraph 55, “second denoising diffusion model 130” should read “second denoising diffusion model 140”.
Paragraph 95 “current iteration is the final iteration” is repeated twice.
Appropriate correction is required.
Claim Objections
Claims 8, 19 and 20 objected to because of the following informalities: typographical errors as follows.
Claim 8, “each respective iteration” should read “each iteration of the plurality of iterations”.
Claim 19 (line 4) and claim 20 “perform the operations” should read “perform operations”.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim 11 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 11 recites the limitation "at each of a plurality of iterations, updating…the final iteration" in line 7 and last line. There is insufficient antecedent basis for this limitation in the claim. This is because it is unclear whether this plurality of iterations is the same instance as a plurality of iterations from line 13 of claim 10 or a new instance of a plurality of iterations. Additionally, reciting “the final iteration” makes it unclear which final iteration is being referred to from these possibly two instances of plurality of iterations.
Note. Most likely these claims depend on some dependent claim or are missing elements. In order to fix this issue, dependency should be reviewed and any first instance of an element should be made clear that it’s a first instance and should be referred to as “a” or “an” instead of “the”, and if multiple instances exist, further instances should be further distinguished for example by saying “first”, “second”, and/or “third” etc.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 20, recites a computer-readable storage media. The broadest reasonable
interpretation of a claim drawn to a computer readable medium (also called machine readable
media and other such variations) typically covers forms of non-transitory tangible media and
transitory propagating signals per se in view of the ordinary and customary meaning of computer
readable media, particularly when the specification is silent. See MPEP 2111.01. When the
broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected
under 35 U.S.C. 101 as covering non-statutory subject matter. The USPTO recognizes that
applicants may have claims directed to computer readable media that cover signals per se, which
the USPTO must reject under 35 U.S.C. 101 as covering both non-statutory subject matter and
statutory subject matter. A claim drawn to such a computer readable medium that covers both
transitory and non-transitory embodiments may be amended to narrow the claim to cover only
statutory embodiments to avoid a rejection under 35 U.S.C. 101 by adding the limitation "non-
transitory" to the claim. Such an amendment would typically not raise the issue of new matter,
even when the specification is silent because the broadest reasonable interpretation relies on the
ordinary and customary meaning that includes signals per se.
Applicant’s specification in paragraphs [0106] “Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, ….” [00112] recites “Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.”
Since Applicant’s disclosure does not limit the definition of “computer-readable storage media”, it could be a signal. Also, paragraph 106 fails to explicitly exclude transitory medium.
As an additional note, a non-transitory computer readable media having executable
programming instructions stored thereon is considered statutory as non-transitory computer
readable media excludes transitory data signals.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-2, 6-10, and 16-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Molad et al. (Dreamix: Video Diffusion Models are General Video Editors), hereinafter referenced as Molad, in view of Gandelsman et al. (U.S. Patent Application Publication No. 2024/0161462), hereinafter referenced as Gandelsman.
Regarding claim 1, Molad teaches A computer-implemented method for generating an output video conditioned on an input, the method comprising: (page 8, fig. 9 (and page 7, last two paragraphs of LHC [left hand column]) shows input image(s) and text prompt leading to a output video of a specific type); the specific types of videos from text prompt shows them being conditioned on input; and generating the output video from the current intermediate representation after the final iteration (abstract teaches "first transform the image into
a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use our general video editor to animate it. As a further application, we can use our method for subject-driven video generation"); this shows outputting the video and would include the final current intermediate representation/image since that is the final image after final iteration.
However, Molad fails to teach receiving a conditioning input; initializing a current intermediate representation of the output video; and at each of a plurality of iterations, updating the current intermediate representation, the updating comprising: generating a first noise output by processing a first input comprising the current intermediate representation using a first denoising diffusion model conditioned on the conditioning input, wherein the first denoising diffusion model has been trained on first training data comprising a first plurality of training videos; generating a second noise output conditioned on the conditioning input by processing a second input comprising the current intermediate representation using a second denoising diffusion model conditioned on the conditioning input, wherein the second denoising diffusion model (i) is different from the first denoising diffusion model and (ii) has been trained on second training data comprising a second plurality of training videos; generating a combined noise output by combining at least (i) the first noise output and (ii) the second noise output; and updating the current intermediate representation using the combined noise output generated for the iteration;
However, Gandelsman teaches receiving a conditioning input; (Gandelsman, fig. 1 and paragraph 25 teaches "optimization-based method including embedding the input image in a latent text embedding space of a pre-trained diffusion model"); embedding the input image in a latent text embedding space of the model shows receiving conditioning input (embedded input image); initializing a current intermediate representation of the output video; (Gandelsman, paragraph 38 teaches "initializes a noise map, and computes a loss function by comparing each of the set of intermediate images to the image"); comparing each set of intermediate images/representation means there must be a initialized current intermediate representation first of the output video; and at each of a plurality of iterations, updating the current intermediate representation, (Gandelsman, paragraph 49 teaches "Diffusion models work by iteratively adding noise to the data during a forward process and then learning to recover the data by denoising the data during a reverse…gradually add noise to the original image 405 to obtain noisy images 420 at various noise levels"); this shows multiple iterations and updating of intermediate image/representation at each since paragraph 84 cited above mentions intermediate images corresponding to the set of noise maps at different noise levels; the updating comprising: generating a first noise output by processing a first input comprising the current intermediate representation using a first denoising diffusion model conditioned on the conditioning input, (Gandelsman, paragraph 84 teaches "generates a set of intermediate images corresponding to the set of noise maps at different noise levels using the diffusion model. In some cases, the operations of this step refer to, or may be performed by, a pre-trained diffusion model as described with reference to FIG. 2"); noise map at noise level shows first noise output and operations performed by pre-trained diffusion model shows this is done by processing input (including first) comprising intermediate image/representation by using the pre-trained/first denoising diffusion model conditioned on the conditioning input; wherein the first denoising diffusion model has been trained on first training data comprising a first plurality of training videos (Gandelsman, paragraph 21 teaches "Diffusion models are a class of generative neural networks which can be trained to generate new data with features similar to features found in training data. In particular, diffusion models can be used to generate novel images"); when viewed in combination, these images / training data that the diffusion models are trained on (images because the models generate data found in training data and generate images) would be of the videos (and frames thereof) mentioned in Molad Abstract; generating a second noise output conditioned on the conditioning input by processing a second input comprising the current intermediate representation using a second denoising diffusion model conditioned on the conditioning input, (Gandelsman, paragraph 84 teaches "generates a set of intermediate images corresponding to the set of noise maps at different noise levels using the diffusion model." and paragraph 49 teaches "add noise to the original image 405 to obtain noisy images 420 at various noise levels"); using diffusion model includes the tuned/second model (mentioned in abstract) and the noisy image at various levels means a second noise output which is done by processing a second intermediate image/representation from the set of intermediate images; wherein the second denoising diffusion model (i) is different from the first denoising diffusion model and (ii) has been trained on second training data comprising a second plurality of training videos (Gandelsman, paragraph 75 teaches "the tuned diffusion model takes a noise map and a text embedding as input and generates one or more modified images that retain the identity or a target image while also incorporating elements described by the text or other guidance" and paragraph 112 teaches "obtain an image and a prompt for editing the image; fine-tuning a pre-trained diffusion model based on the image to obtain a tuned diffusion model"); generating modified images that retain identity and fine-tuning based on the image/data provided shows training the model on second training data which when viewed in combination would be of video frames and plurality of specific type of videos as shown in Molad on fig. 9 of page 8, also, the fine-tuned /second model is tuned thus different from the pre-trained / first model; generating a combined noise output by combining at least (i) the first noise output and (ii) the second noise output (Gandelsman, paragraph 49 teaches "iteratively adding noise to the data during a forward process"); iteratively adding noise would mean the two noise outputs being combined; and updating the current intermediate representation using the combined noise output generated for the iteration (Gandelsman, paragraph 22 teaches "iteratively adding noise to the data...an output image is created from each of the various noise levels"); output image from each noise level shows updating the current intermediate image/representation and since its done from iteratively adding noise, it is using the combined noise output generated for the iteration. Gandelsman is considered to be analogous art because it is reasonably pertinent to the problem faced by the inventor of inputting prompt to denoising diffusion model and using iterations alongside noise to get a desired visual output. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify Molad's invention with the iteration, noise and diffusion model techniques of Gandelsman to provide improvement over conventional diffusion-based image generation models by producing various outputs that resemble a single target image (Gandelsman, paragraph 24). This ensures a more desired result for a user.
Regarding claim 2, the combination of Molad and Gandelsman teaches wherein the conditioning input is an input text prompt (fig. 1 shows input prompt as "a person with a mustache"); this description is a text input for the input prompt and part of the conditioning input. The same motivations used in claim 1 apply here in claim 2.
Regarding claim 6, the combination of Molad and Gandelsman teaches wherein combining at least (i) the first noise output and (ii) the second noise output comprises: computing a weighted sum of (i) the first noise output and (ii) the second noise output (Gandelsman, paragraph 37 teaches "a first weight for a loss function is used for pre-training the diffusion model 235 and a second weight for the loss function that is different from the first weight is used for fine-tuning the diffusion model"); loss weights affect how much influence each models outputs have thus this is leading to a weighted sum of the first two noise outputs when the noise outputs are combined as explained above. The same motivations used in claim 1 apply here in claim 6.
Regarding claim 7, the combination of Molad and Gandelsman teaches wherein updating the current intermediate representation for the respective iteration further comprises: generating a third noise output by processing a third input comprising the current intermediate representation using the second denoising diffusion model, (Gandelsman, paragraph 84 teaches "generates a set of intermediate images corresponding to the set of noise maps at different noise levels using the diffusion model." and paragraph 49 teaches "add noise to the original image 405 to obtain noisy images 420 at various noise levels"); using diffusion model includes the tuned/second model (mentioned in abstract) and the noisy image at various levels means a third noise output which is done by processing a third intermediate image/representation from the set of intermediate images when on a third iteration; wherein the third noise output is not conditioned on the conditioning input (page 7, right column, second to last full paragraph teaches "Directly mapping the text prompt to a video, without conditioning on the input video using Imagen-Video."); this shows being able to map text prompt to video without conditioning, this means when the third iteration is the last one, the third noise output would not be conditioned on the conditioning input since that would lead to a more efficient design due to more direct and streamlined generation for the last/third iteration; and wherein generating the combined noise output comprises: combining (i) the first noise output, (ii) the second noise output, and (iii) the third noise output (Gandelsman, paragraph 49 teaches "iteratively adding noise to the data during a forward process"); iteratively adding noise would mean the three noise outputs being combined. The same motivations used in claim 1 apply here in claim 7.
Regarding claim 8, the combination of Molad and Gandelsman teaches wherein each respective iteration corresponds to a respective noise level, (Gandelsman, paragraph 22 teaches "iteratively adding noise to the data" and paragraph 79 teaches "t represents a step in the sequence of transitions associated with different noise levels, The reverse diffusion process 1010 outputs x.sub.t−1, such as second intermediate image 1025 iteratively until x.sub.T is reverted back to x.sub.0, the original image 1030"); this shows iteration to add noise and each step in sequence (thus iteration) associated with noise level; and wherein each of the first and the second input further comprises data specifying the respective noise level (Gandelsman, paragraph 102 teaches "include adding noise at the different noise levels to the image to obtain a plurality of noisy images"); adding noise at different noise levels involves for each input having data specifying the respective noise level to ensure the right amount of noise added. The same motivations used in claim 1 apply here in claim 8.
Regarding claim 9, the combination of Molad and Gandelsman teaches wherein for each iteration other than the last iteration, updating the current intermediate representation comprises: performing diffusion sampling to an input comprising the current intermediate representation and the combined noise output (Gandelsman, paragraph 53 teaches "diffusion models are based on a neural network architecture known as a U-Net...input features 505 using an initial neural network layer 510 (e.g., a convolutional network layer) to produce intermediate features 515. The intermediate features 515 are then down-sampled using a down-sampling layer 515 such that the features have a resolution less than the initial resolution and a number of channels greater than the initial number of channels"); this shows diffusion sampling and is done on input which is used to create intermediate features (current intermediate representation) which has aforementioned noise, also one of ordinary skill in the art would understand in some configurations there would be a maximum number of iterations due to computation resource limits leaving the last iteration out from being updated. The same motivations used in claim 1 apply here in claim 9.
Regarding claim 10, Molad teaches A computer-implemented method for generating a domain-specific output video conditioned on an input, the method comprising: (page 8, fig. 9 (and page 7, last two paragraphs of LHC [left hand column]) shows input image(s) and text prompt leading to a output video of a specific type); the specific types of videos shows them being domain-specific; and generating the output domain-specific video from the current intermediate representation after the final iteration (abstract teaches "first transform the image into a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use our general video editor to animate it. As a further application, we can use our method for subject-driven video generation"); this shows outputting the video which is still of the specific subject/type and would include the final current intermediate representation/image since that is the final image after final iteration.
However, Molad fails to teach obtaining model parameter values for a pre-trained denoising diffusion model, wherein the pre-trained denoising diffusion model (i) is configured to process an intermediate input to generate a noise output, and (ii) has been trained on first training data comprising a plurality of training videos; receiving a request for generating a video of a particular type; obtaining model parameter values for a domain-specific denoising diffusion model, wherein the domain-specific denoising diffusion model has been trained on second training data comprising a plurality of domain-specific training videos of the particular type; receiving a conditioning input; initializing a current intermediate representation of the domain-specific output video; at each of a plurality of iterations, updating the current intermediate representation, the updating comprising: generating a first noise output by processing a first input comprising the current intermediate representation using the pre-trained denoising diffusion model conditioned on the conditioning input; generating a second noise output by processing a second input comprising the current intermediate representation using the domain-specific denoising diffusion model conditioned on the conditioning input; generating a combined noise output by combining at least (i) the first noise output and (ii) the second noise output; and updating the current intermediate representation using the combined noise output generated for the iteration;
However, Gandelsman teaches obtaining model parameter values for a pre-trained denoising diffusion model, (Gandelsman, paragraph 33 teaches "Database 120 may contain…model parameters" and paragraph 71 teaches "the parameters of the pre-trained diffusion model are kept unchanged"); database containing parameters (and values thereof) is so that they are to be obtained and since the pre-trained diffusion model performs denoising in paragraph 77, it is a pre-trained denoising diffusion model; wherein the pre-trained denoising diffusion model (i) is configured to process an intermediate input to generate a noise output, (Gandelsman, fig. 11 and paragraph 84 teach "operation 1110, the system generates a set of intermediate images corresponding to the set of noise maps at different noise levels using the diffusion model" ); intermediate input is the intermediate images with different amounts of noise and the noise map at different levels is the generated noise output; and (ii) has been trained on first training data comprising a plurality of training videos (Gandelsman, paragraph 21 teaches "Diffusion models are a class of generative neural networks which can be trained to generate new data with features similar to features found in training data. In particular, diffusion models can be used to generate novel images"); when viewed in combination, these images / training data that the diffusion models are trained on (images because the models generate data found in training data and generate images) would be of the videos (and frames thereof) mentioned in Molad Abstract; receiving a request for generating a video of a particular type (Gandelsman, abstract teaches "Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt" and page 2 first full paragraph teaches "user provides an input video and a text prompt which describes the desired attributes of the resulting video (Fig. 1)."); aligning with a guided text prompt which is provided by the user shows receiving a request for generating a video and the described attributes in the text prompt make the video of a particular type; obtaining model parameter values for a domain-specific denoising diffusion model, (Gandelsman, paragraph 25 teaches "fine-tuning the pre-trained diffusion model to generate a tuned diffusion model, and generating images similar to the user-provided image in terms of content, composition, and style using the tuned diffusion model. The tuned diffusion model can perform various edits base within a single framework and generate variations of the input image" and paragraph 85 teaches "the parameters of the pre-trained diffusion model are fine-tuned" ); fine-tuned model is the domain-specific denoising diffusion model since it is tuned for the specific type of image/frame of video and the parameters being fine-tuned shows obtaining corresponding model parameters for such as well; wherein the domain-specific denoising diffusion model has been trained on second training data comprising a plurality of domain-specific training videos of the particular type (Gandelsman, paragraph 75 teaches "the tuned diffusion model takes a noise map and a text embedding as input and generates one or more modified images that retain the identity or a target image while also incorporating elements described by the text or other guidance" and paragraph 112 teaches "obtain an image and a prompt for editing the image; fine-tuning a pre-trained diffusion model based on the image to obtain a tuned diffusion model"); generating modified images that retain identity and fine-tuning based on the image/data provided shows training the model on second training data which when viewed in combination would be of video frames and plurality of specific type of videos as shown in Molad on fig. 9 of page 8; receiving a conditioning input (Gandelsman, paragraph 25 teaches "optimization-based method including embedding the input image in a latent text embedding space of a pre-trained diffusion model"); embedding the input image in a latent text embedding space of the model shows receiving conditioning input (embedded input image); initializing a current intermediate representation of the domain-specific output video (Gandelsman, paragraph 38 teaches "initializes a noise map, and computes a loss function by comparing each of the set of intermediate images to the image"); comparing each set of intermediate images/representation means there must be a initialized current intermediate representation first of the domain-specific / specific type of output video; at each of a plurality of iterations, updating the current intermediate representation, (Gandelsman, paragraph 49 teaches "Diffusion models work by iteratively adding noise to the data during a forward process and then learning to recover the data by denoising the data during a reverse…gradually add noise to the original image 405 to obtain noisy images 420 at various noise levels"); this shows multiple iterations and updating of intermediate image/representation at each since paragraph 84 cited above mentions intermediate images corresponding to the set of noise maps at different noise levels; the updating comprising: generating a first noise output by processing a first input comprising the current intermediate representation using the pre-trained denoising diffusion model conditioned on the conditioning input (Gandelsman, paragraph 84 teaches "generates a set of intermediate images corresponding to the set of noise maps at different noise levels using the diffusion model. In some cases, the operations of this step refer to, or may be performed by, a pre-trained diffusion model as described with reference to FIG. 2"); noise map at noise level shows first noise output and operations performed by pre-trained diffusion model shows this is done by processing input (including first) comprising intermediate image/representation by using the pre-trained denoising diffusion model conditioned on the conditioning input; generating a second noise output by processing a second input comprising the current intermediate representation using the domain-specific denoising diffusion model conditioned on the conditioning input (Gandelsman, paragraph 84 teaches "generates a set of intermediate images corresponding to the set of noise maps at different noise levels using the diffusion model." and paragraph 49 teaches "add noise to the original image 405 to obtain noisy images 420 at various noise levels"); using diffusion model includes the specific type / domain specific model and the noisy image at various levels means a second noise output which is done by processing a second intermediate image/representation from the set of intermediate images; generating a combined noise output by combining at least (i) the first noise output and (ii) the second noise output (Gandelsman, paragraph 49 teaches "iteratively adding noise to the data during a forward process"); iteratively adding noise would mean the two noise outputs being combined; and updating the current intermediate representation using the combined noise output generated for the iteration (Gandelsman, paragraph 22 teaches "iteratively adding noise to the data...an output image is created from each of the various noise levels"); output image from each noise level shows updating the current intermediate image/representation and since its done from iteratively adding noise, it is using the combined noise output generated for the iteration. The same motivations used in claim 1 apply here in claim 10.
Regarding claim 16, the method claim 16 recites similar limitations as method claim 6, and thus is rejected under similar rationale.
Regarding claim 17, the method claim 17 recites similar limitations as method claim 7, and thus is rejected under similar rationale.
Regarding claim 18, the method claim 18 recites similar limitations as method claim 8, and thus is rejected under similar rationale.
Regarding claim 19, the system claim 19 recites similar limitations as method claim 1, and thus is rejected under similar rationale. In addition, Gandelsman, fig. 2 teaches a system 200 with computer/processing unit 205, memory/storage device 210 that has instructions to be executed by the processor. The same motivations used in claim 1 apply here in claim 19.
Regarding claim 20, the computer-readable storage media claim 20 recites similar limitations as method claim 1, and thus is rejected under similar rationale. In addition, Gandelsman, paragraph 5 teaches “computer readable medium for image generation” and paragraph 7 teaches “one or more memories including instructions executable by the one or more processors”. The same motivations used in claim 1 apply here in claim 20.
Claim(s) 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Molad and Gandelsman as applied to claims 1 and 10 above, and further in view of Li et al. (U.S. Patent Application Publication No. 2021/0124881), hereinafter referenced as Li.
Regarding claim 3, the combination of Molad and Gandelsman fails to teach wherein the first denoising diffusion model has a greater number of model parameters than the second denoising diffusion model.
However, Li teaches wherein the first denoising diffusion model has a greater number of model parameters than the second denoising diffusion model (Li, paragraph 47 teaches "model that has been trained and has the number of model parameters which is greater than that of the current intermediate teacher model under training."); this shows first/trained model has greater number of parameters than the 2nd/under-training/domain-specific model. Li is considered to be analogous art because it is reasonably pertinent to the problem faced by the inventor of making trained model having more parameters. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Molad and Gandelsman with the number of parameters techniques of Li to ensure efficiently compressing complex neural network models to reduce model storage overhead, improving model inference speed while minimizing the damage to translation quality (Li, paragraph 4). This would be done by having less parameters in the second model.
Regarding claim 13, the method claim 13 recites similar limitations as method claim 3, and thus is rejected under similar rationale.
Claim(s) 4-5, 11-12 and 14-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Molad and Gandelsman as applied to claims 1 and 10 above, and further in view of Chen et al. (U.S. Patent Application Publication No. 2023/0252300), hereinafter referenced as Chen.
Regarding claim 4, the combination of Molad and Gandelsman fails to teach wherein the first denoising diffusion model has been trained on a greater number of training videos than the second denoising diffusion model.
However, Chen teaches wherein the first denoising diffusion model has been trained on a greater number of training videos than the second denoising diffusion model (Chen, fig 8. teaches the source model (offline trained) on the left being trained with large offline data and the right side shows repeated model refinement for target model with small data e.g. few video frames); source model is trained thus acts as pretrained/first model and is trained with more training videos than target model which is being refined thus acts as second/domain-specific/tuned model. Chen is considered to be analogous art because it is reasonably pertinent to the problem faced by the inventor of training models using video content to generate video content. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Molad and Gandelsman with the training video techniques of Chen to improve the coding efficiency of the neural networks utilized for a video coding system (Chen, paragraph 63) and to reduce the network complexity such that the network may better overfit the small amount of data within each refinement period (Chen, paragraph 89). This would be done by having the two models and tuning one with the particular type of videos to ensure the other model can be reused.
Regarding claim 5, the combination of Molad, Gandelsman and Chen teaches wherein the second denoising diffusion model has been trained on training videos belonging to a particular type, (Chen, paragraph 65 teaches "the offline trained models are refined by online training, e.g., online trained for adapting to specific video content"); the refined/second/tuned model being online trained for adapting to specific video content shows being trained on videos belonging to a particular type; and the first denoising diffusion model has been trained on more than one type of training videos (Chen, fig. 8 left side shows source models (offline trained) being trained with large offline data e.g. videos); videos in plural form from large data means more than one type of training video used to train the first/pre-trained model. The same motivations used in claim 4 apply here in claim 5.
Regarding claim 11, the combination of Molad, Gandelsman and Chen teaches further comprising: receiving a request for generating a video of a second particular type (Molad, page 8, fig. 9, last row of images shows a prompt "A streching caterpillar crawling on soft leaves”"); this is a user prompt which relates to a second request of caterpillar (particular type) video; obtaining model parameter values for a second domain-specific denoising diffusion model, (Chen, fig. 8, right side teaches target models (online refinement a second time) in repeated model refinement process` and Gandelsman paragraph 85 teaches "the parameters of the pre-trained diffusion model are fine-tuned"); another refinement means a second tuned/denoising diffusion model and since the parameters being fine-tuned shows obtaining corresponding model parameters for such as well; wherein the domain-specific denoising diffusion model has been trained on third training data comprising a plurality of domain-specific training videos of the second particular type (Chen, fig. 8, right side teaches another set of small data e.g. few video frames); this shows the tuning and domain-specific denoising diffusion model associated with such is also trained on third data which would have domain-specific videos of a particular type; initializing the current intermediate representation of the domain-specific output video (Gandelsman, paragraph 38 teaches "initializes a noise map, and computes a loss function by comparing each of the set of intermediate images to the image"); comparing each set of intermediate images/representation means there must be a initialized current intermediate representation first of the domain-specific / specific type of output video; at each of a plurality of iterations, updating the current intermediate representation based on a third noise output generated using the pre-trained denoising diffusion model and a fourth noise output generated using the second domain-specific denoising diffusion model (Gandelsman, paragraph 84 teaches "generates a set of intermediate images corresponding to the set of noise maps at different noise levels using the diffusion model. In some cases, the operations of this step refer to, or may be performed by, a pre-trained diffusion model as described with reference to FIG. 2", paragraph 49 teaches "add noise to the original image 405 to obtain noisy images 420 at various noise levels", and paragraph 22 teaches "iteratively adding noise to the data...an output image is created from each of the various noise levels"); iteratively adding noise shows updating current intermediate representation at each iteration, since noise maps at different levels may be performed by a pre-trained diffusion it means third noise in third iteration generated by such as well, using diffusion model includes the specific type / domain specific model and the noisy image at various levels means a fourth noise output generated by such as well on fourth iteration, thus future iterations would be the plurality of iterations updating the current intermediate representation based on such; and generating a second output domain-specific video from the current intermediate representation after the final iteration (Molad, abstract teaches "first transform the image into a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use our general video editor to animate it. As a further application, we can use our method for subject-driven video generation" and page 8, fig. 9 last row of images shows a second output of domain-specific video); this shows outputting the video which is still of the specific subject/type and would include the final current intermediate representation/image since that is the final image after final iteration, also the second output of domain-specific video shown would be from the current intermediate representation after final iteration of Gandelsman when viewed in combination. The same motivations used in claim 4 apply here in claim 11.
Regarding claim 12, the combination of Molad, Gandelsman and Chen teaches wherein obtaining the model parameter values for the domain-specific denoising diffusion model comprises: obtaining the second training data comprising the plurality of domain-specific training videos of the particular type (Chen, paragraph 10 teaches "bitstream including the encoded video information and online trained parameters, and obtaining the decoded video information by decoding with the bitstream", fig. 8 teaches small data, and paragraph 65 teaches "online trained for adapting to specific video content"); encoding trained parameters shows this is for obtaining parameters, the small data (second training data) is of the domain-specific/tuned denoising diffusion model and has few video frames thus videos of a specific content/type (domain-specific videos of particular type); and training the domain-specific denoising diffusion model on the second training data (Chen, fig. 8 right side shows repeated model refinement using the small data); this shows small/second data is used to train the tuned/domain-specific denoising diffusion model. The same motivations used in claim 4 apply here in claim 12.
Regarding claim 14, the method claim 14 recites similar limitations as method claim 4, and thus is rejected under similar rationale.
Regarding claim 15, the combination of Molad, Gandelsman and Chen teaches wherein the pre-trained denoising diffusion model has been trained on more than one types of training videos (Chen, fig. 8 left side shows source models (offline trained) being trained with large offline data e.g. videos); videos in plural form from large data means more than one type of training video used to train the first/pre-trained model. The same motivations used in claim 4 apply here in claim 15.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Li et al. (U.S. Patent Application Publication No. 2021/0264106) Abstract teaches two models trained on two datasets and text to content from such.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NAUMAN U AHMAD whose telephone number is (703)756-5306. The examiner can normally be reached Monday - Friday 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached at (571) 272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/N.U.A./Examiner, Art Unit 2611
/KEE M TUNG/Supervisory Patent Examiner, Art Unit 2611