Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1 – 20 are pending in this application. Claims 1, 9 and 17 are independent.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1 – 6, 9 – 14 and 17 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Saharia, Chitwan (US-20230377226-A1, hereinafter simply referred to as Chitwan).
Regarding independent claim 1, Chitwan teaches:
A computer-implemented method (e.g., FIG. 1B of Chitwan) comprising: generating, utilizing one or more encoder neural networks (e.g., encoder neural network 110 (FIG. 1A) of Chitwan), a sequence of embeddings (e.g., contextual embeddings 104 (FIG. 1A) of Chitwan) comprising a prompt embedding representing a text prompt (e.g., TEXT prompt 102: “A dragon fruit wearing karate belt in the snow.” (FIG. 1A) of Chitwan) and an object text embedding (e.g., TEXT prompt 102: “A dragon fruit wearing karate belt in the snow.” (FIG. 1A) of Chitwan) representing a phrase indicating an object in the text prompt (See at least Chitwan, ¶ [0065]; FIGS. 1 – 3; "…text encoder neural network 110 is configured to process the text prompt 102 to generate a set of contextual embeddings of the text prompt 102…The contextual embeddings 104 can also be referred to as an encoded representation of the text prompt 102 that provides a computationally amenable representation for processing by the system 100…"); generating, utilizing the one or more encoder neural networks (e.g., neural network image encoder of Chitwan), a visual embedding (e.g., visual embedding of Chitwan) representing an object image corresponding to the object (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3; "…In some implementations…The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), and/or an image encoder to perform such classification and can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…"); determining a modified sequence of embeddings (e.g., sequence 121 (FIGS. 1A, 2A, 3A, & 6A) of Chitwan) by replacing the object text embedding with the visual embedding in the sequence of embeddings (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3; "…In some implementations…The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), and/or an image encoder to perform such classification and can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…the sequence 121 may be trained (e.g., by a training engine) to generate output images 106 from text-based training sets (as opposed to only labelled text-image training sets) by generating output images 106 that faithfully reconstruct (or replace) contextual embeddings 104 once encoded into visual embeddings…"); and generating, utilizing a generative neural network (e.g., generative neural networks (GNNs) 120 (FIG. 1A) of Chitwan), a synthetic digital image from the modified sequence of embeddings comprising the visual embedding (See at least Chitwan, ¶ [0027, 0028]; FIGS. 1 – 3; "…each GNN in the sequence can be independently optimized by the training engine to impart certain properties to the GNN, e.g., particular output resolutions, fidelity, perceptual quality, efficient decoding (or denoising), fast sampling, reduced artifacts, etc…", "…To provide high fidelity text-to-image synthesis (or a synthetic digital image) with a high degree of text-image alignment, the system can use a pre-trained text encoder neural network to process a text prompt and generate a set (or sequence) of contextual embeddings of the text prompt…The training engine can also hold the text encoder frozen when the sequence of GNNs is trained to improve alignment between text prompts and images generated at inference…").
Chitwan teaches the subject matter of the claimed inventive concept as expressed in the rejections above. However, the teachings are taught in separate embodiments.
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Chitwan taught in separate embodiments for the desirable and advantageous purpose of ensuring that a high resolution image can be generated without requiring a single neural network to generate the image at the desired output resolution directly by cascading GNNs to significantly improve their sample quality, as well as compensate for any artifacts generated at lower resolutions, e.g., distortions, checkerboard artifacts, etc, as discussed in Chitwan (See ¶ [0026]); thereby, achieving the predictable result of improving the overall efficiency and speed of the system with a reasonable expectation of success while enabling others skilled in the art to best utilize the invention along with various implementations and modifications as are suited to the particular use contemplated.
Regarding independent claim 9, Chitwan teaches:
A system (e.g., FIG. 1A of Chitwan) comprising: one or more memory devices; and one or more processors configured to cause the system to: generate, utilizing one or more encoder neural networks (e.g., encoder neural network 110 (FIG. 1A) of Chitwan), a sequence of embeddings (e.g., contextual embeddings 104 (FIG. 1A) of Chitwan) comprising a prompt embedding (e.g., TEXT prompt 102: “A dragon fruit wearing karate belt in the snow.” (FIG. 1A) of Chitwan) representing a first phrase indicating a first object in a text prompt (e.g., TEXT prompt 102: “A dragon fruit wearing karate belt in the snow.” (FIG. 1A) of Chitwan) and a second object text embedding representing a second phrase indicating a second object in the text prompt (e.g., TEXT prompt 102: “A dragon fruit wearing karate belt in the snow.” (FIG. 1A) of Chitwan) (See at least Chitwan, ¶ [0065]; FIGS. 1 – 3; "…text encoder neural network 110 is configured to process the text prompt 102 to generate a set of contextual embeddings of the text prompt 102…The contextual embeddings 104 can also be referred to as an encoded representation of the text prompt 102 that provides a computationally amenable representation for processing by the system 100…"); generate, utilizing the one or more encoder neural networks (e.g., neural network image encoder of Chitwan), a first visual embedding (e.g., visual embedding of Chitwan) representing a first object image corresponding to the first object (e.g., TEXT prompt 102: “A dragon fruit.” (FIG. 1A) of Chitwan) and a second visual embedding (e.g., visual embedding of Chitwan) representing a second object image corresponding to the second object (e.g., TEXT prompt 102: “karate belt in the snow.” (FIG. 1A) of Chitwan) (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3; "…In some implementations…The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), and/or an image encoder to perform such classification and can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…"); determine a modified sequence of embeddings (e.g., sequence 121 (FIGS. 1A, 2A, 3A, & 6A) of Chitwan) by replacing, in the sequence of embeddings, the first object text embedding with the first visual embedding and the second object text embedding with the second visual embedding (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3; "…In some implementations…The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), and/or an image encoder to perform such classification and can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…the sequence 121 may be trained (e.g., by a training engine) to generate output images 106 from text-based training sets (as opposed to only labelled text-image training sets) by generating output images 106 that faithfully reconstruct (or replace) contextual embeddings 104 once encoded into visual embeddings…"); and generate, utilizing a generative neural network (e.g., generative neural networks (GNNs) 120 (FIG. 1A) of Chitwan), a synthetic digital image from the modified sequence of embeddings comprising the first visual embedding and the second visual embedding (See at least Chitwan, ¶ [0027, 0028]; FIGS. 1 – 3; "…each GNN in the sequence can be independently optimized by the training engine to impart certain properties to the GNN, e.g., particular output resolutions, fidelity, perceptual quality, efficient decoding (or denoising), fast sampling, reduced artifacts, etc…", "…To provide high fidelity text-to-image synthesis (or a synthetic digital image) with a high degree of text-image alignment, the system can use a pre-trained text encoder neural network to process a text prompt and generate a set (or sequence) of contextual embeddings of the text prompt…The training engine can also hold the text encoder frozen when the sequence of GNNs is trained to improve alignment between text prompts and images generated at inference…").
Chitwan teaches the subject matter of the claimed inventive concept as expressed in the rejections above. However, the teachings are taught in separate embodiments.
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Chitwan taught in separate embodiments for the desirable and advantageous purpose of ensuring that a high resolution image can be generated without requiring a single neural network to generate the image at the desired output resolution directly by cascading GNNs to significantly improve their sample quality, as well as compensate for any artifacts generated at lower resolutions, e.g., distortions, checkerboard artifacts, etc, as discussed in Chitwan (See ¶ [0026]); thereby, achieving the predictable result of improving the overall efficiency and speed of the system with a reasonable expectation of success while enabling others skilled in the art to best utilize the invention along with various implementations and modifications as are suited to the particular use contemplated.
Regarding independent claim 17, Chitwan teaches:
A non-transitory computer readable medium (e.g., machine-readable storage device of Chitwan) storing executable instructions which, when executed by at least one processing device, cause the at least one processing device to perform operations comprising: determining, by parsing a text prompt for generating or modifying a digital image (e.g., processing the input text prompt using a text encoder neural network to generate a final output image that depicts a scene that is described by the input text prompt of Chitwan), a plurality of phrases corresponding to a plurality of objects (See at least Chitwan, ¶ [0029]; FIGS. 1 – 3; "…The system can process the contextual embeddings using the sequence of GNNs to generate a final output image depicting the scene that is described by the text prompt…"); generating, utilizing one or more encoder neural networks (e.g., encoder neural network 110 (FIG. 1A) of Chitwan), a sequence of embeddings (e.g., contextual embeddings 104 (FIG. 1A) of Chitwan) comprising a prompt embedding representing a text prompt (e.g., TEXT prompt 102: “A dragon fruit wearing karate belt in the snow.” (FIG. 1A) of Chitwan) and an object text embedding (e.g., TEXT prompt 102: “A dragon fruit wearing karate belt in the snow.” (FIG. 1A) of Chitwan) representing a phrase indicating an object in the text prompt (See at least Chitwan, ¶ [0065]; FIGS. 1 – 3; "…text encoder neural network 110 is configured to process the text prompt 102 to generate a set of contextual embeddings of the text prompt 102…The contextual embeddings 104 can also be referred to as an encoded representation of the text prompt 102 that provides a computationally amenable representation for processing by the system 100…"); generating, utilizing the one or more encoder neural networks (e.g., neural network image encoder of Chitwan), a visual embedding (e.g., visual embedding of Chitwan) representing an object image corresponding to the object (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3; "…In some implementations…The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), and/or an image encoder to perform such classification and can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…"); determining a modified sequence of embeddings (e.g., sequence 121 (FIGS. 1A, 2A, 3A, & 6A) of Chitwan) by replacing the object text embedding with the visual embedding in the sequence of embeddings (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3; "…In some implementations…The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), and/or an image encoder to perform such classification and can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…the sequence 121 may be trained (e.g., by a training engine) to generate output images 106 from text-based training sets (as opposed to only labelled text-image training sets) by generating output images 106 that faithfully reconstruct (or replace) contextual embeddings 104 once encoded into visual embeddings…"); and generating, utilizing a generative neural network (e.g., generative neural networks (GNNs) 120 (FIG. 1A) of Chitwan), a synthetic digital image from the modified sequence of embeddings comprising the visual embedding (See at least Chitwan, ¶ [0027, 0028]; FIGS. 1 – 3; "…each GNN in the sequence can be independently optimized by the training engine to impart certain properties to the GNN, e.g., particular output resolutions, fidelity, perceptual quality, efficient decoding (or denoising), fast sampling, reduced artifacts, etc…", "…To provide high fidelity text-to-image synthesis (or a synthetic digital image) with a high degree of text-image alignment, the system can use a pre-trained text encoder neural network to process a text prompt and generate a set (or sequence) of contextual embeddings of the text prompt…The training engine can also hold the text encoder frozen when the sequence of GNNs is trained to improve alignment between text prompts and images generated at inference…").
Chitwan teaches the subject matter of the claimed inventive concept as expressed in the rejections above. However, the teachings are taught in separate embodiments.
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Chitwan taught in separate embodiments for the desirable and advantageous purpose of ensuring that a high resolution image can be generated without requiring a single neural network to generate the image at the desired output resolution directly by cascading GNNs to significantly improve their sample quality, as well as compensate for any artifacts generated at lower resolutions, e.g., distortions, checkerboard artifacts, etc, as discussed in Chitwan (See ¶ [0026]); thereby, achieving the predictable result of improving the overall efficiency and speed of the system with a reasonable expectation of success while enabling others skilled in the art to best utilize the invention along with various implementations and modifications as are suited to the particular use contemplated.
Regarding dependent claim 2, Chitwan teaches:
wherein generating the sequence of embeddings comprises: determining, from the text prompt, a plurality of phrases indicating a plurality of objects in the text prompt (See at least Chitwan, ¶ [0130]; FIGS. 1 – 3; "…The initial generative neural network processes the contextual embeddings to generate, as output, an initial output image having an initial resolution (234)…"); generating, based on the plurality of phrases, the object text embedding representing the phrase indicating the object (See at least Chitwan, ¶ [0130]; FIGS. 1 – 3; "…The initial generative neural network processes the contextual embeddings to generate, as output, an initial output image having an initial resolution (234)…"); and generating, based on the plurality of phrases (e.g., TEXT prompt 102: “A dragon fruit wearing karate belt in the snow.” (FIG. 1A) of Chitwan), an additional object text embedding representing an additional phrase indicating an additional object (e.g., See final image 108 (FIG. 1A) of Chitwan) (See at least Chitwan, ¶ [0070]; FIGS. 1 – 3; "…The final image 108 depicts the scene described by the text prompt 102 and is output by the system 100 with a final resolution custom-character. For example, as shown in FIG. 1A, the final image 108 depicts a dragon fruit wearing a karate belt in the snow. Accordingly, the final image 108 is accurately captioned by the corresponding text prompt 102 in FIG. 1A…").
Regarding dependent claim 3, Chitwan teaches:
wherein generating the visual embedding comprises generating the object image comprising an example object (e.g., image sample of Chitwan) based on the phrase indicating the object (See at least Chitwan, ¶ [0086, 0089]; FIGS. 1 – 3; "…a GNN 120 can process a conditioning input c and sample a latent from the prior distribution z˜p.sub.θ(z|c…the GNN 120 can sample an image from the conditional distribution... use an algorithm to choose from multiple samples of images…").
Regarding dependent claim 4, Chitwan teaches:
generating an additional object image comprising an additional example object based on an additional phrase indicating an additional object in the text prompt (See at least Chitwan, ¶ [0050]; FIGS. 1 – 3, 7; "…FIG. 7 shows various images generated from text prompts by an image generation system…"); and generating an additional visual embedding representing the additional object image corresponding to the additional object (See at least Chitwan, ¶ [0050, 0064, 0065]; FIGS. 1 – 3, 7; "…FIG. 7 shows various images generated from text prompts by an image generation system…", "…the system 100 can generate various different types of images such as three-dimensional (3D) images, photorealistic images, cartoon images, abstract visualizations, point cloud images, medical images of different modalities, among others…", "…The text encoder 110 is configured to process the text prompt 102 to generate a set of contextual embeddings (u) of the text prompt 102…").
Regarding dependent claim 5, Chitwan teaches:
wherein determining the modified sequence of embeddings comprises: determining a position of the object text embedding in the sequence of embeddings (See at least Chitwan, ¶ [0065, 0149]; FIGS. 1 – 3, 7; "…The text encoder 110 is configured to process the text prompt 102 to generate a set of contextual embeddings (u) of the text prompt 102…", "…One or more of the DBlocks 510 and UBlocks 520 can be conditioned on the contextual embeddings (u) 104 via an attention mechanism (e.g., cross-attention) using one or more self-attention layers. Alternatively or in addition, one or more of the DBlocks 510 and UBlocks 520 can be conditioned on the contextual embeddings 104…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…"); removing the object text embedding (e.g., by replacing text prompts 102 with noise inputs 114 of Chitwan) from the sequence of embeddings (See at least Chitwan, ¶ [0164]; FIGS. 1 – 3, 7; "…the image generation system 101 shown in FIG. 6A can generate images from noise which amounts to changing the conditioning input into the image generation system 100 of FIG. 1A, e.g., replacing text prompts 102 with noise inputs 114…"); and inserting the visual embedding into the sequence of embeddings at the position (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3, 7; "…the sequence 121 may be trained (e.g., by a training engine) to generate output images 106 from text-based training sets (as opposed to only labelled text-image training sets) by generating output images 106 that faithfully reconstruct contextual embeddings 104 once encoded into visual embeddings…").
Regarding dependent claim 6, Chitwan teaches:
wherein: generating the sequence of embeddings comprises generating the object text embedding in a feature space (e.g., a scene or environment of Chitwan) (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3, 7; "…the post-processor 130 can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…"); and generating the visual embedding comprises generating the visual embedding in the feature space of the object text embedding (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3, 7; "…the post-processor 130 can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…the sequence 121 may be trained (e.g., by a training engine) to generate output images 106 from text-based training sets (as opposed to only labelled text-image training sets) by generating output images 106 that faithfully reconstruct contextual embeddings 104 once encoded into visual embeddings…").
Regarding dependent claim 10, Chitwan teaches:
generate the sequence of embeddings by generating, utilizing the one or more encoder neural networks (e.g., encoder neural network 110 (FIG. 1A) of Chitwan), a prompt embedding representing the text prompt in a feature space corresponding to the first object text embedding and the second object text embedding (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3, 7; "…the post-processor 130 can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…the sequence 121 may be trained (e.g., by a training engine) to generate output images 106 from text-based training sets (as opposed to only labelled text-image training sets) by generating output images 106 that faithfully reconstruct contextual embeddings 104 once encoded into visual embeddings…").
Regarding dependent claim 11, Chitwan teaches:
generate the sequence of embeddings by parsing the text prompt to: determine the first object and one or more visual attributes of the first object (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…"); and determine the second object and one or more visual attributes of the second object (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…").
Regarding dependent claim 12, Chitwan teaches:
determine, based on the first phrase, the first object image comprising a first example object including the one or more visual attributes of the first object (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…"); and determine, based on the second phrase, the second object image comprising a second example object including the one or more visual attributes of the second object (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…").
Regarding dependent claim 14, Chitwan teaches:
wherein the one or more processors are further configured to generate the synthetic digital image by providing the modified sequence of embeddings with a noised image embedding to the generative neural network (See at least Chitwan, ¶ [0164, 0167]; FIGS. 1 – 3, 7; "…the image generation system 101 shown in FIG. 6A can generate images from noise which amounts to changing the conditioning input into the image generation system 100 of FIG. 1A, e.g., replacing text prompts 102 with noise inputs 114…", "…A training engine (e.g., the training engine 300 of FIG. 3A) can train the sequence 121 to generate output images R from noise in a similar manner as a text. Training involves slight modifications to the training regime outlined in FIG. 3A since the training set generally includes unlabeled images as opposed to labelled text-image pairs…").
Regarding dependent claim 18, Chitwan teaches:
wherein: parsing the text prompt comprises: determining a first phrase corresponding to a first object of the digital image and one or more visual attributes of the first object (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…"); and determining a second phrase corresponding to a second object of the digital image and one or more visual attributes of the second object (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…"); and generating the sequence of embeddings comprises generating a first object text embedding representing the first phrase and a second object text embedding representing the second phrase (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…").
Regarding dependent claim 19, Chitwan teaches:
wherein generating the plurality of visual embeddings comprises: determining a first object image comprising a first example object including the one or more visual attributes of the first object (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…"); determining a second object image comprising a second example object including the one or more visual attributes of the second object; and generating, utilizing the one or more encoder neural networks, a first visual embedding representing the first object image and a second visual embedding representing the second object image (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…").
Regarding dependent claim 20, Chitwan teaches:
wherein determining the modified sequence of embeddings comprises: determining locations corresponding to the plurality of object text embeddings in the sequence of embeddings (See at least Chitwan, ¶ [0060, 0149]; FIGS. 1 – 3, 7; "…the text prompt can describe a mood that the scene should evoke, e.g., “happiness is a sunny day”, or “fear of the unknown”. In general, text prompts can include any text, whether it is descriptive of visual attributes or not…", "…one or more of the DBlocks 510 and UBlocks 520 can condition on other visual features that are expected in the output image, e.g., relating to specific colors or textural properties, or locations of objects, all of which can be obtained by the training engine 300 from the training images…"); and replacing the plurality of object text embeddings with the plurality of visual embeddings at the locations in the sequence of embeddings (See at least Chitwan, ¶ [0069]; FIGS. 1 – 3; "…In some implementations…The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), and/or an image encoder to perform such classification and can determine if the output image 106 accurately depicts the scene described by the text prompt 102 by encoding the output image 106 into a set of visual embeddings and comparing it with the contextual embeddings 104…the sequence 121 may be trained (e.g., by a training engine) to generate output images 106 from text-based training sets (as opposed to only labelled text-image training sets) by generating output images 106 that faithfully reconstruct (or replace) contextual embeddings 104 once encoded into visual embeddings…").
Allowable Subject Matter
Dependent claims 7, 13 and 15 are objected to as being allowable – including all of the limitations of their base claim(s) and any intervening and/or dependent claims, if re-written in independent form. Claims 8 and 16 are also objected to as being allowable because of their dependency(ies) to claim(s) 7 and 15, respectively.
Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicant's disclosure: See the Notice of References Cited (PTO–892)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IDOWU O OSIFADE whose telephone number is (571)272-0864. The Examiner can normally be reached on Monday-Friday 8:00am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the Examiner’s Supervisor, ANDREW MOYER can be reached on (571) 272 – 9523. The fax phone number for the organization where this application or proceeding is assigned is (571) 273 – 8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov.
Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at (866) 217 – 9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call (800) 786 – 9199 (IN USA OR CANADA) or (571) 272 – 1000.
/IDOWU O OSIFADE/Primary Examiner, Art Unit 2675