Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/05/2024, 05/16/2024, 09/19/2024, 05/14/2025 was filed and are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 21-25 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Miranda (The Illustrated VQGAN, hereinafter Miranda).
Examiner notes that Miranda mentioned a Google Colab Notebook VQGAN-CLIP and though examiner is not directly citing to the Notebook, examiner is reading Miranda in light of the content and code of the Notebook for the current 102 rejection and the later 103 rejection.
Regarding Claim 21, Miranda discloses
A computer-implemented method to perform text-to-image generation, the method comprising: obtaining, by a computing system comprising one or more computing devices, a natural language input descriptive of desired image content (
PNG
media_image1.png
571
653
media_image1.png
Greyscale
);
processing, by the computing system, the natural language input with a text encoder portion of a machine-learned code prediction model to generate a text embedding (P1. Para. 2: “ In essence, the way they work is that VQGAN generates the images, while CLIP judges how well an image matches our text prompt.”, CLIP model judges the similarity between text and image based on the embedding of both. As such, creating a text embedding is an inherent step);
processing, by the computing system, the text embedding with an autoregressive code selection portion of the machine-learned code prediction model to autoregressively predict a sequence of predicted codes from a quantization codebook that contains a plurality of candidate codes (
PNG
media_image2.png
575
648
media_image2.png
Greyscale
);
processing, by the computing system, the sequence of quantized codes with a machine- learned image decoder to generate a plurality of synthesized image patches that form a synthesized image (
PNG
media_image3.png
192
665
media_image3.png
Greyscale
);
wherein the synthesized image depicts the desired image content (
PNG
media_image1.png
571
653
media_image1.png
Greyscale
).
Regarding Claim 22, dependent upon claim 21, Miranda discloses everything regarding claim 21. Miranda further discloses
one or more of the text encoder portion of the machine-learned code prediction model, the autoregressive code selection portion of the machine-learned code prediction model, and the machine-learned image decoder are configured to perform one or more self-attention operations (P .15 Para. 1: “Lastly, we discussed how the codebook was trained together with the two models. We started with training the GAN, then followed with training the Transformer. We learned that the GAN is composed of an encoder-decoder network as its generator, and that the Transformer uses a sliding-attention window when sampling images:”).
Regarding Claim 23, dependent upon claim 21, Miranda discloses everything regarding claim 21. Miranda further discloses
one or more of the text encoder portion of the machine-learned code prediction model, the autoregressive code selection portion of the machine-learned code prediction model, and the machine-learned image decoder comprise transformer neural networks (
PNG
media_image4.png
426
711
media_image4.png
Greyscale
).
Regarding Claim 24, dependent upon claim 21, Miranda discloses everything regarding claim 21. Miranda further discloses
one or both of the machine-learned image decoder and the codebook were jointly learned with an image encoder model (P .15 Para. 1: “Lastly, we discussed how the codebook was trained together with the two models. We started with training the GAN, then followed with training the Transformer. We learned that the GAN is composed of an encoder-decoder network as its generator, and that the Transformer uses a sliding-attention window when sampling images:”).
Regarding Claim 25, dependent upon claim 21, Miranda discloses everything regarding claim 21. Miranda further discloses
the text encoder portion of the machine-learned code prediction model was pre-trained on a pre-training task (P1. Para. 2: “ In essence, the way they work is that VQGAN generates the images, while CLIP judges how well an image matches our text prompt.”, For the CLIP model to perform the judging, it requires the model to have been pre-trained on other tasks.).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-5, 7, 9-12, 15 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao (Improved Transformer for High-Resolution GANs, hereinafter Zhao) in view of Esser (Taming Transformers for High-Resolution Image Synthesis, hereinafter Esser).
Regarding claim 1, Zhao discloses
A computer-implemented method to perform vector quantization of imagery, the method comprising:… processing, by the computing system, the plurality of input image patches with a machine-learned image encoder to generate a plurality of image tokens in a latent space, wherein the plurality of image tokens correspond to the plurality of input image patches, and wherein the machine-learned image encoder performs one or more self-attention operations to process the plurality of input image patches to generate the plurality of image tokens in the latent space (Figure 2: “The different stages of multi-axis self-attention for a [4, 4, C] input with the block size of b = 2. The input is first blocked into 2 x 2 non-overlapping [2, 2, C] patches”, P. 3-4: the input feature is first divided into non-overlapping blocks where each block can be considered as a local patch);
mapping, by the computing system, the plurality of image tokens to a plurality of quantized codes contained in a quantization codebook that contains a plurality of candidate codes (Figure 1, P. 4 Para. 1: “We enhance the local self-attention of Nested Transformer by the proposed multi-axis blocked self-attention that can produce a richer feature representation by explicitly considering local (within blocks) as well as global (across blocks) relations. We denote the overall architecture of these stages as multi-axis Nested Transformer.”; the process in Figure 1 goes through attention layers, which looks at the latent embedding for best match. Eventually, patches of images are generated and fused to form the final output image); and
providing, by the computing system, the plurality of quantized codes as an encoded version of the image (Figure 1 Output Image).
However Zhao does not explicitly disclose
obtaining, by a computing system comprising one or more computing devices, a plurality of input image patches of an image.
Esser teaches
obtaining, by a computing system comprising one or more computing devices, a plurality of input image patches of an image (Figure 2).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhao with encoding images and other aspects of Esser as both Zhao and Esser generate images patch by patch with Zhao inputting a latent input, which can be created using Esser’s encoder.
Regarding Claim 2, dependent upon claim 1, Zhao in view of Esser teaches everything regarding claim 1.
Zhao further discloses
the machine-learned image encoder comprises a vision transformer model (Figure 1: Low-Resolution Stages: Multi-Axis Nested Transformer).
Regarding Claim 3, dependent upon claim 1, Zhao in view of Esser teaches everything regarding claim 1.
the machine-learned image encoder performs one of the one or more self-attention operations on the plurality of input image patches (Figure 2: “The different stages of multi-axis self-attention for a [4, 4, C] input with the block size of b = 2. The input is first blocked into 2 x 2 non-overlapping [2, 2, C] patches”, P. 3-4: the input feature is first divided into non-overlapping blocks where each block can be considered as a local patch).
Regarding Claim 4, dependent upon claim 1, Zhao in view of Esser teaches everything regarding claim 1.
Zhao further discloses
processing, by the computing system, the plurality of quantized codes with a machine- learned image decoder to generate a plurality of synthesized image patches that form a synthesized image (Figure 1);
Esser further teaches
evaluating, by the computing system, a loss function that provides a loss value based at least in part on the synthesized image; and modifying, by the computing system, one or more of the machine-learned image encoder, the machine-learned image decoder, and the plurality of candidate codes based at least in part on the loss function (P . 4 Para. 2: “Backpropagation through the non-differentiable quantization operation in Eq. (3) is achieved by a straight-through gradient estimator, which simply copies the gradients from the decoder to the encoder [3], such that the model and codebook can be trained end-to-end via the loss function.”).
Regarding Claim 5, dependent upon claim 4, Zhao in view of Esser teaches everything regarding claim 4.
Zhao further discloses
the machine-learned image decoder comprises a vision transformer model (Figure 1: Low-Resolution Stages: Multi-Axis Nested Transformer).
Regarding Claim 7, dependent upon claim 1, Zhao in view of Esser teaches everything regarding claim 1.
Zhao further discloses
after projecting the image tokens to the lower-dimensional space, mapping, by the computing system, the plurality of image tokens to the plurality of quantized codes contained in the quantization codebook (Figure 1: Latent Embedding, P. 4 Para. 1: “We enhance the local self-attention of Nested Transformer by the proposed multi-axis blocked self-attention that can produce a richer feature representation by explicitly considering local (within blocks) as well as global (across blocks) relations. We denote the overall architecture of these stages as multi-axis Nested Transformer.”, P. 5 Para. 4: “Formally, let
X
l
be the first-layer feature representation of the
l
-th stage. The input latent code
z
is first projected into a 2D spatial embedding
Z
with the resolution of
H
Z
×
W
Z
and dimension of
C
Z
by a linear function
X
l
is then treated as the query and
Z
as the key and value. We compute their cross-attention following the update rule:
X
l
'
=
M
H
A
(
X
l
,
Z
+
P
Z
)
, where MHA represents the standard mul[ti]-head self-attention,
X
l
'
is the output, and
P
Z
is the learnable positional encoding having the same shape as
Z
. Note that
Z
is shared across all stages.”).
Esser further teaches
projecting, by the computing system, the plurality of image tokens to a lower-dimensional space (Figure 2: Codebook and the result of the encoder).
Regarding Claim 9, dependent upon claim 1, Zhao in view of Esser teaches everything regarding claim 1.
Zhao further discloses
autoregressively predicting, by the computing system using a machine-learned code prediction model, a plurality of predicted codes from the quantization codebook based at least in part on one or more of the plurality of quantized codes (Figure 1 Output of image patches);
processing, by the computing system, the plurality of predicted codes with a machine- learned image decoder to generate a plurality of synthesized image patches that form a synthesized image (Figure 1 Both of the Repeat x M).
Regarding Claim 10, dependent upon claim 9, Zhao in view of Esser teaches everything regarding claim 9.
evaluating, by the computing system, a code prediction loss function that evaluates a negative log-likelihood based on the plurality of predicted codes (
PNG
media_image5.png
60
322
media_image5.png
Greyscale
);
modifying, by the computing system, one or more parameters of the machine-learned code prediction model based on the code prediction loss function (P. 4 Section 3.2: “the transformer learns to predict the distribution of possible next indices,”).
Regarding Claim 11, dependent upon claim 9, Zhao in view of Esser teaches everything regarding claim 9.
Esser further teaches
autoregressively predicting, by the computing system using the machine-learned code prediction model, the plurality of predicted codes comprises conditioning, by the computing system, the machine- learned code prediction model with auxiliary conditioning data descriptive of one or more desired characteristics of the synthesized image (P. 4 Conditioned Synthesis: “In many image synthesis tasks a user demands control over the generation process by providing additional information from which an example shall be synthesized. This information, which we will call c, could be a single label describing the overall image class or even another image itself.”).
Regarding Claim 12, dependent upon claim 11, Zhao in view of Esser teaches everything regarding claim 11.
Esser further teaches
the auxiliary conditioning data comprises a class label descriptive of a desired class of the synthesized image (P. 4 Conditioned Synthesis: “In many image synthesis tasks a user demands control over the generation process by providing additional information from which an example shall be synthesized. This information, which we will call c, could be a single label describing the overall image class or even another image itself.”).
Regarding Claim 15, dependent upon claim 11, Zhao in view of Esser teaches everything regarding claim 11.
Esser further teaches
extracting, by the computing system, one or more intermediate features from the machine-learned code prediction model; and predicting, by the computing system, a class label for the image based at least in part on the intermediate features (P. 4 Learning a Perceptually Rich Codebook: “More specifically, we replace the
L
2
loss used in [63] for
L
r
e
c
by a perceptual loss and introduce an adversarial training procedure with a patch-based discriminator
D
[25] that aims to differentiate between real and reconstructed images”, The real and reconstructed determination are the classes).
Regarding claim 18, Zhao discloses
A computing system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store a machine-learned image processing model comprising:…a quantization portion configured to quantize the one or more image tokens into one or more quantized codes selected from a codebook (Figure 1, P. 4 Para. 1: “We enhance the local self-attention of Nested Transformer by the proposed multi-axis blocked self-attention that can produce a richer feature representation by explicitly considering local (within blocks) as well as global (across blocks) relations. We denote the overall architecture of these stages as multi-axis Nested Transformer.”; P. 5 Para. 4: “Formally, let
X
l
be the first-layer feature representation of the
l
-th stage. The input latent code
z
is first projected into a 2D spatial embedding
Z
with the resolution of
H
Z
×
W
Z
and dimension of
C
Z
by a linear function
X
l
is then treated as the query and
Z
as the key and value. We compute their cross-attention following the update rule:
X
l
'
=
M
H
A
(
X
l
,
Z
+
P
Z
)
, where MHA represents the standard mul[ti]-head self-attention,
X
l
'
is the output, and
P
Z
is the learnable positional encoding having the same shape as
Z
. Note that
Z
is shared across all stages.”);
a code prediction portion configured to predict one or more predicted quantized codes from the codebook based at least in part on the one or more quantized codes (Figure 1 Both of the Repeat x M, We denote the overall architecture of these stages as multi-axis Nested Transformer.”; P. 5 Para. 4: “Formally, let
X
l
be the first-layer feature representation of the
l
-th stage. The input latent code
z
is first projected into a 2D spatial embedding
Z
with the resolution of
H
Z
×
W
Z
and dimension of
C
Z
by a linear function
X
l
is then treated as the query and
Z
as the key and value. We compute their cross-attention following the update rule:
X
l
'
=
M
H
A
(
X
l
,
Z
+
P
Z
)
, where MHA represents the standard mul[ti]-head self-attention,
X
l
'
is the output, and
P
Z
is the learnable positional encoding having the same shape as
Z
. Note that
Z
is shared across all stages.”).
However Zhao does not explicitly disclose
an encoder portion configured to encode one or more input image patches into one or more image tokens in a latent space; a discriminative prediction portion configured to generate one or more discriminative predictions for the input image patches based at least in part on data extracted from the code prediction portion.
Esser teaches
an encoder portion configured to encode one or more input image patches into one or more image tokens in a latent space (Figure 2 CNN Encoder);
a discriminative prediction portion configured to generate one or more discriminative predictions for the input image patches based at least in part on data extracted from the code prediction portion (Figure 2 CNN Discriminator).
Regarding Claim 19, dependent upon claim 18, Zhao in view of Esser teaches everything regarding claim 18.
Zhao further discloses
a decoder portion configured to generate reconstructed image patches based on the one or more quantized codes or to generate synthetic image patches based at least in part on the one or more predicted quantized codes (Figure 1).
Regarding Claim 20, dependent upon claim 18, Zhao in view of Esser teaches everything regarding claim 18.
Esser further teaches
the one or more discriminative predictions comprise image classification predictions (Figure 2 CNN Discriminator predicting real or fake patch by patch).
Claims 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao (Improved Transformer for High-Resolution GANs, hereinafter Zhao) in view of Esser (Taming Transformers for High-Resolution Image Synthesis, hereinafter Esser) and Miranda (The Illustrated VQGAN, hereinafter Miranda).
Regarding Claim 13, dependent upon claim 11, Zhao in view of Esser teaches everything regarding claim 11.
However Zhao in view of Esser does not explicitly teach
the auxiliary conditioning data comprises natural language text tokens.
Miranda teaches
the auxiliary conditioning data comprises natural language text tokens (
PNG
media_image1.png
571
653
media_image1.png
Greyscale
).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhao in view of Esser with having text as condition for generation of image and other aspects of Miranda as Miranda included the Colab Notebook and explanation of VQGAN+CLIP, and VQGAN is Esser.
Regarding Claim 14, dependent upon claim 13, Zhao in view of Esser and Miranda teaches everything regarding claim 13.
Miranda further teaches
processing, by the computing system, the natural language text tokens with a text encoder portion of the machine-learned code prediction model to generate a text embedding (P1. Para. 2: “ In essence, the way they work is that VQGAN generates the images, while CLIP judges how well an image matches our text prompt.”, CLIP model judges the similarity between text and image based on the embedding of both. As such, creating a text embedding is an inherent step); and
providing, by the computing system, the text embedding as an input to an autoregressive code selection portion of the machine-learned code prediction model to autoregressively predict the plurality of predicted codes (
PNG
media_image2.png
575
648
media_image2.png
Greyscale
).
Allowable Subject Matter
Claims 26-30 are allowed.
Claims 6 and 8 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Relevant Prior Art Directed to State of Art
WANG et al. (US 2024/0112341 A1, hereinafter Wang) is prior art not applied in the rejection(s) above. Wang discloses obtaining a synthetic histochemically stained image from a multiplexed immunofluorescence (MPX) image may include producing an N-channel input image that is based on information from each of M channels of an MPX image of a tissue section, where M and N are positive integers and N is less than or equal to M; and generating a synthetic image by processing the N-channel input image using a generator network, the generator network having been trained using a training data set that includes a plurality of pairs of images.
Radford et al. (Learning Transferable Visual Models From Natural Language Supervision, hereinafter Radford) is prior art not applied in the rejection(s) above. Radford discloses the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks.
Isola et al. (Image-to-Image Translation with Conditional Adversarial Networks, hereinafter Isola) is prior art not applied in the rejection(s) above. Isola discloses a method for synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images and other tasks.
Ding et al. (CogView: Mastering Text-to-Image Generation via Transformers, hereinafter Ding) is prior art not applied in the rejection(s) above. Ding discloses a 4-billion-parameter Transformer with VQ-VAE tokenizer to for Text-to-Image generation.
Ramesh et al. (Zero-Shot Text-to-Image Generation, hereinafter Ramesh) is prior art not applied in the rejection(s) above. Ramesh discloses a simple approach for text-to-image generation based on a transformer that autoregressively models the text and image tokens as a single stream of data.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOSHUA CHEN whose telephone number is (703)756-5394. The examiner can normally be reached M-Th: 9:30 am - 4:30pm ET F: 9:30 am - 2:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, STEPHEN R KOZIOL can be reached at (408)918-7630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J. C./Examiner, Art Unit 2665
/Stephen R Koziol/Supervisory Patent Examiner, Art Unit 2665