Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority based on an application filed in Singapore on Jul. 18, 2023.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 6/26/24 and 2/6/26 are being considered by the examiner.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2 and 4-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao et al. (US 2024/0404243 A1) in view of Liu et al. (US 2022/0300764 A1).
As to Claim 1, Zhao teaches A method comprising:
generating a plurality of sample images using a trained generative model (Zhao discloses “In ANNs, a hidden (or intermediate) layer includes hidden nodes and is located between an input layer and an output layer. Hidden layers perform nonlinear transformations of inputs entered into the network. Each hidden layer is trained to produce a defined output that contributes to a joint output of the output layer of the neural network” in [0044]; “In some examples, the nodes are aggregated into layers. Different layers perform different transformations on their inputs” in [0045], see also down-sampling in [0059] and “a generative adversarial network (GAN) or a diffusion model” in [0067]; “encoding the text using a multimodal encoder to obtain a predicted text embedding, wherein encoding the text comprises generating a plurality of multi-head attention (MHA) text outputs corresponding to a plurality of different text scales, respectively” in [0149]).
Zhao doesn’t directly teach “attention map”. The combination of Liu further teaches following limitations:
for a sample image of the plurality of sample images, obtaining at least one attention map from a generative model, the at least one attention map being determined by the generative model for generating the sample image, an attention map indicating visual elements of an object within the sample image
(Zhao discloses “In some cases, a transformer includes one or more feed-forward ANNs to process the data after the application of the self-attention mechanism to allow the transformer to make predictions based on the sequence of data.” in [0052]; “According to some aspects, multimodal encoder 215 comprises the MHA module. In some cases, the MHA module is configured to generate a feature vector based on the prompt. In some cases, the MHA module comprises a set of attention layers” in [0053]; “According to some aspects, multi-scale aggregator 220 is a pyramid projection layer configured to ensemble the set of MHA outputs by combining the set of MHA outputs to generate an aggregated output” in [0054]. Liu further discloses “The training module 16 may include instructions that cause the processor(s) 12 to utilize layer 66 to compute the image-caption attention map 68… may be able to identify a location 68A within the image 32…” in [0044]); and
performing training of a target model according to unsupervised learning at least based on the plurality of sample images and attention maps for the plurality of sample images, the target model being configured to perform an image processing task (Zhao discloses “In some aspects, the multimodal encoder 215 includes at least one pre-trained encoder that is fine-tuned based on the multi-scale aggregator 220. In some cases, the pre-trained encoder is an encoder network of the one or more encoder networks of multimodal encoder 215. In some cases, the pre-trained encoder comprises a CLIP (Contrastive Language-Image Pre-Training) model” in [0060]. Here, CLIP is unsupervised learning. Liu further discloses “As such, the system and method train the model in an unsupervised fashion using images and related captions…” in [0024].)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Zhao with the teaching of Liu so as to use attention layers to generate attention map to identify the location of the image within the image (Liu, [0044]).
As to Claim 2, Zhao in view of Liu teaches The method of claim 1, wherein generating the plurality of sample images comprises: generating a plurality of sample images by providing a plurality of text prompts into a trained generative model, respectively, and wherein an attention map for a sample image indicates visual elements of an object indicated by a text prompt (Zhao discloses “For example, a user provides a text prompt to the machine learning system to retrieve an image that includes a depiction of content included in the text prompt. The multimodal retrieval system generates an embedding of the text prompt using a multimodal encoder based on an aggregation of multiple attention scales applied to the text prompt [24” in [0024]. Liu further discloses “The image-caption attention map 68 may be able to identify a location 68A within the image 32 that relates to the location of the cat 32A…” in [0044].)
As to Claim 4, Zhao in view of Liu teaches The method of claim 1, wherein the unsupervised learning comprises contrastive learning (Zhao, [0166]), and wherein performing training of the target model comprises:
for a first sample image of the plurality of sample images that includes at least two objects, for at least one feature of the first sample image by the target model, masking the at least one feature with at least two attention maps for the at least two objects, respectively, to obtain at least two masked features for the at least two objects, a masked feature comprising feature information related to an object; (Zhao, Fig 1 and [0057,0112]. Liu, [0047] and Fig 4.);
constructing at least one positive sample pair and at least one negative sample pair from the at least two masked features, a positive sample pair comprising a pair of masked features for a same object, and a negative sample pair comprising a pair of masked features for a pair of different objects; (Zhao discloses “Contrastive learning refers to a type of machine learning in which a model is trained using the selection of positive and negative sample pairs. Contrastive learning can be used in either a supervised or an unsupervised (e.g., self-supervised) training context. A loss function for a contrastive learning model can encourage a model to generate similar results for positive sample pairs, and dissimilar results for negative sample pairs” in [0166]. Liu also discloses “The system and method described in this specification utilize a contrastive pre-training framework for training a model between images and related captions” in [0024]);
determining a contrastive loss for the first sample image based on at least one similarity between the at least one positive sample pair and at least one similarity between the at least one negative sample pair; and training the target model based on the contrastive loss (Zhao discloses “For example, in some cases, the training component determines a loss (such as a contrastive loss) using a loss function (such as a contrastive loss function)… Specifically, during each training iteration, the output of the model is compared to the known annotation information in the training data. The loss function provides a value (a "loss") for how close the predicted annotation data is to the actual annotation data” in [0165]; “A loss function for a contrastive learning model can encourage a model to generate similar results for positive sample pairs, and dissimilar results for negative sample pairs” in [0166]; see also similarity metric in [0131]. Liu also discloses “Once the contrastive loss is determined, the training module 16 may include instructions that cause the processor(s) 12 to adjust, based on the contrastive loss, the model weights 23 and/or 25 of the visual backbone model 22 and/or the textual backbone model 24, respectively. Applying the contrastive loss over the global visual and textual features (after average pooling) provides the visual backbone model 22 with a holistic sense of what objects 32A-32C are in the image 32.” in [0042].)
As to Claim 5, Zhao in view of Liu teaches The method of claim 4, wherein masking the at least one feature with at least two attention maps for the at least two objects, respectively, to obtain at least two masked features for the at least two objects comprises:
applying a first cropping operation and a second cropping operation on the first sample image, respectively, to generate a first cropped image and a second cropped image; applying the first cropping operation and the second cropping operation on at least two attention maps for the at least two objects, respectively, to obtain a first set of cropped attention maps for the first cropped image, and a second set of cropped attention maps for the second cropped image; for a first feature of the first cropped image extracted by the target model, masking the first feature with the first set of cropped attention maps, respectively, to obtain a first set of masked features for the at least two objects; and for a second feature of the second cropped image extracted by the target model, masking the second feature with the second set of cropped attention maps, respectively, to obtain a second set of masked features for the at least two objects (Liu discloses “To determine the localization loss, the training module 16 may include instructions that cause the processor(s) 12 to temporally crop portions of the visual identifiers 36 to using a cropping function 70 to generate cropped visual identifiers that correspond to the words of the caption associated with each of the objects of the image 32. Next, the training module 16 may include instructions that cause the processor(s) 12 to render covered regions of the image 32 associated with the cropped visual identifiers to generate binary masks with a resolution R” in [0046]; “Thereafter, the training module 16 may include instructions that cause the processor(s) 12 to stack the rendered masks of all tokens together to generate a rendered attention 72 (Mk). The rendered attention 72 may include render attentions 72A, 72B, and 72C for each of the detected objects in the image 32.” in [0047]; see also Fig 4.)
As to Claim 6, Zhao in view of Liu teaches The method of claim 1, wherein the unsupervised learning comprises masked modeling, and wherein performing training of the target model comprises: for a second sample image of the plurality of sample images, masking at least one patch in the second sample image based on the at least one attention map for the second sample image; and training the target model by performing masked modeling to reconstruct the at least one masked patch in the second sample image (Zhao discloses “In some examples, multimodal encoder 215 identifies a set of masks corresponding to the set of different scales, respectively, where the set of MHA outputs are based on the set of masks. In some aspects, each of the set of masks indicates neighboring pixels around a central pixel. In some aspects, each of the set of masks indicates neighboring words around a central word” in [0057]; “In some cases, the machine learning system separates the original attention corresponding to the feature vector f into the set of different scales by applying the set of masks” in [0107], see also [0112]. Liu also discloses “that cause the processor (s) 12 to temporally crop portions of the visual identifiers 36 to using a cropping function 70 to generate cropped visual identifiers that correspond to the words of the caption associated with each of the objects of the image 32. Next, the training module 16 may include instructions that cause the processor(s) 12 to render covered regions of the image 32 associated with the cropped visual identifiers to generate binary masks with a resolution R.” in [0046]; “Thereafter, the training module 16 may include instructions that cause the processor(s) 12 to stack the rendered masks of all tokens together to generate a rendered attention 72 (Mk). The rendered attention 72 may include render attentions 72A, 72B, and 72C for each of the detected objects in the image 32.” in [0047].)
As to Claim 7, Zhao in view of Liu teaches The method of claim 6, wherein an attention map for a sample image comprises importance scores of visual elements within the sample image with respect to an object (Zhao discloses “In the machine learning field, an attention mechanism is a method of placing differing levels of importance on different elements of an input” in [0047]; “In some examples, a transformer uses a self-attention mechanism to iteratively determine the importance of parts of the input sequence” in [0050], see also [0051]); and
wherein masking at least one patch in the second sample image comprises: masking the at least one patch in the second sample image based on importance scores comprised in the at least one attention map for the second sample image, the at least one masked patch corresponding to higher importance scores in an attention map than unmasked patches (Zhao discloses “In some examples, multimodal encoder 215 identifies a set of masks corresponding to the set of different scales, respectively, where the set of MHA outputs are based on the set of masks” in [0057]; “In some cases, the machine learning apparatus or the multimodal encoder generates the set of MHA outputs by identifying a corresponding set of masks and applying the corresponding set of masks to the attention score matrix for the feature vector f output by the MHA module. In some cases, the set of masks includes a large-scale mask, a middle-scale mask, and a small-scale mask, and the set of MHA outputs correspondingly includes a large-scale output, a middle-scale output, and a small-scale output” in [0107]. Liu, [0047] and Fig 4.)
As to Claim 8, Zhao in view of Liu teaches The method of claim 6, wherein masking at least one patch in the second sample image comprises: in a first training iteration of the target model, masking a first ratio of patches among the plurality of patches in the second sample image, the first ratio being larger than a second ratio of patches masked in the second sample image in a training iteration earlier than the first training iteration (Zhao discloses In some cases, the machine learning apparatus or the multimodal encoder generates the set of MHA outputs by identifying a corresponding set of masks and applying the corresponding set of masks to the attention score matrix for the feature vector f output by the MHA module. In some cases, the set of masks includes a large-scale mask, a middle-scale mask, and a small-scale mask, and the set of MHA outputs correspondingly includes a large-scale output, a middle-scale output, and a small-scale output” in [0107], see also Fig 4.)
As to Claim 9, Zhao in view of Liu teaches The method of claim 1, wherein the unsupervised learning comprises vision-and-language pretraining, and wherein performing training of the target model comprises: for a third sample image of the plurality of sample images, determining at least one location of at least object within the third sample image based on at least one attention map for the third sample image; and performing the vision-and-language pretraining of the target model based at least in part on the third sample image and location information indicating the at least one location of the at least object within the third sample image (Zhao discloses “In some cases, the pre-trained encoder comprises a CLIP (Contrastive Language-Image Pre-Training) model” in [0060]; “A CLIP model can be instructed in natural language to perform a variety of classification benchmarks without directly optimizing for the benchmarks' performance, in a manner building on "zero-shot" or zero-data learning” in [0061]; “According to some aspects, multimodal encoder 215 encodes a text to obtain a predicted text embedding, where encoding the text includes generating a set of multi-head attention (MHA) text outputs corresponding to a set of different text scales, respectively” in [0062]. Liu further discloses “The training module 16 may include instructions that cause the processor(s) 12 to utilize layer 66 to compute the image-caption attention map 68 as the normalized product between the transformed visual feature maps 42 zv,k and the transformed textual feature vectors 44 zv,k… The image-caption attention map 68 may be able to identify a location 68A within the image 32 that relates to the location of the cat 32A, a location 68B within the image 32 that relates to the location of the blanket 32B, and a location 68C within the image 32 that relates to the location of the books 32C” in [0044], see also Fig 4.)
As to Claim 10, Zhao in view of Liu teaches The method of claim 9, wherein performing the vision-and-language pretraining of the target model comprises:
generating a text description to describe the at least one location of at least object within the third sample image; and performing the vision-and-language pretraining of the target model based at least in part on the third sample image, the location information, and the text description (Zhao discloses “For example, a user provides a text prompt to the machine learning system to retrieve an image that includes a depiction of content included in the text prompt. The multimodal retrieval system generates an embedding of the text prompt using a multimodal encoder based on an aggregation of multiple attention scales applied to the text prompt” in [0024]; “Likewise, second response 1025 is an example of a response including a text description of an image that is provided by the machine learning system in response to image prompt 1020” in [0148]; see also [0061, 0126] and Fig 5-6.)
As to Claim 11, Zhao in view of Liu teaches The method of claim 1, wherein the generative model comprises a diffusion model for text-to-image generation (Zhao discloses “In some cases, the system generates an image using the embedding of the text prompt as a guidance prompt (for example, using a diffusion model or a GAN)” in [0094].)
Claim 12 recites similar limitations as claims 1 & 2 but in a system form. Therefore, the same rationale used for claims 1& 2 is applied.
Claim 13 is rejected based upon similar rationale as Claim 2.
Claim 14 is rejected based upon similar rationale as Claim 4.
Claim 15 is rejected based upon similar rationale as Claim 5.
Claim 16 is rejected based upon similar rationale as Claim 6.
Claim 17 is rejected based upon similar rationale as Claim 7.
Claim 18 is rejected based upon similar rationale as Claim 8.
Claim 19 is rejected based upon similar rationale as Claim 9.
Claim 20 recites similar limitations as claim 1 but in a computer readable medium form. Therefore, the same rationale used for claim 1 is applied.
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Zhao in view of Liu and Hou et al. (CN 115937661A).
As to Claim 3, Zhao in view of Liu teaches The method of claim 2. The combination of Hou further teaches generating the plurality of text prompts through at least one of the following:
filling at least one of a plurality of object class names into a text template, to obtain at least one text prompt, or providing at least one of the plurality of object class names into a trained word-to-sentence model, to obtain at least one text prompt generated by the word-to-sentence model (Hou discloses “It should be noted that since the text is composed of class names placed in a predefined template, the text embedding represents the semantic information of the corresponding class” in [0045].)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Zhao and Liu with the teaching of Hou so as to place test with class names in a predefined template for the semantic information of the corresponding class (Hou, [0045]).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WEIMING HE whose telephone number is (571)270-1221. The examiner can normally be reached on Monday-Friday, 8:30am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached on 571-272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/WEIMING HE/
Primary Examiner, Art Unit 2611