DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Election/Restrictions
Claims 17–20 are withdrawn from further consideration pursuant to 37 CFR 1.142(b), as being drawn to a nonelected invention, there being no allowable generic or linking claim. Applicant timely traversed the restriction (election) requirement in the reply filed on December 29, 2025.
Applicant's election with traverse of claims 1 in the reply filed on December 29, 2025 is acknowledged. The traversal is on the grounds that the search and examination of all claims can be made without serious burden to the Examiner, and that the Restriction Requirement fails to identify any different electronic resources to be used, any different search strategies that would be required, or any different search queries to be used. This is not found persuasive because the Restriction Requirement dated October 29, 2025 (herein “Restriction”), set forth that Invention II of claims 17–20 was directed towards separate utility such as controlling the cooking of food, generating a recipe, cleaning clothing, and generating a recommendation, all of which would require different search strategies including searched terms and databases versus the subject matter of claims 1–16 directed strictly towards a specific machine learning model absent of any particular use case. Further, the Restriction noted the differences in CPC classification as well, thus establishing a serious burden of search and examination to the Examiner.
Thus, in view of the above, the restriction requirement is still deemed proper and is therefore made FINAL.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1–5, 8–11, and 14–16 are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al., “Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning,” arXiv:2211.10681v1 [cs.CV] November 19, 2022, https://doi.org/10.48550/arXiv.2211.10681 (herein “Lu”) in view of Mancini et al., “Open World Compositional Zero-Shot Learning,” arXiv:2101.12609v3 [cs.CV], March 30, 2021, https://doi.org/10.48550/arXiv.2101.12609 (herein “Mancini”), cited in the IDS filed August 23, 2023.
Regarding claims 1 and 8, with claim 1 as exemplary, and with deficiencies of Lu noted in square brackets [], and substantive differences between claims 1, and 8 noted in curly brackets {}, Lu teaches {a method comprising – claim 1/ an apparatus comprising: at least one processing device configured to – claim 8 } (Lu Abstract, figure 1, operations of the decomposed soft prompt guided fusion enhancing for compositional zero-shot learning model (DFSP), where page 6, implementation details teaches the model trained and evaluated with a NVIDIA RTX 3090 GPU, a processing device):
obtaining an image, a set of attribute labels, and a set of object labels (Lu page 3, section 3.1, fig. 2, inputs to the DFSP system including a state text (attribute labels), an object text (object labels), and an image);
and performing prompt tuning of a pre-trained vision-language model comprising a first textual encoder, [a second textual encoder], and a vision encoder (Lu fig. 2, page 4, left column, section 3, page 2, section 2, DFSP including soft prompt which is a paradigm of prompt learning (prompt tuning), the DFSP including a text encoder for the state and object text and an image encoder for the input image), the pre-trained vision-language model trained during the prompt tuning to select one of the attribute labels and one of the object labels that match content contained in the image (Lu page 6, section 3.3, inference (what the trained system outputs) produces the most likely composition for an image, where a composition is a state (attribute), object pair);
wherein performing the prompt tuning comprises {the at least one processing device is configured – claim 8} for each of multiple attribute label-object label pairs: generating object textual features associated with the object label of the attribute label-object label pair using the first textual encoder (Lu page 3, fig. 2, section 3, text encoder outputs language features, including object features);
generating attribute textual features associated with the attribute label of the attribute label-object label pair using the [second] textual encoder (Lu page 3, fig. 2, section 3, text encoder outputs language features, including state (attribute) features); and
generating image features associated with the image using the vision encoder (Lu page 3, fig. 2, section 3, image encoder outputs image features from the input image);
wherein {the at least one processing device is configured to combine – claim 8} intermediate outputs from initial layers of the first textual encoder, [the second textual encoder], and the vision encoder are combined to generate layer-specific learnable prompt tokens (Lu page 3, fig. 2 (reproduced below), section 3, text features output from the text encoder (initial layers output from textual encoder) from the state text and the object text are input to further respective layers of the image processing branch (bottom branch) and the image features output from the image encoder (initial layers of vision encoder) are input to further layers of the text processing branch (top branch) – see dashed lines below crossing the features from the top branch to bottom branch and vice-versa:
PNG
media_image1.png
327
745
media_image1.png
Greyscale
) that are appended to inputs of specified layers in the first textual encoder, [the second textual encoder], and the vision encoder during the prompt tuning (Lu page 3, fig. 2, section 3, cross modal fusion layers for each of the text branch and the image branch, the respective branch’s fusion layers receiving both the text features and the image features and outputting one output into the Ldfm pair space, see above reproduction of fig. 2).
While Lu teaches separate text and image encoders, and that the text features for the state (attribute) are output separate from the text object features, Lu does not teach two text encoders, and thus Lu does not teach “a second textual encoder,” as claimed.
Mancini teaches a compositional zero-shot learning model with two textual encoders, including one for the object text (first textual encoder) and one for the state (attribute text) which is a second textual encoder (Mancini page 3, section 3.2, fig. 2, state embeddings (features) output from the shown φstate block).
Therefore, taking the teachings of Lu and Mancini together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the compositional zero-shot learning model of Lu to include two text encoders as disclosed in Mancini at least because doing so would help to identify and isolate unfeasible distractor compositions (erroneous compositions) and thus improve model performance. See Mancini pages 2–3, section 3.1 and Abstract).
Regarding claim 2, Lu teaches wherein: the pre-trained vision-language model is trained during the prompt tuning using multiple images containing objects associated with a subset of the attribute label-object label pairs (Lu page 6, section 4.1, training datasets including images, states and objects); and the pre-trained vision-language model after the prompt tuning is configured to evaluate additional images in a zero-shot manner and identify objects within the additional images associated with attribute label-object label pairs not seen during the prompt tuning (Lu page 6, section 4.1, after training evaluation metrics including accuracy of unseen compositions (label pairs not seen during prompt tuning), where page 4, section 3.2 states that the DFSP system follows a paradigm for CZSL – compositional zero-shot learning, and thus evaluates in a zero-shot manner).
Regarding claims 3 and 9, with claim 3 as exemplary, Lu teaches wherein performing the prompt tuning further comprises: determining object scores based on similarities between the object textual features and the image features; determining attribute scores based on similarities between the attribute textual features and the image features (Lu pages 4–5, Decomposed Fusion Module section, the DFM establishes respective associations (similarity scores) of the image with the state and object in the pair-space, with probability formulas 8 and 9 respectively being the scores); and tuning prompts of the pre-trained vision-language model based on the object scores and the attribute scores (Lu page 5, equation 10, cross-entropy loss metric for optimizing (tuning) prompts of the disclosed DFSP model).
Regarding claims 4 and 10, with claim 4 as exemplary, Lu teaches wherein: the vision encoder processes the image with learnable visual prompts added to the image during the prompt tuning (Lu figure 2, page 4, section 3.2, soft prompt module section, encoders in DFSP use a fully learnable soft prompt converted to embeddings and then a class probability pspm is calculated using the image features (processes the image) and the soft prompt converted embeddings (with learnable visual prompts added)); and the pre-trained vision-language model is trained during the prompt tuning to consider each combination of an attribute label from the set of attribute labels and an object label from the set of object labels when identifying the content contained in the image (Lu page 4, section 3.2, equation 5 providing the cross entropy loss function for optimizing the soft prompt module, including the pspm probability that is calculated from the prompt set as in equation 1, which is a set of state (attribute) labels and object labels).
Regarding claims 5 and 11, with claim 5 as exemplary, and deficiencies of Lu noted in square brackets [], Lu teaches wherein: the pre-trained vision-language model comprises a transformer-based machine learning model having [three] branches of transformer layers (Lu page 5, right column, last paragraph, the image and text encoders are all based on transformer layers, where fig. 2 illustrates at least a top branch for the text encoding, and a bottom branch for the vision encoding, thus at least two branches); a first of the branches of transformer layers implements the first textual encoder; [a second of] the branches of transformer layers implements the [second] textual encoder (Lu fig. 2, top branch with text encoder that encodes both the state (attribute) and the object input text into embeddings); and a third of the branches of transformer layers implements the vision encoder (Lu fig. 2, bottom branch with the image encoder).
While Lu teaches that the state (attribute) and object text are encoded individually and that the encoder is based on transformer layers, Lu does not explicitly teach that there is a separate branch for the state text encoding versus the object text encoding, resulting in three branches, and also a second text encoder. However, Mancini teaches a separate branch for the state text encoding (middle branch), object text encoding (bottom branch) and visual embedding encoding (top branch), making three branches, as shown in fig. 2 of Mancini, as well as a separate second textual encoder for the state (attribute) features:
PNG
media_image2.png
358
796
media_image2.png
Greyscale
Therefore, taking the teachings of Lu and Mancini together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the compositional zero-shot learning model of Lu to include two text encoders, and three branches as disclosed in Mancini at least because doing so would help to identify and isolate unfeasible distractor compositions (erroneous compositions) and thus improve model performance. See Mancini pages 2–3, section 3.1 and Abstract).
Regarding claim 14, with deficiencies of Lu noted in square brackets [], Lu teaches a method comprising (Lu Abstract, figure 1, operations of the decomposed soft prompt guided fusion enhancing for compositional zero-shot learning model (DFSP)):
obtaining an image of an object (Lu page 3, section 3.1, fig. 2, inputs to the DFSP system including an image);
obtaining a set of attribute labels and a set of object labels (Lu page 3, section 3.1, fig. 2, inputs to the DFSP system including a state text (attribute labels), an object text (object labels));
selecting one of the attribute labels and one of the object labels associated with the object (Lu page 6, section 3.3, inference (what the trained system outputs) produces the most likely composition for an image, where a composition is a state (attribute), object pair) using a vision-language model comprising a first textual encoder, [a second textual encoder], and a vision encoder (Lu fig. 2, page 4, left column, section 3, page 2, section 2, DFSP including soft prompt which is a paradigm of prompt learning (prompt tuning), the DFSP including a text encoder for the state and object text and an image encoder for the input image);
wherein selecting the one of the attribute labels and the one of the object labels associated with the object comprises: generating object textual features associated with each of the object labels using the first textual encoder (Lu page 3, fig. 2, section 3, text encoder outputs language features, including object features);
generating attribute textual features associated with each of the attribute labels using the [second] textual encoder (Lu page 3, fig. 2, section 3, text encoder outputs language features, including state (attribute) features); and
generating image features associated with the image using the vision encoder (Lu page 3, fig. 2, section 3, image encoder outputs image features from the input image);
wherein one or more layers in the first textual encoder, one or more layers in [the second textual encoder], and one or more layers in the vision encoder are each associated with a layer-specific multi-modal shared prompt that is concatenated to an input for the layer (Lu page 3, fig. 2 (reproduced below), section 3, text features output from the text encoder (layers in the textual encoder) from the state text and the object text are input to further respective layers of the image processing branch (bottom branch) and the image features output from the image encoder (initial layers of vision encoder) are input to further layers of the text processing branch (top branch) – see dashed lines below crossing the features from the top branch to bottom branch and vice-versa, and page 3, fig. 2, section 3, teaching cross modal fusion layers (specific multi-modal shared prompt that is concatenated) for each of the text branch and the image branch, the respective branch’s fusion layers receiving both the text features and the image features and outputting one output into the Ldfm pair space, see above reproduction of fig. 2:
PNG
media_image1.png
327
745
media_image1.png
Greyscale
).
While Lu teaches separate text and image encoders, and that the text features for the state (attribute) are output separate from the text object features, Lu does not teach two text encoders, and thus Lu does not teach “a second textual encoder,” as claimed.
Mancini teaches a compositional zero-shot learning model with two textual encoders, including one for the object text (first textual encoder) and one for the state (attribute text) which is a second textual encoder (Mancini page 3, section 3.2, fig. 2, state embeddings (features) output from the shown φstate block).
Therefore, taking the teachings of Lu and Mancini together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the compositional zero-shot learning model of Lu to include two text encoders as disclosed in Mancini at least because doing so would help to identify and isolate unfeasible distractor compositions (erroneous compositions) and thus improve model performance. See Mancini pages 2–3, section 3.1 and Abstract).
Regarding claim 15, Lu teaches wherein selecting the one of the attribute labels and the one of the object labels associated with the object further comprises: for each object label, determining an object score based on a similarity between the image features generated by a final layer of the vision encoder and the object textual features associated with the object label generated by a final layer of the first textual encoder (Lu pages 4–5, Decomposed Fusion Module section, the DFM establishes respective associations (similarity scores) of the image with the state and object in the pair-space, with probability formulas 8 and 9 respectively being the scores); selecting the object label associated with a highest object score (Lu page 6, section 3.3, only feasible compositions are selected (infeasible are filtered out), including object scores higher than a threshold T (associated with a highest object score)); for each attribute label, determining an attribute score based on a similarity between the image features generated by the final layer of the vision encoder and the attribute textual features associated with the attribute label generated by a final layer of the second textual encoder (Lu pages 4–5, Decomposed Fusion Module section, the DFM establishes respective associations (similarity scores) of the image with the state (attribute) and object in the pair-space, with probability formulas 8 and 9 respectively being the scores); and selecting the attribute label associated with a highest attribute score (Lu page 6, section 3.3, only feasible compositions are selected (infeasible are filtered out), including state (attribute) scores higher than a threshold T (associated with a highest attribute score)).
Regarding claim 16, Lu teaches the claimed selected object label and the selected attribute label (Lu page 6, section 3.3, inference produces (selects) the most likely composition for an image, where a composition is a state (attribute), object pair), however Lu does not explicitly teach where Mancini teaches further comprising: performing at least one action based on (Mancini section 2, selection of a composition of object representations in text are used in tasks (performing an action based on) such as compositional reasoning for visual question answering and modular image generation).
Therefore, taking the teachings of Lu and Mancini together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the compositional zero-shot learning model of Lu to include an action performed subsequent to the composition selection as disclosed in Mancini at least because doing so would improve performance of a downstream task. See Mancini pages 2–3, section 3.1 and Abstract).
Claims 7, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Lu in view of Mancini, as set forth above regarding claims 1 and 8 from which claims 7 and 13 respectively depend, further in view of Misra et al., "From Red Wine to Red Tomato: Composition with Context," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 1160-1169, doi: 10.1109/CVPR.2017.129, cited in the IDS filed August 8, 2023 (herein “Misra”).
Regarding claims 7 and 13, with claim 7 being exemplary, and the deficiencies of Lu noted in square brackets [], Lu teaches wherein: the first and [second] textual encoders represent [separate machine learning classifiers] (Lu fig. 2, page 4, left column, section 3, page 2, section 2, the DFSP including a text encoder for the state and object text);
{[the machine learning classifiers] are trained jointly – claim 7 / the at least one processing device is configured to jointly train the machine learning classifiers– claim 13} using a loss function that combines losses associated with the [machine learning classifiers] (Lu page 5, left column, cross entropy loss in equation 10 combining losses from the state (attribute) text and the object text); and
[the machine learning classifiers are used to select the one of the attribute labels and the one of the object labels by identifying an attribute label-object label combination having a highest score].
As indicated above, Lu does not explicitly the limitations denoted in square brackets above [], however, Mancini teaches the second textual encoder (Mancini page 3, section 3.2, fig. 2, state embeddings (features) output from the shown φstate block).
Misra teaches separate machine learning classifiers and the machine learning classifiers (Misra page 1162, section 3.3, fig. 3, a separate classifier for the object (such as elephant) and for the attribute (large)). Misra further teaches the machine learning classifiers are used to select the one of the attribute labels and the one of the object labels by identifying an attribute label-object label combination having a highest score (Misra pages 1162–1163, function T (including the classifiers) is learned to produce the highest score for a pair of attribute, object values from various combinations).
Therefore, taking the teachings of Lu and Mancini together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the compositional zero-shot learning model of Lu to include two text encoders as disclosed in Mancini at least because doing so would help to identify and isolate unfeasible distractor compositions (erroneous compositions) and thus improve model performance. See Mancini pages 2–3, section 3.1 and Abstract).
Further, taking the teachings of Lu as modified by Mancini and Misra together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the compositional zero-shot learning model of Lu to include having text encoders as machine learning classifiers that provide a highest scoring pair of attribute, text values as disclosed in Misra at least because doing so would provide generalization to unseen combinations of concepts with strong performance. See Misra Abstract).
Allowable Subject Matter
Claims 6 and 12 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Specifically, the closest cited art to the limitations of claims 6 and 12 include Lu and Mancini as applied above to claims 1 and 5, and 8 and 11, from which claims 6 and 12 respectively depend. Lu and Mancini do not teach or suggest the aspects of claims 6 and 12 directed towards the “visual patch prompts are generated at one or more random positions in the image,” or “the initial visual and textual prompts are projected for at least one of the transformer layers in each of the branches to generate layer-specific initial prompt tokens, different branches associated with different layer-specific initial prompt tokens,” and the further limitations regarding the each branch of the transformer-based machine learning model.
Further, the reference Gu et al., U.S. Patent Application Publication No. US 2022/0147838 A1, provides teachings of a vision-language application using transformer encoders, specifically BERT transformer, with input tokens including text and image, and with random masking in the image regions, thus mapping to the claimed “one or more visual patch prompts are generated at one or more random positions in the image.” However, Gu, whether considered alone or in a combination obvious to a person having ordinary skill in the art does not teach or suggest “the initial visual and textual prompts are projected for at least one of the transformer layers in each of the branches to generate layer-specific initial prompt tokens, different branches associated with different layer-specific initial prompt tokens,” and the further limitations regarding the each branch of the transformer-based machine learning model.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Gu et al., U.S. Patent Application Publication No. US 2022/0147838 A1, teaching a vision-language application using transformer encoders, specifically BERT transformer, with input tokens including text and image, and with random masking in the image regions.
Nayak et al., "Learning to Compose Soft Prompts for Compositional Zero-Shot Learning," arXiv:2204.03574v1 [cs.LG], April 2022, cited in the IDS filed October 2, 2025, directed towards compositional soft prompting for compositions zero-shot learning using a multi-modal embedding space.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Thursday, 09:00-17:00, Friday 09:00-13:00, EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
MICHELLE M. KOETH
Primary Examiner
Art Unit 2671
/MICHELLE M KOETH/Primary Examiner, Art Unit 2671