Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Status
Claim(s) 1, 4-7, 10, 11, 14-17, 20, 22 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021), hereinafter referred to as Huang.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/09/2026 was filed and is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendment
The amendment filed on 01/09/2026 has been entered.
Claims 1, 11, 16, 20 were amended
Claims 2, 3, 8, 9, 12, 13, 18, 19, 21 was/were cancelled.
Claims 1, 4-7, 10, 11, 14-17, 20, 22 remain pending in the application.
The Ma prior art reference has been withdrawn due to the 130(a) affidavit filed 01/09/2026.
The 112(b) rejection has been withdrawn for claims 6 and 16 due to the amendments.
Response to Arguments
Applicant’s arguments (remarks filed 01/09/2026) have been considered but are moot view of the new ground(s) of rejection in view of Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021).
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1, 4-7, 10, 11, 14-17, 20, 22 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021), hereinafter referred to as Huang.
Regarding claims 1, 11, and 20, Huang teaches A method comprising (Huang, abstract: “In this paper, we propose SOHO to “See Out of tHe bOx” that takes a whole image as input, and learns vision-language representation in an end-to-end manner”), at a device:
processing an image (Huang, pg 3, column 1, Section 3, ¶1, “The visual encoder takes animage as input”, which is being interpreted as “processing an image”), by a vision transformer (Huang, see image below: “Transformer” and “visual” is being interpreted as including a “vision transformer”) pretrained (Huang, see image below: “we propose a novel Masked Visual Modeling pre-training” is being interpreted as involving “pretrained”) on a predefined (Huang, see image below, “based on the virtual visual semantic labels produced by the visual dictionary” is being interpreted as involving “predefined”, which is being interpreted as existing before; just as classification labels are predefined in classical machine learning algorithms) concept-feature dictionary (Huang, pg 4, column 2, Section 3.3, ¶1, reproduced below:
PNG
media_image1.png
340
550
media_image1.png
Greyscale
. “Visual dictionary” is being interpreted as involving a concept-feature dictionary as seen from the “image-text matching pre-training tasks”) that correlates image features with image concepts (Huang, pg 8, Section 4.4, reproduced below:
PNG
media_image2.png
338
552
media_image2.png
Greyscale
. “VD index” is being interpreted as image concept [as can be seen in Figure 3 where index 191 is the image concept “head” and index 1074 is the image concept “building”], “visual feature” is being interpreted as “image features”)
to infer an associated concept for the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”, which is being interpreted as involving inferring “an associated concept for the image”) that indicates a relationship (Huang, Figure 1, “Ours: A couple sit on the shore next to a boat on the sea”. “sit on the shore”, “next to a boat” are being interpreted as indicating “a relationship”) between two or more objects as depicted in the image (Huang, Figure 1, “Ours: A couple sit on the shore next to a boat on the sea”. “Couple, shore, boat, sea” are being interpreted as “two or more objects depicted in the image” ), wherein the vision transformer is comprised of a tokenizer, at least one layer for generating patch embeddings (Huang, see Section 4.4 image above, “image patch”; when combined with pg 3, Figure 2 of “Visual Dictionary-based embedding features”, shows “at least one layer for generating patch embeddings”), and at least one multi-head self-attention layer (Huang, pg 9, Section A.4. Discussion, ¶1, reproduced below:
PNG
media_image3.png
572
558
media_image3.png
Greyscale
“Multi-layer Transformer” is being interpreted as involving at least one multi-head. “Self-attention mechanism” is being interpreted as being part of one of the layer, resulting in a “multi-head self-attention layer”. Further, a ResNet-101 backbone and 12-layer Transformer is used in this prior art, seen in pg 5, Section 4.1. As with ordinary skill in the art would know, this contains at least one multi-head self attention layer”, see note * below); and outputting the concept inferred for the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. Which is being interpreted as outputting the concept inferred for the image).
*Note: Knowledge reference on the 12-layer transformer (pg 3, column 2 mentions “self-attention heads”, which is being interpreted as multi-head self-attention): https://arxiv.org/abs/1810.04805.
Regarding claim 4, Huang teaches The method of claim 1, wherein the image is not labeled with the concept when received as input (Huang, pg 5, column 1, last paragraph: “Detailed comparisons of pre-training dataset usage of most VLPT works, including our train/test image and text numbers, are included in our supplementary material”. “Test image” is being interpreted as splitting the dataset into training and testing sets, as one with ordinary skill in the art would know. The testing sets are being interpreted as not labeled with the concept when received as input).
Regarding claim 5, Huang teaches The method of claim 1, wherein the vision transformer performs action prediction and object prediction utilizing the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. “Sit” is being interpreted as “action prediction”. “Couple”, “shore”, “boat”, “sea” are being interpreted as object prediction. Figure 1 shows an image which these predictions are done. SOHO, the method, is being interpreted as involving a “vision transformer”).
Regarding claim 6, Huang teaches The method of claim 1, wherein the concept (Huang, see Figure 1 citation below, the caption is being interpreted as involving “the concept”) indicates the relationship (Huang, see Figure 1 citation below, “A couple sit on the shore next to” is being interpreted as involving a relationship) between the two or more objects (Huang, see Figure 1 citation below, “couple” and “boat” are being interpreted as “two or more objects”) as a tuple (Huang, see Figure 1 citation below. A definition of a tuple, as one with ordinary skill in the art would know, if an ordered, finite sequence of elements. The order of the caption matters and it is a finite sequence of elements.) of two objects and an associated action (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. “Couple” and “boat” are being interpreted as examples of two objects with the association action of “sit”).
Regarding claim 7, Huang teaches The method of claim 1, further comprising, at the device: performing one or more classification operations utilizing the concept (Huang, pg 7, column 2, Section 4.2.4, ¶1, reproduced below:
PNG
media_image4.png
456
548
media_image4.png
Greyscale
. “Three-classification” is being interpreted as one or more classification operations. “Output of the transformer” is being interpreted as “utilizing the concept”).
Regarding claim 10, Huang teaches The method of claim 1, wherein each of a plurality of concepts within the dictionary is represented by a key (Huang, see image below, the “VD index”, or Visual Dictionary Index, is being interpreted as a key), and wherein each of the plurality of concepts within the dictionary is linked to a predefined (Huang, see section 3.3 image from claim 1, “based on the virtual visual semantic labels produced by the visual dictionary” is being interpreted as involving “predefined”, which is being interpreted as existing before; just as classification labels are predefined in classical machine learning algorithms) set of image features (Huang, pg 8, Section 4.4, reproduced below:
PNG
media_image2.png
338
552
media_image2.png
Greyscale
. “VD index” is being interpreted as image concept [as can be seen in Figure 3 where index 191 is the image concept “head” and index 1074 is the image concept “building”], “visual feature” is being interpreted as “image features”).
Regarding claim 22, Huang teaches The method of claim 1, wherein the vision transformer is configured to include: a global task that, during training, clusters images with the same concept together to produce semantically consistent relational representations (Huang, Figure 1, reproduced below:
PNG
media_image5.png
744
1102
media_image5.png
Greyscale
. “Global context” is being interpreted as “global task”. For example, “chatting” or “next to a boat” are examples of relational representations”), and a local task that, during training, guides the vision transformer to discover object-centric semantic correspondence across images (Huang, pg 8, Section 4.4, reproduced below:
PNG
media_image2.png
338
552
media_image2.png
Greyscale
. “VD index” is being interpreted as object-centric semantic correspondence [as can be seen in Figure 3 where index 191 is the object-centric semantic correspondence “head” and index 1074 is the object-centric semantic correspondence “building” across images]. This object-centric semantic correspondence is being interpreted as “a local task”; pg 2, column 1, second to last paragraph: “VD can be dynamically updated through our trainable CNN backbone directly from visual-language data during pretraining” The Visual Dictionary is being interpreted as during training guides the vision transformer).
Claim 14 is rejected using the same rationale as applied to claim 4 discussed above.
Claim 15 is rejected using the same rationale as applied to claim 5 discussed above.
Claim 16 is rejected using the same rationale as applied to claim 6 discussed above.
Claim 17 is rejected using the same rationale as applied to claim 7 discussed above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Kim et al (“Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)”, 2018) discloses a dictionary of concepts (Appendix A) and trained using concept vectors.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNNY B DUONG whose telephone number is (571)272-1358. The examiner can normally be reached Monday - Thursday 10a-9p (ET).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.B.D./Examiner, Art Unit 2667
/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667