Last updated: April 19, 2026

Application No. 17/893,038

PERFORMING VISUAL RELATIONAL REASONING

Non-Final OA §102

Filed

Aug 22, 2022

Examiner

DUONG, JOHNNYKHOI BAO

Art Unit

2667

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

3 (Non-Final)

Interview Optional

— +32.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 56 resolved cases, 2023–2026

Examiner Intelligence

DUONG, JOHNNYKHOI BAO View full profile →

Grants 66% — above average

Career Allow Rate

37 granted / 56 resolved

+4.1% vs TC avg

Strong +33% interview lift

Without

With

+32.8%

Interview Lift

resolved cases with interview

Typical timeline

3y 8m

Avg Prosecution

10 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

5.6%

-34.4% vs TC avg

§103

50.9%

+10.9% vs TC avg

§102

36.3%

-3.7% vs TC avg

§112

4.4%

-35.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 56 resolved cases

Office Action

§102

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Status
Claim(s) 1, 4-7, 10, 11, 14-17, 20, 22 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021), hereinafter referred to as Huang.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/09/2026 was filed and is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Amendment
The amendment filed on 01/09/2026 has been entered.
Claims 1, 11, 16, 20 were amended
Claims 2, 3, 8, 9, 12, 13, 18, 19, 21 was/were cancelled.
Claims 1, 4-7, 10, 11, 14-17, 20, 22 remain pending in the application.
The Ma prior art reference has been withdrawn due to the 130(a) affidavit filed 01/09/2026.
The 112(b) rejection has been withdrawn for claims 6 and 16 due to the amendments.

Response to Arguments
Applicant’s arguments (remarks filed 01/09/2026) have been considered but are moot view of the new ground(s) of rejection in view of Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021).

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 4-7, 10, 11, 14-17, 20, 22 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021), hereinafter referred to as Huang.

Regarding claims 1, 11, and 20, Huang teaches A method comprising (Huang, abstract: “In this paper, we propose SOHO to “See Out of tHe bOx” that takes a whole image as input, and learns vision-language representation in an end-to-end manner”), at a device:

processing an image (Huang, pg 3, column 1, Section 3, ¶1, “The visual encoder takes animage as input”, which is being interpreted as “processing an image”), by a vision transformer (Huang, see image below: “Transformer” and “visual” is being interpreted as including a “vision transformer”) pretrained (Huang, see image below: “we propose a novel Masked Visual Modeling pre-training” is being interpreted as involving “pretrained”) on a predefined (Huang, see image below, “based on the virtual visual semantic labels produced by the visual dictionary” is being interpreted as involving “predefined”, which is being interpreted as existing before; just as classification labels are predefined in classical machine learning algorithms) concept-feature dictionary (Huang, pg 4, column 2, Section 3.3, ¶1, reproduced below:

    PNG
    media_image1.png
    340
    550
    media_image1.png
    Greyscale
. “Visual dictionary” is being interpreted as involving a concept-feature dictionary as seen from the “image-text matching pre-training tasks”) that correlates image features with image concepts (Huang, pg 8, Section 4.4, reproduced below:

    PNG
    media_image2.png
    338
    552
    media_image2.png
    Greyscale
. “VD index” is being interpreted as image concept [as can be seen in Figure 3 where index 191 is the image concept “head” and index 1074 is the image concept “building”], “visual feature” is being interpreted as “image features”)
to infer an associated concept for the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”, which is being interpreted as involving inferring “an associated concept for the image”) that indicates a relationship (Huang, Figure 1, “Ours: A couple sit on the shore next to a boat on the sea”. “sit on the shore”, “next to a boat” are being interpreted as indicating “a relationship”) between two or more objects as depicted in the image (Huang, Figure 1, “Ours: A couple sit on the shore next to a boat on the sea”. “Couple, shore, boat, sea” are being interpreted as “two or more objects depicted in the image” ), wherein the vision transformer is comprised of a tokenizer, at least one layer for generating patch embeddings (Huang, see Section 4.4 image above, “image patch”; when combined with pg 3, Figure 2 of “Visual Dictionary-based embedding features”, shows “at least one layer for generating patch embeddings”), and at least one multi-head self-attention layer (Huang, pg 9, Section A.4. Discussion, ¶1, reproduced below:
    PNG
    media_image3.png
    572
    558
    media_image3.png
    Greyscale
 “Multi-layer Transformer” is being interpreted as involving at least one multi-head. “Self-attention mechanism” is being interpreted as being part of one of the layer, resulting in a “multi-head self-attention layer”. Further, a ResNet-101 backbone and 12-layer Transformer is used in this prior art, seen in pg 5, Section 4.1. As with ordinary skill in the art would know, this contains at least one multi-head self attention layer”, see note * below); and outputting the concept inferred for the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. Which is being interpreted as outputting the concept inferred for the image). 

*Note: Knowledge reference on the 12-layer transformer (pg 3, column 2 mentions “self-attention heads”, which is being interpreted as multi-head self-attention): https://arxiv.org/abs/1810.04805.

Regarding claim 4, Huang teaches The method of claim 1, wherein the image is not labeled with the concept when received as input (Huang, pg 5, column 1, last paragraph: “Detailed comparisons of pre-training dataset usage of most VLPT works, including our train/test image and text numbers, are included in our supplementary material”. “Test image” is being interpreted as splitting the dataset into training and testing sets, as one with ordinary skill in the art would know. The testing sets are being interpreted as not labeled with the concept when received as input).

Regarding claim 5, Huang teaches The method of claim 1, wherein the vision transformer performs action prediction and object prediction utilizing the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. “Sit” is being interpreted as “action prediction”. “Couple”, “shore”, “boat”, “sea” are being interpreted as object prediction. Figure 1 shows an image which these predictions are done. SOHO, the method, is being interpreted as involving a “vision transformer”).

Regarding claim 6, Huang teaches The method of claim 1, wherein the concept (Huang, see Figure 1 citation below, the caption is being interpreted as involving “the concept”) indicates the relationship (Huang, see Figure 1 citation below, “A couple sit on the shore next to” is being interpreted as involving a relationship) between the two or more objects (Huang, see Figure 1 citation below, “couple” and “boat” are being interpreted as “two or more objects”) as a tuple (Huang, see Figure 1 citation below. A definition of a tuple, as one with ordinary skill in the art would know, if an ordered, finite sequence of elements. The order of the caption matters and it is a finite sequence of elements.) of two objects and an associated action (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. “Couple” and “boat” are being interpreted as examples of two objects with the association action of “sit”).

Regarding claim 7, Huang teaches The method of claim 1, further comprising, at the device: performing one or more classification operations utilizing the concept (Huang, pg 7, column 2, Section 4.2.4, ¶1, reproduced below: 
    PNG
    media_image4.png
    456
    548
    media_image4.png
    Greyscale
. “Three-classification” is being interpreted as one or more classification operations. “Output of the transformer” is being interpreted as “utilizing the concept”). 

Regarding claim 10, Huang teaches The method of claim 1, wherein each of a plurality of concepts within the dictionary is represented by a key (Huang, see image below, the “VD index”, or Visual Dictionary Index, is being interpreted as a key), and wherein each of the plurality of concepts within the dictionary is linked to a predefined (Huang, see section 3.3 image from claim 1, “based on the virtual visual semantic labels produced by the visual dictionary” is being interpreted as involving “predefined”, which is being interpreted as existing before; just as classification labels are predefined in classical machine learning algorithms) set of image features (Huang, pg 8, Section 4.4, reproduced below:

    PNG
    media_image2.png
    338
    552
    media_image2.png
    Greyscale
. “VD index” is being interpreted as image concept [as can be seen in Figure 3 where index 191 is the image concept “head” and index 1074 is the image concept “building”], “visual feature” is being interpreted as “image features”).

Regarding claim 22, Huang teaches The method of claim 1, wherein the vision transformer is configured to include: a global task that, during training, clusters images with the same concept together to produce semantically consistent relational representations (Huang, Figure 1, reproduced below: 
    PNG
    media_image5.png
    744
    1102
    media_image5.png
    Greyscale
. “Global context” is being interpreted as “global task”. For example, “chatting” or “next to a boat” are examples of relational representations”), and a local task that, during training, guides the vision transformer to discover object-centric semantic correspondence across images (Huang, pg 8, Section 4.4, reproduced below:

    PNG
    media_image2.png
    338
    552
    media_image2.png
    Greyscale
. “VD index” is being interpreted as object-centric semantic correspondence [as can be seen in Figure 3 where index 191 is the object-centric semantic correspondence “head” and index 1074 is the object-centric semantic correspondence “building” across images]. This object-centric semantic correspondence is being interpreted as “a local task”; pg 2, column 1, second to last paragraph: “VD can be dynamically updated through our trainable CNN backbone directly from visual-language data during pretraining” The Visual Dictionary is being interpreted as during training guides the vision transformer). 

Claim 14 is rejected using the same rationale as applied to claim 4 discussed above.

Claim 15 is rejected using the same rationale as applied to claim 5 discussed above.

Claim 16 is rejected using the same rationale as applied to claim 6 discussed above.

Claim 17 is rejected using the same rationale as applied to claim 7 discussed above.



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Kim et al (“Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)”, 2018) discloses a dictionary of concepts (Appendix A) and trained using concept vectors.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNNY B DUONG whose telephone number is (571)272-1358. The examiner can normally be reached Monday - Thursday 10a-9p (ET).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/J.B.D./Examiner, Art Unit 2667   

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667

Read full office action

Prosecution Timeline

Aug 22, 2022

Application Filed

May 12, 2025

Non-Final Rejection — §102

Aug 14, 2025

Response Filed

Oct 08, 2025

Final Rejection — §102

Dec 15, 2025

Interview Requested

Jan 05, 2026

Applicant Interview (Telephonic)

Jan 05, 2026

Examiner Interview Summary

Jan 09, 2026

Request for Continued Examination

Jan 23, 2026

Response after Non-Final Action

Feb 19, 2026

Non-Final Rejection — §102 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/514,059

Patent 12586187

LESION LINKING USING ADAPTIVE SEARCH AND A SYSTEM FOR IMPLEMENTING THE SAME

2y 5m to grant Granted Mar 24, 2026

17/715,145

Patent 12525024

ELECTRONIC DEVICE, METHOD, AND COMPUTER READABLE STORAGE MEDIUM FOR DETECTION OF VEHICLE APPEARANCE

2y 5m to grant Granted Jan 13, 2026

17/731,769

Patent 12518510

MACHINE LEARNING FOR VECTOR MAP GENERATION

2y 5m to grant Granted Jan 06, 2026

17/571,677

Patent 12498556

Microscopy System and Method for Evaluating Image Processing Results

2y 5m to grant Granted Dec 16, 2025

17/403,017

Patent 12488438

DEEP LEARNING-BASED IMAGE QUALITY ENHANCEMENT OF THREE-DIMENSIONAL ANATOMY SCAN IMAGES

2y 5m to grant Granted Dec 02, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

66%

Grant Probability

99%

With Interview (+32.8%)

3y 8m

Median Time to Grant

High

PTA Risk

Based on 56 resolved cases by this examiner. Grant probability derived from career allow rate.