Prosecution Insights
Last updated: April 19, 2026
Application No. 17/893,038

PERFORMING VISUAL RELATIONAL REASONING

Non-Final OA §102
Filed
Aug 22, 2022
Examiner
DUONG, JOHNNYKHOI BAO
Art Unit
2667
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
3 (Non-Final)
66%
Grant Probability
Favorable
3-4
OA Rounds
3y 8m
To Grant
99%
With Interview

Examiner Intelligence

Grants 66% — above average
66%
Career Allow Rate
37 granted / 56 resolved
+4.1% vs TC avg
Strong +33% interview lift
Without
With
+32.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
10 currently pending
Career history
66
Total Applications
across all art units

Statute-Specific Performance

§101
5.6%
-34.4% vs TC avg
§103
50.9%
+10.9% vs TC avg
§102
36.3%
-3.7% vs TC avg
§112
4.4%
-35.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 56 resolved cases

Office Action

§102
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Status Claim(s) 1, 4-7, 10, 11, 14-17, 20, 22 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021), hereinafter referred to as Huang. Information Disclosure Statement The information disclosure statement (IDS) submitted on 01/09/2026 was filed and is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. Response to Amendment The amendment filed on 01/09/2026 has been entered. Claims 1, 11, 16, 20 were amended Claims 2, 3, 8, 9, 12, 13, 18, 19, 21 was/were cancelled. Claims 1, 4-7, 10, 11, 14-17, 20, 22 remain pending in the application. The Ma prior art reference has been withdrawn due to the 130(a) affidavit filed 01/09/2026. The 112(b) rejection has been withdrawn for claims 6 and 16 due to the amendments. Response to Arguments Applicant’s arguments (remarks filed 01/09/2026) have been considered but are moot view of the new ground(s) of rejection in view of Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021). Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. (a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention. Claim(s) 1, 4-7, 10, 11, 14-17, 20, 22 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Huang (”Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning”, 2021), hereinafter referred to as Huang. Regarding claims 1, 11, and 20, Huang teaches A method comprising (Huang, abstract: “In this paper, we propose SOHO to “See Out of tHe bOx” that takes a whole image as input, and learns vision-language representation in an end-to-end manner”), at a device: processing an image (Huang, pg 3, column 1, Section 3, ¶1, “The visual encoder takes animage as input”, which is being interpreted as “processing an image”), by a vision transformer (Huang, see image below: “Transformer” and “visual” is being interpreted as including a “vision transformer”) pretrained (Huang, see image below: “we propose a novel Masked Visual Modeling pre-training” is being interpreted as involving “pretrained”) on a predefined (Huang, see image below, “based on the virtual visual semantic labels produced by the visual dictionary” is being interpreted as involving “predefined”, which is being interpreted as existing before; just as classification labels are predefined in classical machine learning algorithms) concept-feature dictionary (Huang, pg 4, column 2, Section 3.3, ¶1, reproduced below: PNG media_image1.png 340 550 media_image1.png Greyscale . “Visual dictionary” is being interpreted as involving a concept-feature dictionary as seen from the “image-text matching pre-training tasks”) that correlates image features with image concepts (Huang, pg 8, Section 4.4, reproduced below: PNG media_image2.png 338 552 media_image2.png Greyscale . “VD index” is being interpreted as image concept [as can be seen in Figure 3 where index 191 is the image concept “head” and index 1074 is the image concept “building”], “visual feature” is being interpreted as “image features”) to infer an associated concept for the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”, which is being interpreted as involving inferring “an associated concept for the image”) that indicates a relationship (Huang, Figure 1, “Ours: A couple sit on the shore next to a boat on the sea”. “sit on the shore”, “next to a boat” are being interpreted as indicating “a relationship”) between two or more objects as depicted in the image (Huang, Figure 1, “Ours: A couple sit on the shore next to a boat on the sea”. “Couple, shore, boat, sea” are being interpreted as “two or more objects depicted in the image” ), wherein the vision transformer is comprised of a tokenizer, at least one layer for generating patch embeddings (Huang, see Section 4.4 image above, “image patch”; when combined with pg 3, Figure 2 of “Visual Dictionary-based embedding features”, shows “at least one layer for generating patch embeddings”), and at least one multi-head self-attention layer (Huang, pg 9, Section A.4. Discussion, ¶1, reproduced below: PNG media_image3.png 572 558 media_image3.png Greyscale “Multi-layer Transformer” is being interpreted as involving at least one multi-head. “Self-attention mechanism” is being interpreted as being part of one of the layer, resulting in a “multi-head self-attention layer”. Further, a ResNet-101 backbone and 12-layer Transformer is used in this prior art, seen in pg 5, Section 4.1. As with ordinary skill in the art would know, this contains at least one multi-head self attention layer”, see note * below); and outputting the concept inferred for the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. Which is being interpreted as outputting the concept inferred for the image). *Note: Knowledge reference on the 12-layer transformer (pg 3, column 2 mentions “self-attention heads”, which is being interpreted as multi-head self-attention): https://arxiv.org/abs/1810.04805. Regarding claim 4, Huang teaches The method of claim 1, wherein the image is not labeled with the concept when received as input (Huang, pg 5, column 1, last paragraph: “Detailed comparisons of pre-training dataset usage of most VLPT works, including our train/test image and text numbers, are included in our supplementary material”. “Test image” is being interpreted as splitting the dataset into training and testing sets, as one with ordinary skill in the art would know. The testing sets are being interpreted as not labeled with the concept when received as input). Regarding claim 5, Huang teaches The method of claim 1, wherein the vision transformer performs action prediction and object prediction utilizing the image (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. “Sit” is being interpreted as “action prediction”. “Couple”, “shore”, “boat”, “sea” are being interpreted as object prediction. Figure 1 shows an image which these predictions are done. SOHO, the method, is being interpreted as involving a “vision transformer”). Regarding claim 6, Huang teaches The method of claim 1, wherein the concept (Huang, see Figure 1 citation below, the caption is being interpreted as involving “the concept”) indicates the relationship (Huang, see Figure 1 citation below, “A couple sit on the shore next to” is being interpreted as involving a relationship) between the two or more objects (Huang, see Figure 1 citation below, “couple” and “boat” are being interpreted as “two or more objects”) as a tuple (Huang, see Figure 1 citation below. A definition of a tuple, as one with ordinary skill in the art would know, if an ordered, finite sequence of elements. The order of the caption matters and it is a finite sequence of elements.) of two objects and an associated action (Huang, Figure 1 and Figure 1 text shows “Ours: A couple sit on the shore next to a boat on the sea”. “Couple” and “boat” are being interpreted as examples of two objects with the association action of “sit”). Regarding claim 7, Huang teaches The method of claim 1, further comprising, at the device: performing one or more classification operations utilizing the concept (Huang, pg 7, column 2, Section 4.2.4, ¶1, reproduced below: PNG media_image4.png 456 548 media_image4.png Greyscale . “Three-classification” is being interpreted as one or more classification operations. “Output of the transformer” is being interpreted as “utilizing the concept”). Regarding claim 10, Huang teaches The method of claim 1, wherein each of a plurality of concepts within the dictionary is represented by a key (Huang, see image below, the “VD index”, or Visual Dictionary Index, is being interpreted as a key), and wherein each of the plurality of concepts within the dictionary is linked to a predefined (Huang, see section 3.3 image from claim 1, “based on the virtual visual semantic labels produced by the visual dictionary” is being interpreted as involving “predefined”, which is being interpreted as existing before; just as classification labels are predefined in classical machine learning algorithms) set of image features (Huang, pg 8, Section 4.4, reproduced below: PNG media_image2.png 338 552 media_image2.png Greyscale . “VD index” is being interpreted as image concept [as can be seen in Figure 3 where index 191 is the image concept “head” and index 1074 is the image concept “building”], “visual feature” is being interpreted as “image features”). Regarding claim 22, Huang teaches The method of claim 1, wherein the vision transformer is configured to include: a global task that, during training, clusters images with the same concept together to produce semantically consistent relational representations (Huang, Figure 1, reproduced below: PNG media_image5.png 744 1102 media_image5.png Greyscale . “Global context” is being interpreted as “global task”. For example, “chatting” or “next to a boat” are examples of relational representations”), and a local task that, during training, guides the vision transformer to discover object-centric semantic correspondence across images (Huang, pg 8, Section 4.4, reproduced below: PNG media_image2.png 338 552 media_image2.png Greyscale . “VD index” is being interpreted as object-centric semantic correspondence [as can be seen in Figure 3 where index 191 is the object-centric semantic correspondence “head” and index 1074 is the object-centric semantic correspondence “building” across images]. This object-centric semantic correspondence is being interpreted as “a local task”; pg 2, column 1, second to last paragraph: “VD can be dynamically updated through our trainable CNN backbone directly from visual-language data during pretraining” The Visual Dictionary is being interpreted as during training guides the vision transformer). Claim 14 is rejected using the same rationale as applied to claim 4 discussed above. Claim 15 is rejected using the same rationale as applied to claim 5 discussed above. Claim 16 is rejected using the same rationale as applied to claim 6 discussed above. Claim 17 is rejected using the same rationale as applied to claim 7 discussed above. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Kim et al (“Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)”, 2018) discloses a dictionary of concepts (Appendix A) and trained using concept vectors. Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNNY B DUONG whose telephone number is (571)272-1358. The examiner can normally be reached Monday - Thursday 10a-9p (ET). Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /J.B.D./Examiner, Art Unit 2667 /MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667
Read full office action

Prosecution Timeline

Aug 22, 2022
Application Filed
May 12, 2025
Non-Final Rejection — §102
Aug 14, 2025
Response Filed
Oct 08, 2025
Final Rejection — §102
Dec 15, 2025
Interview Requested
Jan 05, 2026
Applicant Interview (Telephonic)
Jan 05, 2026
Examiner Interview Summary
Jan 09, 2026
Request for Continued Examination
Jan 23, 2026
Response after Non-Final Action
Feb 19, 2026
Non-Final Rejection — §102 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12586187
LESION LINKING USING ADAPTIVE SEARCH AND A SYSTEM FOR IMPLEMENTING THE SAME
2y 5m to grant Granted Mar 24, 2026
Patent 12525024
ELECTRONIC DEVICE, METHOD, AND COMPUTER READABLE STORAGE MEDIUM FOR DETECTION OF VEHICLE APPEARANCE
2y 5m to grant Granted Jan 13, 2026
Patent 12518510
MACHINE LEARNING FOR VECTOR MAP GENERATION
2y 5m to grant Granted Jan 06, 2026
Patent 12498556
Microscopy System and Method for Evaluating Image Processing Results
2y 5m to grant Granted Dec 16, 2025
Patent 12488438
DEEP LEARNING-BASED IMAGE QUALITY ENHANCEMENT OF THREE-DIMENSIONAL ANATOMY SCAN IMAGES
2y 5m to grant Granted Dec 02, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
66%
Grant Probability
99%
With Interview (+32.8%)
3y 8m
Median Time to Grant
High
PTA Risk
Based on 56 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month