Last updated: April 19, 2026

Application No. 18/319,546

SCENE UNDERSTANDING USING LANGUAGE MODELS FOR ROBOTICS SYSTEMS AND APPLICATIONS

Non-Final OA §102§103

Filed

May 18, 2023

Examiner

HAIDER, SYED

Art Unit

2633

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

3 (Non-Final)

Interview Optional

— +4.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 850 resolved cases, 2023–2026

Examiner Intelligence

HAIDER, SYED View full profile →

Grants 83% — above average

Career Allow Rate

709 granted / 850 resolved

+21.4% vs TC avg

Minimal +4% lift

Without

With

+4.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 6m

Avg Prosecution

35 currently pending

Career history

885

Total Applications

across all art units

Statute-Specific Performance

§101

5.6%

-34.4% vs TC avg

§103

54.5%

+14.5% vs TC avg

§102

22.9%

-17.1% vs TC avg

§112

9.2%

-30.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 850 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 02/05/2026, has been entered.
 


Response to Arguments
Applicant’s arguments filed on 02/05/2026, with respect to claim(s) 1-13, and 22-24, have been considered but they are not persuasive. Regarding independent claims 1 and 22 (and their respective deponent claims) Applicant argues that “The Office Action maps Pirk's "query embedding" (e.g., derived from voice input such as "retrieve the red mug," Pirk pars. 11, 75) to the "query" of claim 1, and maps Pirk's object detection (using model 152/object detection engine 142 to find object regions, Pirk pars. 41-43) to the "segmenting" of claim 1. But Pirk first processes images to identify regions/bounding boxes for objects in the scene (Pirk pars. 41-43, 58-60). Only after that detection step does Pirk obtain or compute a "query embedding" and compare it to embeddings of the already-detected regions to select a target object (Pirk pars. 11, 51, 75-77). 
Thus, even assuming arguendo Pirk's "query embedding" corresponds to the "query" of claim 1 and Pirk's object detection corresponds to "segmenting," Pirk's detection of object regions (pars. 41-43) is performed prior to determining any query embedding (pars. 11, 75), and Pirk does not describe using the query embedding to influence or alter detection or segmentation. Pirk therefore does not disclose processing imaging data according to the query to segment a 3D object, as recited in amended claim 1. Accordingly, Applicant respectfully submits that claim 1 (and its dependent claims), and independent claim 22 (and its dependent claims) for the same or similar reasons, is not anticipated by Pirk, and requests withdrawal of the rejection under §102(a)(1).” (please see Remarks, page 10).
	Examiner respectfully disagrees, First of all, Examiner noticed that the way claims 1 and 22, are drafted does not requires obtain, segmenting, providing, receiving and cause performance of one or more control operations steps, for instance currently amended claim 1, recites “one or more circuits to: 
obtain a query, indicating at least one of:
(i) an objective or goal to be accomplished;
(ii) an interaction with a three-dimensional (3D) object or between the 3D object and a second 3D object 
(iii) an action to be performed 
(iv) a semantic attribute of the 3D object or a part thereof; or 
(v) a spatial relationship between two or more parts of the 3D object”, Hence as can be seen from the claim language that only one condition/option needs to be met in order to reject independent claims 1 and 22 (and their respective deponent claims). In this scenario Examiner selected (i) to reject the claim and rest of claim limitations are not given a patentable weight, since rest of claim limitations relates to other conditions/options indicated in claim and because of alternative language “or” said options are not required. Although Examiner cited paragraphs to some claim limitations for completeness and/or to show that even Applicant were to change “or” to “and” Pirk reference would still reads on the rest of claim limitations. For instance, as Pirk paragraph 11, discloses “the query embedding can be determined based on voice input and/or based on an image of the target object. For example, “red mug” in voice input of “retrieve the red mug” can be mapped to a given point in the embedding space (e.g., through labeling of the embedding space with semantic text labels after training). Also, for example, a user can point to a “red mug” and provide a visual, verbal, and/or touch command to retrieve similar objects. Image(s) of the “red mug” can be captured (or cropped from a larger image using object recognition techniques), using the user's pointing as a queue”, Hence as can be seen from above passage that Robot first receives query red mug via voice input and then robot captures images of the red cup in order to “retrieve the red mug”. Therefore even Applicant were to add theses argued limitations in the claim Prik reference still reads on the argued limitations. Examiner suggests Applicant to remove alternative language “or” from the claim and also further elaborate on segmentation in order to overcome the cited reference.


Claim Objections
Claim 4, objected to because of the following informalities: In claim 4, line 2, recites “the machine” however should recite “the autonomous or semi-autonomous machine”.  Appropriate correction is required. Similar issues exists in claim 5, 11, 13, 23, and 24, where “the machine” should be replace with “the autonomous or semi-autonomous machine”


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-6, 11-13, and 22-24, is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Pirk (US PGPUB 2021/0334599 A1).

As per claim 1, Pirk discloses at least one processor (Pirk, Fig. 1:190) comprising: 
one or more circuits (Pirk, paragraphs 10, 56 and 74) to: 
obtain a query, indicating at least one of:
(i) an objective or goal to be accomplished ((Pirk, paragraphs 11, 24-25, 37 and 75, discloses the query embedding can be determined based on voice input and/or based on an image of an object. For example, “red mug” in voice input of “retrieve the red mug” can be mapped to a given point in the embedding space (e.g., through labeling of the embedding space with semantic text labels after training),
Furthermore please note although Examiner has cited related paragraphs of the cited reference to rest of claim limitations in a case Applicant to remove alternative claim language and to incorporate rest of claim limitations into the pending claim, however said limitations are not given a patentable weight, since these limitations are not part of the claim and are as follows:
(ii) an interaction with a three-dimensional (3D) object or between the 3D object and a second 3D object 
(iii) an action to be performed 
(iv) a semantic attribute of the 3D object or a part thereof; or 
(v) a spatial relationship between two or more parts of the 3D object; 
Obtain imaging data representing the 3D object (Pirk, paragraphs 11, and 40-41);
Process the imaging data according to the query to segment the 3D object in the imaging data (Pirk, Fig. 1:152:142, and paragraphs 7, 11, 41-43, discloses Image(s) of the “red mug” can be captured (or cropped from a larger image using object recognition techniques)), the segmenting including: 
providing, to a model, the imaging data representing the 3D object (Pirk, Fig. 1:142:152, and paragraphs 13, 41-43); and
receiving, from the model, an identification of one or more region of the 3D object in the imaging data (Pirk, paragraphs 13, discloses processing the first image using an object recognition model to identify a plurality of first object regions in the first image)
cause performance of one or more control operations of an autonomous or semi-autonomous machine to act based at least on the identification of the one or more regions of the 3D object (Pirk, paragraphs 11, and 78, discloses  robot can be controlled to interact with a target object by: determining a query embedding, in an embedding space of the object-contrastive model; processing a robot image, from a vision component of the robot, using the object-contrastive model; determining, based on the processing, a target object in a current environment of the robot; and controlling the robot to interact with the target object).

As per claim 2, Pirk further discloses the at least one processor of claim 1, wherein the identification comprises a segmentation mask (Pirk, Fig. 2A:210A:250).

As per claim 3, Pirk further discloses the at least one processor of claim 1, wherein the identification comprises a pointwise label or set of pixels (Pirk, paragraph 11).

As per claim 4, Pirk further discloses the at least one processor of claim 1, wherein the causing performance of the one or more control operations of the machine comprises generating an instruction to interact with one or more parts of the 3D object (Pirk, paragraphs 11, 25 and 78).

As per claim 5, Pirk further discloses the at least one processor of claim 1, wherein the causing performance of the one or more control operations of the machine comprises generating an instruction to interact with the 3D object and the second 3D object (Pirk, paragraph 4, discloses generate or identify a query embedding in an embedding space of the trained model (where the query embedding represents an embedding of rich feature(s) of target object(s) to be interacted with by the robot)), based at least on one or more attributes of the 3D object or the second 3D object identified via segmenting (Pirk, paragraphs 9, 11, and 37).

As per claim 6, Pirk further discloses the at least one processor of claim 1, wherein the query is provided to the model to obtain the identification of the one or more regions (Pirk, paragraph 7).

As per claim 11, Pirk further discloses the at least one processor of claim 1, wherein the causing performance of the one or more control operations of the machine comprises generating an instruction and transmitting the instruction to a control system of the machine to cause an interaction with the 3D object based at least on the identification of the one or more regions of the 3D object (Pirk, paragraphs 4, 11 and 78).

As per claim 12, Pirk further discloses the at least one processor of claim 1, wherein obtaining the query comprises:
receiving, prior to segmenting the 3D object in the imaging data, an action to be performed with respect to the 3D object (Pirk, paragraph 4, discloses a trained model in processing vision data captured by a vision component of a robot, generate embeddings based on the processing, and control the robot based at least in part on the generated embeddings. For instance, some of those implementations can: generate or identify a query embedding in an embedding space of the trained model (where the query embedding represents an embedding of rich feature(s) of target object(s) to be interacted with by the robot)); and 
generating the query based at least on the action (Pirk, paragraphs 4, 11 and 25). 

As per claim 13, Pirk further discloses the at least one processor of claim 1, wherein the at least one processor is comprised in at least one of:
a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational Al operations; a system implementing one or more large language models (LLMs); a system for generating synthetic data; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources (Pirk, paragraph 53).

As per claim 22, please see the analysis of claim 1.

As per claim 23, Pirk further discloses the method of claim 22, wherein the method further comprises transmitting the instruction to a control system capable of performing the one or more control operation using the machine (Pirk, paragraphs 4, 11 and 78).

As per claim 24, Pirk further discloses the method of claim 22, wherein the instruction is to cause the machine to interact with one or more parts of the 3D object in performing the one or more control operations (Pirk, paragraphs 11, 25 and 78).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 7-10,  is/are rejected under 35 U.S.C. 103 as being unpatentable over Pirk (US PGPUB 2021/0334599 A1) and further in view of Ahmed (NPL Document “3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language”, hereinafter Ahmed).

As per claim 7, Pirk further discloses the at least one processor of claim 1, wherein the Pirk does not explicitly disclose model is updated using training data comprising natural language descriptions of relationships between a plurality of parts of the 3D object.
Ahmed discloses model is updated using training data comprising natural language descriptions of relationships between a plurality of parts of the 3D object (Ahmed, Abstract, and Spatial Relation Prediction, discloses Given the input objects sequence O, the task of the object transformer encoder is to predict the spatial relationship for some
annotated object pairs (oi, oj ). The goal of this task is to encourage the object transformer to understand the spatial relationship between the objects in the 3D scene).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Pirk teachings by implementing a transformer network to the system, as taught by Ahmed.
The motivation would be to provide an improved system to distinguish target object from other objects in the scene (Ahmed, Introduction), as taught by Ahmed.

As per claim 8, Pirk in view of Ahmed further discloses the at least one processor of claim 7, wherein the plurality of parts of the 3D object are obtained using a dataset of 3D objects annotated with hierarchical 3D part information (Ahmed, please see section 3.7, spatial relation prediction).

As per claim 9, Pirk in view of Ahmed further discloses the at least one processor of claim 7, wherein the natural language descriptions of the relationships are generated at least in part using a language model that produces human-like text (Ahmed, please see Introduction and section 3.2, discloses proposed model employs a transformer architecture for each modality, i.e., for both visual and textual data).

As per claim 10, Pirk in view of Ahmed further discloses the at least one processor of claim 9, wherein the language model comprises a generative transformer network that provides the natural language descriptions in response to queries that are related to spatial relationships between the plurality of parts of the 3D object (optional limitations not given patentable weight as being explained above).


Allowable Subject Matter
Claims 14, 16-17, and 21, are allowed.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SYED Z HAIDER whose telephone number is (571)270-5169. The examiner can normally be reached MONDAY-FRIDAY 9-5:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, SAM K Ahn can be reached at 571-272-3044. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SYED HAIDER/Primary Examiner, Art Unit 2633

Read full office action

Prosecution Timeline

May 18, 2023

Application Filed

Jul 25, 2025

Non-Final Rejection — §102, §103

Sep 23, 2025

Response Filed

Nov 05, 2025

Final Rejection — §102, §103

Feb 05, 2026

Request for Continued Examination

Feb 20, 2026

Response after Non-Final Action

Mar 13, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/852,037

Patent 12602430

Method for Constructing Positioning DB Using Clustering of Local Features and Apparatus for Constructing Positioning DB

2y 5m to grant Granted Apr 14, 2026

17/874,571

Patent 12604296

NETWORKED ULTRAWIDEBAND POSITIONING

2y 5m to grant Granted Apr 14, 2026

18/119,545

Patent 12597163

Systems and Methods to Optimize Imaging Settings for a Machine Vision Job

2y 5m to grant Granted Apr 07, 2026

18/547,138

Patent 12586394

METHOD, APPARATUS AND SYSTEM FOR AUTO-LABELING

2y 5m to grant Granted Mar 24, 2026

18/086,251

Patent 12579676

EGO MOTION-BASED ONLINE CALIBRATION BETWEEN COORDINATE SYSTEMS

2y 5m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

83%

Grant Probability

88%

With Interview (+4.4%)

2y 6m

Median Time to Grant

High

PTA Risk

Based on 850 resolved cases by this examiner. Grant probability derived from career allow rate.