Last updated: April 19, 2026
Application No. 18/770,252
EMBODIED INTELLIGENCE-BASED METHOD, APPARATUS, DEVICE AND MEDIUM FOR INDUSTRIAL PART SORTING PROCESSING

Non-Final OA §102§103§112
Filed
Jul 11, 2024
Examiner
GAMMON, MATTHEW CHRISTOPHER
Art Unit
3657
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
BEIHANG UNIVERSITY
OA Round
1 (Non-Final)
This examiner grants 65% of cases after interview

— +23.4% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 102 resolved cases, 2023–2026
Examiner Intelligence

GAMMON, MATTHEW CHRISTOPHER View full profile →
Grants 65% of resolved cases
Career Allow Rate
66 granted / 102 resolved
+12.7% vs TC avg
Strong +23% interview lift
Without
With
+23.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
32 currently pending
Career history
134
Total Applications
across all art units
Statute-Specific Performance

§101
7.4%
-32.6% vs TC avg
§103
32.4%
-7.6% vs TC avg
§102
26.8%
-13.2% vs TC avg
§112
31.1%
-8.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 102 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Claim Objections
Claims 2, 9, and 16 are objected to because of the following informalities: 
Regarding Claims 2, 9, and 16, the claims recite “physical environment”. It should read a physical environment”.
Appropriate correction is required.

Specification
The lengthy specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors of which applicant may become aware in the specification.

Claim Interpretation
General Notes:
Applicant makes frequent use of terms which under their plain meaning would appear to not have any particular distinction without further limitation being provided within the claims, and furthermore leave the claims as potentially disconnected and broader as they do not refer back to items which may normally as disclosed be the same. Examples include:
“feature description” introduced as part of what comprises the “part sorting description information” and “feature … of the to-be-sorted parts”.
“part sorting description information”, “sorting requirement description”, and “sorting requirement”
“feature description” and “feature”.

Applicant has made frequent use of phrases such as “for” and “to”. In some cases, these terms may not amount to an actual positive recitation of a claim limitation and appear to only be provided for context of the actual limitations within the claim or otherwise indicate an intended result, use, or purpose of a preceding limitation. While in the interest of compact prosecution and where expedient Examiner has provided prior art which Examiner believes teaches these recitations, whether considered positively recited or not, Applicant is advised to review their claims and validate all claim limitations which Applicant desires to be clearly and positively recited. This may require amendment to more clear and precise phrasing or construction. 
For example, Claim 1 recites “performing … analysis processing … to obtain …”. If the act of obtaining is desired to be very clearly and explicitly claimed it would better be constructed as “obtaining … by performing … analysis processing” or similar, wherein the action is claimed, and the exact means by which the action is performed, accomplished, etc. is further recited in a clearly linked manner (e.g. “by”).

Applicant recites various “model”. However, without further limitation to these terms, there is no inherent requirement that a given model is distinct, independent, and/or separate from another.

The adjective “industrial” is frequently used before other terms. This does not appear to inherently alter the scope of the terms without further limitations being provided, which appear absent in the present claims, the term “industrial” being particularly broad and inherent under the broadest reasonable interpretation thereof through mere context of a “robotic arm” already required in the limitations of all of the claims.
The term “feature description” and “sorting requirement” have especially broad meaning under their plain English meaning. Furthermore, even as disclosed these terms appear particularly broad. See [0052] – [0054] of Applicant’s specification which recites: 
“The part sorting description information includes the feature description and the sorting requirement description of the to-be-sorted parts, which are formed by natural language descriptions. For example, the part description information is “sort all bearing parts into one box”. 
The feature description mainly defines an attribute of the to-be-sorted parts, such as the material, use and shape of the to-be-sorted parts. 
The sorting requirement description mainly refers to an instruction of where the parts are sorted, such as how many boxes the to-be-sorted parts are to be sorted into”
The disclosures use of “such as” indicates that these are mere examples and non-limiting. 

The terms “global vision feature” and “local global feature” do not appear to be disclosed with any particularity (and consequently also not claimed with any particularity). They thus appear to be especially broad terms.




Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2, 6 – 7, 8 – 14, 16, and 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding Claims 2 and 16, the claims recite the phrase “correspondingly” in the limitation “correspondingly, controlling the robotic arm …”. It is unclear what scope this phrase imparts to the claim. If extremely broad, such that any correspondence suffices to read on the claim, then this is already inherent to “controlling the robotic arm …”, if it indicates any particular correspondence, it should be clearly claimed as it is wholly unclear what it is. Claim 9 which otherwise matches these claims does not use this phrasing and further indicates that it is of a non-limiting nature.
In the interest of compact prosecution, the limitation has been interpreted as simply reading “controlling the robotic arm” (the “correspondingly,” being removed).
Regarding Claim 8, the claim is constructed as being directed towards the statutory category of a machine (MPEP 2106 relates). However, the claim recites the limitation “the processor executes the computer-executed instruction stored in the memory to:” which exhibits two issues. First, the instructions are already indicated as executed (past tense) rather than executable or similar. Second, the claim indicates what the machine does rather than what it is. “Features of an apparatus may be recited either structurally or functionally. In re Schreiber, 128 F.3d 1473, 1478, 44 USPQ2d 1429, 1432 (Fed. Cir. 1997)” and “[A]pparatus claims cover what a device is, not what a device does.” Hewlett-Packard Co.v.Bausch & Lomb Inc., 909 F.2d 1464, 1469, 15 USPQ2d 1525, 1528 (Fed. Cir. 1990) (emphasis in original)” (MPEP 2114). This claim construction is neither a structural or functional limitation, but clearly should be constructed as such otherwise the claim is of trivial nature, merely reciting a generic computer.
Furthermore, Applicant’s dependent claims use the phrasing “the processor is further configured to” which do not exhibit this issue and follow proper form for a machine claim. However, the processor was never first configured to do anything as presently constructed in Claim 8, and thus there is nothing “further” to configure. This indicates to the Examiner that Claim 8 was originally intended to use such phrasing.
In the interest of compact prosecution, the limitation is instead interpreted as reading:
“the processor is configured to”
or similar. The limitation of “computer-executed instructions” may similarly be further clarified through such language as “computer executable instructions” or similar.
Regarding Claims 6 – 7, 9 – 14, and 20, the claims depend from claim(s) rejected above and inherit the deficiencies of said claim(s) as described above. Therefore, Claims 6 – 7, 9 – 14, and 20 are rejected under the same logic presented above.




Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1 – 3, 6, 8 – 10, 13, 15 – 17, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Huang et al. (Huang, Wenlong, et al. "Voxposer: Composable 3d value maps for robotic manipulation with language models." arXiv preprint arXiv:2307.05973 (2023)).

Regarding Claim 1, Huang teaches: 
An embodied intelligence-based method for industrial part sorting processing, comprising: 
receiving part sorting description information inputted by a user, wherein the part sorting description information comprises a feature description and a sorting requirement description of to-be-sorted parts (See at least Page 1, Abstract “In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects” and Figure 1 wherein the first image-paired example of a user instruction is “Sort trash to blue tray”. Feature descriptions include “trash”, “blue”, and “tray”. Sorting requirement descriptions include “trash”, “blue”, and “tray”. See various other examples throughout, such as those in Figure 3 as necessary); 
performing, according to the part sorting description information, analysis processing by adopting a task instruction understanding model (See at least Large Language Model in Figure 2), to obtain a control signal for controlling a robotic arm to sort the to-be-sorted parts (See at least Figure 3 illustrating controlled robot actions), wherein the task instruction understanding model is a neural network model which is pre-trained based on a large language model (See at least Page 2, Section 2, “In this work, we leverage LLMs for zero-shot in-the-wild cost specification with superior generalization” and Page 3, Section 2, “we rely on pre-trained language models for their open-world knowledge and tackle the more challenging robotic manipulation in 3D”) for extracting a sorting requirement from text and generating a sorting control signal based on the sorting requirement (See Figure 2 wherein user input is taken which may include as shown in Figure 1 sorting requirements in text and generates code); 
performing, according to the control signal, analysis processing on an image of a to-be-sorted part by adopting an intelligent perception model for parts (See at least Page 4, Section 3.2, “Specifically, when an instruction is given as a comment in the code, LLMs can be prompted to 1) call perception APIs (which invoke vision language models (VLM) such as an open-vocabulary detector [13–15]) to obtain spatial-geometrical information of relevant objects, 2) generate NumPy operations to manipulate 3D arrays, and 3) pre scribe precise values at relevant locations”), to obtain a category and a grasp pose of the to-be-sorted part (See at least Pages 3 and 4, Section 3.1, “The central problem investigated in this work is to generate a motion trajectory τri for robot r and each manipulation phase described by instruction ℓi. We represent τri as a sequence of dense end-effector waypoints to be executed by an Operational Space Controller [117], where each waypoint consists of a desired 6-DoF end-effector pose, end-effector velocity, and gripper action”), wherein the image is collected by an industrial camera (See at least Page 20, Section A.4, “We mount two RGB-D cameras (Azure Kinect) at two opposite ends of the table: bottom right and top left from the top down view. At the start of each rollout, both cameras start recording and return the real-time RGB-D observations at 20 Hz” and Figure 2, in particular Cam #1 and Cam #2), and the intelligent perception model for parts is a neural network model which is pre-trained for determining a category and a grasp pose of a part based on image recognition analysis (See again at least Page 4, Section 3.2 with respect to the Vision Language Model); 
controlling the robotic arm to sort the to-be-sorted part according to the control signal, the category and the grasp pose of the to-be-sorted part (See at least Pages 3 and 4, Section 3.1, “The central problem investigated in this work is to generate a motion trajectory τri for robot r and each manipulation phase described by instruction ℓi. We represent τri as a sequence of dense end-effector waypoints to be executed by an Operational Space Controller [117], where each waypoint consists of a desired 6-DoF end-effector pose, end-effector velocity, and gripper action” and Figure 3 illustrating control for several tasks), and performing obstacle-avoidance processing through an intelligent obstacle-avoidance neural network model (See at least Page 1, Abstract, “We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations” and Page 6, Section 4, “Motion Planner. We consider only affordance and avoidance maps in the planner optimization, which finds a sequence of collision-free end-effector positions p1:N ∈ R3 using greedy search”) in a sorting process (See above context).  

Regarding Claim 2, Huang teaches: 
The method according to claim 1, wherein performing the obstacle-avoidance processing through the intelligent obstacle-avoidance neural network model in the sorting process comprises: 
in the sorting process, processing, by adopting the intelligent obstacle-avoidance neural network model, voxel data of different angles of physical environment around the robotic arm which is collected in real time by the industrial camera (See at least Page 20, Section A.4, “We mount two RGB-D cameras (Azure Kinect) at two opposite ends of the table: bottom right and top left from the top down view. At the start of each rollout, both cameras start recording and return the real-time RGB-D observations at 20 Hz” and Figure 2, in particular Cam #1 and Cam #2), to obtain a three-dimensional spatial structure of the physical environment around the robotic arm (See at least “3D voxel maps” of Page 2, Section 1), wherein the intelligent obstacle-avoidance neural network model is a pre-trained (See at least caption of Figure 2, “The entire process does not involve any additional training”) intelligent model for performing dynamic three-dimensional reconstruction according to voxel data (See at least Page 4 and 5, Section 3.2, “Concretely, we aim to obtain a voxel value map Vt i = VoxPoser(ot,ℓi) by prompting an LLM and executing the code via a Python interpreter, where ot is the RGB-D observation at time t and ℓi is the current instruction”); 
processing the three-dimensional spatial structure by adopting a robotic arm kinematics model, to obtain a motion trajectory of the robotic arm (See at least Page 6, Section 4, “Dynamics Model. We use the known robot dynamics model in all tasks, where it is used in motion planning for the end-effector to follow the waypoints” and/or Page 21, Section A.5, “The controller takes as input a desired end-effector 6-DoF pose, calculates a se quence of interpolated waypoints using inverse kinematics, and finally follows the waypoints using a PDcontroller”); 
correspondingly, controlling the robotic arm to sort the to-be-sorted part according to the control signal, the category and the grasp pose of the to-be-sorted part comprises: 
controlling, according to the control signal, the category and the grasp pose of the to-be-sorted part, the robotic arm to sort the to-be-sorted part following the motion trajectory of the robotic arm (See at least Pages 3 and 4, Section 3.1, “The central problem investigated in this work is to generate a motion trajectory τri for robot r and each manipulation phase described by instruction ℓi. We represent τri as a sequence of dense end-effector waypoints to be executed by an Operational Space Controller [117], where each waypoint consists of a desired 6-DoF end-effector pose, end-effector velocity, and gripper action” and Figure 3 illustrating control for several tasks).  

Regarding Claim 3, Huang teaches: 
The method according to claim 1, wherein performing, according to the part sorting description information, the analysis processing by adopting the task instruction understanding model, to obtain the control signal for controlling the robotic arm to sort the to-be-sorted parts comprises: 
performing understanding processing on the part sorting description information through a natural language understanding portion of the task instruction understanding model to obtain a feature and a sorting requirement of the to-be-sorted parts, and generating the control signal based on the feature and the sorting requirement of the to-be-sorted parts through a signal generation portion of the task instruction understanding model (See previous recitations and/or Figure 2 showing the flow of data and processing thereof. It is clear that natural language is provided, which is processed by an LLM which has “natural language understanding” portion(s) and is related to the part(s) in a scene and as shown with respect to Claim 1 has the features claimed. Furthermore, it is clear that signals are generated for control. In summary, this claim does not appear to significantly narrow from Claim 1 despite the length of the claim).  

Regarding Claim 6, Huang teaches: 
The method according to claim 2, wherein in the sorting process, processing, by adopting the intelligent obstacle-avoidance neural network model, the voxel data of different angles of the physical environment around the robotic arm which is collected in real time by the industrial camera, to obtain the three-dimensional spatial structure of the physical environment around the robotic arm comprises: 
controlling multiple industrial cameras pre-arranged around the robotic arm to scan an activity space of the robotic arm at different angles and acquire the voxel data of different angles of the physical environment around the robotic arm (See at least Page 20, Section A.4, “We mount two RGB-D cameras (Azure Kinect) at two opposite ends of the table: bottom right and top left from the top down view. At the start of each rollout, both cameras start recording and return the real-time RGB-D observations at 20 Hz” and Figure 2, in particular Cam #1 and Cam #2); 
inputting the voxel data of different angles of the physical environment around the robotic arm into the intelligent obstacle-avoidance neural network model to perform three-dimensional space feature extraction and perform three-dimensional space reconstruction according to an extracted feature, to obtain the three-dimensional space structure (See at least Page 2, Section 1, “By generating Python code to invoke perception APIs, LLMs can obtain spatial-geometric information of relevant objects or parts” and Page 6, Section 4, “VLMs and Perception. Given an object/part query from LLMs, we first invoke open-vocab detector OWL-ViT [15] to obtain a bounding box, then feed it into Segment Anything [118] to obtain a mask, and finally track the mask using video tracker XMEM [119]. The tracked mask is used with RGB-D observation to reconstruct the object/part point cloud”).  

Regarding Claims 8 – 10, 13, 15 – 17, and 20, the claims are directed to effectively the same subject matter as Claims 1 – 3 and 6 with respect to the application of prior art. The claims are therefore rejected under the same logic as Claims 1 – 3 and 6 above.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 5 – 6, 12 – 13, and 19 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. further in view of Jie et al. (Jie, Shibo, and Zhi-Hong Deng. "Convolutional bypasses are better vision transformer adapters." arXiv preprint arXiv:2207.07039 (2022)).
Regarding Claim 4, Huang teaches:
The method according to claim 1, wherein performing, according to the control signal, the analysis processing on the image of the to-be-sorted part by adopting the intelligent perception model for parts, to obtain the category and the grasp pose of the to-be-sorted part, wherein the image is collected by the industrial camera, comprises: 
controlling, according to the control signal, the industrial camera to acquire the image of the to-be-sorted part, wherein the image comprises an RGB image and a depth image (See at least Page 20, Section A.4, “We mount two RGB-D cameras (Azure Kinect) at two opposite ends of the table: bottom right and top left from the top down view. At the start of each rollout, both cameras start recording and return the real-time RGB-D observations at 20 Hz” and Figure 2, in particular Cam #1 and Cam #2); 
inputting the RGB image and the depth image into the intelligent perception model for parts for  performing feature analysis processing, to obtain the category and the grasp pose of the to-be-sorted part (See at least Page 4 and 5, Section 3.2, “Concretely, we aim to obtain a voxel value map Vt i = VoxPoser(ot,ℓi) by prompting an LLM and executing the code via a Python interpreter, where ot is the RGB-D observation at time t and ℓi is the current instruction”); 
…
Huang does not explicitly teach the exact nature of the Vision Langue Model (VLM) utilized instead providing several references disclosing particular models, but Jie explicitly teaches a VLM of the claimed structure below:
…
wherein the intelligent perception model for parts comprises a Transformer layer (See at least “Transformer Layer” in Figure 3), a convolutional neural network layer (See at least “1x1 Conv” and “3x3 Conv” in Figure 3 and Page 3, Section 3.2, “As illustrated in Figure 3 , a Convpass module consists of three convolutional layers”) and a fully connected layer (See at least Page 2, Section 2.1, “An MLP block consists of two fully-connected (FC) lay-ers” and MLP in Figure 3).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to utilize the Vision Transformer Adapter of Jie, or at least features thereof, in the VLM of Huang with a reasonable expectation of success. The components claimed are well-known and routine, Huang already indicates that they are focused on leveraging existing understanding, and Jie indicates that “Convpass benefits from the hard-coded inductive bias of convolutional layers and thus is more suitable for visual tasks” (Abstract) when comparing to other Vision Transformers (ViT) which are utilized in VLMs.

Regarding Claim 5, the combination of Huang and Jie teaches:
The method according to claim 4, 
Huang, in the context of the combination with Jie, further teaches:
wherein inputting the RGB image and the depth image into the intelligent perception model for parts for performing the feature analysis processing, to obtain the category and the grasp pose of the to-be-sorted part comprises: 
performing feature extraction on the RGB image and the depth image through the Transformer layer of the intelligent perception model for parts to obtain a global visual feature of the to-be-sorted part; 
performing feature extraction on the RGB image and the depth image through the convolutional neural network layer of the intelligent perception model for parts to obtain a local visual feature of the to-be-sorted part; 
determining, according to the global visual feature and the local visual feature, the category and the grasp pose of the to-be-sorted part through the fully connected layer of the intelligent perception model for parts (See at least Page 2, Section 1, “By generating Python code to invoke perception APIs, LLMs can obtain spatial-geometric information of relevant objects or parts” and Page 6, Section 4, “VLMs and Perception. Given an object/part query from LLMs, we first invoke open-vocab detector OWL-ViT [15] to obtain a bounding box, then feed it into Segment Anything [118] to obtain a mask, and finally track the mask using video tracker XMEM [119]. The tracked mask is used with RGB-D observation to reconstruct the object/part point cloud”. 
The combination of Huang and Jie inherently has all information in some manner passing through the VLM pass through the structures claimed in Claim 4, and the particular nature of “performing” or “determining” is not claimed with any particularity. 
Furthermore, the natures of the “features” are not claimed with any particularity. A point cloud alone may be considered as inherently having both of these features, or the point cloud and mask may be considered these items, etc.).  

Regarding Claims 12 – 13 and 19 – 20, the claims are directed to effectively the same subject matter as Claims 5 – 6 with respect to the application of prior art. The claims are therefore rejected under the same logic as Claims 5 – 6 and 6 above.



Allowable Subject Matter
Claims 7 and 14 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: 
The combination of limitations has not been found alone or in a reasonable combination of references found in Examiner’s search of the prior art. In particular, the limitation of (or similar) Claim 7 which reads:
the fully connected layer is configured to reconstruct a grid model with a same size as an inputted voxel grid
has not been found in combination with the other limitations of the claims such that a reasonable rejection may be provided at this time. The closest prior art of record is that of Huang recited above, others recited in the conclusion section below which are considered to teach many or possibly all of the present independent claims, as well as Fazlazi et al. (US 20230281961 A1) which appears to teach the remaining limitations of these claims in combination with Huang or other references. Furthermore, Huang appears to at least teach reducing voxel maps (See Page 5, Section 3.2 of Huang, “Additionally, because V is often sparse, we densify the voxel maps via smoothing operations, as they encourage smoother trajectories optimized by motion planners”).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Ding et al. (Ding, Yan, et al. "Task and motion planning with large language models for object rearrangement. In 2023 IEEE." RSJ International Conference on Intelligent Robots and Systems (IROS)) which discloses sorting items based on natural language instructions using LLMs.
Driess et al. (Driess, Danny, et al. "Palm-e: An embodied multimodal language model." arXiv preprint arXiv:2303.03378 (2023)) which discloses an embodied multimodal language model that is highly flexible to different user specified tasks and includes at least one specific example of sorting items per a user instruction to do so. 
Florence et al. (US 20250144795 A1) which is a generally equivalent disclosure as Driess et al. as a patent publication.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW C GAMMON whose telephone number is (571)272-4919. The examiner can normally be reached M - F 10:00 - 6:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ADAM MOTT can be reached on (571) 270-5376. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MATTHEW C GAMMON/Examiner, Art Unit 3657                                                                                                                                                                                                        
/ADAM R MOTT/Supervisory Patent Examiner, Art Unit 3657
Read full office action
Prosecution Timeline

Jul 11, 2024
Application Filed
Mar 25, 2026
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/556,276
Patent 12594673
Method of Calibrating Manipulator, Control System and Robot System
2y 5m to grant Granted Apr 07, 2026
18/550,894
Patent 12588646
MILKING SYSTEM COMPRISING A MILKING ROBOT
2y 5m to grant Granted Mar 31, 2026
18/324,527
Patent 12583110
ROBOT CONTROL SYSTEM
2y 5m to grant Granted Mar 24, 2026
19/008,421
Patent 12576523
CONTROLLING ROBOTS USING MULTI-MODAL LANGUAGE MODELS
2y 5m to grant Granted Mar 17, 2026
18/588,084
Patent 12544926
OBJECT INTERFERENCE CHECK METHOD
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
65%
Grant Probability
88%
With Interview (+23.4%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 102 resolved cases by this examiner. Grant probability derived from career allow rate.