Prosecution Insights
Last updated: April 19, 2026
Application No. 18/417,553

MODEL PREDICTIVE PATH INTEGRAL CONTROLLER GUIDED BY LARGE VISION LANGUAGE MODEL FOR INTELLIGENT AUTONOMOUS VEHICLE PATH PLANNING

Final Rejection §103
Filed
Jan 19, 2024
Examiner
BRADY III, PATRICK MICHAEL
Art Unit
3665
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Constructor Education And Research Genossenschaft
OA Round
2 (Final)
56%
Grant Probability
Moderate
3-4
OA Rounds
3y 2m
To Grant
99%
With Interview

Examiner Intelligence

Grants 56% of resolved cases
56%
Career Allow Rate
67 granted / 119 resolved
+4.3% vs TC avg
Strong +44% interview lift
Without
With
+44.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
38 currently pending
Career history
157
Total Applications
across all art units

Statute-Specific Performance

§101
23.2%
-16.8% vs TC avg
§103
52.5%
+12.5% vs TC avg
§102
10.1%
-29.9% vs TC avg
§112
11.5%
-28.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 119 resolved cases

Office Action

§103
DETAILED ACTION This final Action is in response to the reply filed 23 October 2025, which was in response to the non-final action dated 30 July 2025. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Amendment Claims 1-20 are pending. Claims 1, 8, 17 and 19 have been amended. With regard to the 35 U.S.C. 112(b) rejection of claims 8-20 (pgs. 2-3, Action), applicant’s amendments render the rejection moot. Accordingly, the rejection under 35 U.S.C. 112(b) of claims 8-20 has been withdrawn. With regard to the 35 U.S.C. 101 rejection of claims 1-20 (pgs. 3-16, Action), applicant has amended the independent claims 1, 8 and 17 to require “controlling the autonomous vehicle to the desired state using one of the lowest cost paths”. The examiner finds that this additional element along with the claim as a whole, is sufficient to integrate the judicial exception into practical application. Thus, under Step 2A Prong two (see MPEP 2106), since the claims as a whole are found to integrate the judicial exception into practical application, they are eligible at pathway B, thereby concluding the eligibility analysis. Accordingly, the rejection of claims 1-20 under 35 U.S.C. 101 has been withdrawn. With regard to the 35 U.S.C. 103 rejection of claims 1-20 (pgs. 16-43, Action), applicant’s amendments necessitated additional searching and consideration of new grounds of rejection. Accordingly, the new grounds of rejection under 35 U.S.C. 103 are: claims 1-4, 6-11, 13-15, 17, 19-20 in view of Danielczuk, Shalev, Ma and Pan; claims 5 and 12 in view of Danielczuk, Shalev, Ma, Pan and Do; and claims 16 and 18 in view of Danielczuk, Shalev, Ma, Pan and Abati, as discussed below. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or non-obviousness. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. Claims 1-4, 6-11, 13-15, 17, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent Publication Number 2022/0152826 to Danielczuk et al. (hereafter Danielczuk ) in view of U.S. Patent Publication Number 2019/0291728 to Shalev-Shwartz et al. (hereafter Shalev), Y. Ma, Y. Cao, J. Sun, M. Pavone, and C. Xiao, “Dolphins: Multimodal Language Model for Driving”, 1 December 2023, arXiv:2312.00438 (https://doi.org/10.48550/arXiv.2312.00438) (hereafter Ma), and U.S. Patent Publication Number 2025/0153736 to Pan et al. (hereafter Pan). As per claim 1, Danielczuk discloses [a] method for navigating a path by an autonomous vehicle in motion (see at least Danielczuk, see Fig. 11A, B, C, D; [0156] disclosing training and deploying a deep neural network; [0173] disclosing that FIG. 11A illustrates an example of an autonomous vehicle 1100, according to at least one embodiment. In at least one embodiment, autonomous vehicle 1100 (alternatively referred to herein as “vehicle 1100”) may be, without limitation, a passenger vehicle, such as a car, a truck, a bus, and/or another type of vehicle that accommodates one or more passengers ; [0176] disclosing that a steering system 1154, which may include, without limitation, a steering wheel, is used to steer vehicle 1100 (e.g., along a desired path or route) when propulsion system 1150 is operating (e.g., when vehicle 1100 is in motion; [0186] disclosing that cameras with a field of view that include portions of an environment in front of vehicle 1100 (e.g., front-facing cameras) may be used for surround view, to help identify forward facing paths and obstacles, as well as aid in, with help of one or more of controller(s) 1136 and/or control SoCs, providing information critical to generating an occupancy grid and/or determining preferred vehicle paths), the method comprising: collecting image data along the path with a camera operably coupled to the autonomous vehicle in motion (see at least Danielczuk, [0182] disclosing that FIG. 11B illustrates an example of camera locations and fields of view for autonomous vehicle 1100 of FIG. 11A, according to at least one embodiment. In at least one embodiment, cameras and respective fields of view are one example embodiment and are not intended to be limiting. For instance, in at least one embodiment, additional and/or alternative cameras may be included and/or cameras may be located at different locations on vehicle 1100; [0186] disclosing that cameras with a field of view that include portions of an environment in front of vehicle 1100 (e.g., front-facing cameras) may be used for surround view, to help identify forward facing paths and obstacles, as well as aid in, with help of one or more of controller(s) 1136 and/or control SoCs, providing information critical to generating an occupancy grid and/or determining preferred vehicle paths. In at least one embodiment, front-facing cameras may be used to perform many similar ADAS functions as LIDAR, including, without limitation, emergency braking, pedestrian detection, and collision avoidance ); ... (1) ... ; ... (2) ... ; ... (3) ... , ... (4) ... ; ... (5) ... ; ... (6) ... , ... (7) ... ; passing the outputted driving instruction ... (8) ... from the LVLM to a Model Predictive Path Integral (MPPI) module operably coupled to the autonomous vehicle (see at least Danielczuk, [0113]-[0114]; [0135] disclosing that the system that performs at least a part of process 700 includes executable code to determine 704 one or more trajectories between the first state and the second state. A system may adapt an MPPI policy such that trajectories may be generated by sampling around a straight line between the first state and the second state, in which the straight line may correspond to any suitable straight line between a robot appendage and/or an object in a position indicated by the first state and the robot appendage and/or the object in a position indicated by the second state <interpreted as driving instruction>); ... (9) ... ... (10) ... ; ... (11) ... , ... (12) ... ; ... (13) ... ; assigning, by the MPPI module, a change in cost of one of the plurality of possible calculated paths based on the driving instructions from the LVLM (see at least Danielczuk, [0089] disclosing that one or more systems utilize an SGD algorithm to update parameters of a scene collision network 102 such that calculated loss for the scene collision network 102 is minimized.); ... (14) ... . But, Danielczuk does not explicitly teach the following limitations taught in Ma: (3) passing a slice of collected image data encoded with a feature extractor to a large vision language model (LVLM) (see at least Ma, Fig. 1, caption Dolphins overview. Disclosing that the demonstrations in Section 5 show that Dolphins’s capabilities on a group of subtasks belonging to the two dimensions of holistic understanding and reasoning, and human-like capabilities. The former encompasses autonomous driving-related capabilities such as scene understanding and prediction and planning for the ego car’s behavior. The latter analyzes three human-level abilities: rapid learning and adaptation, reflection and error recovery, and interactive conversation; pg. 3, para. 2, disclosing Dolphins, which is extended from OpenFlamingo [18] with strong in-context learning capabilities. Furthermore, we employ in-context instruction tuning [56] to enhance few-shot adaptations of our model. Consequently, Dolphins is proficient in handling diverse video inputs and exhibits the capacity for rapid adaptation to unseen instructions through in-context learning; Fig. 2, illustrating a general process, and pg. 4, para. 2, disclosing that, with regard to Fig. 2,this process is divided into three steps: (1) briefly describe the content of the image. (2) identify the object in the question and describe its spatial position. (3) if the question requires reasoning, provide the reasoning process in this step. Finally, we combine the sentences generated by ChatGPT in these three steps and append “So the answer is {answer}” at the end to form a complete GCoT response. This approach involves training the model on diverse visual data with GCoT response, where it learns to articulate its reasoning process in a step-by-step manner for various scenarios and objects that might not be specific to driving but are crucial for building foundational reasoning skills ; pg. 4, para. 2, 3, 4), (4) wherein the LVLM has been pretrained and wherein the LVLM has been tuned with image-pairs from driving environments (see at least Ma, pg. 4, para. 4, disclosing that grounding the model in a general image dataset with GCoT responses generated by ChatGPT, followed by a careful transfer and fine-tuning of this skill in the specific context of AD. The use of both real and synthetic AV datasets ensures a comprehensive and robust training regime, preparing the VLM to handle the intricate and varied challenges of autonomous vehicular navigation with nuanced, step-by-step reasoning. instructions with just a handful of annotated examples in autonomous driving-related tasks; pg. 7, section 4.2); and (5) passing a text-based query related to an aspect of driving to the LVLM (see at least Ma, pg. 9, section 5.2.2, disclosing that Utilizing a set of instructions primarily derived from LINGO-1 3, we present Dolphins with a spectrum of queries reflective of real-world driving interactions. As shown in Figures 22, 23, and 24, the conversations are constructed to assess Dolphins’s ability to comprehend and respond to nuanced language, maintain context over multiple exchanges, and offer informative and contextually relevant responses spanning from potential hazards in the scene to ego planning and the reasoning behind the scene. The results from these interactions indicate that Dolphins possesses a robust conversational ability, distinguishing itself significantly from other contemporary driving-related Vision Language Models in terms of linguistic flexibility and contextual understanding. In the future, this could be a foundation for a human interface that builds up trust between AV and road users or its passengers) ... . But, neither Danielczuk nor Ma explicitly teach the following limitations taught in Shalev: (6) outputting from the LVLM driving instructions in a structured, machine-readable format (see at least Shalev, [0599] disclosing that the driving policy system (e.g., the RL system) may be implemented together with one or more of the described accident liability rules to provide a navigational system that takes into account potential accident liability when deciding on a particular navigational instruction to implement. Such rules may be applied during the planning phase; e.g., within a set of programmed instructions or within a trained model such that a proposed navigational action is developed by the system <interpreted as structure, machine readable format> already in compliance with the rules. For example, a driving policy module may account for or be trained with, for example, one or more navigational rules upon which RSS is based. Additionally or alternatively, the RSS safety constraint may be applied as a filter layer through which all proposed navigational actions proposed by the planning phase are tested against the relevant accident liability rules to ensure that the proposed navigational actions are in compliance; [0600] disclosing that using the image information, a driving policy may take in a plurality of inputs and output a planned navigational action for accomplishing a navigational goal of the host vehicle <interpreted as structure, machine readable format>. The driving policy may include a set of programmed instructions, a trained network, etc., that may receive various inputs (e.g., images from one or more cameras showing the surroundings of the host vehicle, including target vehicles, roads, objects, pedestrians, etc.; output from LIDAR or RADAR systems; outputs from speed sensors, suspension sensors, etc.; information representing one or more goals of the host vehicle—e.g., a navigational plan for delivering a passenger to a particular location, etc.). Based on the input, the processor may identify a target vehicle in the environment of the host vehicle, e.g., by analyzing camera images, LIDAR output, RADAR output, etc.); (9) wherein the MPPI module is configured to calculate a plurality of possible paths using a cost model (see at least Shalev, [0385] disclosing a “semantic” language that consists of units, measurements, and action space, and specification as to how they are incorporated into planning, sensing and actuation of the AV. To get a sense of semantics, in this context, consider how a human taking driving lessons is instructed to think about a “driving policy.” These instructions are not geometric—they do not take the form “drive 13.7 meters at the current speed and then accelerate at a rate of 0.8 m/s.sup.2”. Instead, the instructions are of a semantic nature—“follow the car in front of you” or “overtake that car on your left” The typical language of human driving policy is about longitudinal and lateral goals rather than through geometric units of acceleration vectors. A formal semantic language may be useful on multiple fronts connected to the computational complexity of planning that do not scale up exponentially with time and number of agents, to the manner in which safety and comfort interact, to the way the computation of sensing is defined and the specification of sensor modalities and how they interact in a fusion methodology; [0386]; [0801]; [0802] disclosing that while FIGS. 57A-57C depict a predicted path for a host vehicle passing a target vehicle, the predicted path may be associated with any navigational maneuver of the host vehicle and may relate to any obstacle (e.g., a VRU, a plurality of host vehicles, a road characteristic, etc.). As an example, a predicted path may be generated for a host vehicle entering a turn in a roadway. Any other navigational maneuver may benefit from the predicted path analysis and/or the guess function); (13) parsing the driving instructions from the LVLM and inputting the parsed driving instructions into the cost model (see at least Shalev, [0600]; [0601] disclosing that based on the information available to the driving policy module, an output may be provided in the form of one or more planned navigational actions for accomplishing a navigational goal of the host vehicle. In some embodiments, the RSS safety constraint <interpreted as the cost model> may be applied as a filter of the planned navigational actions. That is, the planned navigational action, once developed, can be tested against at least one accident liability rule (e.g., any of the accident liability rules discussed above) for determining potential accident liability for the host vehicle relative to the identified target vehicle ); (14) selecting, with the MPPI module, the lowest cost path from among the possible calculated paths based on the change in cost assigned by the MPPI module (see at least Shalev, [0253] disclosing that the mapping from sensed navigational state to desires is a policy π.sub.θ.sup.(D) that may be learned from experience by maximizing an expected reward. The desires produced by π.sub.0.sup.(D) may be translated into a cost function over driving trajectories. The function π.sup.(T), not a learned function, may be implemented by finding a trajectory that minimizes the cost subject to hard constraints on functional safety. This decomposition may ensure functional safety while at the same time providing for comfortable driving). But, neither Danilczuk, Shalev nor Ma explicitly disclose the following limitations disclosed in Pan: (1) determining a slice of the collected image data as a sequential set of images captured over a time period (see at least Pan, [0036]; [0046] disclosing that the present disclosure introduces a Vision-Language-Planning (VLP) foundation model for autonomous driving. In embodiments, the VLP foundation model uses contrastive learning techniques, such as those introduced in a Contrastive Language-Image Pretraining (CLIP) model; [0047] disclosing that contrastive learning is a type of machine learning where the model learns to distinguish between positive and negative pairs of data. In the context of CLIP, the “positive pair” consists of an image and a text description that are semantically related, while the “negative pair” consists of an image and a randomly selected text description that is not related. During training, CLIP is designed to encourage bringing together features from related text and images pairs into a common embedding space, while pushing unrelated pairs apart ); (2) encoding the slice with a feature extractor to generate an encoded slice of collected image data (see at least Pan, [0049] disclosing that the contrastive learning concept used in CLIP (teachings of which are included in the VLP foundation model) is illustrated in Fig. 5, generally shown as a contrastive learning model at 500. As shown, a plurality of natural language text descriptions 502 are fed into a text encoder 504, and a plurality of images 506 are fed into an image encoder 508. The model 500 then performs feature mapping, where the vectors output by the encoders are mapped to a joint embedding space.); (7) outputting from the LVLM driving instructions for the autonomous vehicle in a structured, machine-readable format (see at least Pan, [0052] disclosing that Fig. 6 illustrates a high-level overview of a planning model 600 in the end-to-end autonomous driving system 400 of FIG. 4, according to an embodiment. As illustrated, the planning features are used in predicting the future trajectory for self-driving car. ... And further, that the features extracted from the BEV can include information about the agents in the environment, such as their location, their trajectory ... which are inputs to determine how the autonomous vehicle itself should react. The extracted planning features are further sent into a trajectory regression head 608 to plan the future trajectory for the self-driving car in the next P timestamps. The trajectory regression head 608 can be used to transform the vector into another size vector. In an embodiment, the trajectory regression head 608 includes a small neural network that maps high dimensional input to the expected P timestamps trajectory. During the training process, it applies two kinds of planning losses to optimize the planning module. One is the average distance error (ADE) which aims to reduce the distance or loss between the predicted trajectory 610 and the ground truth trajectory 612, the other is the collision rate (COL) which aims to ensure the safety of the planned trajectory) (8) passing the outputted driving instruction for the autonomous vehicle from the LVLM to a Model Predictive Path Integral (MPPI) module ... (see at least Pan, [0059] disclosing that FIG. 8 illustrates a method 800 of training an autonomous driving system utilizing a vision-language planning (VLP) machine learning model, according to an embodiment; [0061] disclosing that at 816, a contrastive learning model is executed in order to derive similarities between the vision-based planning features and the text-based planning features. At 818, a predicted trajectory of the vehicle is generated based on these similarities, based on the vision-based planning features. Then, the model is refined, updated, and trained once again, and more feature space is determined based on the similarities.); (10) wherein the MPPI module is configured to calculate a plurality of possible paths using a cost model (see at least Pan, Claim 1, disclosing extract text-based planning features from the text information, execute a contrastive learning model to derive similarities between the vision-based planning features and the text-based planning features, and generate a predicted trajectory of the vehicle based on the similarities; Claim 2, disclosing determining a loss between the predicted trajectory of the vehicle and a ground truth trajectory of the vehicle; and repeat the steps of claim 1 until convergence to minimize the loss); (11) wherein the cost model comprises a representation of the driving environment in which the autonomous vehicle is driving (see at least Pan, claim 2, disclosing determining a loss between the predicted trajectory of the vehicle and a ground truth trajectory of the vehicle; and repeat the steps of claim 1 until convergence to minimize the loss), (12) wherein the cost model is originally generated using a prior map of the driving environment and at least one camera image (see at least Pan, [0036] disclosing that system 200 may implement a machine-learning algorithm 210 that is configured to analyze the raw source dataset 216. The raw source dataset 216 may include raw or unprocessed sensor data that may be representative of an input dataset for a machine-learning system. The raw source dataset 216 may include video, video segments, images, text-based information, audio or human speech, time series data (e.g., a pressure sensor signal over time), and raw or partially processed sensor data (e.g., radar map of objects) <interpreted as the prior map of the driving environment>); (15) iterating execution of the MPPI module using the lowest cost path as predicted state input until cost is minimized in the cost model to meet a desired state (see at least Pan, [0027] disclosing that system 100 may further comprise a processor subsystem 110 which may be configured to, during operation of the system 100, provide an iterative function as a substitute for a stack of layers of the neural network to be trained; [0038]; [0061] disclosing that At 818, a predicted trajectory of the vehicle is generated based on these similarities, based on the vision-based planning features. Then, the model is refined, updated, and trained once again, and more feature space is determined based on the similarities <interpretated as iteratively execution>; claim 2 ); and (16) controlling the autonomous vehicle to the desired state using one of the lowest cost paths (see at least Pan, [0043] disclosing that the actuator control command may be determined such that a propulsion unit, steering unit and/or brake unit of the mobile robot may be controlled such that the mobile robot may avoid collisions with identified objects; [0061] disclosing that At 818, a predicted trajectory of the vehicle is generated based on these similarities, based on the vision-based planning features. Then, the model is refined, updated, and trained once again, and more feature space is determined based on the similarities.). Danielczuk, Shalev, Ma and Pan are analogous art to claim 1 because they are in the same field relating to navigation techniques. Danielczuk relates to processors or computing systems used to determine collisions between objects and a scene using various novel techniques (see at least Danielczuk, [0002]). Shalev relates to systems and methods for navigating according to potential accident liability constraints (see at least Shalev, [0002]). Ma relates to processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instruction (see at least Ma, Abstract). Pan relates to systems and methods for vision-language planning (VLP) foundation models for autonomous driving (see at least Pan, [0001]). Therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method as disclosed in Danielczuk to provide the benefit of (3) passing a slice of collected image data encoded with a feature extractor to a large vision language model (LVLM), (4) having the LVLM be pretrained and tuned with image-pairs from driving environments, and (5) passing a text-based query related to an aspect of driving to the LVLM, as disclosed in Ma, with a reasonable expectation of success. Doing so would provide the benefit of facilitating a model’s capacity to learn new tasks from several input-output examples (see at least Ma, pg. 5, last para.). It would have been further obvious to modify the method as disclosed in Danielczuk, as modified by Ma, to provide the benefit of (6) outputting driving instructions in a structured, machine-readable format (9) calculate a plurality of possible paths using a cost model, (13) parsing the driving instructions from the LVLM and inputting the parsed driving instructions into the cost model, and (14) selecting the lowest cost path from among the possible calculated paths based on the change in cost, as disclosed in Shalev, with a reasonable expectation of success. Doing so would improve the safety and scalability of the method (see at least Shalev, 0004]). If would still further be obvious to modify the method as disclosed in Danilczuk, as modified by Ma and Shalev, to provide the benefit of (1) determining a slice of the collected image data as a sequential set of images captured over a time period (2) encoding the slice with a feature extractor to generate an encoded slice of collected image data, (7) outputting from the LVLM driving instructions for the autonomous vehicle in a structured, machine-readable format, (8) passing the outputted driving instruction for the autonomous vehicle from the LVLM to a Model Predictive Path Integral (MPPI) module, (10) having the MPPI module be configured to calculate a plurality of possible paths using a cost model, (11) having the cost model comprise a representation of the driving environment in which the autonomous vehicle is driving, (12) having the cost model be originally generated using a prior map of the driving environment and at least one camera image, (15) iterating execution of the MPPI module using the lowest cost path as predicted state input until cost is minimized in the cost model to meet a desired state, and (16) controlling the autonomous vehicle to the desired state using one of the lowest cost paths, as disclosed in Pan, with a reasonable expectation of success. Doing so would provide the benefit of improving the accuracy, safety and generalization capacity of autonomous driving systems, by harnessing the power of language foundation models in tandem with advanced computer vision techniques (see at least Pan, [0024]). As per claim 2, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 1, as shown above. Danielczuk further disclose the following limitation: wherein the text-based query is a prompt sent in accordance with a predetermined schedule (see at least Danielczuk, [0110] disclosing that an MPPI policy may be utilized, called, or otherwise queried in any suitable time intervals and at any suitable frequency, which may be variable or constant <interpreted as a predetermined schedule> ; [0214] disclosing that one or more of SoC(s) 1104 may include a real-time ray-tracing hardware accelerator. In at least one embodiment, real-time ray-tracing hardware accelerator may be used to quickly and efficiently determine positions and extents of objects (e.g., within a world model), to generate real-time visualization simulations, for RADAR signal interpretation, for sound propagation synthesis and/or analysis, for simulation of SONAR systems, for general wave propagation simulation, for comparison to LIDAR data for purposes of localization and/or other functions, and/or for other uses ; [0023] real time camera engine; [0274]). As per claim 3, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 1, as shown above. Ma further disclose the following limitation: wherein the driving instructions comprise a scene description and an object description (see at least Ma, pg. 5, para. 1; pg. 7, section 5.1, disclosing that Compared to bounding boxes and tracked history for road agents categorized in a close set of labels, Dolphins should be able to understand road agents and traffic elements with comprehensive semantic attributes including: an open vocabulary semantic type (e.g., a police vehicle, a kid pedestrian, etc.); a semantic status (e.g., with right turn light on, with green light on, etc.); a behavior description if it is a dynamic road agent (e.g., turning right in slow speed, parallel parking, etc.). These comprehensive attributes are crucial for understanding the rationale behind the scene with VLM (e.g., giving road to a police vehicle with siren on, right turn light on inferring a right turn behavior, etc.)). As per claim 4, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 1, as shown above. Ma further disclose the following limitations: wherein the LVLM is pre-trained on a dataset comprising video/image-caption pairs (see at least Ma, pg. 6, section 4.1 disclosing that another set of learned latent vectors is incorporated to function as media position embeddings, introducing essential ordering information within the few-shot prompt. The inclusion of these embeddings has led to a noteworthy enhancement in the model’s ability in video understanding. To preserve the pretraining knowledge and reduce computing consumption, We freeze both the encoders and only finetune the perceiver resampler module, gated cross-attention layers, and LoRA [71] module added to the text encoder, as shown in Figure 4), wherein the video/image-caption pairs comprise images commonly observed on roadways combined with text captions (see at least Ma, pg. 26, para. 1, disclosing that the Visual Genome dataset [80] is used as a supplement, as it has richer annotations and intersects with MSCOCO. The GQA task provides detailed object annotations but lacks captions, which presents a challenge to ChatGPT in comprehending the overall content of the image. So we organize the objects, attributes, and their relationships in the annotations into sentences, which are used to describe the relationships between two objects in the image in place of captions. After preparation, we prompt ChatGPT to follow the aforementioned three steps to generate GCoT templates step by step. The prompts can be found in Table 2. In addition, we also include LLaVA-instruct-80k [15] and SVIT [23] datasets to enhance the model’s instruction-following capability; pg. 21, showing Fig., 28, showing caption pairs comprising images ... with text captions. ). As per claim 6, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 4, as shown above. Ma further disclose the following limitations: wherein the LVLM is tuned on a dataset comprising visual-instruction pairs (see at least Ma, pg. 5, showing Fig. 3, showing Overview of our proposed dataset. Compared with the previous datasets, we employ RICES (Retrieval-based In-Context Example Selection) [60] approach to choose in-context examples for each sample. Additionally, We introduce the "Detailed Conversation" task to train our model to generate detailed responses that align closely with human-preferred responses. This instruction is aimed at unlocking the latent potential of the foundation model, which has instruction fine-tuned on the dataset consisting of image-instruction-response triplets; pg. 8, section 5.1.1, Perception, disclosing Semantic attributes of road agents & traffic elements. Dolphins is able to capture various types of road agents and traffic elements with attributes (e.g., black car, red traffic light, evident in Figures 5, 7, 8 and 9)), wherein images commonly observed on roadways are linked with instructions (see at least Ma, pg. 11, caption for Fig. 7 disclosing that Fig. 7 is an example showcasing Dolphins’s capability in scenario understanding § 5.1.1. The video features an ego car driving on a snowy street. Dolphins can identify the environment in which the ego car is situated, the presence of the traffic light, and accurately determine the color of the passing vehicle; pg. 14, caption for Fig. 14 disclosing that Fig. 14 is an example showcasing Dolphins’s capability in planning § 5.1.2 with control signals. Dolphins can also understand control signals and employ them to predict the speed and turn angle of the ego car next second. Additionally, We can enable Dolphins to formulate plans for the ego car, thereby assisting the driver during driving. However, due to the lack of sufficient information, such as the map, driving destination, etc., the plan is still limited to a brief period in the future). As per claim 7, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 6, as shown above. Ma further disclose the following limitation: wherein the instructions comprise commands to stop or to use caution (see at least Ma, pg. 16, caption for Fig. 18, captioned, An example showcasing Dolphins’s capability in planning § 5.1.2. Disclosing that the video features an ego car driving on a dark city street. Dolphins can recognize that the ego car is approaching an intersection with a red traffic light, so it plans the future behavior for the ego car, which should be to come to a stop and wait for the traffic light to turn green <interpreted as a command to stop or use caution> to pass through the intersection safely). As per claim 8, similar to claim 1, Danielczuk disclose [a] system for navigating a path by an autonomous vehicle in motion (see at least Danielczuk, see Fig. 11A, B, C, D; [0156]; [0173]; [0176]; [0186]), the system comprising: an autonomous vehicle coupled with a plurality of sensors for collecting image data from the environment (see at least Danielczuk, [0178] disclosing that controller(s) 1136 provide signals for controlling one or more components and/or systems of vehicle 1100 in response to sensor data received from one or more sensors (e.g., sensor inputs). In at least one embodiment, sensor data may be received from, for example and without limitation, global navigation satellite systems (“GNSS”) sensor(s) 1158 (e.g., Global Positioning System sensor(s)), RADAR sensor(s) 1160, ultrasonic sensor(s) 1162, LIDAR sensor(s) 1164, inertial measurement unit (“IMU”) sensor(s) 1166 (e.g., accelerometer(s), gyroscope(s), a magnetic compass or magnetic compasses, magnetometer(s), etc.), microphone(s) 1196, stereo camera(s) 1168, wide-view camera(s) 1170 (e.g., fisheye cameras), infrared camera(s) 1172, surround camera(s) 1174 (e.g., 360 degree cameras), long-range cameras (not shown in FIG. 11A), mid-range camera(s) (not shown in FIG. 11A), speed sensor(s) 1144 (e.g., for measuring speed of vehicle 1100), vibration sensor(s) 1142, steering sensor(s) 1140, brake sensor(s) (e.g., as part of brake sensor system 1146), and/or other sensor types; [0186] ); ... (1) ... , ...(2) ... ; ... (3) ... ; ... (4) ... ; ... (5) ... ; ... (6) ... ; ... (7) ... ; wherein the MPPI module is configured to assign a change in cost of one of the plurality of possible paths based on the driving instructions from the LVLM (see at least Danielczuk, [0089]); ... (8) ... . But Danielczuk does not explicitly teach the following limitations taught in Ma: (1) wherein the plurality of sensors are configured to pass a slice of the collected image data to a Large Vision Language Model (LVLM) (see at least Ma, Fig. 1, pg. 3, para. 2; Fig. 2, pg. 4, para. 2; pg. 4, para. 2, 3, 4), (2) wherein the LVLM has been pretrained with image-pairs from driving environments (see at least Ma, pg. 4, para. 4; pg. 7, section 4.2); (4) wherein the LVLM is configured to receive a text-based query related to an aspect of driving (see at least Ma, pg. 9, section 5.2.2) ... . But, neither Danielczuk nor Ma explicitly teach the following limitations taught in Shalev: (5) output driving instructions in a structured, machine-readable format to a Model Predictive Path Integral (MPPI) control module operably coupled to the autonomous vehicle (see at least Shalev, [0599]; [0600]); (7) wherein the MPPI module is configured to calculate a plurality of possible paths using a cost model (see at least Shalev, [0385]; [0386]; [0801]; [0802]); (11) wherein the MPPI module is configured to receive structured, machine-readable driving instructions from the LVLM (see at least Shalev, [0600]; [0601]) and to (13) input parsed driving instructions into the cost model (see at least Shalev, [0253]); (14) wherein the MPPI module is configured to select a lowest cost path from among the plurality of possible paths based on the change in cost assigned by the MPPI module (see at least Shalev, [0253]) ... . But, neither Danilczuk, Shalev nor Ma explicitly disclose the following limitations disclosed in Pan: (3) wherein the slice of the collected image data is a sequential set of images captured over a time period encoded with a feature extractor (see at least Pan, [0047]; [0049]); (6) output driving instructions for the automated vehicle in a structured, machine-readable format to a Model Predictive Path Integral (MPPI) control module operably coupled to the autonomous vehicle (see at least Pan, [0052] ), (8) wherein the MPPI module is configured to calculate a plurality of possible paths using a cost model (see at least Pan, Claim 1, Claim 2), (9) wherein the cost model comprises a representation of the driving environment in which the autonomous vehicle is driving (see at least Pan, Claim 1, Claim 2), (10) wherein the cost model is originally generated using a prior map of the driving environment and at least one camera image (see at least Pan, [0036]); (12) wherein the MPPI module is configured to receive structured, machine-readable driving instructions for the autonomous vehicle from the LVLM (see at least Pan, [0052]; [0059]; [0061]), (15) wherein the MPPI module execution is iterated using the lowest cost path as predicted state input until cost is minimized in the cost model to meet a desired state (see at least Pan, [0027]; [0038]; [0061]; claim 2); and (16) wherein the autonomous vehicle is controlled to the desired state using one of the lowest cost paths (see at least Pan, [0043]; [0061]). Danielczuk, Shalev, Ma and Pan are analogous art to claim 8 because they are in the same field relating to navigation techniques. Danielczuk relates to processors or computing systems used to determine collisions between objects and a scene using various novel techniques (see at least Danielczuk, [0002]). Shalev relates to systems and methods for navigating according to potential accident liability constraints (see at least Shalev, [0002]). Ma relates to processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instruction (see at least Ma, Abstract). Pan relates to systems and methods for vision-language planning (VLP) foundation models for autonomous driving (see at least Pan, [0001]). Therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system, as disclosed in Danielczuk, to provide the benefit of (1) having the plurality of sensors pass a slice of the collected image data to a Large Vision Language Model (LVLM), (2) having the LVLM be pretrained with image-pairs from driving environments, and (4) having the LVLM receive a text-based query related to an aspect of driving, as disclosed in Ma, with a reasonable expectation of success. Doing so would provide the benefit of facilitating a model’s capacity to learn new tasks from several input-output examples (see at least Ma, pg. 5, last para.). It would have been further obvious to modify the system, as disclosed in Danielczuk, as modified by Ma, to provide the benefit of (5) outputting driving instructions in a structured, machine-readable format to a Model Predictive Path Integral (MPPI) control module operably coupled to the autonomous vehicle (7) calculating a plurality of possible paths using a cost model, (11) receive structured, machine-readable driving instructions from the LVLM, (13) inputting parsed driving instructions into the cost model and (14) selecting a lowest cost path from among the plurality of possible paths based on the change in cost assigned by the MPPI module, as disclosed in Shalev, with a reasonable expectation of success. Doing so would improve the safety and scalability of the method (see at least Shalev, 0004]). If would still further be obvious to modify the method as disclosed in Danilczuk, as modified by Ma and Shalev, to provide the benefit of (3) having the slice of the collected image data be a sequential set of images captured over a time period encoded with a feature extractor, (6) outputting driving instructions for the automated vehicle in a structured, machine-readable format to a Model Predictive Path Integral (MPPI) control module operably coupled to the autonomous vehicle, (8) having the MPPI module be configured to calculate a plurality of possible paths using a cost model, (9) having the cost model comprise a representation of the driving environment in which the autonomous vehicle is driving, (10) having the cost model be originally generated using a prior map of the driving environment and at least one camera image, (12) having the MPPI module be configured to receive structured, machine-readable driving instructions for the autonomous vehicle from the LVLM, (15) having the MPPI module execution be iterated using the lowest cost path as predicted state input until cost is minimized in the cost model to meet a desired state, and (16) having the autonomous vehicle be controlled to the desired state using one of the lowest cost paths, as disclosed in Pan, with a reasonable expectation of success. Doing so would provide the benefit of improving the accuracy, safety and generalization capacity of autonomous driving systems, by harnessing the power of language foundation models in tandem with advanced computer vision techniques (see at least Pan, [0024]). As per claim 9, similar to claim 2, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 8, as shown above. Danielczuk further disclose the following limitation: wherein the text-based query is a prompt sent in accordance with a predetermined schedule (see at least Danielczuk, [0110]; [0214]; [0023]; [0274]). As per claim 10, similar to claim 3, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 8, as shown above. Ma further disclose the following limitation: wherein the driving instructions comprise a scene description and an object description (see at least Ma, pg. 5, para. 1; pg. 7, section 5.1). As per claim 11, similar to claim 4, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 8, as shown above. Ma further disclose the following limitations: wherein the LVLM is pre-trained on a dataset comprising video/image-caption pairs (see at least Ma, pg. 6, section 4.1, and Fig. 4), wherein the video/image-caption pairs comprise images commonly observed on roadways, combined with text captions (see at least Ma, pg. 26, para. 1; pg. 21, showing Fig. 28) As per claim 13, similar to claim 6, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 11, as shown above. Ma further disclose the following limitations: wherein the LVLM is tuned on a dataset comprising visual-instruction pairs (see at least Ma, pg. 5, showing Fig. 3), wherein visual instruction pairs comprise images commonly observed on roadways are linked with instructions (see at least Ma, pg. 11, caption for Fig. 7; pg. 14, Fig. 14). As per claim 14, similar to claim 7, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 13, as shown above. Ma further disclose the following limitation: wherein the instructions comprise commands to stop or to use caution (see at least Ma, pg. 16, Fig. 18) As per claim 15, the combination of the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 8, as shown above. Danielczuk further disclose the following limitation: wherein the plurality of sensors comprises at least one of a camera, LiDAR, radar, or GPS (see at least Danielczuk, [0178] disclosing that controller(s) 1136 provide signals for controlling one or more components and/or systems of vehicle 1100 in response to sensor data received from one or more sensors (e.g., sensor inputs). In at least one embodiment, sensor data may be received from, for example and without limitation, global navigation satellite systems (“GNSS”) sensor(s) 1158 (e.g., Global Positioning System sensor(s)), RADAR sensor(s) 1160, ultrasonic sensor(s) 1162, LIDAR sensor(s) 1164, inertial measurement unit (“IMU”) sensor(s) 1166 (e.g., accelerometer(s), gyroscope(s), a magnetic compass or magnetic compasses, magnetometer(s), etc.), microphone(s) 1196, stereo camera(s) 1168, wide-view camera(s) 1170 (e.g., fisheye cameras), infrared camera(s) 1172, surround camera(s) 1174 (e.g., 360 degree cameras), long-range cameras (not shown in FIG. 11A), mid-range camera(s) (not shown in FIG. 11A)). As per claim 17, similar to claims 1 and 8, Danielczuk discloses [a] method for training a Large Language Model (LLM) for trajectory calculation when integrated with an MPPI controller (see at least Danielczuk, Fig. 11A, B, C, D; [0156]; [0176]; [0186]), the method comprising: providing a first dataset of image pairs to the LLM (see at least Danielczuk, [0113]-[0114]; [0135]) ... (1) ... ; ... (2) ... , ... (3) ... ; ... (4) ... ; ... (5) ... ; ... (6) ... ; ... (7) ... ; ... (8) ... . But, Danielczuk does not explicitly teach the following limitations taught in Ma: (1) wherein the image pairs comprise images from roadway scenarios and text labels (see at least Ma, see pg. 3, Figure 3, captioned, Overview of our proposed dataset. Disclosing compared with the previous datasets, we employ RICES (Retrieval-based In-Context Example Selection) [60] approach to choose in-context examples for each sample. Additionally, We introduce the "Detailed Conversation" task to train our model to generate detailed responses that align closely with human-preferred responses. This instruction is aimed at unlocking the latent potential of the foundation model, which has instruction fine-tuned on the dataset consisting of image-instruction-response triplets); (2) providing a second dataset of images paired with driving instructions (similar to claim 7, see at least Ma, pg. 16, showing Fig. 18); (3) training the LLM on the first dataset (see at least Ma, see pg. 3, Figure 3, pg. 6, Fig. 4) (4) fine-tuning the LLM on the second dataset (see at least Ma, see pg. 3, Figure 3; pg. 4, para. 4 ); and (5) passing an image of a roadway scenario to the LLM (see at least Ma, pg. 6, Fig. 4, showing a schematic of the model architecture and last para., section 4.1, , disclosing another set of learned latent vectors is incorporated to function as media position embeddings, introducing essential ordering information within the few-shot prompt. The inclusion of these embeddings has led to a noteworthy enhancement in the model’s ability in video understanding. To preserve the pretraining knowledge and reduce computing consumption, We freeze both the encoders and only finetune the perceiver resampler module, gated cross-attention layers, and LoRA [71] module added to the text encoder, as shown in Figure 4.; pg. 10, Fig. 5, captioned, An example showcasing Dolphins’s capability in scenario understanding § 5.1.1. Disclosing that the video features an ego car driving in a tunnel. Dolphins can identify the environment in which the ego car is situated and accurately determine the color of the front vehicle as well as infer the current time); (6) prompting the LLM with a text query related to the image from a roadway scenario (see at least Ma, pg. 9, 5.2.2). But, neither Danielczuk nor Ma explicitly teach the following limitations taught in Shalev: (7) receiving, by a Model Predictive Path Integral (MPPI) controller, a driving instruction in response to the text query from the LLM in a structured, machine-readable format (see at least Shalev, [0599]) (9) parsing the driving instruction for input to a cost model (see at least Shalev, [0060]; [0601]) ... . But, neither Danilczuk, Shalev nor Ma explicitly teach the following limitations taught in Pan: (8) receiving, by a Model Predictive Path Integral (MPPI) controller, a driving instruction for the autonomous vehicle in response to the text query from the LLM in a structured, machine-readable format (see at least Pan, [0052]; [0059]; [0061], Claim 1, Claim 2), (10) parsing the driving instruction for input to a cost model (see at least Pan, Claim 1, Claim 2), (11) wherein the cost model comprises a representation of the driving environment in which the autonomous vehicle is driving (see at least Pan, Claim 1, Claim 2 ), (12) wherein the cost model is originally generated using a prior map of the driving environment and at least one camera image (see at least Pan, [0036]; [0043]; [0061]); and (13) wherein the MPPI controller can be iterated to minimize cost in the cost model to meet a desired state to control the autonomous vehicle (see at least Pan, [0027]; [0038]; [0061]; claim 2). Danielczuk, Shalev, Ma and Pan are analogous art to claim 17 because they are in the same field relating to navigation techniques. Danielczuk relates to processors or computing systems used to determine collisions between objects and a scene using various novel techniques (see at least Danielczuk, [0002]). Shalev relates to systems and methods for navigating according to potential accident liability constraints (see at least Shalev, [0002]). Ma relates to processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instruction (see at least Ma, Abstract). ). Pan relates to systems and methods for vision-language planning (VLP) foundation models for autonomous driving (see at least Pan, [0001]). Therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method, as disclosed in Danielczuk, to provide the benefit of (1) having the image pairs comprise images from roadway scenarios and text labels, (2) providing a second dataset of images paired with driving instructions, (3) training the LLM on the first dataset, (4) fine-tuning the LLM on the second dataset, (5) passing an image of a roadway scenario to the LLM, and (6) prompting the LLM with a text query related to the image from a roadway scenario, as disclosed in Ma, with a reasonable expectation of success. Doing so would provide the benefit of facilitating a model’s capacity to learn new tasks from several input-output examples (see at least Ma, pg. 5, last para). It would have been further obvious to modify the method as disclosed in Danielczuk, as modified by Ma, to provide the benefit of (7) receiving a driving instruction in response to the text query from the LLM in a structured, machine-readable format and (9) parsing the driving instruction for input to a cost model, as disclosed in Shalev, with a reasonable expectation of success. Doing so would improve the safety and scalability of the method (see at least Shalev, 0004]). If would still further be obvious to modify the method as disclosed in Danielczuk, as modified by Ma and Shalev, to provide the benefit of (8) receiving, by a Model Predictive Path Integral (MPPI) controller, a driving instruction for the autonomous vehicle in response to the text query from the LLM in a structured, machine-readable format, (10) parsing the driving instruction for input to a cost model, (11) having the cost model comprise a representation of the driving environment in which the autonomous vehicle is driving, (12) having the cost model be originally generated using a prior map of the driving environment and at least one camera image, and (13) having the MPPI controller be iterated to minimize cost in the cost model to meet a desired state to control the autonomous vehicle, as disclosed in Pan, with a reasonable expectation of success. Doing so would provide the benefit of improving the accuracy, safety and generalization capacity of autonomous driving systems, by harnessing the power of language foundation models in tandem with advanced computer vision techniques (see at least Pan, [0024]). As per claim 19, similar to claim 1, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 17, as shown above. Shalev further discloses the following limitation: calculating a plurality of possible paths using a cost model (see at least Shalev, [0385]). As per claim 20, similar to claim 1, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 19, as shown above. Shalev further discloses the following limitation: selecting a lowest cost path from among the plurality of possible paths based on a change in cost assigned by the MPPI module (see at least Shalev, [0253]). Claims 5 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Danielczuk, Shalev, Ma and Pan as applied to claims 1 and 8 above, and further in view of U.S. Patent Publication Number 2025/0083666 to Do et al. (hereafter Do). As per claim 5, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 1, as shown above. But, neither Danielczuk, Shalev, Ma nor Pan explicitly teach the following limitation taught in Do: wherein selecting the lowest cost path includes applying an optimizer using a Monte Carlo approximation (see at least Do, [0112] disclosing that the main idea of these MPPI algorithms is to transform the cost function of an optimum control problem into the expectation of all possible trajectories. This allows solving the stochastic optimum problem with a probabilistic approximation (of the Monte Carlo type) using a direct sampling of a stochastic diffusion process. The MPPI algorithm determines a control sequence that minimizes the overall cost at each iteration. This cost corresponds to the integral of each individual cost at each step in which the solution of the Hamilton-Jacobi-Bellman equation is approximated using the Feynman-Kac theorem and the KL divergence as described in the aforementioned document by Wlliams et al. The use of the MPPI algorithm is attractive because it consists of a derivative-free optimization method, which enables the use of non-linear and non-convex cost models and functions, and having demonstrated good performances in aggressive driving situations is therefore particularly well suited to emergency trajectories in the context of driving aid). Danielczuk, Shalev, Ma, Pan and Do are analogous art to claim 5 because they are in the same field relating to navigation techniques. Danielczuk relates to processors or computing systems used to determine collisions between objects and a scene using various novel techniques (see at least Danielczuk, [0002]). Shalev relates to systems and methods for navigating according to potential accident liability constraints (see at least Shalev, [0002]). Ma relates to processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instruction (see at least Ma, Abstract). Pan relates to systems and methods for vision-language planning (VLP) foundation models for autonomous driving (see at least Pan, [0001]). Do relates to a method for controlling a motor vehicle allowing avoiding an obstacle (see at least Do, [0002]). Therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method as disclosed in Danielczuk, as modified by Shalev, Ma and Pan, to provide the benefit of having the selecting the lowest cost path include applying an optimizer using a Monte Carlo approximation, as disclosed in Do, with a reasonable expectation of success. Doing so would provide the benefit minimizing the risks of impact and the risks of serious injury (see at least Do, [0002]). As per claim 12, similar to claim 5, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 8, as shown above. But, neither Danielczuk, Shalev, Ma nor Pan explicitly teach the following limitation taught in Do: wherein the MPPI module is configured to select the lowest cost path by applying an optimizer using a Monte Carlo approximation (see at least Do, [0112]). Danielczuk, Shalev, Ma, Pan and Do are analogous art to claim 12 because they are in the same field relating to navigation techniques. Danielczuk relates to processors or computing systems used to determine collisions between objects and a scene using various novel techniques (see at least Danielczuk, [0002]). Shalev relates to systems and methods for navigating according to potential accident liability constraints (see at least Shalev, [0002]). Ma relates to processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instruction (see at least Ma, Abstract). Pan relates to systems and methods for vision-language planning (VLP) foundation models for autonomous driving (see at least Pan, [0001]). Do relates to a method for controlling a motor vehicle allowing avoiding an obstacle (see at least Do, [0002]). Therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method as disclosed in Danielczuk, as modified by Shalev, Ma and Pan, to provide the benefit of having the selecting the lowest cost path include applying an optimizer using a Monte Carlo approximation, as disclosed in Do, with a reasonable expectation of success. Doing so would provide the benefit minimizing the risks of impact and the risks of serious injury (see at least Do, [0002]). Claims 16 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Danielczuk, Shalev, Ma and Pan as applied to claims 8 and 17 above, and further in view of U.S. Patent Publication Number 2025/0157207 to Abati et al. (hereafter Abati). As per claim 16, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 8, as shown above. But, neither Danielczuk, Shalev, Ma nor Pan explicitly teach the following limitation taught in Abati: wherein the LVLM is built from a generative pretrained transformer (see at least Abati, [0063] disclosing that Fig. 5 is a block diagram illustrating an example of generative data augmentation. In the example of FIG. 5, a pretrained generative model (e.g., an LDM) 502, having a set of model parameters (e.g., weights) ϕ, receives a text-based input of a single class type having a class name (e.g., “An image of a single <class-name>”). For instance, the generative model 502 may receive as input the text “Image of a single cat” to which the generative model 502 outputs an actual image of a cat. This is the process of sampling image and class (x and y) from a text2image LDM; [0067]; [0069] ). Danielczuk, Shalev, Ma, Pan and Abati are analogous art to claim 16 because they are in the same field relating to navigation techniques. Danielczuk relates to processors or computing systems used to determine collisions between objects and a scene using various novel techniques (see at least Danielczuk, [0002]). Shalev relates to systems and methods for navigating according to potential accident liability constraints (see at least Shalev, [0002]). Ma relates to processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instruction (see at least Ma, Abstract). Pan relates to systems and methods for vision-language planning (VLP) foundation models for autonomous driving (see at least Pan, [0001]). Abati relates to image recognition based on generative data augmentation with task loss guided fine-tuning (see at least Abati, [0001]). Therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method as disclosed in Danielczuk, as modified by Shalev, Ma and Pan, to provide the benefit of having the LVLM is built from a generative pretrained transformer, as disclosed in Abati, with a reasonable expectation of success. Doing so would provide the benefit of improving image recognition (see Abati, [0003]). As per claim 18, the combination of Danielczuk, Shalev, Ma and Pan discloses all of the limitations of claim 17, as shown above. But, neither Danielczuk, Shalev, Ma nor Pan explicitly teach the following limitation taught in Abati: wherein the LLM is a generative pretrained transformer (see at least Abati, [0063]; [0067]; [0069]). Danielczuk, Shalev, Ma, Pan and Abati are analogous art to claim 18 because they are in the same field relating to navigation techniques. Danielczuk relates to processors or computing systems used to determine collisions between objects and a scene using various novel techniques (see at least Danielczuk, [0002]). Shalev relates to systems and methods for navigating according to potential accident liability constraints (see at least Shalev, [0002]). Ma relates to processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instruction (see at least Ma, Abstract). Pan relates to systems and methods for vision-language planning (VLP) foundation models for autonomous driving (see at least Pan, [0001]). Abati relates to image recognition based on generative data augmentation with task loss guided fine-tuning (see at least Abati, [0001]). Therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method as disclosed in Danielczuk, as modified by Shalev, Ma and Pan, to provide the benefit of having the LLM be a generative pretrained transformer, as disclosed in Abati, with a reasonable expectation of success. Doing so would provide the benefit of improving image recognition (see Abati, [0003]). Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to PATRICK M. BRADY III whose telephone number is (571)272-7458. The examiner can normally be reached Monday - Friday 8:00 am - 5;30 pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Helal Algahaim can be reached at (571) 270-5227. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. PATRICK M. BRADY III Examiner Art Unit 3666 /PATRICK M BRADY/ Examiner, Art Unit 3666 /HELAL A ALGAHAIM/ SPE , Art Unit 3645
Read full office action

Prosecution Timeline

Jan 19, 2024
Application Filed
Jul 24, 2025
Non-Final Rejection — §103
Oct 10, 2025
Interview Requested
Oct 16, 2025
Examiner Interview Summary
Oct 16, 2025
Applicant Interview (Telephonic)
Oct 23, 2025
Response Filed
Feb 02, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12594992
VEHICLE STEERING CONTROL DEVICE
2y 5m to grant Granted Apr 07, 2026
Patent 12591236
REMOTE SUPPORT SYSTEM AND REMOTE SUPPORT METHOD
2y 5m to grant Granted Mar 31, 2026
Patent 12589734
METHOD FOR DEALING WITH OBSTACLES IN AN INDUSTRIAL TRUCK
2y 5m to grant Granted Mar 31, 2026
Patent 12583517
VEHICLE STEERING CONTROL DEVICE
2y 5m to grant Granted Mar 24, 2026
Patent 12577755
WORK MACHINE AND CONTROL SYSTEM
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
56%
Grant Probability
99%
With Interview (+44.1%)
3y 2m
Median Time to Grant
Moderate
PTA Risk
Based on 119 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month