Last updated: May 29, 2026
Application No. 18/403,234
GENERATIVE ARTIFICIAL INTELLIGENCE TO GENERATE MULTIPLE AUTONOMOUS VEHICLE FUTURE TRAJECTORIES

Non-Final OA §103
Filed
Jan 03, 2024
Examiner
ALSOMAIRY, IBRAHIM ABDOALATIF
Art Unit
3667
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
GM Cruise Holdings LLC
OA Round
2 (Non-Final)
Interview Optional

— +10.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 42% grant rate with +10.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 86 resolved cases, 2023–2026
Examiner Intelligence

ALSOMAIRY, IBRAHIM ABDOALATIF View full profile →
Grants 42% of resolved cases
Career Allowance Rate
36 granted / 86 resolved
-10.1% vs TC avg
Moderate +10% lift
Without
With
+10.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
26 currently pending
Career history
128
Total Applications
across all art units
Statute-Specific Performance

§101
0.3%
-39.7% vs TC avg
§103
97.7%
+57.7% vs TC avg
§102
1.4%
-38.6% vs TC avg
§112
0.6%
-39.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 86 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	This is a Final Action on the Merits. Claims 1-20 are currently pending and are addressed below. 

Response to Amendments 
	The amendment filed on September 16th, 2025 has been considered and entered. Accordingly, claims 1, 3-5, 8-11, 14-17, and 20 have been amended.

Response to Arguments 
	The previous rejection of claim 10 under 35 USC 112(b) has been overcome due to the Applicant’s amendments.
	The previous rejection of claims 1-20 under 35 USC 101 has been overcome due to the Applicant’s amendments.
	The Applicant’s arguments with respect to claims 1-20 has been considered but are moot in view of the newly formulated grounds of rejections necessitated by the Applicant’s amendments.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-3, 5-6, 9-12, and 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Pronovost (US 20240101157 A1) (“Pronovost”) in view of Li (US 20240124016 A1) (“Li”) in view of Muller (US 20190384303 A1) (“Muller”) in view of Lee (US 20240400095 A1) (“Lee”) in view of Xie (US 20210199442 A1) (“Xie”).

With respect to claim 1, Pronovost teaches a computer-implemented method comprising: 
receiving, at volumes and in a variety to electronically replicate a dynamic environment, real time sensor data from one or more sensors of an AV that uses some or all of the real time sensor data to self-navigate through the dynamic environment without a human driver (See at least Pronovost FIG. 3 and Paragraphs 62-63 “As depicted in FIG. 3 , input data 302 representing object trajectories associated with one or more objects, object state data, and scene data can be input into an encoder 304. The encoder 304 can represent a machine learned model such as a GNN, RNN, CNN, and the like, and output one or more feature vectors 306 which can be sent to a codebook 308 and a quantizer 310. In some examples, the quantizer 310 can receive the feature vectors 306 output by the encoder 304, and discretize the feature vectors 306 to output the discretized feature vectors 312. In some examples, the codebook 308 can receive the discretized feature vectors 312 while in other examples the codebook 308 can receive the feature vectors 306. A machine learned model 314 (e.g., a Transformer model) can receive tokens, discrete feature vectors, and/or continuous feature vectors from the codebook and arrange the tokens and associated feature vectors into a token sequence 316. The token sequence 316 can represent tokens arranged or clustered in a particular order. The machine learned model 314 can include one or more self-attention layers to cause the tokens to be arranged with attention to features of another token.” | Paragraph 70 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like).”)
generating from the real time sensor data real time input data and providing the real time input data to a trajectory generation model, and wherein the input data comprises vector map representations, nearby actor history, and autonomous vehicle (AV) history of an AV1 (See at least Pronovost FIGS. 3, 5-9 and Paragraphs 62-63 “As depicted in FIG. 3 , input data 302 representing object trajectories associated with one or more objects, object state data, and scene data can be input into an encoder 304. The encoder 304 can represent a machine learned model such as a GNN, RNN, CNN, and the like, and output one or more feature vectors 306 which can be sent to a codebook 308 and a quantizer 310. In some examples, the quantizer 310 can receive the feature vectors 306 output by the encoder 304, and discretize the feature vectors 306 to output the discretized feature vectors 312. In some examples, the codebook 308 can receive the discretized feature vectors 312 while in other examples the codebook 308 can receive the feature vectors 306. A machine learned model 314 (e.g., a Transformer model) can receive tokens, discrete feature vectors, and/or continuous feature vectors from the codebook and arrange the tokens and associated feature vectors into a token sequence 316. The token sequence 316 can represent tokens arranged or clustered in a particular order. The machine learned model 314 can include one or more self-attention layers to cause the tokens to be arranged with attention to features of another token.” | Paragraph 70 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like).” | Paragraph 142 “At operation 902, the process may include receiving, by a Transformer model, a request to generate a simulated environment that includes a vehicle and an object. For example, the vehicle computing device 804 can initiate a simulation or otherwise send an instruction requesting that the model component 830 generate object trajectories for one or more objects in an environment of a vehicle. In some examples, the Transformer model (e.g., the Transformer model 204) can receive tokens from the codebook 202. In various examples, the tokens can be based at least in part on sensor data from the perception component 822 and map data from the map(s) 828 and/or the map component 844. The vehicle computing device may be configured to receive sensor data representing one or more objects in an environment (e.g., vehicle 110). In some examples, the vehicle computing device may be configured to detect dynamic objects and/or static objects and combine the associated sensor data with map data. In some examples, the map data may represent fixed features of an environment including but not limited to crosswalks, traffic signals, school zones, and the like. The model component 830 may also or instead receive object state data such as position data, orientation data, heading data, velocity data, speed data, acceleration data, yaw rate data, or turning rate data associated with the object.”); 
generating, in real time map tokens from the vector map representations and generating agent tokens from the nearby actor history and the AV history wherein the map tokens and agent tokens are structured as sequential input for the trajectory generation model; inputting a concatenated set of the map tokens and the agent tokens into an encoder transformer of the trajectory generation model (See at least Pronovost Paragraphs 20-21 “In some examples, a first machine learned model comprising one or more self-attention layers (e.g., a Transformer model) can receive tokens representing an object action, an object state, etc. from the codebook and arrange the tokens in a sequence to represent object behaviors (e.g., relative to the environment, relative to the autonomous vehicle, relative to another object, and so on). The first machine learned model can, for example, determine an output by using the one or more self-attention layers to arrange tokens in order or cluster the tokens. For instance, the first machine learned model can employ an autoregressive algorithm or other techniques to sample tokens from the codebook. Generally, the arrangement, set, or cluster of tokens output by the first machine learned model represents discrete latent variables, or interactions between tokens. In some examples, the codebook can map, identify, or determine feature vectors that correspond to the discrete latent variables to represent the output of the first machine learned model as a set of feature vectors that represent continuous variables. For example, the codebook can convert the sequence of tokens that represents discrete latent variables into a new sequence of tokens that represents feature vectors. The feature vectors representing continuous variables can, for example, be fed into a second machine learned model (e.g., a decoder, a generator of a Generative Adversarial Network (GAN), a Graph Neural Network (GNN), a Recurrent Neural Network (RNN), another Transformer model, etc.) as input data to determine a potential action or behavior of one or more objects in an environment. In various examples, the second machine learned model can output data representing one or more of: an object trajectory, a heatmap showing a likelihood of occupancy by an object(s), object state data, or scene data usable in simulation, just to name a few. An output of the second machine learned model can be sent to the vehicle computing device for use in planning operations of a vehicle (e.g., to determine a candidate trajectory for the vehicle).”); 
outputting, by the encoder transformer, an output embedding that is representative of the dynamic environment of the AV (See at least Pronovost Paragraph 22 “In some examples, the second machine learned model can generate a simulated environment that includes one or more object trajectories based at least in part on receiving a sequence of tokens representing feature vectors, though in some examples the second machine learned model can also or instead receive a sequence of tokens representing discrete latent variables. In various examples, the second machine learned model can determine a response by the vehicle to the object trajectory in the simulated environment and control the vehicle in a real-world environment based at least in part on the response.”); 
Pronovost fails to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; determining, by a decoder of the GPT-based trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding; determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the GPT-based trajectory generation model; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Li teaches determining, by a decoder of the trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding (See at least Li FIG. 6B and Paragraph 95 “As shown in FIG. 4E, the transformer model 482 may include an encoder stack having a plurality of encoders 484 (or encoding layers) coupled with a decoder stack having a plurality of decoders 486 (or decoding layers). In the example shown in FIG. 4E, the input 483 (e.g., the embedding of each individual portion of the input 483) flows through successive encoders 484, with the output of the final encoder 484 being passed to every decoder 486 in the decoder stack. For example, in some cases, each encoder 484 in the encoder stack may generate an encoding that contains information about which parts of the input 483 are relevant to each other. Moreover, the output of one encoder 484 may be passed on as an input to the next encoder 484 in the encoder stack. Accordingly, in some cases, the first encoder 484 in the encoder stack may generate a first encoding of the input 483 (e.g., the embedding of each individual portion of the input 483) while the next encoder 484 in the encoder stack may generate a second encoding of the first encoding.” | Paragraph 111-112 “At 654, a plurality of candidate trajectories for a vehicle may be generated by at least applying a plurality of planning models including a first planning model to generate a first candidate trajectory based at least on the scenario and a second planning model to generate a second candidate trajectory based on the scenario. As noted, in some example embodiments, the planning models 504 a-c may be based on different machine learning architectures. Alternatively and/or additionally, the planning models 504 a-c may be trained based on different training data. For example, in some cases, whereas the first planning model 504 a determines the trajectory of a vehicle in the scenario 505 based at least on an interaction between a plurality of agents while located relative to one or more lanes present in the scenario 505, the second planning model 504 b may determine the trajectory of the vehicle based at least on an interaction between the plurality of agents and the one or more lanes while the plurality of agents are located relative to the one or more lanes. That multiple planning models, including the planning models 504 a-c, are applied to determine candidate trajectories for the same scenario 505 enables the planning system 404 to leverage a wider variety of planning models, at least some of which having better performance than others when encountering certain scenarios. For instance, in some cases, the first planning model 504 a may be applied to generate a first candidate trajectory for the scenario 505 while the second planning model 504 b may be applied to generate a second candidate scenario for the scenario 505 and, in some cases, the third planning model 504 c may be applied to also generate a third candidate scenario for the scenario 505. Moreover, in some cases, each of the candidate trajectories may be represented by a graph in which a plurality of interconnected nodes correspond to a plurality of waypoints forming the trajectory. At 656, a routing model may be applied to select, based at least on the scenario, a trajectory from the plurality of candidate trajectories. In some example embodiments, the routing model 502 may ingest, as input, the scenario 505 and multiple candidate trajectories for the scenario 505 including, for example, the first candidate scenario generated by the first planning model 504 a, the second candidate trajectory generated by the second planning model 504 b, the third candidate trajectory generated by the third planning model 504 c, and/or the like. Furthermore, the routing model 504 may generate, based at least on the scenario 505 and the candidate trajectories, an output identifying the one of the planning models 504 a-c having the best performance amongst the planning models 504 a-c. For example, in some cases, the routing model 502 may be a graph neural network (GNN) operating on the graphs representative of the scenario 505 and each of the trajectories to determine which one of the planning models 504 a-c generated a candidate trajectory having a lowest average displacement error (ADE) compared to the other planning models.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost to include determining, by a decoder of the trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding, as taught by Li as disclosed above, in order to determine accurate waypoints for an autonomous vehicle (Li Paragraph 24 “In some aspects and/or embodiments, systems, methods, and computer program products described herein include and/or implement techniques for generating trajectories for navigating a vehicle in a variety of scenarios, including infrequent scenarios for which training samples are few or even nonexistent”).
Pronovost in view of Li fail to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the GPT-based trajectory generation model; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Muller teaches determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the trajectory generation model (See at least Muller Paragraphs 96-97 “The machine learning model(s) 108 may perform forward pass computations on the training image data 404 and/or the training sensor data 406. In some examples, the machine learning model(s) 108 may learn recommended trajectories that correspond to 2D or 3D world space coordinates (e.g., as represented by trajectory points) that correspond to recommended or predicted trajectories for each image and/or sensor data representation. The loss function(s) 316 may be used to measure loss (e.g., error) in the output trajectory data 410 and/or other output(s) 408 (e.g., as predicted by the machine learning model(s) 108) as compared to the ground truth data (e.g., the labels or annotations corresponding to the trajectory training data 418). For example, a binary cross entropy loss function, mean squared error (L2) loss function, L1 loss function, and/or other loss function type may be used as the loss function(s) 316. In some embodiments, two or more different loss functions may be used. For example, one or more loss functions may be used for each type of output 408 where there are two or more outputs 408, or two or more loss functions may be used for a single output type. Where two or more loss functions are used for a single output type (e.g., for the trajectory data 410), the loss functions may be weighted with respect to one another to generate a weighted loss function. Backward pass computations may be performed to recursively compute gradients of the loss function with respect to training parameters (e.g., weights, biases, etc.), as indicated by arrow 420. In some examples, weight and biases of the machine learning model(s) 108 may be used to compute these gradients. In some examples, as described herein, distance based weighted loss may be added to the loss function 316, where the loss function 316 may increasingly penalize loss at farther distances from the bottom of the image or other data representation (or from the vehicle), as described herein. A distance based weighted loss function penalizes more errors on the far distance, and therefore improves the accuracy of the predicted trajectory, or trajectory points, in the far distance.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost in view of Li to include determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the trajectory generation model, as taught by Muller as disclosed above, in order to ensure optimal and accurate model training (Muller Paragraph 22 “Systems and methods are disclosed related to behavior-guided path planning in autonomous machine applications. More specifically, the present disclosure relates to path planning for autonomous vehicles using behavior observation as represented by sensor data generated by sensors of the autonomous vehicles.”).
Pronovost in view of Li in view of Muller fail to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Lee teaches that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction (See at least Lee Paragraph 10 “The ML system can provide the various types of data as input to one or more machine-learning models (hereinafter “ML models”). In some examples, the ML model(s) may be trained to output one or more unique candidate trajectories for the vehicle to follow. That is, the ML model(s) may be trained to output one or more learned trajectories that are unique in type and cause the vehicle to perform a certain type of action (e.g., lane change left, lane change right, increased assertiveness, increased passiveness, nominal trajectory, etc.)” | Paragraphs 14-15 “In some examples, the ML system may provide the various types of data as input to one or more ML models trained to output encoding(s) of the data. The ML system may encode the data and input such encoded data into one or more different ML models trained to output the learned trajectories … In some examples, the ML system may use mixture of experts ML technique to ensure that the ML model(s) are trained to output unique candidate trajectories. A mixture of experts technique includes using an ML model that includes multiple output heads (e.g., copies and/or versions of the model architecture) trained to output a diverse set of candidate trajectories. Specifically, the mixture of experts approach may result in different model heads being trained to output different types of trajectories that cause the vehicle to perform different actions”) | Paragraph 64 “Additional examples of architectures include neural networks such as ResNet-50, ResNet-101, VGG, DenseNet, PointNet, Xception, ConvNeXt, and the like; visual transformer(s) (ViT(s)), such as a bidirectional encoder from image transformers (BEIT), visual bidirectional encoder from transformers (VisualBERT), image generative pre-trained transformer (Image GPT)”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost in view of Li in view of Muller to include that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction, as taught by Lee as disclosed above, in order to accurate vehicle waypoint prediction (Lee Paragraph 10 “As discussed throughout this disclosure, the techniques described herein may improve vehicle safety and/or driving efficiency by determining improved driving trajectories through the environment by increasing the proportion of learned trajectories to heuristic-based trajectories in a tree structure, thereby reducing computations expenses when determining a control trajectory for the vehicle to follow”).
Pronovost in view of Li in view of Muller in view of Lee fail to explicitly disclose selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Xie teaches selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time (See at least Xie Paragraph 32 “The planning stack 116 can determine how to maneuver or operate the AV 102 safely and efficiently in its environment. For example, the planning stack 116 can receive the location, speed, and direction of the AV 102, geospatial data, data regarding objects sharing the road with the AV 102 (e.g., pedestrians, bicycles, vehicles, ambulances, buses, cable cars, trains, traffic lights, lanes, road markings, etc.) or certain events occurring during a trip (e.g., emergency vehicle blaring a siren, intersections, occluded areas, street closures for construction or street repairs, double-parked cars, etc.), traffic rules and other safety standards or practices for the road, user input, and other relevant data for directing the AV 102 from one point to another. The planning stack 116 can determine multiple sets of one or more mechanical operations that the AV 102 can perform (e.g., go straight at a specified rate of acceleration, including maintaining the same speed or decelerating; turn on the left blinker, decelerate if the AV is above a threshold range for turning, and turn left; turn on the right blinker, accelerate if the AV is stopped or below the threshold range for turning, and turn right; decelerate until completely stopped and reverse; etc.), and select the best one to meet changing road conditions and events. If something unexpected happens, the planning stack 116 can select from multiple backup plans to carry out. For example, while preparing to change lanes to turn right at an intersection, another vehicle may aggressively cut into the destination lane, making the lane change unsafe. The planning stack 116 could have already determined an alternative plan for such an event, and upon its occurrence, help to direct the AV 102 to go around the block instead of blocking a current lane while waiting for an opening to change lanes.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost in view of Li in view of Muller in view of Lee to include selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time, as taught by Xie as disclosed above, in order to ensure optimal vehicle control in dynamic environments (Xie Paragraph 22 “The present technology can involve the gathering and use of data available from various sources to improve quality and experience”).

With respect to claim 2, Pronovost in view of Li in view of Muller in view of Lee in view of Xie teach that the encoder transformer comprises an early fusion transformer (See at least Pronovost Paragraphs 62-63 and Paragraphs 70-71 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like). For instance, the vector representation 322 can comprise vectors to represent features of the environment including roadway boundary vectors 328 and roadway centerline vectors 330, among others. In various examples, the computing device can implement the encoder 304 (or other machine learned model) to generate the vector representation 322 based at least in part on state data associated with the autonomous vehicle 324 and/or the object 326. The state data can include data describing an object (e.g., the pedestrian 108, the vehicle 110 in FIG. 1 ) and/or a vehicle (e.g., vehicle 102) in an environment, such as in example environment 100. The state data can include, in various examples, one or more of position data, orientation data, heading data, velocity data, speed data, acceleration data, yaw rate data, or turning rate data associated with the object and/or the vehicle. In some examples, vectors associated with an environment, a vehicle state, and/or an object state may be combined as the vector representation 322.”).

With respect to claim 3, Pronovost in view of Li in view of Muller in view of Lee in view of Xie teach that the early fusion transformer is to fuse the map tokens and the agent tokens together to generate dynamic environment embeddings used to determine the sequence of AV waypoint predictions for the AV (See at least Pronovost Paragraph 53 “In some examples, the Transformer model 204 can sample tokens from the codebook 202 using a combination of the autoregressive technique for some tokens and another technique different from the autoregressive technique for some other tokens. For example, the Transformer model 204 can determine the token sequence 206 by determining two or more tokens in the token sequence 206 using an autoregressive algorithm, and determining another token in the token sequence 206 randomly or without consideration to a previously determined token. In various examples, the Transformer model 204 can be trained using training data to condition the Transformer model 204 to determine the token sequence 206 with consideration to historical object state data, scene data, environmental data, and so on. For instance, training the Transformer model based at least in part on a set of conditions, at least one condition of the set of conditions comprising a previous action, a previous position, or a previous acceleration of the object. In this way, the Transformer model 204 can output the token sequence 206 having tokens that represent potential interactions between an object and the vehicle. Additional detail of training the codebook 202 is discussed in FIG. 5 and elsewhere.”).

With respect to claim 5, and similarly claims 11 and 17, Pronovost in view of Li in view of Muller in view of Lee in view of Xie teaches that generating the map tokens and the agent tokens comprises utilizing at least one multi-layer perceptron (MLP) to generate the map tokens and the agent tokens in real-time(See at least Pronovost FIG. 3 and  Paragraph 159 “As shown, the variable autoencoder 1110 comprises an encoder 1112 and a decoder 1114, which can be trained to perform a variety of functionality as described herein. The encoder 1112 can include the functionality performed by the encoder 304 of FIG. 3 , and the decoder 1114 can include the functionality of the decoder 318.”).

With respect to claim 6, and similarly claims 12 and 18, Pronovost in view of Li in view of Muller in view of Lee in view of Xie teaches that a combination of the encoder transformer and the decoder comprise an encoder-decoder transformer (See at least Pronovost Paragraph 168 “FIG. 13 depicts the variable autoencoder 1110 associated with pixel space 1308 that includes an encoder 1310 and a decoder 1312. The encoder 1310 can be configured similar to the encoder 1112 and the decoder 1312 can be configured similar to the decoder 1114. In some examples, the encoder 1310 and the decoder 1312 can represent an RNN or a multilayer perceptron (MLP).”).

With respect to claim 9, Pronovost teaches an apparatus comprising: one or more hardware processors to:
receiving, at volumes and in a variety to electronically replicate a dynamic environment, real time sensor data from one or more sensors of an AV that uses some or all of the real time sensor data to self-navigate through the dynamic environment without a human driver (See at least Pronovost FIG. 3 and Paragraphs 62-63 “As depicted in FIG. 3 , input data 302 representing object trajectories associated with one or more objects, object state data, and scene data can be input into an encoder 304. The encoder 304 can represent a machine learned model such as a GNN, RNN, CNN, and the like, and output one or more feature vectors 306 which can be sent to a codebook 308 and a quantizer 310. In some examples, the quantizer 310 can receive the feature vectors 306 output by the encoder 304, and discretize the feature vectors 306 to output the discretized feature vectors 312. In some examples, the codebook 308 can receive the discretized feature vectors 312 while in other examples the codebook 308 can receive the feature vectors 306. A machine learned model 314 (e.g., a Transformer model) can receive tokens, discrete feature vectors, and/or continuous feature vectors from the codebook and arrange the tokens and associated feature vectors into a token sequence 316. The token sequence 316 can represent tokens arranged or clustered in a particular order. The machine learned model 314 can include one or more self-attention layers to cause the tokens to be arranged with attention to features of another token.” | Paragraph 70 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like).”)
generating from the real time sensor data real time input data and providing the real time input data to a trajectory generation model, and wherein the input data comprises vector map representations, nearby actor history, and autonomous vehicle (AV) history of an AV2 (See at least Pronovost FIGS. 3, 5-9 and Paragraphs 62-63 “As depicted in FIG. 3 , input data 302 representing object trajectories associated with one or more objects, object state data, and scene data can be input into an encoder 304. The encoder 304 can represent a machine learned model such as a GNN, RNN, CNN, and the like, and output one or more feature vectors 306 which can be sent to a codebook 308 and a quantizer 310. In some examples, the quantizer 310 can receive the feature vectors 306 output by the encoder 304, and discretize the feature vectors 306 to output the discretized feature vectors 312. In some examples, the codebook 308 can receive the discretized feature vectors 312 while in other examples the codebook 308 can receive the feature vectors 306. A machine learned model 314 (e.g., a Transformer model) can receive tokens, discrete feature vectors, and/or continuous feature vectors from the codebook and arrange the tokens and associated feature vectors into a token sequence 316. The token sequence 316 can represent tokens arranged or clustered in a particular order. The machine learned model 314 can include one or more self-attention layers to cause the tokens to be arranged with attention to features of another token.” | Paragraph 70 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like).” | Paragraph 142 “At operation 902, the process may include receiving, by a Transformer model, a request to generate a simulated environment that includes a vehicle and an object. For example, the vehicle computing device 804 can initiate a simulation or otherwise send an instruction requesting that the model component 830 generate object trajectories for one or more objects in an environment of a vehicle. In some examples, the Transformer model (e.g., the Transformer model 204) can receive tokens from the codebook 202. In various examples, the tokens can be based at least in part on sensor data from the perception component 822 and map data from the map(s) 828 and/or the map component 844. The vehicle computing device may be configured to receive sensor data representing one or more objects in an environment (e.g., vehicle 110). In some examples, the vehicle computing device may be configured to detect dynamic objects and/or static objects and combine the associated sensor data with map data. In some examples, the map data may represent fixed features of an environment including but not limited to crosswalks, traffic signals, school zones, and the like. The model component 830 may also or instead receive object state data such as position data, orientation data, heading data, velocity data, speed data, acceleration data, yaw rate data, or turning rate data associated with the object.”); 
generating, in real time map tokens from the vector map representations and generating agent tokens from the nearby actor history and the AV history wherein the map tokens and agent tokens are structured as sequential input for the trajectory generation model; inputting a concatenated set of the map tokens and the agent tokens into an encoder transformer of the trajectory generation model (See at least Pronovost Paragraphs 20-21 “In some examples, a first machine learned model comprising one or more self-attention layers (e.g., a Transformer model) can receive tokens representing an object action, an object state, etc. from the codebook and arrange the tokens in a sequence to represent object behaviors (e.g., relative to the environment, relative to the autonomous vehicle, relative to another object, and so on). The first machine learned model can, for example, determine an output by using the one or more self-attention layers to arrange tokens in order or cluster the tokens. For instance, the first machine learned model can employ an autoregressive algorithm or other techniques to sample tokens from the codebook. Generally, the arrangement, set, or cluster of tokens output by the first machine learned model represents discrete latent variables, or interactions between tokens. In some examples, the codebook can map, identify, or determine feature vectors that correspond to the discrete latent variables to represent the output of the first machine learned model as a set of feature vectors that represent continuous variables. For example, the codebook can convert the sequence of tokens that represents discrete latent variables into a new sequence of tokens that represents feature vectors. The feature vectors representing continuous variables can, for example, be fed into a second machine learned model (e.g., a decoder, a generator of a Generative Adversarial Network (GAN), a Graph Neural Network (GNN), a Recurrent Neural Network (RNN), another Transformer model, etc.) as input data to determine a potential action or behavior of one or more objects in an environment. In various examples, the second machine learned model can output data representing one or more of: an object trajectory, a heatmap showing a likelihood of occupancy by an object(s), object state data, or scene data usable in simulation, just to name a few. An output of the second machine learned model can be sent to the vehicle computing device for use in planning operations of a vehicle (e.g., to determine a candidate trajectory for the vehicle).”); 
outputting, by the encoder transformer, an output embedding that is representative of the dynamic environment of the AV (See at least Pronovost Paragraph 22 “In some examples, the second machine learned model can generate a simulated environment that includes one or more object trajectories based at least in part on receiving a sequence of tokens representing feature vectors, though in some examples the second machine learned model can also or instead receive a sequence of tokens representing discrete latent variables. In various examples, the second machine learned model can determine a response by the vehicle to the object trajectory in the simulated environment and control the vehicle in a real-world environment based at least in part on the response.”); 
Pronovost fails to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; determining, by a decoder of the GPT-based trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding; determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the GPT-based trajectory generation model; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Li teaches determining, by a decoder of the trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding (See at least Li FIG. 6B and Paragraph 95 “As shown in FIG. 4E, the transformer model 482 may include an encoder stack having a plurality of encoders 484 (or encoding layers) coupled with a decoder stack having a plurality of decoders 486 (or decoding layers). In the example shown in FIG. 4E, the input 483 (e.g., the embedding of each individual portion of the input 483) flows through successive encoders 484, with the output of the final encoder 484 being passed to every decoder 486 in the decoder stack. For example, in some cases, each encoder 484 in the encoder stack may generate an encoding that contains information about which parts of the input 483 are relevant to each other. Moreover, the output of one encoder 484 may be passed on as an input to the next encoder 484 in the encoder stack. Accordingly, in some cases, the first encoder 484 in the encoder stack may generate a first encoding of the input 483 (e.g., the embedding of each individual portion of the input 483) while the next encoder 484 in the encoder stack may generate a second encoding of the first encoding.” | Paragraph 111-112 “At 654, a plurality of candidate trajectories for a vehicle may be generated by at least applying a plurality of planning models including a first planning model to generate a first candidate trajectory based at least on the scenario and a second planning model to generate a second candidate trajectory based on the scenario. As noted, in some example embodiments, the planning models 504 a-c may be based on different machine learning architectures. Alternatively and/or additionally, the planning models 504 a-c may be trained based on different training data. For example, in some cases, whereas the first planning model 504 a determines the trajectory of a vehicle in the scenario 505 based at least on an interaction between a plurality of agents while located relative to one or more lanes present in the scenario 505, the second planning model 504 b may determine the trajectory of the vehicle based at least on an interaction between the plurality of agents and the one or more lanes while the plurality of agents are located relative to the one or more lanes. That multiple planning models, including the planning models 504 a-c, are applied to determine candidate trajectories for the same scenario 505 enables the planning system 404 to leverage a wider variety of planning models, at least some of which having better performance than others when encountering certain scenarios. For instance, in some cases, the first planning model 504 a may be applied to generate a first candidate trajectory for the scenario 505 while the second planning model 504 b may be applied to generate a second candidate scenario for the scenario 505 and, in some cases, the third planning model 504 c may be applied to also generate a third candidate scenario for the scenario 505. Moreover, in some cases, each of the candidate trajectories may be represented by a graph in which a plurality of interconnected nodes correspond to a plurality of waypoints forming the trajectory. At 656, a routing model may be applied to select, based at least on the scenario, a trajectory from the plurality of candidate trajectories. In some example embodiments, the routing model 502 may ingest, as input, the scenario 505 and multiple candidate trajectories for the scenario 505 including, for example, the first candidate scenario generated by the first planning model 504 a, the second candidate trajectory generated by the second planning model 504 b, the third candidate trajectory generated by the third planning model 504 c, and/or the like. Furthermore, the routing model 504 may generate, based at least on the scenario 505 and the candidate trajectories, an output identifying the one of the planning models 504 a-c having the best performance amongst the planning models 504 a-c. For example, in some cases, the routing model 502 may be a graph neural network (GNN) operating on the graphs representative of the scenario 505 and each of the trajectories to determine which one of the planning models 504 a-c generated a candidate trajectory having a lowest average displacement error (ADE) compared to the other planning models.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the apparatus of Pronovost to include determining, by a decoder of the trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding, as taught by Li as disclosed above, in order to determine accurate waypoints for an autonomous vehicle (Li Paragraph 24 “In some aspects and/or embodiments, systems, methods, and computer program products described herein include and/or implement techniques for generating trajectories for navigating a vehicle in a variety of scenarios, including infrequent scenarios for which training samples are few or even nonexistent”).
Pronovost in view of Li fail to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the GPT-based trajectory generation model; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Muller teaches determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the trajectory generation model (See at least Muller Paragraphs 96-97 “The machine learning model(s) 108 may perform forward pass computations on the training image data 404 and/or the training sensor data 406. In some examples, the machine learning model(s) 108 may learn recommended trajectories that correspond to 2D or 3D world space coordinates (e.g., as represented by trajectory points) that correspond to recommended or predicted trajectories for each image and/or sensor data representation. The loss function(s) 316 may be used to measure loss (e.g., error) in the output trajectory data 410 and/or other output(s) 408 (e.g., as predicted by the machine learning model(s) 108) as compared to the ground truth data (e.g., the labels or annotations corresponding to the trajectory training data 418). For example, a binary cross entropy loss function, mean squared error (L2) loss function, L1 loss function, and/or other loss function type may be used as the loss function(s) 316. In some embodiments, two or more different loss functions may be used. For example, one or more loss functions may be used for each type of output 408 where there are two or more outputs 408, or two or more loss functions may be used for a single output type. Where two or more loss functions are used for a single output type (e.g., for the trajectory data 410), the loss functions may be weighted with respect to one another to generate a weighted loss function. Backward pass computations may be performed to recursively compute gradients of the loss function with respect to training parameters (e.g., weights, biases, etc.), as indicated by arrow 420. In some examples, weight and biases of the machine learning model(s) 108 may be used to compute these gradients. In some examples, as described herein, distance based weighted loss may be added to the loss function 316, where the loss function 316 may increasingly penalize loss at farther distances from the bottom of the image or other data representation (or from the vehicle), as described herein. A distance based weighted loss function penalizes more errors on the far distance, and therefore improves the accuracy of the predicted trajectory, or trajectory points, in the far distance.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the apparatus of Pronovost in view of Li to include determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the trajectory generation model, as taught by Muller as disclosed above, in order to ensure optimal and accurate model training (Muller Paragraph 22 “Systems and methods are disclosed related to behavior-guided path planning in autonomous machine applications. More specifically, the present disclosure relates to path planning for autonomous vehicles using behavior observation as represented by sensor data generated by sensors of the autonomous vehicles.”).
Pronovost in view of Li in view of Muller fail to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Lee teaches that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction (See at least Lee Paragraph 10 “The ML system can provide the various types of data as input to one or more machine-learning models (hereinafter “ML models”). In some examples, the ML model(s) may be trained to output one or more unique candidate trajectories for the vehicle to follow. That is, the ML model(s) may be trained to output one or more learned trajectories that are unique in type and cause the vehicle to perform a certain type of action (e.g., lane change left, lane change right, increased assertiveness, increased passiveness, nominal trajectory, etc.)” | Paragraphs 14-15 “In some examples, the ML system may provide the various types of data as input to one or more ML models trained to output encoding(s) of the data. The ML system may encode the data and input such encoded data into one or more different ML models trained to output the learned trajectories … In some examples, the ML system may use mixture of experts ML technique to ensure that the ML model(s) are trained to output unique candidate trajectories. A mixture of experts technique includes using an ML model that includes multiple output heads (e.g., copies and/or versions of the model architecture) trained to output a diverse set of candidate trajectories. Specifically, the mixture of experts approach may result in different model heads being trained to output different types of trajectories that cause the vehicle to perform different actions”) | Paragraph 64 “Additional examples of architectures include neural networks such as ResNet-50, ResNet-101, VGG, DenseNet, PointNet, Xception, ConvNeXt, and the like; visual transformer(s) (ViT(s)), such as a bidirectional encoder from image transformers (BEIT), visual bidirectional encoder from transformers (VisualBERT), image generative pre-trained transformer (Image GPT)”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the apparatus of Pronovost in view of Li in view of Muller to include that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction, as taught by Lee as disclosed above, in order to accurate vehicle waypoint prediction (Lee Paragraph 10 “As discussed throughout this disclosure, the techniques described herein may improve vehicle safety and/or driving efficiency by determining improved driving trajectories through the environment by increasing the proportion of learned trajectories to heuristic-based trajectories in a tree structure, thereby reducing computations expenses when determining a control trajectory for the vehicle to follow”).
Pronovost in view of Li in view of Muller in view of Lee fail to explicitly disclose selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Xie teaches selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time (See at least Xie Paragraph 32 “The planning stack 116 can determine how to maneuver or operate the AV 102 safely and efficiently in its environment. For example, the planning stack 116 can receive the location, speed, and direction of the AV 102, geospatial data, data regarding objects sharing the road with the AV 102 (e.g., pedestrians, bicycles, vehicles, ambulances, buses, cable cars, trains, traffic lights, lanes, road markings, etc.) or certain events occurring during a trip (e.g., emergency vehicle blaring a siren, intersections, occluded areas, street closures for construction or street repairs, double-parked cars, etc.), traffic rules and other safety standards or practices for the road, user input, and other relevant data for directing the AV 102 from one point to another. The planning stack 116 can determine multiple sets of one or more mechanical operations that the AV 102 can perform (e.g., go straight at a specified rate of acceleration, including maintaining the same speed or decelerating; turn on the left blinker, decelerate if the AV is above a threshold range for turning, and turn left; turn on the right blinker, accelerate if the AV is stopped or below the threshold range for turning, and turn right; decelerate until completely stopped and reverse; etc.), and select the best one to meet changing road conditions and events. If something unexpected happens, the planning stack 116 can select from multiple backup plans to carry out. For example, while preparing to change lanes to turn right at an intersection, another vehicle may aggressively cut into the destination lane, making the lane change unsafe. The planning stack 116 could have already determined an alternative plan for such an event, and upon its occurrence, help to direct the AV 102 to go around the block instead of blocking a current lane while waiting for an opening to change lanes.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the apparatus of Pronovost in view of Li in view of Muller in view of Lee to include selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time, as taught by Xie as disclosed above, in order to ensure optimal vehicle control in dynamic environments (Xie Paragraph 22 “The present technology can involve the gathering and use of data available from various sources to improve quality and experience”).

With respect to claim 10, Pronovost in view of Li in view of Muller in view of Lee in view of Xie teach that the set of tokens comprise map tokens generated from the vector map representations and agent tokens generated from the nearby actor history and the AV history (See at least Li Paragraphs 20-22), wherein the encoder transformer comprises an early fusion transformer (See at least Pronovost Paragraphs 62-63 and Paragraphs 70-71 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like). For instance, the vector representation 322 can comprise vectors to represent features of the environment including roadway boundary vectors 328 and roadway centerline vectors 330, among others. In various examples, the computing device can implement the encoder 304 (or other machine learned model) to generate the vector representation 322 based at least in part on state data associated with the autonomous vehicle 324 and/or the object 326. The state data can include data describing an object (e.g., the pedestrian 108, the vehicle 110 in FIG. 1 ) and/or a vehicle (e.g., vehicle 102) in an environment, such as in example environment 100. The state data can include, in various examples, one or more of position data, orientation data, heading data, velocity data, speed data, acceleration data, yaw rate data, or turning rate data associated with the object and/or the vehicle. In some examples, vectors associated with an environment, a vehicle state, and/or an object state may be combined as the vector representation 322.”) 
and wherein the early fusion transformer is to fuse the map tokens and the agent tokens together to generate dynamic environment embeddings used to determine the candidate sequence of AV waypoint predictions for the AV in real-time (See at least Pronovost Paragraph 53 “In some examples, the Transformer model 204 can sample tokens from the codebook 202 using a combination of the autoregressive technique for some tokens and another technique different from the autoregressive technique for some other tokens. For example, the Transformer model 204 can determine the token sequence 206 by determining two or more tokens in the token sequence 206 using an autoregressive algorithm, and determining another token in the token sequence 206 randomly or without consideration to a previously determined token. In various examples, the Transformer model 204 can be trained using training data to condition the Transformer model 204 to determine the token sequence 206 with consideration to historical object state data, scene data, environmental data, and so on. For instance, training the Transformer model based at least in part on a set of conditions, at least one condition of the set of conditions comprising a previous action, a previous position, or a previous acceleration of the object. In this way, the Transformer model 204 can output the token sequence 206 having tokens that represent potential interactions between an object and the vehicle. Additional detail of training the codebook 202 is discussed in FIG. 5 and elsewhere.”).

With respect to claim 15, Pronovost teaches a non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to:
receiving, at volumes and in a variety to electronically replicate a dynamic environment, real time sensor data from one or more sensors of an AV that uses some or all of the real time sensor data to self-navigate through the dynamic environment without a human driver (See at least Pronovost FIG. 3 and Paragraphs 62-63 “As depicted in FIG. 3 , input data 302 representing object trajectories associated with one or more objects, object state data, and scene data can be input into an encoder 304. The encoder 304 can represent a machine learned model such as a GNN, RNN, CNN, and the like, and output one or more feature vectors 306 which can be sent to a codebook 308 and a quantizer 310. In some examples, the quantizer 310 can receive the feature vectors 306 output by the encoder 304, and discretize the feature vectors 306 to output the discretized feature vectors 312. In some examples, the codebook 308 can receive the discretized feature vectors 312 while in other examples the codebook 308 can receive the feature vectors 306. A machine learned model 314 (e.g., a Transformer model) can receive tokens, discrete feature vectors, and/or continuous feature vectors from the codebook and arrange the tokens and associated feature vectors into a token sequence 316. The token sequence 316 can represent tokens arranged or clustered in a particular order. The machine learned model 314 can include one or more self-attention layers to cause the tokens to be arranged with attention to features of another token.” | Paragraph 70 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like).”)
generating from the real time sensor data real time input data and providing the real time input data to a trajectory generation model, and wherein the input data comprises vector map representations, nearby actor history, and autonomous vehicle (AV) history of an AV3 (See at least Pronovost FIGS. 3, 5-9 and Paragraphs 62-63 “As depicted in FIG. 3 , input data 302 representing object trajectories associated with one or more objects, object state data, and scene data can be input into an encoder 304. The encoder 304 can represent a machine learned model such as a GNN, RNN, CNN, and the like, and output one or more feature vectors 306 which can be sent to a codebook 308 and a quantizer 310. In some examples, the quantizer 310 can receive the feature vectors 306 output by the encoder 304, and discretize the feature vectors 306 to output the discretized feature vectors 312. In some examples, the codebook 308 can receive the discretized feature vectors 312 while in other examples the codebook 308 can receive the feature vectors 306. A machine learned model 314 (e.g., a Transformer model) can receive tokens, discrete feature vectors, and/or continuous feature vectors from the codebook and arrange the tokens and associated feature vectors into a token sequence 316. The token sequence 316 can represent tokens arranged or clustered in a particular order. The machine learned model 314 can include one or more self-attention layers to cause the tokens to be arranged with attention to features of another token.” | Paragraph 70 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like).” | Paragraph 142 “At operation 902, the process may include receiving, by a Transformer model, a request to generate a simulated environment that includes a vehicle and an object. For example, the vehicle computing device 804 can initiate a simulation or otherwise send an instruction requesting that the model component 830 generate object trajectories for one or more objects in an environment of a vehicle. In some examples, the Transformer model (e.g., the Transformer model 204) can receive tokens from the codebook 202. In various examples, the tokens can be based at least in part on sensor data from the perception component 822 and map data from the map(s) 828 and/or the map component 844. The vehicle computing device may be configured to receive sensor data representing one or more objects in an environment (e.g., vehicle 110). In some examples, the vehicle computing device may be configured to detect dynamic objects and/or static objects and combine the associated sensor data with map data. In some examples, the map data may represent fixed features of an environment including but not limited to crosswalks, traffic signals, school zones, and the like. The model component 830 may also or instead receive object state data such as position data, orientation data, heading data, velocity data, speed data, acceleration data, yaw rate data, or turning rate data associated with the object.”); 
generating, in real time map tokens from the vector map representations and generating agent tokens from the nearby actor history and the AV history wherein the map tokens and agent tokens are structured as sequential input for the trajectory generation model; inputting a concatenated set of the map tokens and the agent tokens into an encoder transformer of the trajectory generation model (See at least Pronovost Paragraphs 20-21 “In some examples, a first machine learned model comprising one or more self-attention layers (e.g., a Transformer model) can receive tokens representing an object action, an object state, etc. from the codebook and arrange the tokens in a sequence to represent object behaviors (e.g., relative to the environment, relative to the autonomous vehicle, relative to another object, and so on). The first machine learned model can, for example, determine an output by using the one or more self-attention layers to arrange tokens in order or cluster the tokens. For instance, the first machine learned model can employ an autoregressive algorithm or other techniques to sample tokens from the codebook. Generally, the arrangement, set, or cluster of tokens output by the first machine learned model represents discrete latent variables, or interactions between tokens. In some examples, the codebook can map, identify, or determine feature vectors that correspond to the discrete latent variables to represent the output of the first machine learned model as a set of feature vectors that represent continuous variables. For example, the codebook can convert the sequence of tokens that represents discrete latent variables into a new sequence of tokens that represents feature vectors. The feature vectors representing continuous variables can, for example, be fed into a second machine learned model (e.g., a decoder, a generator of a Generative Adversarial Network (GAN), a Graph Neural Network (GNN), a Recurrent Neural Network (RNN), another Transformer model, etc.) as input data to determine a potential action or behavior of one or more objects in an environment. In various examples, the second machine learned model can output data representing one or more of: an object trajectory, a heatmap showing a likelihood of occupancy by an object(s), object state data, or scene data usable in simulation, just to name a few. An output of the second machine learned model can be sent to the vehicle computing device for use in planning operations of a vehicle (e.g., to determine a candidate trajectory for the vehicle).”); 
outputting, by the encoder transformer, an output embedding that is representative of the dynamic environment of the AV (See at least Pronovost Paragraph 22 “In some examples, the second machine learned model can generate a simulated environment that includes one or more object trajectories based at least in part on receiving a sequence of tokens representing feature vectors, though in some examples the second machine learned model can also or instead receive a sequence of tokens representing discrete latent variables. In various examples, the second machine learned model can determine a response by the vehicle to the object trajectory in the simulated environment and control the vehicle in a real-world environment based at least in part on the response.”); 
Pronovost fails to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; determining, by a decoder of the GPT-based trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding; determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the GPT-based trajectory generation model; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Li teaches determining, by a decoder of the trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding (See at least Li FIG. 6B and Paragraph 95 “As shown in FIG. 4E, the transformer model 482 may include an encoder stack having a plurality of encoders 484 (or encoding layers) coupled with a decoder stack having a plurality of decoders 486 (or decoding layers). In the example shown in FIG. 4E, the input 483 (e.g., the embedding of each individual portion of the input 483) flows through successive encoders 484, with the output of the final encoder 484 being passed to every decoder 486 in the decoder stack. For example, in some cases, each encoder 484 in the encoder stack may generate an encoding that contains information about which parts of the input 483 are relevant to each other. Moreover, the output of one encoder 484 may be passed on as an input to the next encoder 484 in the encoder stack. Accordingly, in some cases, the first encoder 484 in the encoder stack may generate a first encoding of the input 483 (e.g., the embedding of each individual portion of the input 483) while the next encoder 484 in the encoder stack may generate a second encoding of the first encoding.” | Paragraph 111-112 “At 654, a plurality of candidate trajectories for a vehicle may be generated by at least applying a plurality of planning models including a first planning model to generate a first candidate trajectory based at least on the scenario and a second planning model to generate a second candidate trajectory based on the scenario. As noted, in some example embodiments, the planning models 504 a-c may be based on different machine learning architectures. Alternatively and/or additionally, the planning models 504 a-c may be trained based on different training data. For example, in some cases, whereas the first planning model 504 a determines the trajectory of a vehicle in the scenario 505 based at least on an interaction between a plurality of agents while located relative to one or more lanes present in the scenario 505, the second planning model 504 b may determine the trajectory of the vehicle based at least on an interaction between the plurality of agents and the one or more lanes while the plurality of agents are located relative to the one or more lanes. That multiple planning models, including the planning models 504 a-c, are applied to determine candidate trajectories for the same scenario 505 enables the planning system 404 to leverage a wider variety of planning models, at least some of which having better performance than others when encountering certain scenarios. For instance, in some cases, the first planning model 504 a may be applied to generate a first candidate trajectory for the scenario 505 while the second planning model 504 b may be applied to generate a second candidate scenario for the scenario 505 and, in some cases, the third planning model 504 c may be applied to also generate a third candidate scenario for the scenario 505. Moreover, in some cases, each of the candidate trajectories may be represented by a graph in which a plurality of interconnected nodes correspond to a plurality of waypoints forming the trajectory. At 656, a routing model may be applied to select, based at least on the scenario, a trajectory from the plurality of candidate trajectories. In some example embodiments, the routing model 502 may ingest, as input, the scenario 505 and multiple candidate trajectories for the scenario 505 including, for example, the first candidate scenario generated by the first planning model 504 a, the second candidate trajectory generated by the second planning model 504 b, the third candidate trajectory generated by the third planning model 504 c, and/or the like. Furthermore, the routing model 504 may generate, based at least on the scenario 505 and the candidate trajectories, an output identifying the one of the planning models 504 a-c having the best performance amongst the planning models 504 a-c. For example, in some cases, the routing model 502 may be a graph neural network (GNN) operating on the graphs representative of the scenario 505 and each of the trajectories to determine which one of the planning models 504 a-c generated a candidate trajectory having a lowest average displacement error (ADE) compared to the other planning models.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost to include determining, by a decoder of the trajectory generation model, a candidate sequence of AV waypoint predictions for the AV based on the output embedding, as taught by Li as disclosed above, in order to determine accurate waypoints for an autonomous vehicle (Li Paragraph 24 “In some aspects and/or embodiments, systems, methods, and computer program products described herein include and/or implement techniques for generating trajectories for navigating a vehicle in a variety of scenarios, including infrequent scenarios for which training samples are few or even nonexistent”).
Pronovost in view of Li fail to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the GPT-based trajectory generation model; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Muller teaches determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the trajectory generation model (See at least Muller Paragraphs 96-97 “The machine learning model(s) 108 may perform forward pass computations on the training image data 404 and/or the training sensor data 406. In some examples, the machine learning model(s) 108 may learn recommended trajectories that correspond to 2D or 3D world space coordinates (e.g., as represented by trajectory points) that correspond to recommended or predicted trajectories for each image and/or sensor data representation. The loss function(s) 316 may be used to measure loss (e.g., error) in the output trajectory data 410 and/or other output(s) 408 (e.g., as predicted by the machine learning model(s) 108) as compared to the ground truth data (e.g., the labels or annotations corresponding to the trajectory training data 418). For example, a binary cross entropy loss function, mean squared error (L2) loss function, L1 loss function, and/or other loss function type may be used as the loss function(s) 316. In some embodiments, two or more different loss functions may be used. For example, one or more loss functions may be used for each type of output 408 where there are two or more outputs 408, or two or more loss functions may be used for a single output type. Where two or more loss functions are used for a single output type (e.g., for the trajectory data 410), the loss functions may be weighted with respect to one another to generate a weighted loss function. Backward pass computations may be performed to recursively compute gradients of the loss function with respect to training parameters (e.g., weights, biases, etc.), as indicated by arrow 420. In some examples, weight and biases of the machine learning model(s) 108 may be used to compute these gradients. In some examples, as described herein, distance based weighted loss may be added to the loss function 316, where the loss function 316 may increasingly penalize loss at farther distances from the bottom of the image or other data representation (or from the vehicle), as described herein. A distance based weighted loss function penalizes more errors on the far distance, and therefore improves the accuracy of the predicted trajectory, or trajectory points, in the far distance.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost in view of Li to include determining, by the decoder, a weighted loss corresponding to the sequence of AV waypoint predictions, the weighted loss for use in training weights and parameters of the trajectory generation model, as taught by Muller as disclosed above, in order to ensure optimal and accurate model training (Muller Paragraph 22 “Systems and methods are disclosed related to behavior-guided path planning in autonomous machine applications. More specifically, the present disclosure relates to path planning for autonomous vehicles using behavior observation as represented by sensor data generated by sensors of the autonomous vehicles.”).
Pronovost in view of Li in view of Muller fail to explicitly disclose that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction; and selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Lee teaches that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction (See at least Lee Paragraph 10 “The ML system can provide the various types of data as input to one or more machine-learning models (hereinafter “ML models”). In some examples, the ML model(s) may be trained to output one or more unique candidate trajectories for the vehicle to follow. That is, the ML model(s) may be trained to output one or more learned trajectories that are unique in type and cause the vehicle to perform a certain type of action (e.g., lane change left, lane change right, increased assertiveness, increased passiveness, nominal trajectory, etc.)” | Paragraphs 14-15 “In some examples, the ML system may provide the various types of data as input to one or more ML models trained to output encoding(s) of the data. The ML system may encode the data and input such encoded data into one or more different ML models trained to output the learned trajectories … In some examples, the ML system may use mixture of experts ML technique to ensure that the ML model(s) are trained to output unique candidate trajectories. A mixture of experts technique includes using an ML model that includes multiple output heads (e.g., copies and/or versions of the model architecture) trained to output a diverse set of candidate trajectories. Specifically, the mixture of experts approach may result in different model heads being trained to output different types of trajectories that cause the vehicle to perform different actions”) | Paragraph 64 “Additional examples of architectures include neural networks such as ResNet-50, ResNet-101, VGG, DenseNet, PointNet, Xception, ConvNeXt, and the like; visual transformer(s) (ViT(s)), such as a bidirectional encoder from image transformers (BEIT), visual bidirectional encoder from transformers (VisualBERT), image generative pre-trained transformer (Image GPT)”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost in view of Li in view of Muller to include that the trajectory model is a generative pre-trained transformer (GPT)-based trajectory generation model, wherein the GPT-based generation model is generatively pre-trained and fine-tuned for vehicle trajectory prediction, as taught by Lee as disclosed above, in order to accurate vehicle waypoint prediction (Lee Paragraph 10 “As discussed throughout this disclosure, the techniques described herein may improve vehicle safety and/or driving efficiency by determining improved driving trajectories through the environment by increasing the proportion of learned trajectories to heuristic-based trajectories in a tree structure, thereby reducing computations expenses when determining a control trajectory for the vehicle to follow”).
Pronovost in view of Li in view of Muller in view of Lee fail to explicitly disclose selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time.
Xie teaches selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time (See at least Xie Paragraph 32 “The planning stack 116 can determine how to maneuver or operate the AV 102 safely and efficiently in its environment. For example, the planning stack 116 can receive the location, speed, and direction of the AV 102, geospatial data, data regarding objects sharing the road with the AV 102 (e.g., pedestrians, bicycles, vehicles, ambulances, buses, cable cars, trains, traffic lights, lanes, road markings, etc.) or certain events occurring during a trip (e.g., emergency vehicle blaring a siren, intersections, occluded areas, street closures for construction or street repairs, double-parked cars, etc.), traffic rules and other safety standards or practices for the road, user input, and other relevant data for directing the AV 102 from one point to another. The planning stack 116 can determine multiple sets of one or more mechanical operations that the AV 102 can perform (e.g., go straight at a specified rate of acceleration, including maintaining the same speed or decelerating; turn on the left blinker, decelerate if the AV is above a threshold range for turning, and turn left; turn on the right blinker, accelerate if the AV is stopped or below the threshold range for turning, and turn right; decelerate until completely stopped and reverse; etc.), and select the best one to meet changing road conditions and events. If something unexpected happens, the planning stack 116 can select from multiple backup plans to carry out. For example, while preparing to change lanes to turn right at an intersection, another vehicle may aggressively cut into the destination lane, making the lane change unsafe. The planning stack 116 could have already determined an alternative plan for such an event, and upon its occurrence, help to direct the AV 102 to go around the block instead of blocking a current lane while waiting for an opening to change lanes.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost in view of Li in view of Muller in view of Lee to include selecting, using a planning stack, the candidate sequence of AV waypoint predictions and using the selected candidate sequence of AV waypoint predictions, along with some or all of the real time sensor data, to self-navigate the AV through the dynamic environment without a driver in real-time, as taught by Xie as disclosed above, in order to ensure optimal vehicle control in dynamic environments (Xie Paragraph 22 “The present technology can involve the gathering and use of data available from various sources to improve quality and experience”).

With respect to claim 16, Pronovost in view of Li in view of Muller in view of Lee in view of Xie teach wherein the encoder transformer comprises an early fusion transformer (See at least Pronovost Paragraphs 62-63 and Paragraphs 70-71 “As shown in FIG. 3 , the environment may be represented by the vector representation 322 comprising vectors to represent objects and/or features of the environment including one or more of: an attribute (e.g., position, velocity, acceleration, yaw, etc.) of the object 326, history of the object 326 (e.g., location history, velocity history, etc.), an attribute of the autonomous vehicle 324 (e.g., velocity, position, etc.), and/or features of the environment (e.g., roadway boundary, roadway centerline, crosswalk permission, traffic light permission, and the like). For instance, the vector representation 322 can comprise vectors to represent features of the environment including roadway boundary vectors 328 and roadway centerline vectors 330, among others. In various examples, the computing device can implement the encoder 304 (or other machine learned model) to generate the vector representation 322 based at least in part on state data associated with the autonomous vehicle 324 and/or the object 326. The state data can include data describing an object (e.g., the pedestrian 108, the vehicle 110 in FIG. 1 ) and/or a vehicle (e.g., vehicle 102) in an environment, such as in example environment 100. The state data can include, in various examples, one or more of position data, orientation data, heading data, velocity data, speed data, acceleration data, yaw rate data, or turning rate data associated with the object and/or the vehicle. In some examples, vectors associated with an environment, a vehicle state, and/or an object state may be combined as the vector representation 322.”) 
and wherein the early fusion transformer is to fuse the map tokens and the agent tokens together to generate dynamic environment embeddings used to determine the sequence of AV waypoint predictions for the AV in real-time (See at least Pronovost Paragraph 53 “In some examples, the Transformer model 204 can sample tokens from the codebook 202 using a combination of the autoregressive technique for some tokens and another technique different from the autoregressive technique for some other tokens. For example, the Transformer model 204 can determine the token sequence 206 by determining two or more tokens in the token sequence 206 using an autoregressive algorithm, and determining another token in the token sequence 206 randomly or without consideration to a previously determined token. In various examples, the Transformer model 204 can be trained using training data to condition the Transformer model 204 to determine the token sequence 206 with consideration to historical object state data, scene data, environmental data, and so on. For instance, training the Transformer model based at least in part on a set of conditions, at least one condition of the set of conditions comprising a previous action, a previous position, or a previous acceleration of the object. In this way, the Transformer model 204 can output the token sequence 206 having tokens that represent potential interactions between an object and the vehicle. Additional detail of training the codebook 202 is discussed in FIG. 5 and elsewhere.”).

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Pronovost (US 20240101157 A1) (“Pronovost”) in view of Li (US 20240124016 A1) (“Li”) in view of Muller (US 20190384303 A1) (“Muller”) in view of Lee (US 20240400095 A1) (“Lee”) in view of Xie (US 20210199442 A1) (“Xie”) further in view of Ayvaci (US 20210295555 A1) (“Ayvaci”).

With respect to claim 4, Pronovost in view of Li in view of Muller in view of Lee in view of Xie fail to explicitly disclose that the weighted loss comprises a weighted Huber loss.
Ayvaci teaches that the weighted loss comprises a weighted Huber loss (See at least Casas Paragraph 97 “The system can compare the object interaction predictions to the labels in the training examples. The system can calculate a loss which can measure the difference between the object interaction prediction and the labels in the training examples. The loss can include: (1) object detection loss, e.g., a weighted Huber loss”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost in view of Li in view of Muller in view of Lee in view of Xie to include that the weighted loss comprises a weighted Huber loss, as taught by Ayvaci as disclosed above, in order to ensure accurate waypoint prediction (Ayvaci Paragraph 16 “ In other words, the system can make accurate object interaction predictions from input sensor data without using human-programmed logic about how the input sensor data should be combined or analyzed”).

Claims 7-8, 13-14, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Pronovost (US 20240101157 A1) (“Pronovost”) in view of Li (US 20240124016 A1) (“Li”) in view of Muller (US 20190384303 A1) (“Muller”) in view of Lee (US 20240400095 A1) (“Lee”) in view of Xie (US 20210199442 A1) (“Xie”) further in view of Clement (US 20210357762 A1) (“Clement”).

With respect to claim 7, and similarly claims 13 and 19, Pronovost in view of Li in view of Muller in view of Lee in view of Xie fail to explicitly disclose that the encoder-decoder transformer comprises the encoder transformer that encodes the map tokens through self-attention and a decoder transformer that runs masked self-attention over the agent tokens over time and provides cross-attention between encoded agent states and encoded map states.
Clement teaches that the encoder-decoder transformer comprises the encoder transformer that encodes the tokens through self-attention and a decoder transformer that runs masked self-attention over the tokens over time and provides cross-attention between encoded states (See at least Clement FIG. 3 and Paragraph 23 “An encoder uses bi-directional attention which enables the encoder to learn the relationships of the tokens/subtokens in an input sequence both before and after their occurrence. Classifiers are trained to interpret a model's internal representation into a class label. Since bi-directional attention allows the model's internal representation to depend on all other tokens, and not just the previous tokens, bi-directional attention leads to superior classification performance.” | Paragraph 42 “Each token/subtoken flows through all the decoder blocks along its own path. The masked self-attention component 242 allows the neural network 246 to focus on certain features or inputs. The inputs to the decoder block 234 are added with the positional embeddings 236 forming context tensor 238. The decoder block 240 predicts each token/subtoken ti in the target language one-by-one at each time step conditioned on all previously-generated target tokens/subtokens t1, . . .t−1.” | Paragraphs 46-48 “FIG. 3 illustrates an exemplary configuration of an encoder-decoder neural transformer with attention. The model 300 incorporates one or more encoder blocks 212 as described above and one or more decoder blocks. In this particular transformer configuration, the encoder block 212 does not have an output layer. The output of the top encoder block is a set of attention vectors K and V 217 which is used by the encoder-decoder multi-head attention layer 302 of the decoder block 306. The input layer 204 of the encoder block 212 operates as described above. The decoder block 306 contains a masked multi-head attention component 242, an encoder-decoder multi-head self-attention component 302, and feed forward neural network 246. The output of multi-head attention component 242 is input into layer normalization 244, the output of the encoder-decoder multi-head self-attention component 302 is input into layer normalization 304, and the output of feed forward neural network 246 is input into layer normalization 248. The output of layer normalization 244 has a residual connection to layer normalization 304, the output of layer normalization 304 has a residual connection to layer normalization 248, and the input to the masked multi-head attention 242 has a residual connection to layer normalization 244. The masked multi-head attention component 242 receives the output embeddings of the previous timestep 234. The masked multi-head attention component 242 masks the output embeddings from future time steps. The encoder-decoder multi-head attention layer 302 receives queries from the previous decoder layer 242 and the memory keys and values 217 from the output of the encoder block 212. In this manner, the decoder block 306 can attend to every position of the input sequence”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Pronovost in view of Li in view of Muller in view of Lee in view of Xie to include that the encoder-decoder transformer comprises the encoder transformer that encodes the tokens through self-attention and a decoder transformer that runs masked self-attention over the tokens over time and provides cross-attention between encoded states, as taught by Clement as disclosed above, such that the encoder transformer encodes the map tokens and the decoder transformer runs masked self-attention over the agent tokens to provide cross-attention between encoded agent states and encoded map states, in order to ensure an accurate and efficient vehicle scene output (Clement Paragraph 16 “The reuse of the pre-trained neural transformer model with developed weights and biases for source code is a good starting point to develop different models for various software engineering tasks faster and with less computational cost and resources.”).

With respect to claim 8, and similarly claims 14 and 20, Pronovost in view of Li in view of Muller in view of Lee in view of Xie in view of Clement teach that the encoder-decoder transformer outputs the candidate sequence of AV waypoint predictions in an autoregressive model (See at least Pronovost Paragraph 52 “In various examples, the codebook 202 can exchange data with a Transformer model 204 that is configured to output a token sequence 206. For example, the Transformer model 204 can sample tokens from the codebook 202 using an autoregressive technique and arrange the tokens in a sequence or set that represents potential interactions between objects and the vehicle”).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IBRAHIM ABDOALATIF ALSOMAIRY whose telephone number is (571)272-5653. The examiner can normally be reached M-F 7:30-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Faris Almatrahi can be reached at 313-446-4821. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/IBRAHIM ABDOALATIF ALSOMAIRY/               Examiner, Art Unit 3667                                                                                                                                                                                         /KENNETH J MALKOWSKI/Primary Examiner, Art Unit 3667                                                                                                                                                                                                        

        1 There is no limiting definition as to what constitutes an “autonomous vehicle (AV) history of an AV”
        2 There is no limiting definition as to what constitutes an “autonomous vehicle (AV) history of an AV”
        3 There is no limiting definition as to what constitutes an “autonomous vehicle (AV) history of an AV”
Read full office action
Prosecution Timeline

Jan 03, 2024
Application Filed
Jun 18, 2025
Non-Final Rejection mailed — §103
Sep 16, 2025
Response Filed
Dec 31, 2025
Final Rejection mailed — §103
Feb 13, 2026
Interview Requested
Feb 24, 2026
Applicant Interview (Telephonic)
Feb 25, 2026
Response after Non-Final Action
Feb 26, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/863,992
Patent 12602044
VEHICLE CONTROL SYSTEM, VEHICLE CONTROL METHOD, AND VEHICLE CONTROL PROGRAM
3y 9m to grant Granted Apr 14, 2026
18/462,919
Patent 12578728
AUTONOMOUS SNOW REMOVING MACHINE
2y 6m to grant Granted Mar 17, 2026
18/005,290
Patent 12426758
METHOD AND APPARATUS FOR CONTROLLING ROBOT, ELECTRONIC DEVICE, AND COMPUTER-READABLE STORAGE MEDIUM
2y 8m to grant Granted Sep 30, 2025
17/437,626
Patent 12313379
SYSTEM FOR NEUTRALISING A TARGET USING A DRONE AND A MISSILE
3y 8m to grant Granted May 27, 2025
17/309,127
Patent 12265385
SYSTEMS, DEVICES, AND METHODS FOR MILLIMETER WAVE COMMUNICATION FOR UNMANNED AERIAL VEHICLES
3y 11m to grant Granted Apr 01, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

2-3
Expected OA Rounds
42%
Grant Probability
52%
With Interview (+10.5%)
3y 2m (~9m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 86 resolved cases by this examiner. Grant probability derived from career allowance rate.