Last updated: April 19, 2026
Application No. 18/936,286
METHOD AND APPARATUS WITH VEHICLE DRIVING CONTROL

Non-Final OA §103§112
Filed
Nov 04, 2024
Examiner
MATTA, ALEXANDER GEORGE
Art Unit
3668
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Samsung Electronics Co., Ltd.
OA Round
1 (Non-Final)
Interview Optional

— +22.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 137 resolved cases, 2023–2026
Examiner Intelligence

MATTA, ALEXANDER GEORGE View full profile →
Grants 72% — above average
Career Allow Rate
98 granted / 137 resolved
+19.5% vs TC avg
Strong +23% interview lift
Without
With
+22.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
42 currently pending
Career history
179
Total Applications
across all art units
Statute-Specific Performance

§101
8.5%
-31.5% vs TC avg
§103
54.2%
+14.2% vs TC avg
§102
13.0%
-27.0% vs TC avg
§112
21.7%
-18.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 137 resolved cases
Office Action

§103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim(s) 1 - 20 are pending for examination.
This Action is made NON-FINAL.
Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. The following title is suggested METHOD AND APPARATUS WITH VEHICLE DRIVING CONTROL USING A VECTORIZED MAP GENERATED FROM EXTRACTED BIRD’S-EYE VIEW FEATURES.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is “a communication interface configured to….” as recited in claim 19.
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. 
Regarding the communication interface, the specification states “FIG. 14 illustrates an example of an apparatus for controlling driving of a vehicle, according to one or more embodiments.  Referring to FIG. 14, a control apparatus 1400 may include a communication interface 1410, a memory 1430 including a first neural network 1433 and a second neural network 1436, and a processor 1450.  The communication interface 1410, the memory 1430, and the processor 1450 may be connected to one another via a communication bus 1405.  The communication interface 1410 may receive multi-view images including image frames at consecutive time points corresponding to a driving environment of a vehicle. ” but fails to describe an structure for the communication interface.
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recites sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 19 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 

In claim(s) 19 the “communication interface configured to….” is recited. The specification fails to describe the structure of the “communication interface” which invokes 112f and thus the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function.


The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim(s) 19 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim limitation “communication interface configured to….”  invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. The specification is devoid of any description of the structure of the “communication interface”. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s)  are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (VectorMapNet: End-to-end Vectorized HD Map Learning, 2023, hereinafter known as Liu) in view of Afshar et al. (US 20240132112 A1, hereinafter known as Afshar).

Regarding Claim 1, Liu teaches A method of controlling driving of a vehicle, the method comprising: receiving multi-view images comprising image frames of a driving environment of the vehicle 
{Section 3 “Similar to HDMapNet (Li et al., 2021), our task is to vectorize map elements using data from onboard sensors of autonomous vehicle, such as RGB cameras and/or LiDARs.”

Section 3.2 “The objective of BEV feature extractor is to lift various modality inputs into a canonical feature space and aggregates and align features these features into a canonical representation termed BEV features FBEV ∈ RW× H× (C1+C2) based on their coordinates, where W and H represent the width and height of the BEV feature, respectively; C1 and C2 represent the output channels of the BEV feature extracted from the two common modalities: surrounding camera images I and LiDAR points P”
}
extracting bird's-eye view (BEV) features respectively corresponding to the consecutive time points for each of the image frames, and extracting map queries respectively corresponding to the consecutive time points for each of the image frames; 
{ Section 3.2 “The objective of BEV feature extractor is to lift various modality inputs into a canonical feature space and aggregates and align features these features into a canonical representation termed BEV features FBEV ∈ RW× H× (C1+C2) based on their coordinates, where W and H represent the width and height of the BEV feature, respectively; C1 and C2 represent the output channels of the BEV feature extracted from the two common modalities: surrounding camera images I and LiDAR points P”

Section 3.3 “After extracting the birds-eye view (BEV) features, VectorMapNet have to identify and abstractly represent map elements using these features. We employ a hierarchical representation for this purpose, specifically through element queries and keypoint queries, enabling us to model the nonlocal shape of map elements effectively. We leverage a variant of transformer set prediction detector (Carion et al., 2020) to achieve this goal, as it is a robust detector that eliminates the need for extra post-processing. Specifically, the detector represents map elements locations and categories by predicting their element keypoints A and class labels L from the BEV features FBEV. 

The detector uses learnable element … as its inputs, where d represents the hidden embedding size and Nmax is a preset constant, which is much greater than the number of map elements N in the scene. The i-th element query  is composed of k element keypoint embeddings  Element queries are similar to object queries used in Detection Transformer (DETR) (Carion et al., 2020), where a query represents an object. In our case, an element query represents a map element.”
}
generating a vectorized map by predicting and vectorizing map elements represented in the image frames, the generating based on first memory tokens stored in a memory corresponding to queries of previously-processed image frames, the BEV features, and the map queries; and 
{See all of section 3.4
}

Liu does not teach, image frames of a driving environment of the vehicle at consecutive time points and controlling the driving of the vehicle based on the vectorized map.

However, Afshar teaches image frames of a driving environment of the vehicle at consecutive time points

{Para [0069] “In some embodiments, perception system 402 receives data associated with at least one physical object (e.g., data that is used by perception system 402 to detect the at least one physical object) in an environment and classifies the at least one physical object. In some examples, perception system 402 receives image data captured by at least one camera (e.g., cameras 202a), the image associated with (e.g., representing) one or more physical objects within a field of view of the at least one camera. In such an example, perception system 402 classifies at least one physical object based on one or more groupings of physical objects (e.g., bicycles, vehicles, traffic signs, pedestrians, and/or the like). In some embodiments, perception system 402 transmits data associated with the classification of the physical objects to planning system 404 based on perception system 402 classifying the physical objects.”

Para [0158] “In some embodiments, the process 900 includes: periodically predicting a future trajectory of an agent in a current environment of the vehicle based on at least one reference path determined for the agent. For example, the perception system can perform the process 900 in a period, e.g., every 10 seconds, 20 seconds, 30 seconds, or 1 minute. In some embodiments, the perception system performs the process 900 continuously. For example, once a round of the process 900 ends, the process 900 restarts or reiterates. In some embodiments, the perception system performs the process 900 in response to a triggering event, e.g., an input from a driver.”
}

controlling the driving of the vehicle based on the vectorized map.
{Para [0030] “For each agent, the path-based trajectory prediction can include multiple operations: 1) vectorizing map into connected lane segments; 2) sampling the vectorized map for candidate reference paths (e.g., in 8 seconds) with reachable lane segments or reachable targets (e.g., end points) of the candidate reference paths; 3) classifying a set of candidate reference paths (e.g., by predicting a discrete probability distribution over the candidate reference paths) based on defined feature vectors, including scene feature vector (e.g., agent behavior) and path feature vector (e.g., first point, middle point, last point, direction, and length of each candidate reference path); 4) making trajectory prediction with respect to one or more selected reference paths in the Frenet frame using agents feature map augmented with path information; and 5) transforming the predicted trajectories back to Cartesian co-ordinates relative to the agent to obtain multimodal predictions.”

Para [0030] “Some of the advantages of these techniques are as follows. For example, the techniques predict trajectories conditioned on feature descriptors of a complete reference path from the agent's current location to the agent's goal instead of just its goal locations. This is a much more informative feature descriptor and leads to more map compliant trajectories over longer prediction horizons compared to goal based prediction. Also, the techniques use reference paths, which allow to predict trajectories in the path relative Frenet frame relative to each sampled path. Compared to the Cartesian frame with varying lane locations and curvatures, predictions in the Frenet frame can have much lower variance. This again leads to more map compliant trajectories that better generalize to novel scene layouts. Moreover, compared to using a rasterized HD map for its scene and reference path encoders, the techniques directly encode the scene and reference paths using polylines, making the encoders more efficient. The techniques can sample and classify variable length reference paths along each lane centerline, which provides trajectory prediction with more flexibility to predict different motion profiles along lanes. The techniques can improve path prediction and path compliance, e.g., using agent past trajectory history in the prediction. The techniques can enhance performance of prediction in multi-lane turns with better path classifier and scene upsampling. In addition to standard metrics for multimodal prediction, the techniques can enhance two map compliance metrics of the predicted trajectories (e.g., commonly used drivable area compliance metric and a new lane deviation metric), for example, by utilizing map prior knowledge (e.g., high likelihood drivable areas). Further, the techniques can improve interaction reasoning in path encoder and improve the map and agents interaction graph. The techniques can improve reaction of autonomous vehicles to surrounding environments (e.g., periodically or continuously) to achieve reliable and accurate prediction for their own route/trajectory or operation planning, which realizes safe and reliable driving.”
}

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Afshar to use the vectorized map generated from images for autonomous driving because Para [0030] “. The techniques can improve reaction of autonomous vehicles to surrounding environments (e.g., periodically or continuously) to achieve reliable and accurate prediction for their own route/trajectory or operation planning, which realizes safe and reliable driving.” 

Regarding Claim 2, Lui in view of Afshar teaches The method of claim 1. Liu further teaches wherein the extracting of the BEV features and the map queries comprises: extracting image features of a perspective view (PV) corresponding to the image frames using a backbone network; transforming the image features of the PV into the BEV features; extracting the map queries at a frame level used to construct the vectorized map based on the BEV features and a query corresponding to the image frames; and outputting the BEV features and the map queries.
{See section 3.2 and section 3.3
Also see Section 3 “The inputs and outputs of the mapping problem are not perfectly aligned. They exist in different view spaces (e.g. camera data is in perspective
view and map elements are in BEV)”
}

Regarding Claim 3, Lui in view of Afshar teaches The method of claim 1. Liu further teaches wherein the generating of the vectorized map comprises: reading the first memory tokens; based on the map queries, the BEV features, and the first memory tokens, generating map tokens comprising the map elements comprised in the vectorized map and/or clip tokens comprising vectorized features corresponding to the image frames; and generating the vectorized map based on the map tokens.
{See figure 2 and Section 3.3 and Section 3.4
}
Regarding Claim 4, Lui in view of Afshar teaches The method of claim 3. Liu further teaches wherein the generating of the map tokens and/or the clip tokens comprises: generating, from the map queries and the first memory tokens, the clip tokens comprising cues for the map elements in a feature space corresponding to the image frames; updating the BEV features using the clip tokens such that the BEV features comprise hidden map elements; and generating the map tokens using the updated BEV features and the map queries.
{See figure 2 and Section 3.3 and Section 3.4
}
Regarding Claim 5, Lui in view of Afshar teaches The method of claim 3. Liu further teaches wherein sizes of the map queries are determined based on sizes of the clip tokens, a number of the map elements, or a number of points for each of the map elements.
{ See figure 2 and Section 3.3 and Section 3.4
}
Regarding Claim 6, Lui in view of Afshar teaches The method of claim 4. Liu further teaches wherein the updating of the BEV features comprises: extracting a query from the BEV features; extracting a key and a value from the clip tokens; and updating the BEV features via a cross-attention network and a feed-forward network using the query, the key, and the value.
{ See figure 2 and Section 3.3 and Section 3.4
}
Regarding Claim 7, Lui in view of Afshar teaches The method of claim 4. Liu further teaches wherein the generating of the map tokens comprises generating the map tokens from the map queries and the updated BEV features using a deformable attention network, a decoupled self-attention network, and a feed-forward network.
{ See figure 2 and Section 3.3 and Section 3.4
}
Regarding Claim 8, Lui in view of Afshar teaches The method of claim 7. Liu further teaches wherein the generating of the map tokens comprises generating the map tokens by extracting the queries from the map queries using the deformable attention network and obtaining a value from the updated BEV features.
{ See figure 2 and Section 3.3 and Section 3.4
}
Regarding Claim 9, Lui in view of Afshar teaches The method of claim 1. Liu further teaches wherein the generating of the vectorized map comprises generating the vectorized map by predicting the map elements represented in the image frames by a pre-trained neural network and vectorizing the map elements for each instance, and the pre-trained neural network comprises at least one of: a (2-1)-th neural network configured to read the first memory tokens from the memory or write second memory tokens to the memory; and a (2-2)-th neural network configured to generate the vectorized map corresponding to a current frame among the image frames based on the map queries, the BEV features, and the first memory tokens.
{See figure 2 and Section 3.3 and Section 3.4 and Section 3.5}
Regarding Claim 10, Lui in view of Afshar teaches The method of claim 9. Liu further teaches wherein the generating of the vectorized map comprises: writing the map tokens to the memory by the (2-1)-th neural network; and generating the vectorized map as a map token corresponding to the current frame among the map tokens passes through a prediction head.
{ See figure 2 and Section 3.3 and Section 3.4 and Section 3.5
}
Regarding Claim 11, Lui in view of Afshar teaches The method of claim 9. Liu further teaches further comprising: generating the second memory tokens by writing the map tokens and the clip tokens to the memory using the (2-1)-th neural network; and outputting the second memory tokens.
{ See figure 2 and Section 3.3 and Section 3.4 and Section 3.5
}
Regarding Claim 12, Lui in view of Afshar teaches The method of claim 9. Liu further teaches wherein the (2-1)-th neural network is configured to preserve time information corresponding to the previous image frames by reading first memory tokens corresponding to the previous image frames to propagate the first memory tokens as an input for the (2-2)-th neural network.
{ See figure 2 and Section 3.3 and Section 3.4 and Section 3.5
}
Regarding Claim 13, Lui in view of Afshar teaches The method of claim 9. Liu further teaches wherein the (2-1)-th neural network is configured to set intra-clip associations between the map elements by associating inter-clip information through propagation of clip tokens generated in the (2-2)-th neural network.
{ See figure 2 and Section 3.3 and Section 3.4 and Section 3.5
}
Regarding Claim 14, Lui in view of Afshar teaches The method of claim 9. Liu further teaches wherein the (2-1)-th neural network is configured to generate the second memory tokens comprising global map information through embedding of a learnable frame and store the second memory tokens in the memory, based on the map tokens and the clip tokens generated in the (2-2)-th neural network.
{ See figure 2 and Section 3.3 and Section 3.4 and Section 3.5
}
Regarding Claim 15, Lui in view of Afshar teaches The method of claim 9. Liu further teaches wherein the (2-1)-th neural network is configured to generate the second memory tokens by combining clip tokens, the map tokens, and the first memory tokens together.
{ See figure 2 and Section 3.3 and Section 3.4 and Section 3.5
}
Regarding Claim 16, Lui in view of Afshar teaches The method of claim 9. Liu further teaches wherein the (2-2)-th neural network is configured to generate the vectorized map by outputting a map token corresponding to a current frame having a predetermined time window corresponding to lengths of the image frames, based on the first memory tokens, the BEV feature, and the map queries.
{ See figure 2 and Section 3.3 and Section 3.4 and Section 3.5
}
Regarding Claim 17, Lui in view of Afshar teaches The method of claim 1. Liu further teaches wherein the map elements comprise a crosswalk, a road, a lane, a lane boundary, a building, a curbstone, or traffic lights comprised in the driving environment.
{Section 3 “Similar to HDMapNet (Li et al., 2021), our task is to vectorize map elements using data from onboard sensors of autonomous vehicle, such as RGB cameras and/or LiDARs. These map elements include but are not limited to: Road boundaries (boundaries of roads separating roads and sidewalks, typically irregularly-shaped curves of arbitrary lengths), Lane dividers (boundaries dividing lanes on the road, usually straight lines), and Pedestrian crossings (regions with white markings indicating legal pedestrian crossing points, typically represented as polygons).”
}
Regarding Claim 18, Lui in view of Afshar teaches The method of claim 1. Afshar further teaches A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method
{Para [0063] “In some embodiments, device 300 performs one or more processes described herein. Device 300 performs these processes based on processor 304 executing software instructions stored by a computer-readable medium, such as memory 305 and/or storage component 308. A computer-readable medium (e.g., a non-transitory computer readable medium) is defined herein as a non-transitory memory device. A non-transitory memory device includes memory space located inside a single physical storage device or memory space spread across multiple physical storage devices.”
}

Regarding Claim 19, Liu teaches An apparatus for controlling driving of a vehicle, the apparatus comprising: a communication interface configured to receive multi-view images of a driving environment of the vehicle 
{Section 3 “Similar to HDMapNet (Li et al., 2021), our task is to vectorize map elements using data from onboard sensors of autonomous vehicle, such as RGB cameras and/or LiDARs.”

Section 3.2 “The objective of BEV feature extractor is to lift various modality inputs into a canonical feature space and aggregates and align features these features into a canonical representation termed BEV features FBEV ∈ RW× H× (C1+C2) based on their coordinates, where W and H represent the width and height of the BEV feature, respectively; C1 and C2 represent the output channels of the BEV feature extracted from the two common modalities: surrounding camera images I and LiDAR points P”
}
a first neural network configured to extract bird's-eye view (BEV) features respectively corresponding to the consecutive time points for each of the image frames, and configured to extract map queries respectively corresponding to the consecutive time points for each of the image frames;
{ Section 3.2 “The objective of BEV feature extractor is to lift various modality inputs into a canonical feature space and aggregates and align features these features into a canonical representation termed BEV features FBEV ∈ RW× H× (C1+C2) based on their coordinates, where W and H represent the width and height of the BEV feature, respectively; C1 and C2 represent the output channels of the BEV feature extracted from the two common modalities: surrounding camera images I and LiDAR points P”

It is also discussed in Section 3.2 that a convolutional neural network is used

Section 3.3 “After extracting the birds-eye view (BEV) features, VectorMapNet have to identify and abstractly represent map elements using these features. We employ a hierarchical representation for this purpose, specifically through element queries and keypoint queries, enabling us to model the nonlocal shape of map elements effectively. We leverage a variant of transformer set prediction detector (Carion et al., 2020) to achieve this goal, as it is a robust detector that eliminates the need for extra post-processing. Specifically, the detector represents map elements locations and categories by predicting their element keypoints A and class labels L from the BEV features FBEV. 

The detector uses learnable element … as its inputs, where d represents the hidden embedding size and Nmax is a preset constant, which is much greater than the number of map elements N in the scene. The i-th element query  is composed of k element keypoint embeddings  Element queries are similar to object queries used in Detection Transformer (DETR) (Carion et al., 2020), where a query represents an object. In our case, an element query represents a map element.”
}
a second neural network configured to generate a vectorized map by predicting and vectorizing map elements represented in the image frames, the generating based on first memory tokens stored in a memory corresponding to queries of previously-processed image, the BEV features, and the map queries; 
{See all of section 3.4
}
Liu does not teach, receive multi-view images of a driving environment of the vehicle at consecutive time points and a processor configured to generate a control signal for driving the vehicle based on the vectorized map.

However, Afshar teaches receive multi-view images of a driving environment of the vehicle at consecutive time points
{Para [0069] “In some embodiments, perception system 402 receives data associated with at least one physical object (e.g., data that is used by perception system 402 to detect the at least one physical object) in an environment and classifies the at least one physical object. In some examples, perception system 402 receives image data captured by at least one camera (e.g., cameras 202a), the image associated with (e.g., representing) one or more physical objects within a field of view of the at least one camera. In such an example, perception system 402 classifies at least one physical object based on one or more groupings of physical objects (e.g., bicycles, vehicles, traffic signs, pedestrians, and/or the like). In some embodiments, perception system 402 transmits data associated with the classification of the physical objects to planning system 404 based on perception system 402 classifying the physical objects.”

Para [0158] “In some embodiments, the process 900 includes: periodically predicting a future trajectory of an agent in a current environment of the vehicle based on at least one reference path determined for the agent. For example, the perception system can perform the process 900 in a period, e.g., every 10 seconds, 20 seconds, 30 seconds, or 1 minute. In some embodiments, the perception system performs the process 900 continuously. For example, once a round of the process 900 ends, the process 900 restarts or reiterates. In some embodiments, the perception system performs the process 900 in response to a triggering event, e.g., an input from a driver.”
}

and a processor configured to control driving of the vehicle based on the vectorized map.
{Para [0030] “For each agent, the path-based trajectory prediction can include multiple operations: 1) vectorizing map into connected lane segments; 2) sampling the vectorized map for candidate reference paths (e.g., in 8 seconds) with reachable lane segments or reachable targets (e.g., end points) of the candidate reference paths; 3) classifying a set of candidate reference paths (e.g., by predicting a discrete probability distribution over the candidate reference paths) based on defined feature vectors, including scene feature vector (e.g., agent behavior) and path feature vector (e.g., first point, middle point, last point, direction, and length of each candidate reference path); 4) making trajectory prediction with respect to one or more selected reference paths in the Frenet frame using agents feature map augmented with path information; and 5) transforming the predicted trajectories back to Cartesian co-ordinates relative to the agent to obtain multimodal predictions.”

Para [0030] “Some of the advantages of these techniques are as follows. For example, the techniques predict trajectories conditioned on feature descriptors of a complete reference path from the agent's current location to the agent's goal instead of just its goal locations. This is a much more informative feature descriptor and leads to more map compliant trajectories over longer prediction horizons compared to goal based prediction. Also, the techniques use reference paths, which allow to predict trajectories in the path relative Frenet frame relative to each sampled path. Compared to the Cartesian frame with varying lane locations and curvatures, predictions in the Frenet frame can have much lower variance. This again leads to more map compliant trajectories that better generalize to novel scene layouts. Moreover, compared to using a rasterized HD map for its scene and reference path encoders, the techniques directly encode the scene and reference paths using polylines, making the encoders more efficient. The techniques can sample and classify variable length reference paths along each lane centerline, which provides trajectory prediction with more flexibility to predict different motion profiles along lanes. The techniques can improve path prediction and path compliance, e.g., using agent past trajectory history in the prediction. The techniques can enhance performance of prediction in multi-lane turns with better path classifier and scene upsampling. In addition to standard metrics for multimodal prediction, the techniques can enhance two map compliance metrics of the predicted trajectories (e.g., commonly used drivable area compliance metric and a new lane deviation metric), for example, by utilizing map prior knowledge (e.g., high likelihood drivable areas). Further, the techniques can improve interaction reasoning in path encoder and improve the map and agents interaction graph. The techniques can improve reaction of autonomous vehicles to surrounding environments (e.g., periodically or continuously) to achieve reliable and accurate prediction for their own route/trajectory or operation planning, which realizes safe and reliable driving.”

Para [0063] “In some embodiments, device 300 performs one or more processes described herein. Device 300 performs these processes based on processor 304 executing software instructions stored by a computer-readable medium, such as memory 305 and/or storage component 308. A computer-readable medium (e.g., a non-transitory computer readable medium) is defined herein as a non-transitory memory device. A non-transitory memory device includes memory space located inside a single physical storage device or memory space spread across multiple physical storage devices.”
}

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Afshar to use the vectorized map generated from images for autonomous driving because Para [0030] “. The techniques can improve reaction of autonomous vehicles to surrounding environments (e.g., periodically or continuously) to achieve reliable and accurate prediction for their own route/trajectory or operation planning, which realizes safe and reliable driving.”
Regarding Claim 20, Liu teaches A vehicle comprising: sensors configured to capture multi-view images comprising image frames 
{Section 3 “Similar to HDMapNet (Li et al., 2021), our task is to vectorize map elements using data from onboard sensors of autonomous vehicle, such as RGB cameras and/or LiDARs.”

Section 3.2 “The objective of BEV feature extractor is to lift various modality inputs into a canonical feature space and aggregates and align features these features into a canonical representation termed BEV features FBEV ∈ RW× H× (C1+C2) based on their coordinates, where W and H represent the width and height of the BEV feature, respectively; C1 and C2 represent the output channels of the BEV feature extracted from the two common modalities: surrounding camera images I and LiDAR points P”
}
a neural network configured to extract bird's-eye view (BEV) features and map queries respectively corresponding to the consecutive time points for each of the image frames and generate a vectorized map by predicting and vectorizing map elements represented in the image frames based on first memory tokens stored in a memory corresponding to queries of previous image frames of the image frames, the BEV features, and the map queries;
{ Section 3.2 “The objective of BEV feature extractor is to lift various modality inputs into a canonical feature space and aggregates and align features these features into a canonical representation termed BEV features FBEV ∈ RW× H× (C1+C2) based on their coordinates, where W and H represent the width and height of the BEV feature, respectively; C1 and C2 represent the output channels of the BEV feature extracted from the two common modalities: surrounding camera images I and LiDAR points P”

It is also discussed in Section 3.2 that a convolutional neural network is used

Section 3.3 “After extracting the birds-eye view (BEV) features, VectorMapNet have to identify and abstractly represent map elements using these features. We employ a hierarchical representation for this purpose, specifically through element queries and keypoint queries, enabling us to model the nonlocal shape of map elements effectively. We leverage a variant of transformer set prediction detector (Carion et al., 2020) to achieve this goal, as it is a robust detector that eliminates the need for extra post-processing. Specifically, the detector represents map elements locations and categories by predicting their element keypoints A and class labels L from the BEV features FBEV. 

The detector uses learnable element … as its inputs, where d represents the hidden embedding size and Nmax is a preset constant, which is much greater than the number of map elements N in the scene. The i-th element query  is composed of k element keypoint embeddings  Element queries are similar to object queries used in Detection Transformer (DETR) (Carion et al., 2020), where a query represents an object. In our case, an element query represents a map element.”

Also see all of section 3.4
}
Liu does not teach, image frames at consecutive time points and a processor configured to generate a control signal for driving the vehicle based on the vectorized map. 

However, Afshar teaches image frames at consecutive time points
{Para [0069] “In some embodiments, perception system 402 receives data associated with at least one physical object (e.g., data that is used by perception system 402 to detect the at least one physical object) in an environment and classifies the at least one physical object. In some examples, perception system 402 receives image data captured by at least one camera (e.g., cameras 202a), the image associated with (e.g., representing) one or more physical objects within a field of view of the at least one camera. In such an example, perception system 402 classifies at least one physical object based on one or more groupings of physical objects (e.g., bicycles, vehicles, traffic signs, pedestrians, and/or the like). In some embodiments, perception system 402 transmits data associated with the classification of the physical objects to planning system 404 based on perception system 402 classifying the physical objects.”

Para [0158] “In some embodiments, the process 900 includes: periodically predicting a future trajectory of an agent in a current environment of the vehicle based on at least one reference path determined for the agent. For example, the perception system can perform the process 900 in a period, e.g., every 10 seconds, 20 seconds, 30 seconds, or 1 minute. In some embodiments, the perception system performs the process 900 continuously. For example, once a round of the process 900 ends, the process 900 restarts or reiterates. In some embodiments, the perception system performs the process 900 in response to a triggering event, e.g., an input from a driver.”
}

and a processor configured to control driving of the vehicle based on the vectorized map.
{Para [0030] “For each agent, the path-based trajectory prediction can include multiple operations: 1) vectorizing map into connected lane segments; 2) sampling the vectorized map for candidate reference paths (e.g., in 8 seconds) with reachable lane segments or reachable targets (e.g., end points) of the candidate reference paths; 3) classifying a set of candidate reference paths (e.g., by predicting a discrete probability distribution over the candidate reference paths) based on defined feature vectors, including scene feature vector (e.g., agent behavior) and path feature vector (e.g., first point, middle point, last point, direction, and length of each candidate reference path); 4) making trajectory prediction with respect to one or more selected reference paths in the Frenet frame using agents feature map augmented with path information; and 5) transforming the predicted trajectories back to Cartesian co-ordinates relative to the agent to obtain multimodal predictions.”

Para [0030] “Some of the advantages of these techniques are as follows. For example, the techniques predict trajectories conditioned on feature descriptors of a complete reference path from the agent's current location to the agent's goal instead of just its goal locations. This is a much more informative feature descriptor and leads to more map compliant trajectories over longer prediction horizons compared to goal based prediction. Also, the techniques use reference paths, which allow to predict trajectories in the path relative Frenet frame relative to each sampled path. Compared to the Cartesian frame with varying lane locations and curvatures, predictions in the Frenet frame can have much lower variance. This again leads to more map compliant trajectories that better generalize to novel scene layouts. Moreover, compared to using a rasterized HD map for its scene and reference path encoders, the techniques directly encode the scene and reference paths using polylines, making the encoders more efficient. The techniques can sample and classify variable length reference paths along each lane centerline, which provides trajectory prediction with more flexibility to predict different motion profiles along lanes. The techniques can improve path prediction and path compliance, e.g., using agent past trajectory history in the prediction. The techniques can enhance performance of prediction in multi-lane turns with better path classifier and scene upsampling. In addition to standard metrics for multimodal prediction, the techniques can enhance two map compliance metrics of the predicted trajectories (e.g., commonly used drivable area compliance metric and a new lane deviation metric), for example, by utilizing map prior knowledge (e.g., high likelihood drivable areas). Further, the techniques can improve interaction reasoning in path encoder and improve the map and agents interaction graph. The techniques can improve reaction of autonomous vehicles to surrounding environments (e.g., periodically or continuously) to achieve reliable and accurate prediction for their own route/trajectory or operation planning, which realizes safe and reliable driving.”

Para [0063] “In some embodiments, device 300 performs one or more processes described herein. Device 300 performs these processes based on processor 304 executing software instructions stored by a computer-readable medium, such as memory 305 and/or storage component 308. A computer-readable medium (e.g., a non-transitory computer readable medium) is defined herein as a non-transitory memory device. A non-transitory memory device includes memory space located inside a single physical storage device or memory space spread across multiple physical storage devices.”
}

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Afshar to use the vectorized map generated from images for autonomous driving because Para [0030] “. The techniques can improve reaction of autonomous vehicles to surrounding environments (e.g., periodically or continuously) to achieve reliable and accurate prediction for their own route/trajectory or operation planning, which realizes safe and reliable driving.”





Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Xiao et al. (US 20250085115 A1) teaches in the abstract “A computer-implemented method of trajectory prediction includes obtaining a first cross-attention between a vectorized representation of a road map near a vehicle and information obtained from a rasterized representation of an environment near the vehicle by processing through a first cross-attention stage; obtaining a second cross-attention between a vectorized representation of a vehicle history and information obtained from the rasterized representation by processing through a second cross-attention stage; operating a scene encoder on the first cross-attention and the second cross-attention; operating a trajectory decoder on an output of the scene encoder; obtaining one or more trajectory predictions by performing one or more queries on the trajectory decoder.”.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER MATTA whose telephone number is (571)272-4296. The examiner can normally be reached Mon - Fri 10:00-6:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, James Lee can be reached at (571) 270-5965. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.G.M./Examiner, Art Unit 3668                                                                                                                                                                                                        
/JAMES J LEE/Supervisory Patent Examiner, Art Unit 3668
Read full office action
Prosecution Timeline

Nov 04, 2024
Application Filed
Mar 21, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/460,793
Patent 12589770
SAFETY CONTROLLER FOR AUTOMATED DRIVING
2y 5m to grant Granted Mar 31, 2026
18/583,003
Patent 12570148
ACCESSORY MANAGEMENT SYSTEM THAT IDENTIFIES ACCESSORIES TO ALLOW FOR CONNECTION
2y 5m to grant Granted Mar 10, 2026
18/224,661
Patent 12552253
VEHICLE AND A METHOD OF CONTROLLING A DISPLAY TO OUTPUT A VISUAL INDICATION FOR INDUCING SELECTION OF A SPECIFIC DRIVING MODE
2y 5m to grant Granted Feb 17, 2026
18/231,494
Patent 12534132
SYSTEM AND METHOD FOR PROVIDING A VISUAL AID FOR STEERING ANGLE OFFSET IN A STEER-BY-WIRE SYSTEM
2y 5m to grant Granted Jan 27, 2026
18/394,306
Patent 12522245
COMPUTER-IMPLEMENTED METHOD FOR MANAGING AN OPERATIONAL DESIGN DOMAIN'S EXPANSION FOR AN AUTOMATED DRIVING SYSTEM
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
94%
With Interview (+22.6%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 137 resolved cases by this examiner. Grant probability derived from career allow rate.