Last updated: May 29, 2026
Application No. 17/530,023
METHOD, APPARATUS, AND SYSTEM FOR DEEP LEARNING OF SPARSE SPATIAL DATA FUNCTIONS

Non-Final OA §103
Filed
Nov 18, 2021
Examiner
DEVORE, CHRISTOPHER DILLON
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Here Global B V
OA Round
3 (Non-Final)
This examiner grants 50% of cases after interview

— +41.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 10 resolved cases, 2023–2026
Examiner Intelligence

DEVORE, CHRISTOPHER DILLON View full profile →
Grants 50% of resolved cases
Career Allowance Rate
5 granted / 10 resolved
-5.0% vs TC avg
Strong +42% interview lift
Without
With
+41.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
17 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.0%
-38.0% vs TC avg
§103
93.9%
+53.9% vs TC avg
§102
2.0%
-38.0% vs TC avg
§112
2.0%
-38.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 10 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 08/18/2025 has been entered.
 

Response to Arguments
Remarks page 8-9, Applicant contends:
	The 101 rejections are traversed by having claim 1 directed to a computer-implemented method that improves the functioning of neural network architectures by enabling sorting of unsorted input data.
Response:
	The applicant’s arguments in regards to 101 are persuasive in regard to the claims satisfying 101, for the application is seen as describing elements of a neural network (the SortCNN layer) that improves the functioning of a machine learning computer system on unordered or sparse data, especially in regard to the use of CNN layers.
	As a result, the 101 rejections are removed, as the other independent claims include the same amendments as present in claim 1.

Remarks page 10-11, Applicant contends:
	Claim 1 is amended to traverse pending rejections.
Response:
Applicant’s arguments with respect to claim(s) 1 have been considered but are moot because the new ground of rejection contain elements that have not been previously examined or does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
The remarks appear to not give any full arguments regarding the changes, as the argument appears to be “The cited references fail to disclose at least the features.”, which the uncertainty as to what features are specifically being referenced led to the interpretation that the amended features are the features indicated.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-3, 5-12, 14-16, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Niu et al (“A review on the attention mechanism of deep learning”), referred to as Niu in this document, and in further view of Cheng et al (US 20230035475 A1), referred to as Cheng in this document.
Regarding Claim 1:
Niu teaches:
	A computer-implemented method comprising: creating a sort convolutional neural network (SortCNN) layer comprising a multi-head cross- attention layer and one or more convolutional neural network (CNN) layers,
[Niu 2.2 A unified attention model]: "Vaswani et al. [16] proposed multi-head attention [multi-head cross- attention layer] that linearly projects the input sequence (Q, K, V) to multiple subspaces based on learnable parameters, then applies scaled dot-product attention to its representation in each subspace, and finally concatenates their output."
[Niu 4.3 Networks without RNNs]: “Gehring et al. [102] proposed an encoder-decoder architecture that relied entirely on convolutional neural networks combined with the attention mechanism [creating a sort convolutional neural network (SortCNN) layer comprising a multi-head cross- attention layer and one or more convolutional neural network (CNN) layers]. In contrast to the fact that recurrent networks maintain a hidden state of the entire past, convolutional networks do not rely on the computations of the previous time step, so that it allows parallelization on each element in a sequence.”

wherein at least one attention head of the multi-head cross-attention layer is associated with at least one linear projection matrix that is trained to arrange and quantize an unsorted set of input entities along an axis of a query/key space into a soft sorted set of the input entities based on one or more inducing points in the query/key space

[Niu 2.2 A unified attention model]: "Vaswani et al. [16] proposed multi-head attention that linearly projects the input [wherein at least one attention head of the multi-head cross-attention layer is associated with at least one linear projection matrix] sequence (Q, K, V) to multiple subspaces [along an axis of a query/key space into a soft sorted set of the input entities based on one or more inducing points in the query/key space] based on learnable parameters [that is trained to arrange and quantize an unsorted set of input entities], then applies scaled dot-product attention to its representation in each subspace, and finally concatenates their output."
[Niu 2.2 Unified Attention Model]: “When computing the attention distribution, the neural network first encodes the source data feature as K, called a key.”
[Niu 2.2 Unified Attention Model]: “When the neural network computes context vectors, it is often necessary to introduce a new data feature representation V, called value.”
[Niu 2.2 Unified Attention Model]: “In addition, it is usually necessary to introduce a task-related representation vector q, the query”
Information from specification to help define an axis as well as what is meant by soft sorted set [Current Application 0043]: “For each head 301, there is a learnable linear projection matrix 303, which projects the ordered sequence of values (e.g., the learnable vector 311) into inducing points 307 in k-dimensional space, where k is the dimensionality of the keys 309 and queries 313. This produces an axis (e.g., axis 317a-317c - also collectively referred axes 317) in the query/key space corresponding to each head 301, which is then used to match to keys 309 to pick entities (e.g., from among the input entities 109) with associated keys 309 closest to an inducing point 307(step 405). This process approximately "soft sorts" the input entities 109 according to a learnable axis 317 and its quantization (step 203). In other words, the dimensionality of the query/key space is based on a dimensionality of the one or more queries and the one or more keys of the input entities.”
Figure 3 of current application is shown with Figure 16 of Niu to give further credit to them sharing structure, especially in regard to multi-headed attention. Both have 3 linear blocks at the bottom of their figure to take in inputs of query, key, and value.
[Current Application Figure 3]

    PNG
    media_image1.png
    751
    975
    media_image1.png
    Greyscale

[Niu Figure 16]

    PNG
    media_image2.png
    886
    733
    media_image2.png
    Greyscale

A quote to reinforce that Niu shows soft attention methods [Niu 3.1 The Softness of attention]: “For the soft attention, the attention module is differentiable with respect to the inputs, so the whole system can still be trained by standard back-propagation methods.”

wherein a training of the at least one linear projection matrix comprises: initializing a learnable vector as an ordered sequence of values representing target quantization positions, wherein a number of elements in the ordered sequence of values corresponds to a number of queries to input in the multi-head cross attention layer; 

and projecting, via the linear projection matrix, the learnable vector into the one or more inducing points in the query/key space to produce the axis of the query/key space, wherein the axis is used to match one or more keys of the input entities with one or more inducing point keys of a closest inducing point of the one or more inducing points to arrange and quantize the input entities along the axis

[Niu 2.2 A unified attention model]: "Vaswani et al. [16] proposed multi-head attention that linearly projects the input sequence (Q, K, V) [wherein a training of the at least one linear projection matrix comprises: initializing a learnable vector as an ordered sequence of values] to multiple subspaces based on learnable parameters [and projecting, via the linear projection matrix , the learnable vector into the one or more inducing points in the query/key space to produce the axis of the query/key space], then applies scaled dot-product attention to its representation in each subspace, and finally concatenates their output."

[Niu 2.2 A unified attention model]: "The above is our description of the common architectures in the attention model. Here we quote from Vaswani et al. [16], the attention mechanism “can be described as mapping a query and a set of key-value pairs to an output [wherein the axis is used to match one or more keys of the input entities with one or more inducing point keys of a closest inducing point of the one or more inducing points to arrange and quantize the input entities along the axis], where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.” [wherein a number of elements in the ordered sequence of values representing target quantization positions corresponds to a number of queries to input in the multi-head cross attention layer]."

The reference notes “mapping a query and a set of key-value pairs to an output”, which means it is working on a per query basis.

wherein the soft sorted set of the input entities enables the CNN layers to operate with kernel widths greater than one by aligning neighboring entities based on similarity along the learned axis
[Niu 2.2 A unified attention model]: "Vaswani et al. [16] proposed multi-head attention that linearly projects the input sequence (Q, K, V) to multiple subspaces [wherein the soft sorted set of the input entities enables the CNN layers to operate with kernel widths greater than one by aligning neighboring entities based on similarity along the learned axis] based on learnable parameters, then applies scaled dot-product attention to its representation in each subspace, and finally concatenates their output."
The above quote is used in reference to the teaching of soft sort by Niu, but the teaching of the above limitation is seen as taught by the elements teaching claim 1 related to creating the soft sorted set, as the limitation notes that soft sorting “enables” the ability for CNN layers to utilize kernel widths greater than 1. This means that the teaching of soft sort, according to the limitation, teaches this limitation as the enabled features would exist or be enabled. Given the elements required for aspects of the limitations of a soft sorted set are taught by Niu, the requirements for the enablement are satisfied. This interpretation is supported by the current specification noting that something that is considered “soft sorted” would result in the required enablement 
([Current Application 0045]: “The reason for learning a "soft sort" quantization of originally unordered input entities 109 is that it now enables the use of CNN layers 107 with kernel width larger than one. That is because when the input entities 109 have been attentionally "soft sorted," the neighborhoods of the entities in the output (e.g., each set of soft-sorted entities 305) are now meaningful such that neighbor entities in the output resemble each other with in the sense of the sort axes 317. As result, each set of sort-sorted entities 305 can be projected into respective kernel widths 319a-319c (also collectively referred to as kernel widths) of a CNN layer 107. So, the cross-attentional output becomes an input to a stack of normal CNN layers 107 that can have kernel widths greater than 1, which can be set up to produce as many entities as there are outputs of the cross-attentional layer 105 using input padding appropriately.”). 
Should soft sort be interpreted more as the whole of all of the given limitations in claim 1, then the combination of Niu and Cheng are seen as teaching the limitation as claim 1 is considered taught by Niu and Cheng.

Niu does not explicitly teach:
and projecting the soft sorted set of the input entities from the multi-head cross-attention layer through the CNN layers, 
wherein the CNN layers learn one or more functions based on integrating information from the soft sorted set of the input entities as arranged and quantized by the at least one linear projection matrix
that is trained to arrange and quantize an unsorted set of input entities
to arrange and quantize the input entities along the axis
representing target quantization positions


Cheng teaches:
and projecting the soft sorted set of the input entities from the multi-head cross-attention layer through the CNN layers, 
wherein the CNN layers learn one or more functions based on integrating information from the soft sorted set of the input entities as arranged and quantized by the at least one linear projection matrix

[Cheng Figure 2]

    PNG
    media_image3.png
    556
    748
    media_image3.png
    Greyscale

		(The attention mechanisms in Cheng include encoder and decoder layers, where the decoder layers can receive elements from earlier decoder layers and encoder layers, such as 258. Encoders are labeled with E, while decoders are labeled with D. [cross-attention])
		Cheng Figure 2 is shown, as it shows cross attention as known by the common definition and as shown by example in the specification as an encoder-decoder attention setup where parts are received from both an encoder and decoder layer.
[Current Application 0052]: “A cross-attention layer where the queries are from the corresponding encoder layer inputs with an optional learnable projection function, and the entities and keys are from the decoder SortCNN layer outputs.”

[Cheng 0083]: “Decoder block 258 fuses the output of the spatial feature transformer 240, the previous decoder block 256 and the encoder-decoder skip connection from encoder block 210-n and passes its output—a decoded feature map—to a classifier 270 that performs a further sparse convolution [and projecting the soft sorted set of the input entities from the multi-head cross-attention layer through the CNN layers] to reduce the number of feature channels to the number of target classes (e.g., 20) and thereby generate a decoded sparse tensor with class information or labels for each point [wherein the CNN layers learn one or more functions based on integrating information from the soft sorted set of the input entities as arranged and quantized by the at least one linear projection matrix. The learned function in this case is labels or labeling. The arranged and quantized part of the limitation was taught earlier by Niu.]. This can be used to create an output point cloud that has semantic segmentation information applied based on the classes.”

that is trained to arrange and quantize an unsorted set of input entities
to arrange and quantize the input entities along the axis
representing target quantization positions
[Cheng 0053]: “Several methods may be used to pre-process the 3D point clouds, with the most common ones being: cylinder voxelization to reduce the loss of quantization [that is trained to arrange and quantize an unsorted set of input entities][ to arrange and quantize the input entities along the axis] by converting Cartesian coordinates of each point in a 3D point cloud to polar coordinates [representing target quantization positions] to generate a voxel representation of the 3D point cloud; and multi-view fusion of different representations of a 3D point cloud to optimize the perception of certain objects by projecting a 3D point cloud into different representations and fusing the different representations together. For example, projecting multiple representations of a 3D point cloud to a 2D bird's-eye view image…”

One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Niu and Cheng to incorporate convolutional layers after attention. Niu and Cheng are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Niu and Cheng to incorporate the use of convolutional layers for parts of the model, as convolutional layers work well for higher dimensions and sparse data, which are common in point clouds ([Cheng 0054]: “A unique set of sparse 3D convolutional neural networks have been designed to account for the sparse characteristics of 3D point clouds in order to efficiently capture 3D spatial information while reducing the impact of high-dimensional computing performance degradation. Examples of sparse convolutional neural network tools that are used to process 3D point clouds include the Minkowski Engine (i.e. a software library that includes various functions and classes for building sparse convolutional neural networks and performing related operations), as well as the SpConv and Torch.Sparse software libraries.”).
One of ordinary skill in the art would have been motivated to combine Niu and Cheng to incorporate quantization to improve performance and support the processing of data such as point clouds ([Cheng 0053]: “and multi-view fusion of different representations of a 3D point cloud to optimize the perception of certain objects by projecting a 3D point cloud into different representations and fusing the different representations together. For example, projecting multiple representations of a 3D point cloud to a 2D bird's-eye view image can improve the detection performance of a deep neural network which performs sematic segmentation on images, but projecting multiple representations of the 3D point cloud into a 2D range map by spherical projection facilitates the detection of roads and buildings using the deep neural network which performs sematic segmentation”).

Regarding Claim 2:
The method of claim 1 is taught by Niu and Cheng.
Multi-head cross attention is taught in claim 1 by Niu and Cheng.
Niu teaches:
	wherein the multi-head cross-attention layer takes in the input entities and one or more queries to generate an output entity for each query of the one or more queries, and wherein the output entity is a weighted linear combination of the input entities based on a matching of one or more keys of the input entities to each query.
[Niu 2.2 Unified Attention Model]: "The above is our description of the common architectures in the attention model. Here we quote from Vaswani et al. [16], the attention mechanism “can be described as mapping a query and a set of key-value pairs [wherein the multi-head cross-attention layer takes in the input entities and one or more queries] to an output [to generate an output entity for each query of the one or more queries], where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value [is a weighted linear combination of the input entities] is computed by a compatibility function of the query with the corresponding key. [based on a matching of one or more keys of the input entities to each query]”."

Regarding Claim 3:
The method of claim 2 is taught by Niu and Cheng.
Niu teaches:
wherein a dimensionality of the query/key space is based on a dimensionality of the one or more queries and the one or more keys of the input entities, and wherein the axis corresponds to one dimension of the query/key space.
[Niu 2.2 A unified attention model]: "The score function f is a crucial part of the attention model because it defines how keys and queries are matched or combined [wherein a dimensionality of the query/key space is]. In Table 1, we list some common score functions... Moreover, Vaswani et al. [16] proposed a variant of multiplicative attention by adding the scaling factor of 1/squareroot(dk), where dk is the dimension of keys."
	[Niu Table 1]

    PNG
    media_image4.png
    384
    731
    media_image4.png
    Greyscale

(Scaled multiplicative shows a function that is based on query and key dimension [based on a dimensionality of the one or more queries and the one or more keys of the input entities]. It shows it is based off of key dimension directly, but matrix math dictates that qTk must have dimensions compatible to perform the operation. Thus it is also dependent on query dimension.)
[Niu 2.2 A unified attention model]: "Vaswani et al. [16] proposed multi-head attention that linearly projects the input sequence (Q, K, V) to multiple subspaces [and wherein the axis corresponds to one dimension of the query/key space (This is the same limitation taught in claim 1. The reference quote is simply provided again for convenience.)] based on learnable parameters, then applies scaled dot-product attention to its representation in each subspace, and finally concatenates their output." 


Regarding Claim 5:
The method of claim 1 is taught by Niu and Cheng.
Niu teaches:
wherein the SortCNN layer is a component of an overall machine learning model architecture
[Niu 4.3 Networks without RNNs]: “Gehring et al. [102] proposed an encoder-decoder architecture that relied entirely on convolutional neural networks combined with the attention mechanism [wherein the SortCNN layer is a component of an overall machine learning model architecture]. In contrast to the fact that recurrent networks maintain a hidden state of the entire past, convolutional networks do not rely on the computations of the previous time step, so that it allows parallelization on each element in a sequence.”

Regarding Claim 6:
The method of claim 5 is taught by Niu and Cheng.

Cheng teaches:
wherein the input entities of the overall machine learning model architecture include a first geometric entity, 
[Cheng 0068]: “Neural network 200 accepts input data 202 at input block 204. As noted, input data 202 may be unprocessed 3D point cloud data [wherein the input entities of the overall machine learning model architecture include a first geometric entity,] for a volume, which may be pre-processed to generate a voxel-based representation.”
and wherein an output of the overall machine learning model is a second geometric entity
[Cheng 0083]: “Decoder block 258 fuses the output of the spatial feature transformer 240, the previous decoder block 256 and the encoder-decoder skip connection from encoder block 210-n and passes its output—a decoded feature map—to a classifier 270 that performs a further sparse convolution to reduce the number of feature channels to the number of target classes (e.g., 20) and thereby generate a decoded sparse tensor with class information or labels for each point. This can be used to create an output point cloud [and wherein an output of the overall machine learning model is a second geometric entity] that has semantic segmentation information applied based on the classes. The nature of the class labels may depend on the specific application. For example, in the automotive context, classes may include ground, structure, vehicle, nature, human, object and other classes, which may be further subdivided into, e.g., road, sidewalk, parking, other-ground, and so forth.”

Notes from current application on what is a geographic entity [Current Application 0024]: “Specifically sparse geometric data such as point sets (e.g., geographic coordinates corresponding to locations, trajectories, cartographic/map features, etc.) and multi- lines in vector graphic-like representations (e.g., representing road networks, geographic areas, terrain features, etc.) are inherently challenging for deep neural networks (e.g., convolutional neural networks (CNNs)).”

One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Niu and Cheng to incorporate geometric or 3D data. Niu and Cheng are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Niu and Cheng to incorporate the use of geometric or 3D data, as such data can be used to capture and detect elements to understand a scene ([Cheng 0002]: “With the advancement of technology, 3D scenes—also referred to as 3D environment—can be captured using detection and ranging (DAR) sensors, such as scanning light detection and ranging (LiDAR) sensors. Currently, 3D scenes that are captured using DAR sensors, such as LiDAR sensors, are represented by sparse 3D point clouds. The processing of sparse 3D point clouds to recognize and understand 3D scenes has proven challenging. Unlike low-dimensional 2D images, 3D point clouds lack color feature information, are sparse, and have the property of varying density, where a region near the LiDAR sensor has much greater density (i.e. the 3D point has many more points) than a region distant to the LiDAR sensor. This has made it difficult for conventional methods of processing 2D images to perform semantic segmentation processing of sparse 3D point clouds. However, 3D point clouds are informative and their precise geometric features can still be exploited to play a role in scene understanding.”).


Regarding Claim 7:
The method of claim 6 is taught by Niu and Cheng.
Cheng teaches:
wherein the first geometric entity includes a set of point coordinates in an arbitrary dimensional space; 
[Cheng 0068]: “Neural network 200 accepts input data 202 at input block 204. As noted, input data 202 may be unprocessed 3D point cloud data [wherein the first geometric entity includes a set of point coordinates in an arbitrary dimensional space] for a volume, which may be pre-processed to generate a voxel-based representation.”

and wherein the second geometric entity comprises a point, a line, a shape, or a combination thereof
[Cheng 0083]: “Decoder block 258 fuses the output of the spatial feature transformer 240, the previous decoder block 256 and the encoder-decoder skip connection from encoder block 210-n and passes its output—a decoded feature map—to a classifier 270 that performs a further sparse convolution to reduce the number of feature channels to the number of target classes (e.g., 20) and thereby generate a decoded sparse tensor with class information or labels for each point. This can be used to create an output point cloud [and wherein the second geometric entity comprises a point, a line, a shape, or a combination thereof] that has semantic segmentation information applied based on the classes. The nature of the class labels may depend on the specific application. For example, in the automotive context, classes may include ground, structure, vehicle, nature, human, object and other classes, which may be further subdivided into, e.g., road, sidewalk, parking, other-ground, and so forth.”

		The motivation to combine Niu and Cheng for claim 7 is the same motivation in claim 6.

Regarding Claim 8:
The method of claim 5 is taught by Niu and Cheng.
Niu teaches:
wherein the overall machine learning model architecture comprises one or more tiers of an encoder layer and a decoder layer
[Niu 4.3 Networks without RNNs]: “Gehring et al. [102] proposed an encoder-decoder architecture [wherein the overall machine learning model architecture comprises one or more tiers of an encoder layer and a decoder layer] that relied entirely on convolutional neural networks combined with the attention mechanism. In contrast to the fact that recurrent networks maintain a hidden state of the entire past, convolutional networks do not rely on the computations of the previous time step, so that it allows parallelization on each element in a sequence.”

Regarding Claim 9:
The method of claim 8 is taught by Niu and Cheng.
Cheng teaches:
wherein the encoder layer and the decoder layer are connected with a skip connection.
[Cheng 0012]: “In some cases, the method further comprises feeding (n−1) encoder-decoder skip connection [wherein the encoder layer and the decoder layer are connected with a skip connection] outputs from a first through (n−1)th encoder blocks of the n encoder blocks to the n decoder blocks, wherein the (n−1) encoder-decoder skip connection outputs are fed to the n decoder blocks by reverse order of respective depth.”

One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Niu and Cheng to incorporate a skip connection. Niu and Cheng are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Niu and Cheng to incorporate a skip connection, as a skip connection gives residual convolutions or direct input to the decoder from the encoder, which allows later layers in a model to receive data related to what was encoded ([Cheng 0082]: “The phrase ‘encoder-decoder skip connection’ as used in this context may be considered as residual convolutions wherein the feature map output of the encoder block is used as direct input to a corresponding decoder block.”).

Regarding Claim 10:
The method of claim 5 is taught by Niu and Cheng.
Niu teaches:
wherein the overall machine learning model architecture comprises at least the SortCNN layer followed by another cross-attention layer, 
[Niu 4.3 Networks without RNNs]: “Gehring et al. [102] proposed an encoder-decoder architecture [wherein the overall machine learning model architecture comprises at least the SortCNN layer] that relied entirely on convolutional neural networks combined with the attention mechanism. In contrast to the fact that recurrent networks maintain a hidden state of the entire past, convolutional networks do not rely on the computations of the previous time step, so that it allows parallelization on each element in a sequence.”

Cheng teaches:
[Cheng 0012]: “In some cases, the method further comprises feeding (n−1) encoder-decoder skip connection outputs from a first through (n−1)th encoder blocks of the n encoder blocks to the n decoder blocks [followed by another cross-attention layer], wherein the (n−1) encoder-decoder skip connection outputs are fed to the n decoder blocks by reverse order of respective depth.”
This quote from current application is used to show that the current limitation appears to refer to the idea of an encoder-decoder setup as shown by Cheng. [Current Application 0052]: “In one embodiment, each decoder layer can be composed of a sequence of layers from input to output (the illustrated order of the layers is provided as just one presented alternative and is not intended as a limitation, the order can be arbitrary). An example sequence uses an inverse order compared to the corresponding encoder layer as follows: One or more entity-wise dense layers; A Sort CNN layer 108; and A cross-attention layer where the queries are from the corresponding encoder layer inputs with an optional learnable projection function…”

and wherein one or more attended entities of the SortCNN layer are concatenated with one or more output entities of the another cross-attention layer
[Cheng 0113]: “If only the final (nth) sparse deconvolutional layer remains, then at 665 the most recent intermediate (i.e., (n−1)th) decoder feature map, the fused feature map and the encoder-decoder skip connection from the first encoder layer are fused [and wherein one or more attended entities of the SortCNN layer are concatenated with one or more output entities of the another cross-attention layer] and processed in a final sparse deconvolutional layer operation, represented as decoder 258 of network 200, to produce a decoded feature map. The fusing involves a concatenation operation.”

One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Niu and Cheng to incorporate concatenating the outputs of attention layers. Niu and Cheng are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Niu and Cheng to incorporate concatenating the outputs of attention layers, as this captures and restores multi-scale features more effectively ([Cheng 0064]: “Linear fusion is the element-wise concatenation of the feature maps of corresponding points in the sparse tensors of the output of the two branches. If a corresponding point is not found, the feature vector is assumed to be 0 for the point, so that the feature information of the original 3D point cloud is not lost. In this way, multi-scale features in the spatial information can be captured and restored more effectively.”).

Regarding Claim 11:
The method of claim 10 is taught by Niu and Cheng.
Cheng teaches:
wherein the one or more attended entities of the SortCNN layer are concatenated with one or more output entities of the another cross-attention layer using a learnable transformation
[Cheng 0113]: “If only the final (nth) sparse deconvolutional layer remains, then at 665 the most recent intermediate (i.e., (n−1)th) decoder feature map, the fused feature map and the encoder-decoder skip connection from the first encoder layer are fused [wherein the one or more attended entities of the SortCNN layer are concatenated with one or more output entities of the another cross-attention layer] and processed in a final sparse deconvolutional layer operation, represented as decoder 258 of network 200, to produce a decoded feature map. The fusing involves a concatenation operation.”
[Cheng 0014]: “In some cases, the method further comprises fusing the fused feature map, an output of the (n−1)th decoder block and the output of the first encoder blocks, wherein the fusing comprises concatenation followed by a convolution operation [using a learnable transformation]” 

The motivation to combine Niu and Cheng is the same motivation in claim 10.

Regarding Claim 12:
The method of claim 5 is taught by Niu and Cheng.
Cheng teaches:
wherein the input entities of the overall machine learning model architecture include one or more observed points with approximate locations, and wherein an output of the overall machine learning model is an approximate ground truth location of a map feature.
[Cheng 0068]: “Neural network 200 accepts input data 202 at input block 204. As noted, input data 202 may be unprocessed 3D point cloud data [wherein the input entities of the overall machine learning model architecture include one or more observed points with approximate locations] for a volume, which may be pre-processed to generate a voxel-based representation.”
[Cheng 0083]: “Decoder block 258 fuses the output of the spatial feature transformer 240, the previous decoder block 256 and the encoder-decoder skip connection from encoder block 210-n and passes its output—a decoded feature map—to a classifier 270 that performs a further sparse convolution to reduce the number of feature channels to the number of target classes (e.g., 20) and thereby generate a decoded sparse tensor with class information or labels for each point. This can be used to create an output point cloud that has semantic segmentation information applied based on the classes [and wherein an output of the overall machine learning model is an approximate ground truth location of a map feature]. The nature of the class labels may depend on the specific application. For example, in the automotive context, classes may include ground, structure, vehicle, nature, human, object and other classes, which may be further subdivided into, e.g., road, sidewalk, parking, other-ground, and so forth.”

One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Niu and Cheng to incorporate location data that outputs a location of a feature. Niu and Cheng are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Niu and Cheng to incorporate location data that outputs a location of a feature, as classification of features or objects are useful in contexts such as automotive ([Cheng 0083]: “This can be used to create an output point cloud that has semantic segmentation information applied based on the classes. The nature of the class labels may depend on the specific application. For example, in the automotive context, classes may include ground, structure, vehicle, nature, human, object and other classes, which may be further subdivided into, e.g., road, sidewalk, parking, other-ground, and so forth.”).

Regarding Claim 14:
		Niu does not explicitly teach:
An apparatus comprising: at least one processor; and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, within the at least one processor, cause the apparatus to perform at least the following
		Cheng teaches:
An apparatus comprising: at least one processor; and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, within the at least one processor, cause the apparatus to perform at least the following

[Cheng 0122]: “The functions of the various elements shown in FIG. 1, including the functional blocks labelled as “CPU” and “SPU”, may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor [at least one processor], the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” should not be construed to refer exclusively to hardware capable of executing software [the at least one memory and the computer program code configured to, within the at least one processor, cause the apparatus to perform at least the following], and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) [and at least one memory] for storing software [including computer program code for one or more programs], random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.”

		The rest of the claim is analogous to claim 1.
		The motivation to combine Niu and Cheng is the same as claim 1.

Regarding Claim 15:
		The apparatus of claim 14 is taught by Niu and Cheng.
		This claim is analogous to claim 2.

Regarding Claim 16:
		The apparatus of claim 15 is taught by Niu and Cheng.
		This claim is analogous to claim 3.


	Regarding Claim 18:
		Niu does not explicitly teach:
A non-transitory computer-readable storage medium carrying one or more sequences of one or more instructions which, when executed by one or more processors, cause an apparatus to perform
		Cheng teaches:
A non-transitory computer-readable storage medium carrying one or more sequences of one or more instructions which, when executed by one or more processors, cause an apparatus to perform
[Cheng 0123]: “Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in a non-transitory computer readable medium [A non-transitory computer-readable storage medium] and so executed by a computer or processor whether or not such computer or processor is explicitly shown.”
[Cheng 0122]: “The functions of the various elements shown in FIG. 1, including the functional blocks labelled as “CPU” and “SPU”, may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” should not be construed to refer exclusively to hardware capable of executing software [when executed by one or more processors, cause an apparatus to perform], and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) [and at least one memory] for storing software [carrying one or more sequences of one or more instructions which], random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.”

The rest of this claim is analogous to claim 1.
The motivation to combine is the same as the motivation in claim 1.

Regarding Claim 19:
		The non-transitory computer-readable storage medium of claim 18 is taught by Niu and Cheng.
		This claim is analogous to claim 2.

Regarding Claim 20:
		The non-transitory computer-readable storage medium of claim 19 is taught by Niu and Cheng.
		This claim is analogous to claim 3.
		

Claims 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Niu et al (“A review on the attention mechanism of deep learning”), referred to as Niu in this document, and in further view of Cheng et al (US 20230035475 A1), referred to as Cheng in this document, and even further in view of Li et al (US 20210287430 A1), referred to as Li in this document.
Regarding Claim 13:
The method to claim 12 is taught by Niu and Cheng.
Niu does not teach:
wherein a loss function of the overall machine learning model architecture includes a Chamfer distance

Li teaches:
wherein a loss function of the overall machine learning model architecture includes a Chamfer distance
[Li 0125]: “A distance such as the Chamfer distance [wherein a loss function of the overall machine learning model architecture includes a Chamfer distance] may be utilized because the projected vertices and pixels with the same part label p in the input image may not have a strictly one-to-one correspondence.”

One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Niu and Li to incorporate Chamfer distance. Niu and Li are of the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Niu and Li to incorporate Chamfer distance for loss, as Chamfer distance functions on data that does not have one-to-one correspondence ([Li 0125]: “A distance such as the Chamfer distance may be utilized because the projected vertices and pixels with the same part label p in the input image may not have a strictly one-to-one correspondence.”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Vaswani et al (“Attention is all you need”) is relevant art, as Vaswani et al teaches attention and attention types for use in models. This is relevant for understand the variants of attention used in the current application.
One quote in particular shows a variation of cross attention much like the one mentioned in the specification of the current application ([Vaswani et al 3.2.3 Applications of Attention in our Model]: “In ‘encoder-decoder attention’ layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [31, 2, 8].”).
Brahma et al (US 20220261590 A1) is relevant art that teaches the use of machine learning to assist in Lidar data. The machine learning in Brahma implements attention, projection, and concatenation.
Qi et al (US 20190147245 A1) is relevant art that teaches the use of machine learning on point clouds or sparse data where quantization, CNN layers, attention, and projections are utilized to process the data in the point clouds. Processing the data in sparse data for elements like CNN layers is very similar to the motivation behind the current application.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHRISTOPHER D DEVORE whose telephone number is (703)756-1234. The examiner can normally be reached Monday-Friday 7:30 am - 5 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/C.D.D./Examiner, Art Unit 2129                                                                                                                                                                                                        


/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Nov 18, 2021
Application Filed
Dec 30, 2024
Non-Final Rejection mailed — §103
Mar 30, 2025
Response Filed
May 16, 2025
Final Rejection mailed — §103
Aug 18, 2025
Request for Continued Examination
Aug 28, 2025
Response after Non-Final Action
Oct 27, 2025
Non-Final Rejection mailed — §103
Jan 27, 2026
Response Filed
Precedent Cases

Applications granted by this same examiner with similar technology

17/332,099
Patent 12530603
OBTAINING AND UTILIZING FEEDBACK FOR AGENT-ASSIST SYSTEMS
4y 7m to grant Granted Jan 20, 2026
17/616,946
Patent 12505355
GENERAL FORM OF THE TREE ALTERNATING OPTIMIZATION (TAO) FOR LEARNING DECISION TREES
4y 0m to grant Granted Dec 23, 2025
17/454,551
Patent 12468978
Reinforcement Learning In A Processing Element Method And System Thereof
4y 0m to grant Granted Nov 11, 2025
17/508,715
Patent 12412069
COOKIE SPACE DOMAIN ADAPTATION FOR DEVICE ATTRIBUTE PREDICTION
3y 10m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 4 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
50%
Grant Probability
92%
With Interview (+41.7%)
4y 1m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 10 resolved cases by this examiner. Grant probability derived from career allowance rate.
METHOD, APPARATUS, AND SYSTEM FOR DEEP LEARNING OF SPARSE SPATIAL DATA FUNCTIONS

This examiner grants 50% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email