Last updated: April 19, 2026
Application No. 17/099,634
MULTI OBJECT TRACKING USING MEMORY ATTENTION

Final Rejection §103
Filed
Nov 16, 2020
Examiner
WU, NICHOLAS S
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Waymo LLC
OA Round
6 (Final)
Interview Optional

— +43.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 38 resolved cases, 2023–2026
Examiner Intelligence

WU, NICHOLAS S View full profile →
Grants 47% of resolved cases
Career Allow Rate
18 granted / 38 resolved
-7.6% vs TC avg
Strong +43% interview lift
Without
With
+43.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
44 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
26.7%
-13.3% vs TC avg
§103
52.6%
+12.6% vs TC avg
§102
3.1%
-36.9% vs TC avg
§112
17.4%
-22.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 38 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments filed 11/18/2025 (“Remarks”) have been fully considered and are not persuasive.
Regarding the 103 rejections, applicant's arguments filed with respect to the prior art have been fully considered but they are not persuasive. 
Alleged no teaching of Xu reference in same field of endeavor as claimed invention
	In Remarks/Arguments pg. 10, applicant contends:
“First, Applicant respectfully submits that Xu describes techniques in a non-analogous field of art. The present claimed invention pertains to tracking "a plurality of objects that have been detected in an environment" based on "measurements" that characterize the objects in the environment. In contrast, Xu is directed to e-commerce and "generating item recommendations" based on past user interactions, such as "item click-through" or "item purchase." As such, Xu is in a different field of endeavor (recommenders vs. tracking) and Xu is not reasonably pertinent to solving the multiple-object tracking problem of using measurement embeddings across current and earlier time steps to associate detections. A person of ordinary skill in the art would not be reasonably motivated to seek solutions in the field of e-commerce and product recommendation when attempting to solve the technical problem of associating sensor measurement data for multi-object tracking.”

The examiner respectfully disagrees that Xu is in a different field of endeavor than the claimed invention. Xu and the claimed invention are in the same field of endeavor of determining time dependencies between events at different time steps using a self-attention mechanism. Similar to how the claimed invention uses self-attention for associating measurement embeddings across current and earlier time steps, Xu uses self-attention to associate interactions across multiple time steps (see Xu, ⁋63). Therefore, Xu and the claimed invention are in the same analogous field of art of determining time dependencies between events at different time steps using self-attention. One of ordinary skill in the art would be motivated to seek solutions in the field of self-attention mechanisms when attempting to associate measurements across current and earlier time steps. Therefore, applicant’s arguments are not persuasive. 

Alleged no teaching of amended attention mechanism
	In Remarks/Arguments pg. 10-11, applicant contends:
“Applicant respectfully submits Xu fails to teach or suggest "an attention mechanism that computes ... an attention score between the embedded representation of the particular new measurement and the embedded representation of the particular earlier measurement as a function of a relative time difference between the current time step and the particular earlier time step.
…
Furthermore, Xu fails to describe or suggest the attention mechanism as claimed in amended claim 1. For example, the quoted portions of Xu disclose a "self-attention layer configured to weight user interactions given the time at which the interaction occurred," ⁋63, and that " when predicting a next time-dependent event (eg+1, tq+1) ... the attention weights and prediction are a function of the next occurrence time."  ⁋55. That is, Xu describes generating attention weights and prediction as a function of "occurrence time", which does not result in computing "an attention score between the embedded representation of the particular new measurement and the embedded representation of the particular earlier measurement as a function of a relative time difference between the current time step and the particular earlier time step," as required by amended claim 1.” (applicant emphasis added)

The relevant claim limitations appear to be: wherein the self-attention neural network applies an attention mechanism that computes, for a particular new measurement received at the current time step and a particular earlier measurement received at a particular earlier time step, an attention score between the embedded representation of the particular new measurement and the embedded representation of the particular earlier measurement as a function of a relative time difference between the current time step and the particular earlier time step; in amended claim 1. As noted in the previous Office Action, Xu teaches:
(Xu, ⁋63, “At step 210, the trained prediction model 160 a generates time-aware, personalized item recommendations 260. The personalized item recommendations 260 includes a set of ranked items 262 a-262 e corresponding to user interests based on various user interactions weighted over time. For example, in some embodiments, the trained prediction model 160 a includes a neural network structure having at least one self-attention layer configured to weight user interactions given the time at which the interaction occurred.”).

(Xu ⁋55, “In some embodiments, when predicting a next time-dependent event (eg+1, tq+1) the prediction model generation system 28 may be configured to take account of a time lag between each event in an input sequence and a target event such that {tilde over (t)}i=tq+1−ti, i=1, . . . , q and Φ({tilde over (t)}i) is a time representation. The relative time difference between inputs is {circumflex over (t)}i−{circumflex over (t)}j=ti−tj for i, j=1, . . . , q and the attention weights and prediction are a function of the next occurrence time.”). 

In other words, Xu teaches wherein the self-attention neural network applies an attention mechanism that computes, for a particular new measurement received at the current time step and a particular earlier measurement received at a particular earlier time step, an attention score between the embedded representation of the particular new measurement and the embedded representation of the particular earlier measurement as a function of a relative time difference between the current time step and the particular earlier time step. Xu shows determining the time lag between each event in an input sequence and a target event which is interpreted as a relative time difference between the current time step and the particular earlier time step. The attention weights being based on the occurrence time is interpreted as an attention score that is a function of a relative time difference between the current time step and the particular earlier time step because the next occurrence time is associated with the next time-dependent event. Therefore, applicant’s arguments are not persuasive. 

Alleged no teaching of claim 21
	In Remarks/Arguments pg. 12, applicant contends:
“Furthermore, Applicant respectfully submits that the applied references fail to teach or 
suggest "wherein the feature representation for the occlusion state is determined during a training process that jointly trains the embedding neural network and the self-attention neural network" as recited by claim 21. 
The Office cited Rastagar as allegedly teaching a "feature representation for the 
occlusion state," and cited Fuchs as allegedly teaching "training process that jointly trains the embedding neural network and the self-attention neural network." However, Applicant respectfully submits that the combination of Rastagar and Fuchs does not reach the above quoted features of claim 21. 
First, rather than determining a feature representation for the occlusion state "during a training process", Rastgar describes determining an "occlusion state" when an "appearance similarity score is less than the first specified threshold value" and a "template matching score is less than the second specified threshold value." ⁋14. Second, while Fuchs describes an "end-to-end" or "joint" training process for its CNN and its self-attention block, Fuchs does not teach or suggest the concept of a "feature representation for an occlusion state" at all. Thus, even combining Rastgar and Fuchs, the combination would still fail to teach the claimed limitation of claim 21. Fuchs provides no "feature representation for an occlusion state" that could be jointly trained, and Rastgar's "occlusion state" is a post-processing determination based on thresholds, not a feature representation that is determined during the training process itself. The combination therefore lacks the claimed feature of "wherein the feature representation for the occlusion state is determined during a training process that jointly trains the embedding neural network and the self-attention neural network." The other applied references also fail to teach the above-quoted features in claim 21, and thus do not cure the deficiencies of Rastgar and Fuchs.”

The relevant claim limitations appear to be: wherein the feature representation for the occlusion state is determined during a training process that jointly trains the embedding neural network and the self-attention neural network. in claim 21. As noted in the previous Office Action, Fuchs and Rastgar teach:
(Rastgar, ⁋14, “In other words, the one or more regions of the object in the image may not be hidden by other objects present in the image providing a clear view of the object to be tracked in the image. The detected current state may correspond to the occlusion state or a second reacquisition state, in an event that a corresponding appearance similarity score is less than the first specified threshold value, and a corresponding template matching score is less than the second specified threshold value.”).

(Fuchs, pg. 3 ⁋5, “HART uses a CNN to convert the glimpse gt into features ft, which then update the hidden state ht of a LSTM core.”, and Fuchs, pg. 4 ⁋1-2, “Multi-object support in HART requires the following modifications. Firstly, in order to handle a dynamically changing number of objects, we apply HART to multiple objects in parallel, where all parameters between HART instances are shared…Therefore, we use the multi-head self-attention block (SAB, Lee et al. [11]), which is able to account for higher-order interactions between set elements when computing their representations”, and Fuchs, pg. 4 ⁋3, “MOHART is trained fully end-to-end, contrary to other tracking approaches [1, 2, 3, 4]. It maintains a hidden state, which can contain information about the object’s motion.”).

In other words, the combination of Fuchs and Rastgar teaches wherein the feature representation for the occlusion state is determined during a training process that jointly trains the embedding neural network and the self-attention neural network. As shown in claim 4, Rastgar teaches a feature representation for an occlusion state by leveraging similarity thresholds. Fuchs shows an end-to-end training approach for the HART system which includes a CNN, which is interpreted as an embedding neural network, and a self-attention block. Training a model end-to-end is interpreted as jointly training the embedding and self-attention neural networks as it is known in the art that end-to-end learning simultaneously trains the model parameters and MOHART has both CNN and self-attention layers. Therefore, Fuchs teaches during a training process that jointly trains the embedding neural network and the self-attention neural network. 
Applicant also argues that the combination of Rastgar and Fuchs is improper as Fuchs does not have a feature representation of an occlusion state during training. Fuchs shows that occlusions states are considered in the MOHART system and thus is interpreted as trained jointly with occlusion states. See further support: (Fuchs, pg. 1 ⁋3, “In this work, we present multi-object hierarchical attentive recurrent tracking (MOHART), a class-agnostic tracker with complex relational reasoning capabilities provided by a multi-headed self-attention module (Vaswani et al. [10], Lee et al. [11]). MOHART infers the latent state of every tracked object in parallel, and uses self-attention to inform per-object states about other tracked objects. This helps to avoid performance loss under self-occlusions of tracked objects or strong ego-motion.”). However, Fuchs does not explicitly state a feature representation for occlusion states and therefore relied upon the teachings of Rastgar to teach that element. Therefore, it would be obvious to one of ordinary skill in the art to combine Rastgar’s teachings of a feature representation for an occlusion state with Fuchs teachings of a jointly trained model which considers occlusions for better occlusion detection. Therefore, applicant’s arguments are not persuasive. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
 

Claims 1-2, 7-8, 10-13, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chu, et al., Non-Patent Literature “Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism” (“Chu”) in view of Fuchs, et al., Non-Patent Literature “End-to-end Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning” (“Fuchs”) and further in view of Xu, US Pre-Grant Publication 2021/0133846A1 (“Xu”).
Regarding claim 1 and analogous claims 12 and 20, Chu teaches:
the method comprising: receiving, at a current time step, a plurality of new measurements characterizing a plurality of objects that have been detected in an environment at the current time step, each respective new measurement characterizing a respective one of the plurality of objects that have been detected; (Chu, Section 2, “This problem has more influence for online tracking methods, since they are more sensitive to noisy detections. Our work focuses on applying online single object tracking methods to MOT. The target is tracked by searching for the best matched location using online learned appearance model [receiving, at a current time step, a plurality of new measurements… each respective new measurement characterizing a respective one of the plurality of objects that have been detected;]. This helps to alleviate the limitations from imperfect detections, especially for missing detections. It is complementary to data association methods, since the tracking results of single object trackers at current frame; current frame is interpreted as a current time step and single object trackers is interpreted as a plurality of objects as there is a tracker for each object (i.e. characterizing a plurality of objects that have been detected in an environment at the current time step) can be consider as association candidates for data association.”).
for each of the plurality of new measurements, generating a respective embedded representation of the respective new measurement by processing the respective new measurement using an embedding neural network (See Chu, Figure 3A below): 

    PNG
    media_image1.png
    246
    978
    media_image1.png
    Greyscale

The input frame is interpreted as one or more new measurements (i.e. for each of the plurality of new measurements,). The CNN used for creating the feature map is interpreted as the embedding neural network generating the embedded representations of the measurements (i.e. generating a respective embedded representation of the respective new measurement by processing the respective new measurement using an embedding neural network).
maintaining data that identifies one or more object tracks, wherein each object track is associated with respective measurements received at one or more of the earlier time steps that have been classified as characterizing the same object, and wherein the data identifying the one or more object tracks includes a respective feature representation for each of the one or more object tracks; (Chu, Figure 3a, “The framework of the proposed CNN model. It contains shared CNN layers and multiple target-specific CNN branches. The shared layers are shared by all targets to be tracked. Each target has its own corresponding target-specific CNN branch [maintaining data that identifies one or more object tracks, wherein each object track is associated with respective measurements] which is learned online. The target-specific CNN branch acts as a single object tracker and can be added to or removed from the whole model according to the entrance of new target or exit of existing target.”; being able to tell whether a target is new or existing is interpreted as knowing measurements from earlier time steps to make that determination (i.e. received at one or more of the earlier time steps that have been classified as characterizing the same object, and wherein the data identifying the one or more object tracks includes a respective feature representation for each of the one or more object tracks)).
and determining, for each of the one or more object tracks, whether to associate any of the plurality of new measurements with the object track based on the attended feature representations of the new measurements and the respective feature representation for the object track (Chu, Section 3.5, “In our work, a new target T new is initialized when a newly detected object with high detection score is not covered by any tracked targets” [and determining, for each of the one or more object tracks, whether to associate any of the plurality of new measurements with the object track]). (Chu, Section 3.4.2, “Since αti indicates the occlusion status of target T i. If αti is large, it means that target T i is undergoing severe occlusion at current frame t. Consequently, the weight for positive samples at current frame is small according to Eq. 9. There, the temporal attention mechanism provides a good balance between current and historical visual cues of the target. Besides, if αti is smaller than a threshold α0, the corresponding target state xit will be added to the historical samples set of target T i” [based on the attended feature representations of the new measurements and the respective feature representation for the object track]).
While Chu teaches an online multi-object detection tracking system using a CNN and attention, Chu does not explicitly teach:
A method performed by one or more computers
generating, using a self-attention neural network, a respective attended feature representation for each respective new measurement in the plurality of new measurements, the generating comprising applying self-attention over an attention input that comprises (i) for each of the plurality of objects that have been detected, the embedded representation of the new measurement and (ii) for each of a plurality of earlier time steps, respective embedded representations of measurements received at the respective earlier time step characterizing the plurality of objects to generate an attention output that updates the respective embedded representation of each of plurality of new measurements;
wherein the self-attention neural network applies an attention mechanism that computes, for a particular new measurement received at the current time step and a particular earlier measurement received at a particular earlier time step, an attention score between the embedded representation of the particular new measurement and the embedded representation of the particular earlier measurement as a function of a relative time difference between the current time step and the particular earlier time step; 
Fuchs teaches:
A method performed by one or more computers (Fuchs, pg. 9 Acknowledgements, “We acknowledge use of Hartree Centre resources in this work. The STFC Hartree Centre is a research collaboratory in association with IBM providing High Performance Computing platforms [A method performed by one or more computers] funded by the UK’s investment in e-Infrastructure.”).
generating, using a self-attention neural network, a respective attended feature representation for each respective new measurement in the plurality of new measurements, (Fuchs, pg. 1 ⁋3, “In this work, we present multi-object hierarchical attentive recurrent tracking (MOHART), a class-agnostic tracker with complex relational reasoning capabilities provided by a multi-headed self-attention module [generating, using a self-attention neural network,] (Vaswani et al. [10], Lee et al. [11]). MOHART infers the latent state of every tracked object in parallel, and uses self-attention to inform per-object states about other tracked objects [a respective attended feature representation for each respective new measurement in the plurality of new measurements,]. This helps to avoid performance loss under self-occlusions of tracked objects or strong ego-motion.”).
the generating comprising applying self-attention over an attention input that comprises (i) for each of the plurality of objects that have been detected, the embedded representation of the new measurement (Fuchs, pg. 4 ⁋2, “Since different objects can interact with each other, it is necessary to use a method that can inform each object about the effects of their interactions with other objects [(i) for each of the plurality of objects that have been detected,]. Moreover, since features extracted from different objects comprise a set, this method should be permutation-equivariant, i. e., the results should not depend on the order in which object features are processed. Therefore, we use the multi-head self-attention block [the generating comprising applying self-attention over an attention input that comprises] (SAB, Lee et al. [11]), which is able to account for higher-order interactions between set elements when computing their representations [the embedded representation of the new measurement], thereby allowing rich information exchange, and it can do so in a permutation-equivariant manner.”).
and (ii) for each of a plurality of earlier time steps, respective embedded representations of measurements received at the respective earlier time step characterizing the plurality of objects (Fuchs, pg. 3 ⁋5, “HART uses a CNN to convert the glimpse gt into features ft, which then update the hidden state ht of a LSTM core. The hidden state is used to estimate the current bounding-box bt, spatial attention parameters for the next time-step at+1, as well as object appearance. Importantly, the recurrent core can learn to predict complicated motion conditioned on the past history of the tracked object [for each of a plurality of earlier time steps,], which leads to relatively small attention glimpses [and (ii)…respective embedded representations of measurements received at the respective earlier time step characterizing the plurality of objects]”).
to generate an attention output that updates the respective embedded representation of each of plurality of new measurements, (Fuchs, pg. 4 ⁋2, “Finally, outputs of different attention heads [to generate an attention output] are concatenated in Equation (3). SAB produces M output vectors, one for each input, which are then concatenated with corresponding inputs and fed into separate LSTMs for further processing; feeding the attention outputs into multiple LSTMs for further processing is interpreted as updating the representations of the new measurements as the LSTMs are used to update the individual object trackers in HART/MOHART (i.e. that updates the respective embedded representation of each of plurality of new measurements,), as in HART—see Figure 1”).
Chu and Fuchs are both in the same field of endeavor (i.e. multi-object detection). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Chu and Fuchs to teach the above limitation(s). The motivation for doing so is that using self-attention in multi-object tracking improves model accuracy (cf. Fuchs, pg. 1 ⁋3, “uses self-attention to inform per-object states about other tracked objects. This helps to avoid performance loss under self-occlusions of tracked objects or strong ego-motion.”).
While Chu in view of Fuchs teaches an online multi-object detection tracking system using a CNN and self-attention with LSTMs, the combination does not explicitly teach:
wherein the self-attention neural network applies an attention mechanism that computes, for a particular new measurement received at the current time step and a particular earlier measurement received at a particular earlier time step, an attention score between the embedded representation of the particular new measurement and the embedded representation of the particular earlier measurement as a function of a relative time difference between the current time step and the particular earlier time step; 
Xu teaches wherein the self-attention neural network applies an attention mechanism that computes, for a particular new measurement received at the current time step and a particular earlier measurement received at a particular earlier time step, an attention score between the embedded representation of the particular new measurement and the embedded representation of the particular earlier measurement as a function of a relative time difference between the current time step and the particular earlier time step; (Xu, ⁋63, “At step 210, the trained prediction model 160 a generates time-aware, personalized item recommendations 260. The personalized item recommendations 260 includes a set of ranked items 262 a-262 e corresponding to user interests based on various user interactions weighted over time. For example, in some embodiments, the trained prediction model 160 a includes a neural network structure having at least one self-attention layer configured to weight user interactions given the time at which the interaction occurred [wherein the self-attention neural network applies an attention mechanism that].” and Xu ⁋55, “In some embodiments, when predicting a next time-dependent event (eg+1, tq+1) the prediction model generation system 28 may be configured to take account of a time lag between each event in an input sequence [and a particular earlier measurement received at a particular earlier time step,] and a target event [that computes, for a particular new measurement received at the current time step] such that {tilde over (t)}i=tq+1−ti, i=1, . . . , q and Φ({tilde over (t)}i) is a time representation. The relative time difference between inputs is {circumflex over (t)}i−{circumflex over (t)}j=ti−tj for i, j=1, . . . , q and the attention weights and prediction are a function of the next occurrence time [an attention score between the embedded representation of the particular new measurement and the embedded representation of the particular earlier measurement as a function of a relative time difference between the current time step and the particular earlier time step;].”). 
Chu, in view of Fuchs, and Xu are both in the same field of endeavor (i.e. self-attention). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Chu, in view of Fuchs, and Xu to teach the above limitation(s). The motivation for doing so is that considering multiple differences between each prior time step improves long-term and shirt-term dependencies (cf. Xu, ⁋63, “the trained prediction model 160 a includes a neural network structure having at least one self-attention layer configured to weight user interactions given the time at which the interaction occurred. The trained prediction model 160 a may be configured to account for long-term user interest (including long-term interests and/or preferences) and short-term interest.”).
Regarding claim 2 and analogous claim 13, Chu in view of Fuchs and Xu teaches the method of claim 1. Chu further teaches wherein the respective feature representation for each of the one or more object tracks is an attended feature representation generated for the measurement that was most recently associated with the object track (Chu, Section 3.4.2, “Since αti indicates the occlusion status of target T i. If αti is large, it means that target T i is undergoing severe occlusion at current frame t. Consequently, the weight for positive samples at current frame is small according to Eq. 9. There, the temporal attention mechanism provides a good balance between current and historical visual cues of the target. Besides, if αti is smaller than a threshold α0, the corresponding target state xit will be added to the historical samples set of target T I”; historical sample is interpreted as the most recently associated measurement of the object (i.e. wherein the respective feature representation for each of the one or more object tracks is an attended feature representation generated for the measurement that was most recently associated with the object track)).
Regarding claim 7 and analogous claim 18, Chu in view of Fuchs and Xu teaches the method of claim 1. Chu further teaches wherein the embedding neural network is a feedforward neural network (Chu, Figure 3A, “The framework of the proposed CNN model. It contains shared CNN layers and multiple target-specific CNN branches.”; a person of ordinary skill in the art would know that a CNN, or convolutional neural network, is a feedforward network (i.e. wherein the embedding neural network is a feedforward neural network)).
Regarding claim 8, Chu in view of Fuchs and Xu teaches the method of claim 1. Fuchs further teaches wherein the plurality of earlier time steps are each time step that is less than a fixed number of time steps earlier than the current time step (Fuchs, pg. 3 ⁋5, “HART uses a CNN to convert the glimpse gt into features ft, which then update the hidden state ht of a LSTM core. The hidden state is used to estimate the current bounding-box bt, spatial attention parameters for the next time-step at+1, as well as object appearance. Importantly, the recurrent core can learn to predict complicated motion conditioned on the past history of the tracked object, which leads to relatively small attention glimpses; a glimpse is interpreted as a fixed number of time steps earlier (i.e. wherein the plurality of earlier time steps are each time step that is less than a fixed number of time steps earlier than the current time step)”).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Fuchs with the teachings of Chu and Xu for the same reasons disclosed in claim 1.
Regarding claim 10, Chu in view of Fuchs and Xu teaches the method of claim 1. Chu further teaches further comprising: in response to determining that a particular new measurement is not to be associated with any of the object tracks, generating a new object track that identifies only the new measurement (Chu, Section 3.5, “In our work, a new target T new is initialized; the new target being initialized is interpreted as having the new measurement because the new target would be on its own target branch and independent from other target branches, or object tracks (i.e. further comprising: in response to determining that a particular new measurement is not to be associated with any of the object tracks, generating a new object track that identifies only the new measurement) when a newly detected object with high detection score is not covered by any tracked targets.”).
Regarding claim 11, Chu in view of Fuchs and Xu teaches the method of claim 1. Chu further teaches further comprising: determining that one of the object tracks has not been associated with a new measurement for more than a threshold number of consecutive time steps, and in response, removing the data identifying the object track that has not been associated with a new measurement for more than a threshold number of consecutive time steps (Chu, Section 3.5, “In our work, a new target T new is initialized when a newly detected object with high detection score is not covered by any tracked targets. To alleviate the influence of false positive detections, the newly initialized target T new will be discarded if it is considered as “untracked” (Sec. 3.3.3) or not detected in any of the first Tinit frames. For target termination, we simply terminate the target if it is “untracked” for over Tterm successive frames [determining that one of the object tracks has not been associated with a new measurement for more than a threshold number of consecutive time steps, and in response, removing the data identifying the object track that has not been associated with a new measurement for more than a threshold number of consecutive time steps].”).

Claims 4, 15, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Chu, et al., Non-Patent Literature “Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism” (“Chu”) in view of Fuchs, et al., Non-Patent Literature “End-to-end Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning” (“Fuchs”) and further in view of Xu, US Pre-Grant Publication 2021/0133846A1 (“Xu”) and Rastgar, et al., US Pre-Grant Publication 2018/0068448 (“Rastgar”).
Regarding claim 4 and analogous claim 15, while Chu in view of Fuchs and Xu teaches the method of claim 1, the combination does not explicitly teach wherein determining, for each of the one or more object tracks, whether to associate any of the new measurements with the object track based on the attended feature representations of the new measurements and the respective feature representation for the object track comprises: for each new measurement, determining a respective similarity score between the respective feature representation for the object track and the attended feature representation for the new measurement; determining a similarity score between the respective feature representation for the object track and a feature representation for an occlusion state that represents none of the new measurements being associated with the object track; and determining whether to associate any of the new measurements with the object track based on the similarity scores for the new measurements and the similarity score for the occlusion state. 
	Rastagar teaches:
wherein determining, for each of the one or more object tracks, whether to associate any of the new measurements with the object track based on the attended feature representations of the new measurements and the respective feature representation for the object track comprises: for each new measurement, determining a respective similarity score between the respective feature representation for the object track and the attended feature representation for the new measurement; (Rastgar, ⁋0013, “[0013] In accordance with an embodiment, the plurality of first parameters may include an appearance similarity score and a template matching score. The template matching score is a value that may be derived based on a similarity score between one or more object features of the object in the current image frame and one or more object features of the object in the previous image frame [for each new measurement, determining a respective similarity score between the respective feature representation for the object track and the attended feature representation for the new measurement] and a specified constant factor. The appearance similarity score is a value that may be derived based on a difference between a first color distribution of the object in the current image frame and a second color distribution of one or more object features of the object in the previous image frame and a specified coefficient value.”).
determining a similarity score between the respective feature representation for the object track and a feature representation for an occlusion state that represents none of the new measurements being associated with the object track; (Rastgar, ⁋0014, “[0014] In other words, the one or more regions of the object in the image may not be hidden by other objects present in the image providing a clear view of the object to be tracked in the image. The detected current state may correspond to the occlusion state or a second reacquisition state, in an event that a corresponding appearance similarity score is less than the first specified threshold value [determining a similarity score between the respective feature representation for the object track and a feature representation for an occlusion state that represents none of the new measurements being associated with the object track], and a corresponding template matching score is less than the second specified threshold value.”).
and determining whether to associate any of the new measurements with the object track based on the similarity scores for the new measurements and the similarity score for the occlusion state (Rastgar, ⁋0032, “[0032] In accordance with an embodiment, the determined state transition may be stored in a local memory unit (not shown) of the electronic device 102. The electronic device 102 may additionally be configured to utilize the stored state transition for detection of the current state of the object in the next subsequent image frames from the sequence of image frames of the video content.”; the storing of the state transition is interpreted as associating the new measurements with the object track because state transition of the object is based on the similarity score and occlusion is a state thus caused by a similarity score (i.e. and determining whether to associate any of the new measurements with the object track based on the similarity scores for the new measurements and the similarity score for the occlusion state)).
Chu, in view of Fuchs and Xu, and Rastgar are both related to the same field of endeavor (i.e. object occlusion tracking). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to apply Rastgar’s teachings of using a similarity score as a tool to determine occlusion state with the combination’s teachings of tracking objects based on self-attention in order to improve the system’s robustness during occlusion (cf. Rastgar, ⁋0097, “[0097] the disclosed video processing system and method detects the current state of the four discrete states of the object in the current image frame (by use of the template matching score and the appearance similarity score) and adjusts the tracking parameters accordingly. Consequently, the disclosed video processing system and method handles the four discrete states of objects separately in order to exhibit robustness”).
Regarding claim 21, Chu in view of Fuchs, Xu, and Rastagar teaches the method of claim 4. Rastagar teaches the feature representation for the occlusion state as seen in claim 4. Fuchs further teaches during a training process that jointly trains the embedding neural network and the self-attention neural network. (Fuchs, pg. 3 ⁋5, “HART uses a CNN to convert the glimpse gt into features ft [the embedding neural network], which then update the hidden state ht of a LSTM core.”, and Fuchs, pg. 4 ⁋1-2, “Multi-object support in HART requires the following modifications. Firstly, in order to handle a dynamically changing number of objects, we apply HART to multiple objects in parallel, where all parameters between HART instances are shared…Therefore, we use the multi-head self-attention block [and the self-attention neural network.] (SAB, Lee et al. [11]), which is able to account for higher-order interactions between set elements when computing their representations”, and Fuchs, pg. 4 ⁋3, “MOHART is trained fully end-to-end, contrary to other tracking approaches [1, 2, 3, 4]. It maintains a hidden state, which can contain information about the object’s motion; training end-to-end is interpreted as jointly training the embedding and self-attention neural networks as it is known in the art that end-to-end learning simultaneously trains the model parameters and MOHART has both CNN and self-attention layers (i.e. during a training process that jointly trains).”).

Claims 5 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Chu, et al., Non-Patent Literature “Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism” (“Chu”) in view of Fuchs, et al., Non-Patent Literature “End-to-end Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning” (“Fuchs”) and further in view of Xu, US Pre-Grant Publication 2021/0133846A1 (“Xu”), Rastgar, et al., US Pre-Grant Publication 2018/0068448 (“Rastgar”), and Yang, et al., Non-Patent Literature “Real-time Multiple Objects Tracking with Occlusion Handling in Dynamic Scenes” (“Yang”).
Regarding claim 5 and analogous claim 16, while Chu in view of Fuchs, Xu, and Rastgar teaches the method of claim 4, the combination does not explicitly teach wherein determining whether to associate any of the new measurements with the object track based on the similarity scores for the new measurements and the similarity score for the occlusion state comprises: when the occlusion state is most similar to the feature representation for the object track from among the occlusion state and the new measurements according to the similarity scores, determining not to associate any of the new measurements with the object track; and when a particular new measurement is most similar to the feature representation for the object track from among the occlusion state and the new measurements according to the similarity scores, associating the particular new measurement with the object track.
Yang teaches:
wherein determining whether to associate any of the new measurements with the object track based on the similarity scores for the new measurements and the similarity score for the occlusion state comprises: when the occlusion state is most similar to the feature representation for the object track from among the occlusion state and the new measurements according to the similarity scores, determining not to associate any of the new measurements with the object track; (Yang, Section 4, “For those non-matched tracks, a merging detection algorithm is used to decide whether the track is merged by another measure or is missed. If a merging happens, a new group is generated; a new group is interpreted as a new track than the object track being compared with and therefore if the merging occurs, a newer track is made and therefore the new measurements is not associated with the object track (i.e. when the occlusion state is most similar to the feature representation for the object track from among the occlusion state and the new measurements according to the similarity scores, determining not to associate any of the new measurements with the object track). If the track is missed, the confidence of the track will be decreased, once it drops below a specific threshold, the track will be deleted. For those non-matched measures, a splitting detection module is developed to decide whether the measure is split from an active track or it is a new target.”).
and when a particular new measurement is most similar to the feature representation for the object track from among the occlusion state and the new measurements according to the similarity scores, associating the particular new measurement with the object track (Yang, Section 3, “This process will keep on looping until none of the elements value of matrix CkE equals to two. Finally, the foreground measures and existing tracks are classified into three parts: non-matched track, non-matched measure, matched track and measure. The above association method assigns one measure to one track and cannot handle merging; merging is interpreted as occlusion state as merging is when objects overlap and if it is not merging then, it is a track not being occluded (i.e. and when a particular new measurement is most similar to the feature representation for the object track from among the occlusion state and the new measurements according to the similarity scores, associating the particular new measurement with the object track) and splitting event, in which one measure may assign to multiple tracks and one track may assign to multiple measures.”).
Chu, in view of Fuchs, Xu, and Rastgar, and Yang are both related to the same field of endeavor (i.e. object occlusion tracking). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to apply Yang’s teachings of using a merging detection algorithm as a tool to determine whether to add a current object track, that is occluded,  to a prior track with the combination’s teachings of tracking objects based on similarity score and self-attention in order to improve the system’s efficiency during occlusion (cf. Yang, Section 6, “the system can deal with difficult situations such as ghosts and background changes. Moreover, it can track multiple objects with long-duration and complete occlusion”).

Claims 6 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Chu, et al., Non-Patent Literature “Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism” (“Chu”) in view of Fuchs, et al., Non-Patent Literature “End-to-end Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning” (“Fuchs”) and further in view of Xu, US Pre-Grant Publication 2021/0133846A1 (“Xu”) and Zeng, et al., US Pre-Grant Publication 2012/0140061 (“Zeng”).
Regarding claim 6 and analogous claim 17, while Chu in view of Fuchs and Xu teaches the method of claim 1, the combination does not explicitly teach wherein each new measurement characterizes a position and an appearance of the respective object that has been detected in the environment at the current time step.
Zeng teaches wherein each new measurement characterizes a position and an appearance of the respective object that has been detected in the environment at the current time step (Zeng, ⁋0019, “[0019] For more robust vision tracking, the present disclosure includes a fusion method integrating motion information sensed by the camera with comparatively linear range data from the range sensor; motion information is interpreted as appearance of the object and data from the range sensor is interpreted as the position of the object (i.e. wherein each new measurement characterizes a position and an appearance of the respective object that has been detected in the environment at the current time step), and an incremental learning method updating attribute, such as target object appearance, of tracks in a database”).
Chu, in view of Fuchs and Xu, and Zeng, are both related to the same field of endeavor (i.e. object tracking). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to apply Zeng’s teachings of collecting position and appearance data of an object with the combination’s teachings of tracking objects using self-attention in order to improve the system’s accuracy by providing more detailed inputs (cf. Zeng, ⁋0022, “[0022] the general concepts of the present disclosure can be used to fuse output from various types of sensors for achieving improved representations of the environment surrounding the vehicle”).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Chu, et al., Non-Patent Literature “Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism” (“Chu”) in view of Fuchs, et al., Non-Patent Literature “End-to-end Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning” (“Fuchs”) and further in view of Xu, US Pre-Grant Publication 2021/0133846A1 (“Xu”) and Shazeer, et al., US Pre-Grant Publication 2018/0341860 (“Shazeer”).
Regarding claim 9, while Chu in view of Fuchs and Xu teaches the method of claim 1, the combination does not explicitly teach wherein the self-attention neural network comprises a plurality of self-attention layers that are stacked one after the other
Shazeer teaches wherein the self-attention neural network comprises a plurality of self-attention layers that are stacked one after the other (Shazeer, claim 9, “wherein each encoder self-attention sub-layer comprises a plurality of encoder self-attention layers [wherein the self-attention neural network comprises a plurality of self-attention layers that are stacked one after the other]”).
Chu, in view of Fuchs and Xu, and Shazeer are both in the same field of endeavor (i.e. attention mechanisms). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Chu, in view of Fuchs, and Shazeer to teach the above limitation(s). The motivation for doing so is to provide long range relationships between inputs (cf. Shazeer, ⁋0008, “[0008] The use of attention mechanisms allows the sequence transduction neural network to effectively learn dependencies between distant positions during training”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Kim, et al., “Residual LSTM Attention Network for Object Tracking” discloses an object tracking attention neural network that uses an LSTM to incorporate temporal correlations between object tracks.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICHOLAS S WU whose telephone number is (571)270-0939. The examiner can normally be reached Monday - Friday 8:00 am - 4:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached on 571-431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/N.S.W./Examiner, Art Unit 2148                                                                                                                                                                                                        /MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Nov 16, 2020
Application Filed
Sep 20, 2023
Non-Final Rejection — §103
Mar 04, 2024
Response Filed
May 21, 2024
Final Rejection — §103
Aug 15, 2024
Examiner Interview Summary
Aug 15, 2024
Applicant Interview (Telephonic)
Aug 27, 2024
Request for Continued Examination
Aug 31, 2024
Response after Non-Final Action
Sep 25, 2024
Non-Final Rejection — §103
Jan 27, 2025
Interview Requested
Jan 27, 2025
Response Filed
Feb 05, 2025
Examiner Interview Summary
Feb 05, 2025
Applicant Interview (Telephonic)
May 06, 2025
Final Rejection — §103
Jul 14, 2025
Response after Non-Final Action
Aug 12, 2025
Non-Final Rejection — §103
Nov 18, 2025
Response Filed
Feb 23, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/882,311
Patent 12488244
APPARATUS AND METHOD FOR DATA GENERATION FOR USER ENGAGEMENT
2y 5m to grant Granted Dec 02, 2025
17/444,687
Patent 12423576
METHOD AND APPARATUS FOR UPDATING PARAMETER OF MULTI-TASK MODEL, AND STORAGE MEDIUM
2y 5m to grant Granted Sep 23, 2025
17/265,476
Patent 12361280
METHOD AND DEVICE FOR TRAINING A MACHINE LEARNING ROUTINE FOR CONTROLLING A TECHNICAL SYSTEM
2y 5m to grant Granted Jul 15, 2025
17/191,518
Patent 12354017
ALIGNING KNOWLEDGE GRAPHS USING SUBGRAPH TYPING
2y 5m to grant Granted Jul 08, 2025
17/161,152
Patent 12333425
HYBRID GRAPH NEURAL NETWORK
2y 5m to grant Granted Jun 17, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

7-8
Expected OA Rounds
47%
Grant Probability
90%
With Interview (+43.1%)
3y 9m
Median Time to Grant
High
PTA Risk
Based on 38 resolved cases by this examiner. Grant probability derived from career allow rate.
MULTI OBJECT TRACKING USING MEMORY ATTENTION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email