Last updated: April 19, 2026
Application No. 18/440,476
SCALABLE CROSS-MODAL MULTI-CAMERA OBJECT TRACKING USING TRANSFORMERS AND CROSS-VIEW MEMORY FUSION

Non-Final OA §103
Filed
Feb 13, 2024
Examiner
NAH, JONGBONG
Art Unit
2674
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
Interview Optional

— +15.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 104 resolved cases, 2023–2026
Examiner Intelligence

NAH, JONGBONG View full profile →
Grants 75% — above average
Career Allow Rate
78 granted / 104 resolved
+13.0% vs TC avg
Strong +15% interview lift
Without
With
+15.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 12m
Avg Prosecution
24 currently pending
Career history
128
Total Applications
across all art units
Statute-Specific Performance

§101
10.1%
-29.9% vs TC avg
§103
58.8%
+18.8% vs TC avg
§102
24.7%
-15.3% vs TC avg
§112
2.8%
-37.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 104 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 05/14/2024 is/are compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Office Action Summary
Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ning et al (US 2022/0020158 A1) in view of Hung et al (US 2021/0150349 A1), further in view of Kitaev et al (US 10,909,461 B1).

Claim Rejections - 35 USC § 103
	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ning et al (US 2022/0020158 A1) in view of Hung et al (US 2021/0150349 A1), further in view of Kitaev et al (US 10,909,461 B1).

Regarding claim(s) 1 and 13, Ning teaches an apparatus for object tracking comprising: 
a memory (Figure 3: Memory 352; and Paragraph [0068]); and 
one or more processors implemented in circuitry and in communication with the memory, the one or more processors (Figure 3: Processor 351; and Paragraph [0067]) configured to: 
detect a dynamic object in a scene captured in a plurality of camera (Paragraph [0072]: “The local camera application 354 includes a calibration module 356, a 3D object detection module 358, a 3D object tracking module 362, a bird-view transformer 367, and optionally a user interface 368”) output images by a plurality of cameras over time (Paragraph [0011]: “detect an object from the video frames, where the detected object in each of the video frames is represented by a detection vector”; and Paragraph [0012]: “track the object in the video frames based on the detection vectors of the object from the video frames to obtain a trajectory of the object”); 
sample key points of the dynamic object (Paragraph [0011]: “[…] the detection vector comprises first dimensions representing two dimensional (2D) parameters of the object and second dimensions representing three dimensional (3D) parameters of the object”; and Paragraph [0016]: “the 2D parameters of the object comprises location and size of a 2D bounding box in the corresponding one of the video frames that encloses the object, and the 3D parameters of the object comprises vertices, center point, and orientation of a 3D box in the corresponding one of the video frames that encloses the object”); and 
predict an updated 3D box for the dynamic object from the updated combined key point features, the updated 3D box including a tracklet (read as “trajectory”) representing motion of the dynamic object in the scene over time (Paragraph [0011]: “[…] the detection vector comprises first dimensions representing two dimensional (2D) parameters of the object and second dimensions representing three dimensional (3D) parameters of the object”; Paragraph [0012]: “track the object in the video frames based on the detection vectors of the object from the video frames to obtain a trajectory of the object”; and Paragraph [0016]: “the 2D parameters of the object comprises location and size of a 2D bounding box in the corresponding one of the video frames that encloses the object, and the 3D parameters of the object comprises vertices, center point, and orientation of a 3D box in the corresponding one of the video frames that encloses the object”).
Ning fails to teach to extract short term features of the key points; combine long-term key point features read from a key point features database stored in a memory and the short term key point features into combined key point features; apply attention processing to hash the combined key point features into hash buckets representing interactions of the key points; and apply a plurality of transformer layers on top of attention processing to update the combined key point features using the interactions of the key points to form the long-term key point features and storing the long-term key point features in the key point features database.
However, Hung teaches to extract short term features of the key points (Paragraph [0046]: “the object tracking system generates an embedded representation of each new measurement 202”; and Paragraph [0047]: “The object tracking system then generates a respective attended feature representation 212 for each of the new measurements 202 by processing (i) the embedded representations of the new measurements 202 and (ii) embedded representations of the measurements received at one or more earlier time steps”);
combine long-term key point features read from a key point features database stored in a memory and the short term key point features into combined key point features (Paragraph [0030]: “The object tracking system 140, also on-board the vehicle 102, receives the measurements 132 generated by the sensor system 130 and uses the measurements 132 to update object track data 142 maintained by the object tracking system 140 [...] the object track data 142 identifies multiple “tracks” of measurements, with each track including measurements that the object tracking system 140 has classified as being measurements of the same object and, therefore, with each of the tracks corresponding to different objects in the environment”; Paragraph [0047]: “The object tracking system then generates a respective attended feature representation 212 for each of the new measurements 202 by processing (i) the embedded representations of the new measurements 202 and (ii) embedded representations of the measurements received at one or more earlier time steps”; and Paragraph [0048]: “aggregates information from object detections received both at the time step t and two earlier time steps t−2 and t−1 to generate the attended feature representations for the new measurements”); and 
apply a plurality of transformer (read as “self-attention neural network”) layers on top of attention processing to update the combined key point features using the interactions of the key points to form the long-term key point features and storing the long-term key point features in the key point features database (Paragraph [0047]: “[…] that precede the current time step t using a self-attention neural network 210 that generates the respective attended feature representations by updating each of the embedded representations by attending over (i) the embedded representations of the new measurements 202 and (ii) the embedded representations of the measurements received at the one or more earlier time steps”; and Paragraph [0030]: “The object tracking system 140, also on-board the vehicle 102, receives the measurements 132 generated by the sensor system 130 and uses the measurements 132 to update object track data 142 maintained by the object tracking system 140”).
Ning teaches detecting and tracking objects across video frames using detection vectors that include 2D and 3D parameters and predicting object trajectories over time. However, Ning relies on conventional tracking mechanisms for associating object detections across frames. Hung teaches improving multi-object tracking robustness by maintaining object track data and updating embedded feature representations using a self-attention neural network that aggregates information from earlier time steps and current measurements. Hung further explains that such attention-based temporal aggregation improves tracking performance in scenarios involving occlusion and complex object interactions. 
Therefore, it would have been obvious to a person having ordinary skill in the art at the time of the invention to modify the 3D object detection and tracking system of Ning  to incorporate the memory attention based feature aggregation framework of Hung. A person of ordinary skill in the art would have been motivated to apply Hung’s memory-attention framework to Ning’s 3D tracking system in order to improve temporal feature consistency, enhance robustness under occlusion, and provide more reliable trajectory prediction, as both references address the same technical field of object tracking across time and aim to improve tracking accuracy using neural network-based representations (Hung, Paragraph [0012]). This motivation for the combination of Ning and Hung is supported by KSR exemplary rationale (G) Some teaching, suggestion, or motivation in the prior art that would have led one of ordinary skill to modify the prior art reference or to combine prior art reference teachings to arrive at the claimed invention. MPEP 2141 (III).
Ning and Hung fails to teach to apply attention processing to hash the combined key point features into hash buckets representing interactions of the key points. However, Kitaev teaches to apply attention processing to hash the combined key point features into hash buckets representing interactions of the key points (Figure 2; and Col. 3, lines 27-33: “to generate a plurality of LSH groupings by determining one or more respective hash values for each key and assigning the respective keys having similar hash values into a same LSH grouping. The attention sub-layer then applies an attention mechanism over respective keys within each LSH grouping to generate the attended input sequence”).
Ning teaches a system for detecting and tracking dynamic objects across video frames using detection vectors that include 2D and 3D parameters and predicting object trajectories over time. Ning discloses generating 3D bounding box representations and tracking objects across multiple frames to obtain trajectories. Hung teaches a multi-object tracking framework that maintains object track data and updates embedded feature representations using a self-attention neural network. Hung explicitly describes aggregating feature representations from earlier time steps with current measurements to model spatiotemporal dependencies and improve tracking robustness, particularly in occlusion scenarios. Furthermore, Kitaev teaches an attention mechanism that employs locality sensitive hashing (LSH) to hash query and key vectors into buckets and compute attention within each bucket, thereby improving computational efficiency and scalability of transformer-based architectures.
Therefore, it would have been obvious to implement the attention mechanism of Hung using the LSH-based attention architecture taught by Kitaev in order to reduce computational complexity and improve scalability when processing large numbers of object features across time. Kitaev explicitly teaches that LSH attention provides efficiency improvements over conventional full attention mechanisms in transformer networks. Applying Kitaev’s LSH-based hashing technique to Hung’s self-attention framework represents a predictable use of known attention optimization techniques to improve computational efficiency in a multi-object tracking context (Kitaev, Col. 5, lines 12-43). The combination of Ning, Hung, and Kitaev therefore merely applies known feature aggregation and attention optimization techniques to a known 3D object tracking system to yield predictable improvements in tracking robustness, temporal modeling, and computational efficiency.  This motivation for the combination of Ning, Hung, and Kitaev is supported by KSR exemplary rationale (G) Some teaching, suggestion, or motivation in the prior art that would have led one of ordinary skill to modify the prior art reference or to combine prior art reference teachings to arrive at the claimed invention. MPEP 2141 (III).

Regarding claim(s) 2 and 14, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Kitaev teach wherein the attention processing comprises locality sensitive hashing (LSH) attention processing (Col. 3, lines 15-20: “To perform the machine learning task, the system includes an attention neural network that includes multiple layers. One or more of the multiple layers are locality-sensitive hashing (LSH) attention layers that operate on a respective input sequence that includes a respective input vector at each of one or more positions”).

Regarding claim(s) 3 and 15, Ning as modified by Hung and Kitaev teach the apparatus of claim 2, where Kitaev teach wherein the LSH attention processing approximates a query-key attention matrix by hashing a query of a key point and key vectors to the hash buckets and computes attention only between queries hashing to a same hash bucket (Figure 2; and Col. 3, lines 27-33: “to generate a plurality of LSH groupings by determining one or more respective hash values for each key and assigning the respective keys having similar hash values into a same LSH grouping. The attention sub-layer then applies an attention mechanism over respective keys within each LSH grouping to generate the attended input sequence”).
Regarding claim(s) 4 and 16, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Hung teach wherein to combine short term key point features and long-term key point features, the one or more processors are configured to concatenate the short term key point features and the long-term key point features (Paragraph [0030]: “The object tracking system 140, also on-board the vehicle 102, receives the measurements 132 generated by the sensor system 130 and uses the measurements 132 to update object track data 142 maintained by the object tracking system 140”; Paragraph [0047]: “The object tracking system then generates a respective attended feature representation 212 for each of the new measurements 202 by processing (i) the embedded representations of the new measurements 202 and (ii) embedded representations of the measurements received at one or more earlier time steps”; and Paragraph [0048]: “aggregates information from object detections received both at the time step t and two earlier time steps t−2 and t−1 to generate the attended feature representations for the new measurements”).

Regarding claim(s) 5 and 17, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Hung teaches wherein the one or more processors (Paragraph [0088]) are further configured to: 
augment the updated combined key point features with identifiers (Paragraph [0030]: “The object tracking system 140, also on-board the vehicle 102, receives the measurements 132 generated by the sensor system 130 and uses the measurements 132 to update object track data 142 maintained by the object tracking system 140 [...] the object track data 142 identifies multiple “tracks” of measurements, with each track including measurements that the object tracking system 140 has classified as being measurements of the same object and, therefore, with each of the tracks corresponding to different objects in the environment”; and Paragraph [0047]: “The object tracking system then generates a respective attended feature representation 212 for each of the new measurements 202 by processing (i) the embedded representations of the new measurements 202 and (ii) embedded representations of the measurements received at one or more earlier time steps”).

Regarding claim(s) 6 and 18, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Hung teaches wherein the one or more processors (Paragraph [0088]) are further configured to: 
train a machine learning model for predicting object motion using a regression loss from a loss function (Paragraph [0009]: “Training the neural network thus involves continually performing a forward pass on the input, computing gradient values, and updating the current values for the set of parameters for each layer using the computed gradient values, e.g., using gradient descent”; Paragraph [0030]: “[...] the object track data 142 identifies multiple “tracks” of measurements, with each track including measurements that the object tracking system 140 has classified as being measurements of the same object and, therefore, with each of the tracks corresponding to different objects in the environment”) where Ning teaches to applied to the updated 3D box including the tracklet (Paragraph [0011]: “[…] the detection vector comprises first dimensions representing two dimensional (2D) parameters of the object and second dimensions representing three dimensional (3D) parameters of the object”; Paragraph [0012]: “track the object in the video frames based on the detection vectors of the object from the video frames to obtain a trajectory of the object”; and Paragraph [0016]: “the 2D parameters of the object comprises location and size of a 2D bounding box in the corresponding one of the video frames that encloses the object, and the 3D parameters of the object comprises vertices, center point, and orientation of a 3D box in the corresponding one of the video frames that encloses the object”).

Regarding claim(s) 7 and 19, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Ning teaches wherein the one or more processors (Figure 3: Processor 351; and Paragraph [0067]) are further configured to: 
detect the dynamic object using a bird’s eye view (BEV) fusion model (Figure 5; Paragraph [0011]: “detect an object from the video frames, where the detected object in each of the video frames is represented by a detection vector”; and Paragraph [0013]: “transform the video frames into bird-view, wherein the bird view of the video frames comprises the trajectory of the object”).

Regarding claim(s) 8, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Ning teaches wherein the one or more processors (Figure 3: Processor 351; and Paragraph [0067]) are further configured to: 
extract the key point features by a residual neural network (Paragraph [0077]: “In certain embodiments, the 3D object regressor of the 3DOD model 360 is implemented by a pretrained ResNet34 backbone (fully connected layers removed), followed by 3 fully connected layers. The channels may be reduced from 512 to 256, 128, and finally 20. The 20 channels denote the eight points (2×8=16 channels), the coarse vehicle dimension (3 channels) and the coarse depth (1 channel)”).

Regarding claim(s) 9, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Hung teaches wherein the one or more processors (Paragraph [0088]) are further configured to: 
read long-term key point features from the key point features database based at least in part on a key point index (Paragraph [0030]: “The object tracking system 140, also on-board the vehicle 102, receives the measurements 132 generated by the sensor system 130 and uses the measurements 132 to update object track data 142 maintained by the object tracking system 140 [...] the object track data 142 identifies multiple “tracks” of measurements, with each track including measurements that the object tracking system 140 has classified as being measurements of the same object and, therefore, with each of the tracks corresponding to different objects in the environment”; and Paragraph [0047]: “The object tracking system then generates a respective attended feature representation 212 for each of the new measurements 202 by processing (i) the embedded representations of the new measurements 202 and (ii) embedded representations of the measurements received at one or more earlier time steps”).

Regarding claim(s) 10, Ning as modified by Hung and Kitaev teach the apparatus of claim 9, where Hung teaches wherein the one or more processors (Paragraph [0088]) are further configured to: 
store the long-term key point features in the key point features database based at least in part on the key point index (Paragraph [0030]: “The object tracking system 140, also on-board the vehicle 102, receives the measurements 132 generated by the sensor system 130 and uses the measurements 132 to update object track data 142 maintained by the object tracking system 140 [...] the object track data 142 identifies multiple “tracks” of measurements, with each track including measurements that the object tracking system 140 has classified as being measurements of the same object and, therefore, with each of the tracks corresponding to different objects in the environment”).

Regarding claim(s) 11 and 20, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Ning teaches wherein the one or more processors (Figure 3: Processor 351; and Paragraph [0067]) are further configured to: 
predict the updated 3D box for the dynamic object, the updated 3D box including the tracklet, using a multilayer perceptron neural network (Paragraph [0011]: “[…] the detection vector comprises first dimensions representing two dimensional (2D) parameters of the object and second dimensions representing three dimensional (3D) parameters of the object”; Paragraph [0012]: “track the object in the video frames based on the detection vectors of the object from the video frames to obtain a trajectory of the object”; Paragraph [0016]: “the 2D parameters of the object comprises location and size of a 2D bounding box in the corresponding one of the video frames that encloses the object, and the 3D parameters of the object comprises vertices, center point, and orientation of a 3D box in the corresponding one of the video frames that encloses the object”; and Paragraph [0077]: “In certain embodiments, the 3D object regressor of the 3DOD model 360 is implemented by a pretrained ResNet34 backbone (fully connected layers removed), followed by 3 fully connected layers. The channels may be reduced from 512 to 256, 128, and finally 20. The 20 channels denote the eight points (2×8=16 channels), the coarse vehicle dimension (3 channels) and the coarse depth (1 channel)”).

Regarding claim(s) 12, Ning as modified by Hung and Kitaev teach the apparatus of claim 1, where Ning teaches wherein the apparatus comprises and vehicle, and wherein the plurality of cameras is disposed on the vehicle (Paragraph [0014]: “In certain embodiments, the object is a vehicle”; and Paragraph [0072]: “The local camera application 354 includes a calibration module 356, a 3D object detection module 358, a 3D object tracking module 362, a bird-view transformer 367, and optionally a user interface 368”).


Relevant Prior Art Directed to State of Art
	Wilf et al (US 2025/0131735 A1) are relevant prior art not applied in the rejection(s) above. Wilf discloses an apparatus for distance estimation of an object from a vehicle, the apparatus comprising: an interface hardware configured to obtain an image frame; a memory unit for storing the image frame; a processing unit coupled to the interface hardware and the memory unit, the processing unit being configured to distance estimation of an object from a vehicle during driving, wherein method comprises: obtaining an image frame including the object; identifying a first set of key points of the object from the image frame; obtaining a second set of key points that correlates to the first set of key points from a non-volatile memory; obtaining spatial information about the second set of key points from the non-volatile memory; and evaluating a distance of the object from the vehicle based on the first set of key points and the spatial information about the second set of key points.

White et al (US 2016/0180189 A1) are relevant prior art not applied in the rejection(s) above. White discloses a method of creating an image hash, comprising: identifying features at different locations within a sample image; identifying descriptor vectors for at least a subset of the features, the subset defining key points of the image, the descriptor vectors describing local image information around the key points, where each descriptor vector is an n-dimensional array and is represented as a unary feature of dimension 128×128; and generating key points based on hashes of data vectors that include at least one of the descriptors, where each feature is a 36×20 hash value.

Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONGBONG NAH whose telephone number is (571) 272-1361. The examiner can normally be reached M - F: 9:00 AM - 5:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ONEAL MISTRY can be reached on 313-446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JONGBONG NAH/Examiner, Art Unit 2674                                                                                                                                                                                                        


                                                                                                                                                                                                        
/ONEAL R MISTRY/Supervisory Patent Examiner, Art Unit 2674
Read full office action
Prosecution Timeline

Feb 13, 2024
Application Filed
Feb 20, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/939,826
Patent 12591952
Image rotation
2y 5m to grant Granted Mar 31, 2026
17/649,734
Patent 12579737
ROTATING 3D SCANNER TO ENABLE PRONE CONTACTLESS REGISTRATION
2y 5m to grant Granted Mar 17, 2026
18/337,510
Patent 12579775
IMAGE PROCESSING USING DATUM IDENTIFICATION AND MACHINE LEARNING ALGORITHMS
2y 5m to grant Granted Mar 17, 2026
18/748,479
Patent 12580050
SPATIALLY CO-REGISTERED GENOMIC AND IMAGING (SCORGI) DATA ELEMENTS FOR FINGERPRINTING MICRODOMAINS
2y 5m to grant Granted Mar 17, 2026
17/840,184
Patent 12567141
MEDICAL IMAGE SYNTHESIS DEVICE AND METHOD
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
75%
Grant Probability
90%
With Interview (+15.2%)
2y 12m
Median Time to Grant
Low
PTA Risk
Based on 104 resolved cases by this examiner. Grant probability derived from career allow rate.
SCALABLE CROSS-MODAL MULTI-CAMERA OBJECT TRACKING USING TRANSFORMERS AND CROSS-VIEW MEMORY FUSION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email