Last updated: May 29, 2026
Application No. 18/433,958
TIME SYNCHRONIZATION OF MULTIPLE CAMERA INPUTS FOR VISUAL PERCEPTION TASKS

Non-Final OA §103
Filed
Feb 06, 2024
Priority
May 04, 2023 — provisional 63/500,157
Examiner
GILLIARD, DELOMIA L
Art Unit
2661
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
Interview Optional

— +10.2% interview lift. Interview lift (+10.2%) is below the 15.0% threshold. A written response is recommended.
Based on 1092 resolved cases, 2023–2026
Examiner Intelligence

GILLIARD, DELOMIA L View full profile →
Grants 90% — above average
Career Allowance Rate
979 granted / 1092 resolved
+27.7% vs TC avg
Moderate +10% lift
Without
With
+10.2%
Interview Lift
resolved cases with interview
Fast prosecutor
1y 12m
Avg Prosecution
13 currently pending
Career history
1104
Total Applications
across all art units
Statute-Specific Performance

§101
3.7%
-36.3% vs TC avg
§103
67.8%
+27.8% vs TC avg
§102
6.5%
-33.5% vs TC avg
§112
3.9%
-36.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 1092 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-7, 10-14, and 16-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation to Wei et al., hereinafter, “Wei” in view of US 2020/0401793 A1 to Leung et al., hereinafter, “Leung”.
Claim 1. Wei teaches An apparatus for processing image data, the apparatus comprising: [Abstract] Depth estimation from images

at least one memory; [4.2 The Benchmark for Multi-camera Depth Estimation]  on a single RTX 3090 (has a memory)

and at least one processor coupled to the at least one memory and configured to: Wei [4.2 The Benchmark for Multi-camera Depth Estimation] on a single RTX 3090 

obtain a plurality of input images associated with a plurality of spatial views of a scene; Figure 2: target frames (input images)

[3.1 Problem Formulation] The predicted depth maps and poses of a set of input surrounding samples I = {I1,I2,···IN}

[3.2 Overview] Given a set of surrounding images

generate, using a machine learning-based encoder of a machine learning system, a plurality of features from the plurality of input images; [3.2 Overview] the encoder networks first extract their multi-scale representations in parallel

[Introduction] we propose SurroundDepth to process all the surrounding views jointly to produce high-quality depth maps across cameras. We first employ a shared encoder to extract high-level feature maps for each view and then propose a cross-view transformer to effectively fuse them.

and combine timing information associated with capture of the plurality of input images with at least one input of the machine learning system to synchronize the plurality of features in time. [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to be combining timing information.

[3.2 Overview] we entangle the features from all views into an integrated feature at each scale, and further utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales,  Figure 3

[6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Wei fails to explicitly teach timing information associated with capture of the plurality of input images, Leung, in the field of train a neural network to perform multi-view association of subject(s) in images teaches [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps. Thus, the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. A synchronized set of images includes the same subject identifier for each subject identified in the respective views as a result of the execution of the view association model 391 by the multi-view associator 330. 

Wei teaches depth estimation of multiview using a neural network. Thus, before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to substituting the feature incorporating the information of Wei with timing information of Leung [0002] for creation of intelligent interactive environments and provides for development of specific subject-based identifiers that can be used to identify and track multiple subject in image data.

Claim 2. Leung teaches wherein each image sensor is configured to capture the plurality of input images is triggered at a first time. [0100] the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. 

Claim 3. Leung teaches wherein a first input image of the plurality of input images is output from a first image sensor at a different time than a second input image of the plurality of input images is output from a second image sensor. [0100] the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. Examiner interprets “substantially the same time” to be a different time.

Claim 4. Wei and Leung teach wherein the at least one input comprises at least one input image of the plurality of input images, Figure 2: target frames (input images)

[3.1 Problem Formulation] The predicted depth maps and poses of a set of input surrounding samples I = {I1,I2,···IN}

the at least one processor is configured to combine the timing information with the at least one input image. [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to be combining timing information.

and wherein to combine the timing information associated with capture of the plurality of input images with the at least one input of the machine learning system, Wei [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to be combining timing information.

Wei [3.2 Overview] we entangle the features from all views into an integrated feature at each scale, and further utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales,  Figure 3

Wei [6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Leung [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps. Thus, the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. A synchronized set of images includes the same subject identifier for each subject identified in the respective views as a result of the execution of the view association model 391 by the multi-view associator 330. 

Claim 5. Wei and Leung teach wherein the at least one input comprises multi-scale feature information generated from at least one input image of the plurality of input images using the machine learning-based encoder, Figure 2: target frames (input images)

[3.1 Problem Formulation] The predicted depth maps and poses of a set of input surrounding samples I = {I1,I2,···IN}

[Introduction] we propose SurroundDepth to process all the surrounding views jointly to produce high-quality depth maps across cameras. We first employ a shared encoder to extract high-level feature maps for each view and then propose a cross-view transformer to effectively fuse them.

[3.2 Overview] Figure 2, the network F can be separated into three parts (i.e., a shared encoder E, a shared decoder D, and several cross-view transformers (CVT))

the at least one processor is configured to add the timing information to the multi-scale feature information. [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to include timing information.

[6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

and wherein to combine the timing information associated with capture of the plurality of input images with the at least one input of the machine learning system, Wei [3.2 Overview] we entangle the features from all views into an integrated feature at each scale, and further utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales,  Figure 3

Wei [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to be combining timing information.

Wei [6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Leung [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps. Thus, the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. A synchronized set of images includes the same subject identifier for each subject identified in the respective views as a result of the execution of the view association model 391 by the multi-view associator 330. 
 
Claim 6. Wei and Leung teach wherein the at least one input comprises downscaled multi-scale feature information generated from at least one input image of the plurality of input images using the machine learning-based encoder, Figure 2: target frames (input images)

[3.1 Problem Formulation] The predicted depth maps and poses of a set of input surrounding samples I = {I1,I2,···IN}

[3.2 Overview] Figure 2, the network F can be separated into three parts (i.e., a shared encoder E, a shared decoder D, and several cross-view transformers (CVT))

[4.1 Experimental Setup] In each scale, we adopted Z=8transformer layers and all features were downsampled 

[Introduction] we propose SurroundDepth to process all the surrounding views jointly to produce high-quality depth maps across cameras. We first employ a shared encoder to extract high-level feature maps for each view and then propose a cross-view transformer to effectively fuse them.

the at least one processor is configured to add the timing information to the downscaled multi-scale feature information. [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to include timing information. 

[6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

and wherein to combine the timing information associated with capture of the plurality of input images, Wei [3.2 Overview] we entangle the features from all views into an integrated feature at each scale, and further utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales,  Figure 3

Wei [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to be combining timing information.

Wei [6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Leung [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps. Thus, the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. A synchronized set of images includes the same subject identifier for each subject identified in the respective views as a result of the execution of the view association model 391 by the multi-view associator 330. 

Claim 7. Wei and Leung teach wherein the at least one input comprises flattening features associated with multi-scale feature information generated from at least one input image of the plurality of input images using the machine learning-based encoder, 
Figure 2: target frames (input images), Figure 3

[3.1 Problem Formulation] The predicted depth maps and poses of a set of input surrounding samples I = {I1,I2,···IN}

[3.2 Overview] Given a set of surrounding images, the encoder networks first extract their multi-scale representations in parallel…

[3.3 Cross-View Transformer] to create the pathway across the features of surrounding views, we flatten the feature maps into an unified sequence, which includes the elements from all views, i.e., N × hk × wk elements in total.
[Introduction] we propose SurroundDepth to process all the surrounding views jointly to produce high-quality depth maps across cameras. We first employ a shared encoder to extract high-level feature maps for each view and then propose a cross-view transformer to effectively fuse them.

the at least one processor is configured to add the timing information to the flattening features. [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to include timing information.  

[6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

and wherein to combine the timing information associated with capture of the plurality of input images, Wei [3.2 Overview] we entangle the features from all views into an integrated feature at each scale, and further utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales,  Figure 3

Wei [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to be combining timing information.

Wei [6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Leung [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps. Thus, the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. A synchronized set of images includes the same subject identifier for each subject identified in the respective views as a result of the execution of the view association model 391 by the multi-view associator 330. 

Claim 10. Wei teaches wherein the at least one processor is configured to: determine a cross-view attention between the plurality of spatial views based on the plurality of features generated from the plurality of input images associated with the plurality of spatial views; Figure 2: target frames (input images)

[3.1 Problem Formulation] The predicted depth maps and poses of a set of input surrounding samples I = {I1,I2,···IN}

Figure 3: The proposed cross-view transformer.

[3.3 Cross-View Transformer, 2nd paragraph] Then we build Z cross-view self-attention layers to perform the cross-view information exchanging…we place a depthwise separable convolution (DS Conv) [44, 45] before attention layers to first summarize the large feature maps into lower-resolution ones with same channel numbers…To create the pathway across the features of surrounding views, we flatten the feature maps into an unified sequence, which includes the elements from all views

and determine depth associated with the plurality of input images based on the cross-view attention. Figure 2: We utilize encoder-decoder networks to predict depths. To entangle surrounding views, we propose the cross-view transformer (CVT) to fuse multi camera features in a multi-scale fashion.

Figure 2: target frames (input images)

[3.1 Problem Formulation] The predicted depth maps and poses of a set of input surrounding samples I = {I1,I2,···IN}

Figure 3: The proposed cross-view transformer. We first use a depthwise separable convolution (DS-Conv) layer to summarize the multi-view features into compact representations…

[D Visualization, 2nd paragraph] …demonstrating that our cross-attentions are able to entangle multi-view features to predict depths jointly.

Claim 11. Reviewed and analyzed in the same way as claim 1. See the above analysis and rationale. 

Claim 12. Reviewed and analyzed in the same way as claim 2. See the above analysis and rationale. 

Claim 13. Reviewed and analyzed in the same way as claim 3. See the above analysis and rationale. 

Claim 14. Reviewed and analyzed in the same way as claim 4. See the above analysis and rationale. 

Claim 16. Reviewed and analyzed in the same way as claim 10. See the above analysis and rationale. 
 
Claim 17. Wei teaches An apparatus for processing image data, the apparatus comprising: [Abstract] Depth estimation from images

at least one memory; [4.2 The Benchmark for Multi-camera Depth Estimation]  on a single RTX 3090 (has a memory)

and at least one processor coupled to the at least one memory and configured to: [4.2 The Benchmark for Multi-camera Depth Estimation]  on a single RTX 3090 

obtain a plurality of input images associated with a plurality of spatial views of a scene; Figure 2: target frames (input images)

[3.1 Problem Formulation] The predicted depth maps and poses of a set of input surrounding samples I = {I1,I2,···IN},  [3.2 Overview] Given a set of surrounding images

generate, using a machine learning system, Figure 2: To entangle surrounding views, we propose the cross-view transformer (CVT) to fuse multi camera features in a multi-scale fashion.

plurality of predicted images associated with the plurality of spatial views; [Introduction] we propose SurroundDepth to process all the surrounding views jointly to produce high-quality depth maps across cameras. 

Figure 2: depth prediction

and generate a plurality of features from each of the plurality of spatial views at a first time. [3.2 Overview] Figure 2: We first employ a shared encoder to extract high-level feature maps for each view and then propose a cross-view transformer to effectively fuse them.

Wei fails to explicitly teach a first time. Leung, in the field of train a neural network to perform multi-view association of subject(s) in images teaches [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps.

Wei teaches depth estimation of multiview using a neural network. Thus, before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to substituting the feature incorporating the information of Wei with timing information of Leung [0002] for creation of intelligent interactive environments and provides for development of specific subject-based identifiers that can be used to identify and track multiple subject in image data.

Claim 18. Wei teaches wherein the at least one processor is configured to: warp each image from the plurality of predicted images based on extrinsic information of a corresponding image sensor. [3.4 Scale-aware Structure-from-Motion Pretraining] A direct way to leverage extrinsic matrices is to use spatial photometric loss between two neighboring views, i.e., warping Ii t to Ijt

[3.5 Joint Pose Estimation] Obtaining the universal pose P, we can further transform it to each camera pose with known camera extrinsic matrices
 
Claim 19. Wei teaches wherein each of the plurality of predicted images is generated to correspond to an of estimate the plurality of features at the first time.  [3.2 Overview] Figure 2: We first employ a shared encoder to extract high-level feature maps for each view…we entangle the features from all views into an integrated feature at each scale, and further utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales

Leung [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps.

Claim 20. Wei teaches wherein the at least one processor is configured to: determine a cross-view attention between the plurality of spatial views based on the plurality of features generated from the plurality of predicted images associated with the plurality of spatial views; Figure 2: … To entangle surrounding views, we propose the cross-view transformer (CVT) to fuse multi camera features in a multi-scale fashion

[3.2 Overview] utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales. 

and determine depth associated with the plurality of predicted images based on the cross-view attention. Figure 2: An overview of our SurroundDepth. We utilize encoder-decoder networks to predict depths. To entangle surrounding views, we propose the cross-view transformer (CVT)

Allowable Subject Matter
Claims 8, 9 and 15 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim 8. Wei teaches wherein the at least one input comprises keys and values generated from multi-scale feature information using the machine learning-based encoder, [3.3 Cross-View Transformer] We develop three linear layers to obtain the query, key, and value vectors…Ki,Qi,Vi denote the i-th feature group of the key, query, and value features, respectively.

[Introduction] we propose SurroundDepth to process all the surrounding views jointly to produce high-quality depth maps across cameras. We first employ a shared encoder to extract high-level feature maps for each view and then propose a cross-view transformer to effectively fuse them.

and wherein to combine the timing information associated with capture of the plurality of input images, Wei [3.2 Overview] we entangle the features from all views into an integrated feature at each scale, and further utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales,  Figure 3

[Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to be combining timing information.

[6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Leung [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps. Thus, the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. A synchronized set of images includes the same subject identifier for each subject identified in the respective views as a result of the execution of the view association model 391 by the multi-view associator 330. 
 
the at least one processor is configured to add the timing information to the keys and the values  [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to include timing information.  

[6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Wei [3.3 Cross-View Transformer] We develop three linear layers to obtain the query, key, and value vectors. However, Wei and other prior art fails to explicitly teach add the timing information to the keys and the values 

Claim 9. Wei teaches wherein the at least one input comprises queries generated from multi-scale feature information using the machine learning-based encoder, Figure 7: The visualization of cross-view attention maps. Given a set of query points, our cross view attention maps will highlight those features at the corresponding locations in other views.

[Introduction] we propose SurroundDepth to process all the surrounding views jointly to produce high-quality depth maps across cameras. We first employ a shared encoder to extract high-level feature maps for each view and then propose a cross-view transformer to effectively fuse them.

[3.2 Overview] the encoder networks first extract their multi-scale representations in parallel

and wherein to combine the timing information associated with capture of the plurality of input images, [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to be combining timing information.

[3.2 Overview] we entangle the features from all views into an integrated feature at each scale, and further utilize multiple scale-specific CVT to perform cross-view self-attentions over all scales,  Figure 3

[6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Leung [0100] The multi-view associator 330 time-synchronizes images generated by the different image capture devices 122, 124, 370 based on, for example, time-stamps. Thus, the multi-view associator 330 generates synchronized sets of images including different views generated by the respective image capture devices 122, 124, 370 at the same or substantially the same time. A synchronized set of images includes the same subject identifier for each subject identified in the respective views as a result of the execution of the view association model 391 by the multi-view associator 330. 

the at least one processor is configured to add the timing information to the queries  [Abstract] we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.  Examiner interprets “incorporate the information” and “fuse the information” to include timing information.  

[6 Conclusion] The cross-view transformer is performed at multiple scales to incorporate multi-view features.

Wei [3.3 Cross-View Transformer] We develop three linear layers to obtain the query, key, and value vectors. However, Wei and other prior art fails to explicitly teach add the timing information to the queries 

Claim 15. Reviewed and analyzed in the same way as claims 5-9. See the above analysis and rationale. 

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DELOMIA L GILLIARD whose telephone number is (571)272-1681. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John Villecco can be reached at (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/DELOMIA L GILLIARD/Primary Examiner, Art Unit 2661
Read full office action
Prosecution Timeline

Feb 06, 2024
Application Filed
May 06, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/252,881
Patent 12639843
METHOD AND APPARATUS FOR DETERMINING POSE OF TRACKED OBJECT IN IMAGE TRACKING PROCESS
3y 0m to grant Granted May 26, 2026
18/381,540
Patent 12639964
VIDEO MANUAL GENERATION DEVICE, VIDEO MANUAL GENERATION METHOD, AND STORAGE MEDIUM STORING VIDEO MANUAL GENERATION PROGRAM
2y 7m to grant Granted May 26, 2026
18/470,367
Patent 12639980
User Eye Model Match Detection
2y 8m to grant Granted May 26, 2026
18/750,644
Patent 12633112
METHODS AND APPARATUS FOR IDENTIFYING VIDEO-DERIVED DATA
1y 11m to grant Granted May 19, 2026
18/282,949
Patent 12626555
ENTRY/EXIT MANAGEMENT SYSTEM, ENTRY/EXIT MANAGEMENT METHOD, AND RECORDING MEDIUM
2y 7m to grant Granted May 12, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
90%
Grant Probability
99%
With Interview (+10.2%)
1y 12m (~0m remaining)
Median Time to Grant
Low
PTA Risk
Based on 1092 resolved cases by this examiner. Grant probability derived from career allowance rate.