Prosecution Insights
Last updated: April 19, 2026
Application No. 18/607,077

MULTI-STAGE MULTI-VIEW OBJECT DETECTION

Non-Final OA §103
Filed
Mar 15, 2024
Examiner
SCHWARTZ, RAPHAEL M
Art Unit
2671
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
67%
Grant Probability
Favorable
1-2
OA Rounds
2y 11m
To Grant
98%
With Interview

Examiner Intelligence

Grants 67% — above average
67%
Career Allow Rate
227 granted / 338 resolved
+5.2% vs TC avg
Strong +31% interview lift
Without
With
+31.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
24 currently pending
Career history
362
Total Applications
across all art units

Statute-Specific Performance

§101
7.8%
-32.2% vs TC avg
§103
48.9%
+8.9% vs TC avg
§102
7.5%
-32.5% vs TC avg
§112
19.3%
-20.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 338 resolved cases

Office Action

§103
DETAILED ACTION Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li (“BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers”). Regarding claim 1, Li discloses apparatus for object detection, the apparatus comprising: (Pg. 2, second paragraph from bottom, “transformer-based bird’s-eye-view (BEV) encoder, termed BEVFormer, which can effectively aggregate spatiotemporal features from multi-view cameras and history BEV features. The BEV features generated from the BEVFormer can simultaneously support multiple 3D perception tasks such as 3D object detection and map segmentation, which is valuable for the autonomous driving system” Tables 5 and 6 teach using a GPU apparatus.) at least one memory; and at least one processor coupled to the at least one memory and configured to: (Tables 5 and 6 teach using a GPU processor and memory.) extract, using an encoder, a plurality of features from one or more images of an environment of the apparatus; (See Fig. 2 and encoder layers processing multi-view input. Pg. 4, last paragraph, “We feed multi-camera images to the backbone network (e.g., ResNet-101[15]), and obtain the features . . . of different camera views”) determine, based on the plurality of features, a first detection of one or more objects and three-dimensional (3D) coordinates for the one or more objects; (3D BEV representation is generated and updated on the basis of the input image features, see Figs. 1 and 2 and see pg. 5, “3.3 Spatial Cross-Attention.” The BEV provides a unified environment representation which aggregates the multi-camera views to detect objects surrounding the vehicle.) back-project the 3D coordinates of the one or more objects onto the one or more images; (Pg. 5, Section 3.3 Spatial Cross-Attention teaches taking the generated 3D representation and back projecting reference points to the different 2D image views via a projection matrix for each camera.) determine one or more regions of at least one first image of the one or more images based on the back-projection of the 3D coordinates of the one or more objects; and (Pg. 5, Section 3.3 ¶ 1, “we develop the spatial cross-attention based on deformable attention, which is a resource-efficient attention layer where each BEV query Qp only interacts with its regions of interest across camera views.” Also see Fig. 2 and Pg. 5, Section 3.3, ¶ 2 which shows that each BEV query only interacts with image features in the region of interest at the reference points in the hit views.) determine, based on the one or more regions of the at least one first image, a second detection of the one or more objects. (After the spatial and temporal cross attention updates the BEV as described above the process continues to a 3D detection head network as seen at the top of Fig 2, and at Section 3.5 as well as on Pg. 16, Section “Detection Head”.) Li does not expressly disclose that all of its above-cited teachings on multi-view 3D object detection are expressly disclosed as occurring in the same embodiment. That is, despite the reference being clear that these functions are disclosed, there is no express disclosure that the details are all found in the same embodiment. For example, the reference teaches using computer hardware in Tables 5 and 6 for model training and model performance but this disclosure relates to Section 4.2 Experimental Settings and its implementation of ‘BEVFormer-S’ which has slight variations on the BEVFormer architecture described in Section 3. There is no express disclosure that the Section 3 system is used with a GPU and memory. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the various teachings to provide a single system capable of performing the computational tasks on a processor and memory. In view of these teachings, this cannot be considered a non-obvious improvement over the prior art. Using known engineering design, no “fundamental” operating principle of the teachings are changed; they continue to perform the same functions as originally taught prior to being combined. Regarding claim 2, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to downsample the one or more images of the environment to produce one or more downsampled images, wherein the plurality of features are extracted from the one or more downsampled images. (Backbone network via ResNet-101 shown in Fig. 2 and pg. 4, last paragraph downsample the images for features extraction. The deformable attention step at section 3.3 also downsamples the images.) Regarding claim 3, the above combination discloses the apparatus of claim 2, wherein the one or more images have a higher resolution than the one or more downsampled images. (As above, Backbone network via ResNet-101 shown in Fig. 2 and pg. 4, last paragraph downsample the images for features extraction. The deformable attention step at section 3.3 also downsamples the images. Downsampling is the process of reducing image resolution.) Regarding claim 4, the above combination discloses the apparatus of claim 2, wherein the one or more images include a larger number of images than the one or more downsampled images. (The deformable attention step at section 3.3 downsamples the images and only operates on the features in the ‘hit’ views.) Regarding claim 5, the above combination discloses the apparatus of claim 1, wherein the one or more images are two-dimensional images. (See Fig 2 regarding the 2D camera views.) Regarding claim 6, the above combination discloses the apparatus of claim 1, further comprising one or more camera sensors, wherein the one or more camera sensors are configured to obtain the one or more images of the environment of the apparatus. (See Fig 2 regarding the multiple 2D cameras.) Regarding claim 7, the above combination discloses the apparatus of claim 6, wherein the at least one processor is configured to determine a subset of camera sensors of the one or more camera sensors for the one or more regions of the at least one first image based on at least one of: the subset of camera sensors having views within which the one or more objects are more centrally located than within one or more views of one or more other camera sensors, the subset of camera sensors having views where the one or more objects are least occluded as compared to views of other camera sensors of the one or more camera sensors, or machine learning training for selecting the subset of camera sensors. (As above, Fig. 2 and Pg. 5, Section 3.3, ¶ 2 show that each BEV query only interacts with image features in the region of interest at the reference points in the hit view without occlusion.) Regarding claim 8, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to determine the second detection of the one or more objects further based on the one or more regions being processed individually. (As above, Fig. 2 and Pg. 5, Section 3.3, ¶ 2 show that each BEV query only interacts with and processes individual images that have hit views.) Regarding claim 9, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to determine the second detection of the one or more objects further based on at least portions of the one or more regions being processed as a single composite region comprising the at least portions of the one or more regions. (The multi-view regions are processed as a composite region in the spatial cross-attention step to aggregate the spatial features from multi-camera images, see pg. 2, second paragraph from bottom and pg. 5, 3.3 Spatial Cross-Attention.) Regarding claim 10, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to determine the second detection of the one or more objects further based on the one or more regions being processed with one or more cross-attention layers of a transformer neural network applied to the one or more regions. (As above, the multi-view regions are processed as a composite region in the spatial cross-attention transformer neural network step to aggregate the spatial features from multi-camera images, see pg. 2, second paragraph from bottom and pg. 5, 3.3 Spatial Cross-Attention.) Regarding claim 11, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to project the plurality of features to a bird’s eye view (BEV). (As above, see Fig. 2) Regarding claim 12, the above combination discloses the apparatus of claim 1, wherein the 3D coordinates are world coordinates. (See pg. 5, section 3.3, ¶ 3.) Regarding claim 13, the above combination discloses the apparatus of claim 1, wherein the apparatus is a vehicle or a computing device of the vehicle. (See Fig. 1 and rejection of claim 1.) Regarding claim 14, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to, for each region of the one or more regions, extract one or more patches of sensor data or one or more patches of features of the plurality of features. (As above, Pg. 5, Section 3.3 ¶ 1, “we develop the spatial cross-attention based on deformable attention, which is a resource-efficient attention layer where each BEV query Qp only interacts with its regions of interest across camera views.” Also see Fig. 2 and Pg. 5, Section 3.3, ¶ 2 which shows that each BEV query only interacts with image features in the region of interest at the reference points in the hit views.) Claims 15-20 are the method claims corresponding to the apparatus of claims 1, 2, 7 and 9-11. The apparatus necessitates method steps. Remaining limitations are rejected similarly. See detailed analysis above. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to Raphael Schwartz whose telephone number is (571)270-3822. The examiner can normally be reached Monday to Friday 9am-5pm CT. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at (571) 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /RAPHAEL SCHWARTZ/ Examiner, Art Unit 2671
Read full office action

Prosecution Timeline

Mar 15, 2024
Application Filed
Mar 30, 2026
Examiner Interview (Telephonic)
Mar 30, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12597128
ASSESSMENT OF SKIN TOXICITY IN AN IN VITRO TISSUE SAMPLES USING DEEP LEARNING
2y 5m to grant Granted Apr 07, 2026
Patent 12592063
MACHINE LEARNING OF SPATIO-TEMPORAL MANIFOLDS FOR SOURCE-FREE VIDEO DOMAIN ADAPTATION
2y 5m to grant Granted Mar 31, 2026
Patent 12579642
Methods, Systems, and Apparatuses for Quantitative Analysis of Heterogeneous Biomarker Distribution
2y 5m to grant Granted Mar 17, 2026
Patent 12548289
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND NON-TRANSITORY COMPUTER READABLE MEDIUM
2y 5m to grant Granted Feb 10, 2026
Patent 12548179
FUNCTIONAL EVALUATION SYSTEM OF HIPPOCAMPUS AND DATA CREATION METHOD
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
67%
Grant Probability
98%
With Interview (+31.3%)
2y 11m
Median Time to Grant
Low
PTA Risk
Based on 338 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month