DETAILED ACTION
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li (“BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers”).
Regarding claim 1, Li discloses apparatus for object detection, the apparatus comprising: (Pg. 2, second paragraph from bottom, “transformer-based bird’s-eye-view (BEV) encoder, termed BEVFormer, which can effectively aggregate spatiotemporal features from multi-view cameras and history BEV features. The BEV features generated from the BEVFormer can simultaneously support multiple 3D perception tasks such as 3D object detection and map segmentation, which is valuable for the autonomous driving system” Tables 5 and 6 teach using a GPU apparatus.)
at least one memory; and at least one processor coupled to the at least one memory and configured to: (Tables 5 and 6 teach using a GPU processor and memory.)
extract, using an encoder, a plurality of features from one or more images of an environment of the apparatus; (See Fig. 2 and encoder layers processing multi-view input. Pg. 4, last paragraph, “We feed multi-camera images to the backbone network (e.g., ResNet-101[15]), and obtain the features . . . of different camera views”)
determine, based on the plurality of features, a first detection of one or more objects and three-dimensional (3D) coordinates for the one or more objects; (3D BEV representation is generated and updated on the basis of the input image features, see Figs. 1 and 2 and see pg. 5, “3.3 Spatial Cross-Attention.” The BEV provides a unified environment representation which aggregates the multi-camera views to detect objects surrounding the vehicle.)
back-project the 3D coordinates of the one or more objects onto the one or more images; (Pg. 5, Section 3.3 Spatial Cross-Attention teaches taking the generated 3D representation and back projecting reference points to the different 2D image views via a projection matrix for each camera.)
determine one or more regions of at least one first image of the one or more images based on the back-projection of the 3D coordinates of the one or more objects; and (Pg. 5, Section 3.3 ¶ 1, “we develop the spatial cross-attention based on deformable attention, which is a resource-efficient attention layer where each BEV query Qp only interacts with its regions of interest across camera views.” Also see Fig. 2 and Pg. 5, Section 3.3, ¶ 2 which shows that each BEV query only interacts with image features in the region of interest at the reference points in the hit views.)
determine, based on the one or more regions of the at least one first image, a second detection of the one or more objects. (After the spatial and temporal cross attention updates the BEV as described above the process continues to a 3D detection head network as seen at the top of Fig 2, and at Section 3.5 as well as on Pg. 16, Section “Detection Head”.)
Li does not expressly disclose that all of its above-cited teachings on multi-view 3D object detection are expressly disclosed as occurring in the same embodiment. That is, despite the reference being clear that these functions are disclosed, there is no express disclosure that the details are all found in the same embodiment. For example, the reference teaches using computer hardware in Tables 5 and 6 for model training and model performance but this disclosure relates to Section 4.2 Experimental Settings and its implementation of ‘BEVFormer-S’ which has slight variations on the BEVFormer architecture described in Section 3. There is no express disclosure that the Section 3 system is used with a GPU and memory. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the various teachings to provide a single system capable of performing the computational tasks on a processor and memory. In view of these teachings, this cannot be considered a non-obvious improvement over the prior art. Using known engineering design, no “fundamental” operating principle of the teachings are changed; they continue to perform the same functions as originally taught prior to being combined.
Regarding claim 2, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to downsample the one or more images of the environment to produce one or more downsampled images, wherein the plurality of features are extracted from the one or more downsampled images. (Backbone network via ResNet-101 shown in Fig. 2 and pg. 4, last paragraph downsample the images for features extraction. The deformable attention step at section 3.3 also downsamples the images.)
Regarding claim 3, the above combination discloses the apparatus of claim 2, wherein the one or more images have a higher resolution than the one or more downsampled images. (As above, Backbone network via ResNet-101 shown in Fig. 2 and pg. 4, last paragraph downsample the images for features extraction. The deformable attention step at section 3.3 also downsamples the images. Downsampling is the process of reducing image resolution.)
Regarding claim 4, the above combination discloses the apparatus of claim 2, wherein the one or more images include a larger number of images than the one or more downsampled images. (The deformable attention step at section 3.3 downsamples the images and only operates on the features in the ‘hit’ views.)
Regarding claim 5, the above combination discloses the apparatus of claim 1, wherein the one or more images are two-dimensional images. (See Fig 2 regarding the 2D camera views.)
Regarding claim 6, the above combination discloses the apparatus of claim 1, further comprising one or more camera sensors, wherein the one or more camera sensors are configured to obtain the one or more images of the environment of the apparatus. (See Fig 2 regarding the multiple 2D cameras.)
Regarding claim 7, the above combination discloses the apparatus of claim 6, wherein the at least one processor is configured to determine a subset of camera sensors of the one or more camera sensors for the one or more regions of the at least one first image based on at least one of: the subset of camera sensors having views within which the one or more objects are more centrally located than within one or more views of one or more other camera sensors, the subset of camera sensors having views where the one or more objects are least occluded as compared to views of other camera sensors of the one or more camera sensors, or machine learning training for selecting the subset of camera sensors. (As above, Fig. 2 and Pg. 5, Section 3.3, ¶ 2 show that each BEV query only interacts with image features in the region of interest at the reference points in the hit view without occlusion.)
Regarding claim 8, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to determine the second detection of the one or more objects further based on the one or more regions being processed individually. (As above, Fig. 2 and Pg. 5, Section 3.3, ¶ 2 show that each BEV query only interacts with and processes individual images that have hit views.)
Regarding claim 9, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to determine the second detection of the one or more objects further based on at least portions of the one or more regions being processed as a single composite region comprising the at least portions of the one or more regions. (The multi-view regions are processed as a composite region in the spatial cross-attention step to aggregate the spatial features from multi-camera images, see pg. 2, second paragraph from bottom and pg. 5, 3.3 Spatial Cross-Attention.)
Regarding claim 10, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to determine the second detection of the one or more objects further based on the one or more regions being processed with one or more cross-attention layers of a transformer neural network applied to the one or more regions. (As above, the multi-view regions are processed as a composite region in the spatial cross-attention transformer neural network step to aggregate the spatial features from multi-camera images, see pg. 2, second paragraph from bottom and pg. 5, 3.3 Spatial Cross-Attention.)
Regarding claim 11, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to project the plurality of features to a bird’s eye view (BEV). (As above, see Fig. 2)
Regarding claim 12, the above combination discloses the apparatus of claim 1, wherein the 3D coordinates are world coordinates. (See pg. 5, section 3.3, ¶ 3.)
Regarding claim 13, the above combination discloses the apparatus of claim 1, wherein the apparatus is a vehicle or a computing device of the vehicle. (See Fig. 1 and rejection of claim 1.)
Regarding claim 14, the above combination discloses the apparatus of claim 1, wherein the at least one processor is configured to, for each region of the one or more regions, extract one or more patches of sensor data or one or more patches of features of the plurality of features. (As above, Pg. 5, Section 3.3 ¶ 1, “we develop the spatial cross-attention based on deformable attention, which is a resource-efficient attention layer where each BEV query Qp only interacts with its regions of interest across camera views.” Also see Fig. 2 and Pg. 5, Section 3.3, ¶ 2 which shows that each BEV query only interacts with image features in the region of interest at the reference points in the hit views.)
Claims 15-20 are the method claims corresponding to the apparatus of claims 1, 2, 7 and 9-11. The apparatus necessitates method steps. Remaining limitations are rejected similarly. See detailed analysis above.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Raphael Schwartz whose telephone number is (571)270-3822. The examiner can normally be reached Monday to Friday 9am-5pm CT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at (571) 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RAPHAEL SCHWARTZ/ Examiner, Art Unit 2671