DETAILED ACTION
This action is in response to the application filed on November 9th, 2023. Claims 1-29 are pending and have been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretation
Claim 10 reads “wherein the probability of skipping the layer is greater when the layer corresponds to one or more objects and the probability of skipping the layer is smaller when the layer corresponds to a background region” This is consistent with paragraph [0066] (“The probability of skipping the layer may be greater when the layer corresponds to one or more objects and the probability of skipping the layer may be smaller when the layer corresponds to a background region”) of the specification, but is inconsistent with paragraphs [0005] (“The probability of skipping a given layer may depend on the location of one or more objects corresponding to the layer relative to the camera. The probability of skipping a layer corresponding to an important object may be lower than the probability of skipping a layer corresponding to background objects.”), [0086] (“The probability of skipping a given layer may be based on the size and location of an objects in a scene captured by camera images 202, with smaller probabilities assigned to layers that correspond to regions of an image containing objects and higher probabilities assigned to layers corresponding to background regions.”), and [0106] (“The probability of skipping a given channel would be based on a location and size of the objects in the scene, with smaller probabilities assigned to channels that correspond to regions of the BEV feature map containing objects and higher probabilities assigned to channels corresponding to background regions.”) For the purposes of examination, the claim will be interpreted as is written.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 29 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim does not fall within at least one of the four categories of patent eligible subject matter because the broadest reasonable interpretation of “a computer readable medium” includes signal and/or carrier waves. This is confirmed in paragraph [0159] “Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.” This rejection can be overcome by including “non-transitory” to describe the computer readable medium.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1, 8, 11-14, 16, 23, and 26-29 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US20200160559 (herein after referred to by its primary author, Urtasun).
In regards to claim 1, Urtasun teaches an apparatus for processing image data, the apparatus comprising: a memory for storing the image data (Urtasun Paragraph [0123] “The computing system 810 includes one or more processors 812 and a memory 814.”), wherein the image data comprises a first set of image data collected by a first camera comprising a first field of view (FOV) and a second set of image data collected by a second camera comprising a second FOV (Urtasun Paragraph [0053] “The one or more autonomy system sensors 114 can be configured to generate and/or store data including the autonomy sensor data 116 associated with one or more objects that are proximate to the vehicle 102 (e.g., within range or a field of view of one or more of the one or more sensors 114). The one or more autonomy system sensors 114 can include a Light Detection and Ranging (LIDAR) system, a Radio Detection and Ranging (RADAR) system, one or more cameras (e.g., visible spectrum cameras and/or infrared cameras), motion sensors, and/or other types of imaging capture devices and/or sensors.”; Paragraph [0105] “At 702, method 700 can include obtaining/receiving LIDAR point cloud data associated with an environment surrounding an autonomous vehicle. In some implementations, the LIDAR point cloud data obtained at 702 can correspond to sensor data 116 of FIG. 1.”; Paragraph [0106] “At 704, method 700 can include obtaining/receiving image data associated with the environment surrounding an autonomous vehicle. In some implementations, the image data obtained at 704 can correspond to sensor data 116 of FIG. 1.”); and processing circuitry in communication with the memory (Urtasun Paragraph [0123] “The computing system 810 includes one or more processors 812 and a memory 814.”), wherein the processing circuitry is configured to: apply an encoder to extract, from the first set of image data based on a location of a first one or more objects within the first FOV, a first set of perspective view features (Urtasun Paragraph [0109] “At 712, method 700 can include fusing image features from image data (e.g., image data obtained at 704) with LIDAR features from the BEV representation of the LIDAR data.”); apply the encoder to extract, from the second set of image data based on a location of a second one or more objects within the second FOV, a second set of perspective view features (Urtasun Paragraph [0109] “At 712, method 700 can include fusing image features from image data (e.g., image data obtained at 704) with LIDAR features from the BEV representation of the LIDAR data.); and project the first set of perspective view features and the second set of perspective view features onto a grid to generate a set of bird’s eye view (BEV) features that provides information corresponding to the first one or more objects and the second one or more objects (Urtasun Paragraph [0109] “At 712, method 700 can include fusing image features from image data (e.g., image data obtained at 704) with LIDAR features from the BEV representation of the LIDAR data. In some implementations, fusing at 712 can include executing one or more continuous convolutions to fuse image features from a first data stream with LIDAR features from a second data stream. In some implementations, executing one or more continuous convolutions as part of fusing at 712 can be implemented as part of a machine-learned sensor fusion model or sensor fusion layers.” Examiner note: The features from the image data captured in step 704 are fused with the LIDAR features to form a single BEV representation. This is done continuously).
In regards to claim 8, Urtasun teaches the apparatus of claim 1, wherein the memory is further configured to store a set of training data comprising a plurality of sets of training image data, and wherein the processing circuitry is further configured to train, based on the plurality of sets of training image data, the encoder (Urtasun Paragraph [0038] “By contrast, the disclosed technology, through use of machine-learned models, can train a model on training data, which can be done at a scale proportional to the available resources of the training system (e.g., a massive scale of training data can be used to train the machine-learned model). Further, the machine-learned models can easily be revised as new training data is made available. As such, use of a machine-learned model trained on labeled sensor data can provide a scalable and customizable solution.”), wherein the encoder represents a residual network comprising a set of layers (Urtasun Paragraph [0072] “Various different network architectures can be used. As one example, the backbone models 202 and 204 can follow a two-stream architecture to process multi-sensor data. Specifically, for the image backbone model 204, in some implementations, a pre-trained ResNet-18 can be used until the fourth convolutional block. Each block can contain 2 residual layers with number of feature maps increasing from 64 to 512 linearly.”).
In regards to claim 11, Urtasun teaches the apparatus of claim 1, wherein the processing circuitry is further configured to apply a decoder to generate an output based on the set of BEV features (Urtasun Paragraph [0111] “At 716, method 700 can include detecting three-dimensional objects of interest based on the fused ROI crops generated at 714. In some implementations, detecting objects of interest at 716 can include providing the feature map generated at 714 as input to a machine-learned refinement model.” Examiner note: The output in this scenario is the one or more three-dimensional objects of interest).
In regards to claim 12, Urtasun teaches the apparatus of claim 11, wherein the processing circuitry is further configured to use the output generated by the decoder to control a device based on the first one or more objects and the second one or more objects (Urtasun Paragraph [0112] “At 718, method 700 can include determining a motion plan based on the object detections determined by the machine-learned detector model at 716.”; Paragraph [0113] “At 720, method 700 can include controlling motion of an autonomous vehicle (e.g., vehicle 102 of FIG. 1) based at least in part on the motion plan determined at 718.”).
In regards to claim 13, Urtasun teaches the apparatus of claim 12, wherein the device is a vehicle, wherein to apply the decoder to generate the output based on the set of BEV features, the processing circuitry is configured to cause the decoder to generate the output to include information identifying one or more characteristics corresponding to each object of the first one or more objects and the second one or more objects (Urtasun Paragraph [0111] “At 716, method 700 can include detecting three-dimensional objects of interest based on the fused ROI crops generated at 714. In some implementations, detecting objects of interest at 716 can include providing the feature map generated at 714 as input to a machine-learned refinement model.”), and wherein to use the output generated by the decoder to control the vehicle based on the first one or more objects and the second one or more objects, the processing circuitry is configured to use the output generated by the decoder to control the vehicle based on the one or more characteristics corresponding to each object of the first one or more objects and the second one or more objects (Urtasun Paragraph [0112] “At 718, method 700 can include determining a motion plan based on the object detections determined by the machine-learned detector model at 716.”; Paragraph [0113] “At 720, method 700 can include controlling motion of an autonomous vehicle (e.g., vehicle 102 of FIG. 1) based at least in part on the motion plan determined at 718.”).
In regards to claim 14, Urtasun teaches the apparatus of claim 13, wherein the one or more characteristics corresponding to each object of the first one or more objects and the second one or more objects may include an identity of the object, a location of the object relative to the vehicle, one or more characteristics of a movement of the object, one or more actions performed by the object, or any combination thereof (Urtasun Paragraph [0111] “At 716, method 700 can include detecting three-dimensional objects of interest based on the fused ROI crops generated at 714. In some implementations, detecting objects of interest at 716 can include providing the feature map generated at 714 as input to a machine-learned refinement model. In response to receiving the feature map, the machine-learned refinement model can be trained to generate as output a plurality of detections corresponding to identified objects of interest within the feature map. In some implementations, detecting objects of interest at 716 can include determining a plurality of object classifications and/or bounding shapes corresponding to the detected objects of interest.”).
In regards to claim 16, Urtasun anticipates the claim language as in the consideration of claim 1.
In regards to claim 23, Urtasun anticipates the claim language as in the consideration of claim 8.
In regards to claim 26, Urtasun anticipates the claim language as in the consideration of claim 11.
In regards to claim 27, Urtasun anticipates the claim language as in the consideration of claim 12.
In regards to claim 28, Urtasun anticipates the claim language as in the consideration of claim 13.
In regards to claim 29, Urtasun teaches a computer-readable medium storing instructions that, when applied by processing circuitry, cause the processing circuitry to apply the method of claim 16 (Urtasun Paragraph [0051] “For instance, the one or more computing devices of the vehicle computing system 112 can include one or more processors and one or more tangible, non-transitory, computer readable media (e.g., memory devices). The one or more tangible, non-transitory, computer readable media can store instructions that when executed by the one or more processors cause the vehicle 102 (e.g., its computing system, one or more processors, and other devices in the vehicle 102) to perform operations and functions, including those described herein.”) and anticipates the remaining claim language as in the consideration of claims 1 and 16.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 2, 9, 17, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Urtasun in view of “Deep Networks with Stochastic Depth” (herein after referred to by its primary author, Huang).
In regards to claim 2, Urtasun teaches the apparatus of claim 1, but fails to teach wherein to apply the encoder to extract the first set of perspective view features, the processing circuitry is configured to use a first stochastic depth scaling function to extract the first set of perspective view features based on a distance of each object of the first one or more objects from the first camera, and wherein to apply the encoder to extract the second set of perspective view features, the processing circuitry is configured to use a second stochastic depth scaling function to extract the second set of perspective view features based on a distance of each object of the second one or more objects from the second camera.
However, Huang teaches wherein to apply the encoder to extract the first set of perspective view features, the processing circuitry is configured to use a first stochastic depth scaling function to extract the first set of perspective view features based on a distance of each object of the first one or more objects from the first camera (Huang Figure 2; Section 3 “See Fig. 2 for a schematic illustration. The linearly decaying survival probability originates from our intuition that the earlier layers extract low-level features that will be used by later layers and should therefore be more reliably present” Examiner note: As can be best understood by the examiner, a stochastic depth scaling function is created during training within the neural network and applied to the corresponding set of image data to extract features. The feature used in step 712 could be extracted from the architecture shown in figure 2, which includes a stochastic depth scaling function. These features would be based on a distance of the object from the camera, since the size and location of the object in the image is based on the distance of the object to the camera, and the size and location of the object are used when extracting features.), and wherein to apply the encoder to extract the second set of perspective view features, the processing circuitry is configured to use a second stochastic depth scaling function to extract the second set of perspective view features based on a distance of each object of the second one or more objects from the second camera. (Huang Figure 2; Section 3 “See Fig. 2 for a schematic illustration. The linearly decaying survival probability originates from our intuition that the earlier layers extract low-level features that will be used by later layers and should therefore be more reliably present” Examiner note: Since the probabilities of skipping a layer are independent for any given input, the stochastic function used for the first image and the stochastic function used for the second image could be different, given that their random outcomes are different.)
Huang is considered to be analogous to the claimed invention because they are both in the same field of neural networks with stochastic depth during training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the system of Urtasun to include the teachings of Huang, to provide the advantage of faster training time (Huang Abstract “To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation.”)
In regards to claim 9, Urtasun in view of Huang teaches the apparatus of claim 8, wherein to train the encoder, the processing circuitry is configured to: perform a plurality of training iterations using the plurality of sets of training image data (Huang Section 3 Training time savings “When a ResBlock is bypassed for a specific iteration, there is no need to perform forward-backward computation or gradient updates.”), wherein during each training iteration of the plurality of training iterations, the processing circuitry is configured to cause the residual network to skip one or more layers of the set of layers (Huang Section 3 “Learning with stochastic depth is based on a simple intuition. To reduce the effective length of a neural network during training, we randomly skip layers entirely.”); and insert the skipped layers of the set of layers into the residual network to complete training of the encoder (Huang Section 3 Expected network depth “In other words, with stochastic depth, we train ResNets with an average number of 40 ResBlocks, but recover a ResNet with 54 blocks at test time”).
In regards to claim 17, Urtasun in view of Huang renders obvious the claim language as in the consideration of claim 2.
In regards to claim 24, Urtasun in view of Huang renders obvious the claim language as in the consideration of claim 9.
Claims 3 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Urtasun in view of Huang as applied to the claims above, and further in view of “Towards Surround Stereo Vision: Analysis of a New Surround View Camera Configuration for Driving Assistance Applications” (herein after referred to by its primary author, Esparza).
In regards to claim 3, Urtasun in view of Huang teaches generate a first set of input perspective view features based on the first set of image data; generate a second set of input perspective view features based on the second set of image data (Urtasun Paragraph [0109] “At 712, method 700 can include fusing image features from image data (e.g., image data obtained at 704) with LIDAR features from the BEV representation of the LIDAR data.”; Paragraph [0053] “The one or more autonomy system sensors 114 can be configured to generate and/or store data including the autonomy sensor data 116 associated with one or more objects that are proximate to the vehicle 102 (e.g., within range or a field of view of one or more of the one or more sensors 114). The one or more autonomy system sensors 114 can include a Light Detection and Ranging (LIDAR) system, a Radio Detection and Ranging (RADAR) system, one or more cameras (e.g., visible spectrum cameras and/or infrared cameras), motion sensors, and/or other types of imaging capture devices and/or sensors.”) but fails to teach determine a first camera pose corresponding to the first camera; calculate, based on the first camera pose and the first FOV, a first FOV mask corresponding to the first camera; determine a second camera pose corresponding to the second camera; calculate, based on the second camera pose and the second FOV, a second FOV mask corresponding to the second camera; extract the first set of perspective view features based on the first set of input perspective view features, the first stochastic depth scaling function, and the first FOV mask; and extract the second set of perspective view features based on the second set of input perspective view features, the second stochastic depth scaling function, and the second FOV mask.
However, Esparza teaches determine a first camera pose corresponding to the first camera (Esparza Figure 1 C0); calculate, based on the first camera pose and the first FOV, a first FOV mask corresponding to the first camera (Esparza Figure 1 C0 Field of view Examiner note: This figure shows the mask applied to the surrounding area of the car. Each of the cameras C0 through C8 have their own FOV mask, and the areas of overlapping masks in this figure are shown in green, while areas which have one or none are shown in red); determine a second camera pose corresponding to the second camera (Esparza Figure 1 C1); calculate, based on the second camera pose and the second FOV, a second FOV mask corresponding to the second camera (Esparza Figure 1 C0 Field of view); extract the first set of perspective view features based on the first set of input perspective view features, the first stochastic depth scaling function, and the first FOV mask; and extract the second set of perspective view features based on the second set of input perspective view features, the second stochastic depth scaling function, and the second FOV mask (Esparza Figure 2 Examiner note: This figure shows the matching feature points between each of the cameras. Each figure shows matching 2 sets of features from 2 cameras. The first set of input perspective view features is analogous to features found in, for example, the top left picture of figure 2, while the second set of input perspective view features is analogous to features found in, for example, the top right picture of figure 2).
Esparza is considered to be analogous to the claimed invention because they are both in the same field of combining data from multiple vehicle cameras into a single feature map to assist drivers. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the system of Urtasun in view of Huang to include the teachings of Esparza, to provide the advantage of increasing the amount of overlap between cameras, allowing for 3D measurements. (Esparza section VII “An extension has been proposed to existing automotive surround view systems, where 8 instead of 4 fisheye cameras have been considered. The criteria to mount the cameras was not purely the visualization of the surrounding of the vehicle, but the overlapping on the fields of view. It has been shown that 3D measurements are possible in such a configuration without time accumulation and they compare well to those provided by a reference lidar sensor.”)
In regards to claim 18, Urtasun in view of Huang and Esparza renders obvious the claim language as in the consideration of claim 3.
Claims 5-6, 15, and 20-21 are rejected under 35 U.S.C. 103 as being unpatentable over Urtasun in view of Esparza.
In regards to claim 5, Urtasun teaches the apparatus of claim 1, where in the image data includes one or more cameras that are projected into the grid to generate the set of BEV features (Urtasun Paragraph [0053] “The one or more autonomy system sensors 114 can be configured to generate and/or store data including the autonomy sensor data 116 associated with one or more objects that are proximate to the vehicle 102 (e.g., within range or a field of view of one or more of the one or more sensors 114). The one or more autonomy system sensors 114 can include a Light Detection and Ranging (LIDAR) system, a Radio Detection and Ranging (RADAR) system, one or more cameras (e.g., visible spectrum cameras and/or infrared cameras), motion sensors, and/or other types of imaging capture devices and/or sensors.”; Paragraph [0109] “At 712, method 700 can include fusing image features from image data (e.g., image data obtained at 704) with LIDAR features from the BEV representation of the LIDAR data. In some implementations, fusing at 712 can include executing one or more continuous convolutions to fuse image features from a first data stream with LIDAR features from a second data stream. In some implementations, executing one or more continuous convolutions as part of fusing at 712 can be implemented as part of a machine-learned sensor fusion model or sensor fusion layers.”)
Urtasun fails to teach wherein the image data further comprises a third set of image data collected by a third camera comprising a third FOV and a fourth set of image data collected by a fourth camera comprising a fourth FOV, and wherein the processing circuitry is further configured to: apply the encoder to extract, from the third set of image data based on a location of a third one or more objects within the third FOV, a third set of perspective view features; apply the encoder to extract, from the fourth set of image data based on a location of a fourth one or more objects within the fourth FOV, a fourth set of perspective view features; and project the third set of perspective view features and the fourth set of perspective view features onto the grid to generate the set of BEV features.
However, Esparza teaches wherein the image data further comprises a third set of image data collected by a third camera comprising a third FOV and a fourth set of image data collected by a fourth camera comprising a fourth FOV, and wherein the processing circuitry is further configured to: apply the encoder to extract, from the third set of image data based on a location of a third one or more objects within the third FOV, a third set of perspective view features; apply the encoder to extract, from the fourth set of image data based on a location of a fourth one or more objects within the fourth FOV, a fourth set of perspective view features; and project the third set of perspective view features and the fourth set of perspective view features onto the grid to generate the set of BEV features (Esparza Figure 1 Examiner note: This reference uses 8 cameras to extract surround image data).
Esparza is considered to be analogous to the claimed invention because they are both in the same field of combining data from multiple vehicle cameras into a single feature map to assist drivers. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the system of Urtasun to include the teachings of Esparza, to provide the advantage of increasing the amount of overlap between cameras, allowing for 3D measurements. (Esparza section VII “An extension has been proposed to existing automotive surround view systems, where 8 instead of 4 fisheye cameras have been considered. The criteria to mount the cameras was not purely the visualization of the surrounding of the vehicle, but the overlapping on the fields of view. It has been shown that 3D measurements are possible in such a configuration without time accumulation and they compare well to those provided by a reference lidar sensor.”)
In regards to claim 6, Urtasun in view of Esparza teaches the apparatus of claim 1, wherein the first FOV overlaps with the second FOV, and wherein to apply the encoder to extract the first set of perspective view features and extract the second set of perspective view features, the processing circuitry is configured to capture information in the first set of perspective view features and the second set of perspective view features corresponding to one or more objects of the first one or more objects and the second one or more objects located within both of the first FOV and the second FOV (Esparza Figures 1 and 2 Examiner note: Figure 1 shows the overlap between cameras, and figure 2 shows the matching of features found in camera pairs).
In regards to claim 15, Urtasun in view of Esparza teaches the apparatus of claim 1, wherein the processing circuitry is part of an advanced driver assistance system (ADAS) (Esparza Section I “Recently, automotive surround view systems have become increasingly popular. These are designed to display the very near range surrounding to the driver in order to prevent collisions during parking and maneuvering [2].”).
In regards to claim 20, Urtasun in view of Esparza renders obvious the claim language as in the consideration of claim 5.
In regards to claim 21, Urtasun in view of Esparza renders obvious the claim language as in the consideration of claim 6.
Claims 7 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Urtasun in view of Huang as applied to the claim above, and further in view of US20190028688 (herein after referred to by its primary author, Kumar).
In regards to claim 7, Urtasun in view of Huang teaches the apparatus of claim 1, wherein to project the first set of perspective view features and the second set of perspective view features onto the grid to generate the set of BEV features, the processing circuitry is configured to: perform one or more stochastic drop actions to drop one or more features of the first set of perspective view features and the second set of perspective view features (Huang Figure 2 Examiner note: In this figure, the drop action is analogous to an “inactive” ResBlock, as can be see with respect to f3.).
Urtasun in view of Huang fails to teach transform, using a least squares operation, the first set of perspective view features and the second set of perspective view features into the set of BEV features.
However, Kumar teaches transform, using a least squares operation, the first set of perspective view features and the second set of perspective view features into the set of BEV features (Kumar Figure 4 405-410; Paragraph [0046] “Methods 401 continue at operation 410 where 3D triangulation is performed based on the feature matches. Given the tracks computed from feature matching, any suitable least squares solver may be used to arrive at an initial estimate of a 3D scene point corresponding to each track.”).
Kumar is considered to be analogous to the claimed invention because they are both in the same field of combining data from multiple cameras into a single feature map. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the system of Urtasun in view of Huang to include the teachings of Kumar, to provide the advantage of increased feature matching accuracy (Kumar Paragraph [0048] “Given a current estimate of relative pose between two cameras, a match for every feature in a first image can be sought along the corresponding epipolar line on the second image with a benefit being that features previously rejected (e.g., due to a ratio test) may be selected because the epipolar line is highly likely to pass through only the best match.”)
In regards to claim 22, Urtasun in view of Huang and Kumar renders obvious the claim language as in the consideration of claim 7.
Allowable Subject Matter
Claims 4, 10, 19, and 25 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
In regards to claim 4, the current prior art of record, alone or in combination, fails to teach “multiply, for each spatial location of a set of spatial locations, the first set of input perspective view features, the first stochastic depth scaling function, and the first FOV mask”. This claim requires multiplying the locations of view features by the first stochastic depth scaling function, which probabilistically determines if an object is to be considered. Then, this result is further multiplied by an FOV mask, which determines if an object is within the field of view of the particular camera. The prior art of record only teaches using stochastic scaling to train a machine learning model, but does not teach using the stochastic portion to determine the first set of perspective view features along with the FOV mask. This will be shown later with regards to the examples. Claim 19 is objected to for the same reasons as claim 4.
In regards to claim 10, the current prior art of record, alone or in combination, fails to teach “the probability of skipping the layer is greater when the layer corresponds to one or more objects and the probability of skipping the layer is smaller when the layer corresponds to a background region”. The prior art of record only teaches that the probability of skipping a layer is based on a linear decay of a “survival probability” based on the current network depth (See Huang Section 3 “The survival probabilities”). Claim 25 is objected to for the same reasons as claim 10.
As a non-limiting example, a close prior art, “Deep Networks with Stochastic Depth for Acoustic Modelling” teaches a method of using stochastic depth. The probability for a layer to be dropped in this reference is based on a linearly increasing dropout probability.
As another non-limiting example, a close prior art, “Leveraging Stochastic Depth Training for Adaptive Inference” teaches a method of using stochastic depth for training. This reference also teaches that the probability of a layer to be dropped linearly increases based on the current depth of the network.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CALEB LOGAN ESQUINO whose telephone number is (703)756-1462. The examiner can normally be reached M-Fr 8:00AM-4:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached at (571) 270-5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CALEB L ESQUINO/ Examiner, Art Unit 2677
/ANDREW W BEE/ Supervisory Patent Examiner, Art Unit 2677