Last updated: May 29, 2026
Application No. 18/492,662
SYSTEMS AND ASSOCIATED METHODS FOR REAL-TIME FEATURE DETECTION OF AN ENVIRONMENT

Non-Final OA §103
Filed
Oct 23, 2023
Examiner
HAUSMANN, MICHELLE M
Art Unit
2671
Tech Center
2600 — Communications
Assignee
Brightai Corporation
OA Round
1 (Non-Final)
Interview Optional

— +21.3% interview lift. Examiner has a relatively high allowance rate (76%); +21.3% interview lift. A written response may suffice.
Based on 870 resolved cases, 2023–2026
Examiner Intelligence

HAUSMANN, MICHELLE M View full profile →
Grants 76% — above average
Career Allowance Rate
663 granted / 870 resolved
+14.2% vs TC avg
Strong +21% interview lift
Without
With
+21.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
22 currently pending
Career history
895
Total Applications
across all art units
Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
94.8%
+54.8% vs TC avg
§102
0.6%
-39.4% vs TC avg
§112
0.9%
-39.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 870 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Election/Restrictions
Claims 15-19 were withdrawn from further consideration pursuant to 37 CFR 1.142(b) as being drawn to a nonelected species, there being no allowable generic or linking claim. Election was made without traverse in the reply filed on 13 March, 2026. Amendments received 13 March, 2026 include new claims 20-25. Examiner agrees claims 20-25 are the same species as claim 1 therefore these claims are also examined.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claim 1 (and by similarity claim 20 and by dependency claims 2-14 and 21-25) are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 2 of copending Application No. 18/492676 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the current application is a slightly broader version of the copending application (“aggregating” and “combination” are functionally the same, the copending application specifies the examined environment is a pipe, and “outputting an identification of the feature and a location of the feature within in the pipeline” is a slightly more specific version of the current application which claims “an output predicting the first feature present in the environment”). It is noted the claims in the copending case are much more similar than the restricted claim 15 which describes “recognizing, by the processor, a feature associated the sensor type within the interior of the pipe using deep learning algorithms operating on the collection of sensor data from the sensor type, wherein the sensor type comprises LIDAR three-dimensional point cloud data fed into a point cloud recognizer” and “creating a three-dimensional representation for the feature” and claim 15 did not have a description of aggregating the recognized first feature using a weighting algorithm as the copending case does. Thereby the restriction and double patenting rejections are not in conflict. 
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Copending Application 18/492676
A method for detecting features in an environment, comprising: receiving, by a processor, a collection of sensor data from a plurality of sensors deployed on a robot traversing a pipeline, each of the plurality of sensors associated with a sensor type, tracking, by the processor and based on data received from a position sensor of the robot, a position of the robot within the pipeline; generating, by the processor and using a plurality of deep learning algorithms, a plurality of recognizer outputs, wherein each recognizer output corresponds a respective sensor of the plurality of sensors; recognizing, by the processor and based on a weighted combination of the plurality of recognizer outputs, a feature  within the environment, mapping the feature relative to the position of the robot; and outputting an identification of the feature and a location of the feature within in the pipeline.
Current Application
A method comprising: receiving, by one or more processors, a collection of sensor data from a plurality of sensors deployed on a robot traversing an environment, each of the plurality of sensors associated with a sensor type, tracking, by the one or more processors, a position of the robot within the environment, wherein the tracking is performed using data received from a position sensor associated with the robot; recognizing, by the one or more processors, a first feature associated the sensor type within the environment, using deep learning algorithms operating on the collection of sensor data from the sensor type; mapping, by the one or more processors, the first feature to the position of the robot; aggregating, by the one or more processors, the recognized first feature using a weighting algorithm; and creating, by the one or more processors, an output predicting the first feature present in the environment.





Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ebrahimi Afrouzi et al. (US 20220066456 A1) [relies on content published 3 March, 2022 and does not rely on the priority earlier than this date].

Regarding claims 1 and 20, Ebrahimi Afrouzi et al. disclose a method comprising, and system comprising: a robot traversing an environment; a plurality of sensors deployed on the robot (The data collected by the camera may be bundled with data collected by one or more of an OTS, an encoder, an IMU, a gyroscope, etc. The robot may also include a 3D or 2D LIDAR for measuring distances to objects as the robot moves within the environment, [0306]); one or more processors associated with the robot; and one or more storage devices that store instructions, that, when executed by the one or more processors, cause the one or more processors to (processors, storage, instructions, [0269], 1488]): receiving, by one or more processors, a collection of sensor data from a plurality of sensors deployed on a robot traversing an environment, each of the plurality of sensors associated with a sensor type (“In embodiments, information is received from sensors and is used in real time by AI algorithms. Decisions actuate the robot without buffer delays based on the real time information. Examples of sensors include, but are not limited to, inertial measurement unit (IMU), gyroscope, optical tracking sensor (OTS), depth camera, obstacle sensor, floor sensor, edge detection sensor, debris sensor, acoustic sensor, speech recognition, camera, image sensor, time of flight (TOF) sensor, TSOP sensor, laser sensor, light sensor, electric current sensor, optical encoder, accelerometer, compass, speedometer, proximity sensor, range finder, LIDAR, LADAR, radar sensor, ultrasonic sensor, piezoresistive strain gauge, capacitive force sensor, electric force sensor, piezoelectric force sensor, optical force sensor, capacitive touch-sensitive surface or other intensity sensors, global positioning system (GPS), etc.”, [0242], The pose estimator may include an Extended Kalman Filter (EKF) that uses odometry, IMU, and LIDAR data, [0244]), tracking by the one or more processors, a position of the robot within the environment, wherein the tracking is performed using data received from a position sensor associated with the robot (The data collected by the camera may be bundled with data collected by one or more of an OTS, an encoder, an IMU, a gyroscope, etc. The robot may also include a 3D or 2D LIDAR for measuring distances to objects as the robot moves within the environment, [0306], In some embodiments, the processor may obtain a first stream of spatial data from a first sensor indicative of the position of the robot within the environment. In some embodiments, the processor may obtain a second stream of spatial data from a second sensor indicative of the position of the robot within the environment, [0423], While an IMU may detect an inertial acceleration after the robot has accelerated a desired cruise speed, the accelerometer may not be helpful in detecting motion with a constant speed. Therefore, in such cases, odometry information from the wheel encoder may be more useful, [0448], The pose estimator may include an Extended Kalman Filter (EKF) that uses odometry, IMU, and LIDAR data. SLAM may build a map based on scan matching. The pose estimator and SLAM may pass information to one another in a feedback loop. The SLAM updated may estimate the pose of the robot, [0244], In case of the LIDAR being covered (i.e., not available), the processor of the robot may use gyroscope data to continue mapping and covering hard surfaces since a gyroscope performs better on hard surfaces. The processor may switch to OTS (optical track sensor) for carpeted areas since OTS performance and accuracy is better in those areas. For example, a mapped area may be generated using LIDAR data, coverage on hard surface by the robot may be executed using only gyroscope sensor, and coverage on carpet by the robot may be executed using an OTS sensor, [0587], the processor may couple LIDAR or camera measurements with IMU, OTS, etc. data, [0589]); recognizing, by the one or more processors, a first feature associated the sensor type within the environment (In some embodiments, a video that is in red, green, blue (RGB) format may be converted to a video in a different format, such as YCoCg color space format, [0266], In embodiments, a kernel may consist of multiple layers of feature maps, each designed to detect a different feature. All neurons in a single feature map share the same parameters and allow the network to recognize a feature pattern regardless of where the feature pattern is within the input. This is important for object detection. For example, once the network learns that an object positioned in a dwelling is a chair, the network will be able to recognize the chair regardless of where the chair is located in the future, [0274], In some embodiments, a camera of the robot (the camera used for SLAM or another camera) captures images or video while the robot navigates around the environment. Using object recognition, the processor may identify the TV within the images captured and may associate a location within the floor map with the TV, [0400], The robot also includes a camera. The processor of the robot may use data collected by the camera to track a location of features, such as a light fixture, a corner, and an edge. In some embodiments, the camera may be slightly recessed and angled rearward. In some embodiments, the processor uses the location of features to localize the robot, [0589]), using deep learning algorithms operating on the collection of sensor data from the sensor type (In embodiments, there may be a high number of layers in the network (i.e., deep network) or there may be a low number of layers (i.e., shallow network), [0285], “In some embodiments, the AP signal strength data collected by sensors of the robot are fed into the deep neural network model along with accurate LIDAR measurements. In some embodiments, the LIDAR data and AP signal strength data are combined into a data structure then provided to the neural network such that a pattern may be learned and the processor may infer probabilities of a location of the robot based on the AP signal strength data collected”, [0309], reward system of trajectory measurement and observation algorithm are transmitted to the database for input into the Deep Q-Network for reinforcement learning, [0343], Some embodiments provide an image sensor and image processor coupled to the robot and use deep learning to analyze images captured by the image sensor and identify objects in the images, either locally or via the cloud, [1193]); mapping, by the one or more processors, the first feature to the position of the robot (SLAM updated may estimate the pose of the robot, [0244], during relocalization a camera of the robot may capture local images and the processor may attempt to locate the robot within the state-space by searching the known map to find a pattern similar to its current observation, [0260], “In some embodiments, the processor may not know the correspondence between data points a priori when merging images and may start by matching nearby points. The processor may then update the most likely correspondence and iterate on. In some embodiments, the processor of the robot may localize the robot against the environment based on feature detection and matching. This may be synonymous to pose estimation or determining the position of cameras and other sensors of the robot relative to a known three dimensional object in the scene, [0339], a camera of the robot may capture an image comprising a television that the processor may use in identifying the room the robot is within, [0588], the processor uses the location of features to localize the robot, [0589]); aggregating, by the one or more processors, the recognized first feature using a weighting algorithm (Since the robot is moving, the most recent measurements captured by the robot may be given more weight as they are more relevant. For instance, data at a current timestamp t is given more weight than older measurements captured at t−1, t−2, t−i. In some embodiments, the position of the robot may be a multidimensional array or tensor and the kernel may be a set of parameters organized in a multidimensional array. The two multidimensional arrays may be convolved to produce a feature map, [0324], processor adjusts the weight given to classification based on the collection of past experiences of robots and classification based on the experiences of the respective robot itself, [0385], “weighted sums computed by hidden layers of the network are propagated to the output layer which may present probabilities to describe a classification, an object detection (to be tracked), a feature detection (to be tracked), etc.”, [0524], “In some embodiments, the weight assigned to readings may be proportional to the size of the overlap area identified. For example, data points corresponding to a moving object captured in one or two frames overlapping with several other frames captured without the moving object may be assigned a low weight as they likely do not fall within the adjustment range and are not consistent with data points collected in other overlapping frames and would likely be rejected for having low assigned weight”, [0926]); and creating, by the one or more processors, an output predicting the first feature present in the environment (In embodiments, x is a first function and is the input to the network, w is a second function called a kernel, and the output of the network is a feature map, [0297], “In some embodiments, low level features are processed in real time. In some embodiments, different outputs may each require a different speed of response from the robot. For instance, an output indicating probabilities of a distance of the robot from an object. This requires fast response from the robot to avoid a collision”, [0318], the processor stitches images and creates a spatial representation of the scene after correcting images with preprocessing, [0339], In some embodiments, computer vision may be used to help with the labeling. For instance, the processor of the robot may recognize cabinetry, an oven, and a dishwasher in a same room and may therefore assume and label the room as the kitchen. Bedrooms, bathrooms, etc. may similarly be identified and labelled. In some embodiments, the processor may use history cubes to determine elements with direction. For example, directions that doors open may be determined using images of a same door at various time stamps. In some embodiments, an architectural plan may be generated by combination of a SLAM generated map and computer vision. In embodiments, additional data may be added to the map by a user or the processor, including labels for each room, specific measurement, notes, etc., [0361], In some embodiments, the processor may use object recognition to identify different objects in the stream of images and may label objects and associate locations in the map with the labelled objects. In some embodiments, the processor may label dynamic obstacles, such as humans and pets, in the map. In some embodiments, the dynamic obstacles have a half life that is determine based on a probability of their presence, [0381], “In some embodiments, the processor classifies the type, size, texture, and nature of objects. In some embodiments, such object classifications are provided as input to the Q-SLAM navigational stack, which then returns as output a decision on how to handle the object with the particular classifications. For example, a decision of the Q-SLAM navigational stack of an autonomous car may be very conservative when an object has even the slightest chance of being a living being, and may therefore decide to avoid the object. In the context of a robotic vacuum cleaner, the Q-SLAM navigational stack may be extra conservative in its decision of handling an object when the object has the slightest chance of being pet bodily waste.” [0514], “The output may be in the form of probabilities of possible outcomes, the outcomes being high-level features such as object type, scene, distance measurement, or displacement of a camera”, [0527]). 

As the language of the claim states “aggregating, by the one or more processors, the recognized first feature using a weighting algorithm” this is interpreted as aggregating over iterations of the neural network, as is disclosed by Ebrahimi Afrouzi et al. above. If instead the claim means something such as aggregating detections from different sensors, this is not what is currently claimed.

Ebrahimi Afrouzi et al. has multiple embodiments described. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the embodiments above as the combination would have predictable results, and Ebrahimi Afrouzi et al. indicate “In embodiments, the processor executes deep learning to improve perception, improve trajectory such that it follows the planned path, improve coverage, improve obstacle detection and prevention, make decisions that are more human-like, and to improve operation of the robot in situations where data becomes unavailable (e.g., due to a malfunctioning sensor)” ([0271]),
“In embodiments, DNN and CNN are advantageous as there are several different tools that may be used to a necessary degree. For example, proper weight initialization may break symmetries or advantageously choosing ELU or ReLu where negative values or those close to a value of zero are important or using leaky ReLu to advantageously increase performance for a more real-time experience or use of sparsification technique by selecting FTRL over Adam optimization” ([0279]) “In some embodiments, the processor may use the SLAM data to add accurate measurement to the generated architectural plan” ([0361]) “In embodiments, the SLAM algorithm is superior to SLAM methods described in prior art as it is less likely to lose localization of the robot. For example, using traditional SLAM methods, localization of the robot may be lost if the robot is randomly picked up and moved to a different room during a work session. However, using the SLAM algorithm described herein, localization is not lost” ([0485]) providing several computational performance benefits and accuracy improvements when embodiments are combined. 

Claim(s) 2-5 and 21-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ebrahimi Afrouzi et al. (US 20220066456 A1) as applied to claims 1 and 20 above, further in view of Philbin et al. (US 20210101624 A1).

Regarding claims 2 and 21, Ebrahimi Afrouzi et al. the method and system of claims 1 and 20.  Ebrahimi Afrouzi et al. further indicate sensor data comprises two-dimensional RGB camera feed data and wherein the recognizing step comprises feeding the two-dimensional camera feed data into an RGB recognizer algorithm (In some embodiments, a video that is in red, green, blue (RGB) format may be converted to a video in a different format, such as YCoCg color space format, [0266], In embodiments, a kernel may consist of multiple layers of feature maps, each designed to detect a different feature. All neurons in a single feature map share the same parameters and allow the network to recognize a feature pattern regardless of where the feature pattern is within the input. This is important for object detection. For example, once the network learns that an object positioned in a dwelling is a chair, the network will be able to recognize the chair regardless of where the chair is located in the future, [0274], In some embodiments, a camera of the robot (the camera used for SLAM or another camera) captures images or video while the robot navigates around the environment. Using object recognition, the processor may identify the TV within the images captured and may associate a location within the floor map with the TV, [0400], RGB, SLAM, [0464], The robot also includes a camera. The processor of the robot may use data collected by the camera to track a location of features, such as a light fixture, a corner, and an edge. In some embodiments, the camera may be slightly recessed and angled rearward. In some embodiments, the processor uses the location of features to localize the robot, [0589]).

Ebrahimi Afrouzi et al. do not disclose processing RGB output into an ensemble predictor to predict the feature.  

Philbin et al. teach sensor data comprises two-dimensional RGB camera feed data (image sensors (e.g., red-green-blue (RGB), [0036]) and wherein the recognizing step comprises feeding the two-dimensional camera feed data into an RGB recognizer algorithm and processing RGB output into an ensemble predictor to predict the feature (In some examples, operation 408 may comprise an ensemble voting technique such as, for example, majority voting, plurality voting, weighted voting (e.g., where certain pipelines are attributed more votes, functionally) and/or an averaging technique such as simple averaging, weighted averaging, and/or the like. In other words, a first occupancy map may indicate that there is a 0.9 likelihood that a portion of the environment is occupied and a second occupancy map may indicate that there is a 0.8 likelihood that the portion is occupied. The techniques may comprise using the likelihoods in a voting technique to determine whether to indicate that the portion is occupied or unoccupied and/or averaging the likelihoods to associate an averaged likelihood therewith, [0076]).

Ebrahimi Afrouzi et al. and Philbin et al. are in the same art of collision avoidance (Ebrahimi Afrouzi et al., [0307]; Philbin et al., abstract). The combination of Philbin et al. with Ebrahimi Afrouzi et al. will enable processing RGB output into an ensemble predictor to predict the feature. It would have been obvious at the time of filing to one of ordinary skill in the art combine the ensemble predictor of Philbin et al. with the invention of Ebrahimi Afrouzi et al. as this was known at the time of invention, the combination would have predictable results, and as Philbin et al. state “To safely operate, an autonomous vehicle may include multiple sensors and various systems for detecting and tracking events surrounding the autonomous vehicle and may take these events into account when controlling the autonomous vehicle. For example, the autonomous vehicle may detect and track every object within a 360-degree view of a set of cameras, LIDAR sensors, radar, and/or the like to control the autonomous vehicle safely” ([0002]) and “The techniques discussed herein may improve the safety of a vehicle by preventing invalid or risky trajectories from being implemented by the vehicle. In at least some examples, such techniques may further prevent collisions due to providing redundancy in such a way as to mitigate errors in any system or subsystem associated with the trajectory generation components (perception, prediction, planning, etc.). Moreover, the techniques may reduce the amount of computational bandwidth, memory, and/or power consumed for collision avoidance in comparison to former techniques. The accuracy of the collision avoidance system may also be higher than an accuracy of the primary perception system, thereby reducing an overall error rate of trajectories implemented by the autonomous vehicle by filtering out invalid trajectories” ([0022]) suggesting a safety benefit that would result from combing inventions.
Regarding claims 3 and 22, Ebrahimi Afrouzi et al. and Philbin et al. disclose the method and system of claims 2 and 21.  Ebrahimi Afrouzi et al. further indicate the recognizer step comprises image segmentation based on color or contrast (processor of the robot may perform segmentation wherein an object captured in an image is separated from other objects and the background of the image, the processor identifies the object based on the characteristics and features of the object. Characteristics of the object, for example, may include shape, color, size, presence of a leaf, and positioning of the leaf., [0382], In some embodiments, classification of an area may be based on commonalities and differences. Commonalities may include, for example, objects, floor types, patterns on walls, corners, ceiling, painting on the walls, windows, doors, power outlets, light fixtures, furniture, appliances, brightness, curtains, and other commonalities and how each of these commonalities relate to one another. Examples of different commonalities observed for an area include a bed, the color of the walls and the tile flooring. Based on these observed commonalities, the processor may classify the area, [0430], For instance, a first image may be segmented using fixed segmentation, whereas other images may be segmented based on entropy and contrast, [0550]).

Regarding claims 4 and 23, Ebrahimi Afrouzi et al. and Philbin et al. disclose the method of claims 3 and 22.  Ebrahimi Afrouzi et al. and Philbin et al. further indicate the deep learning algorithm is a convolutional neural network (CNN) (Ebrahimi Afrouzi et al., [0273]-[0279], [0292], [0313], [0321], [0393]; Philbin et al., [0051]).

Regarding claims 5 and 24, Ebrahimi Afrouzi et al. and Philbin et al. disclose the method and system of claims 4 and 23.  Philbin et al. further indicate executing the CNN to extract features using an encoding algorithm and decoding the features corresponding to the two-dimensional RGB camera feed data (camera feed can be RGB, see [0036], ML can be CNN, see [0051], In some examples, the ML model(s) 302(1)-(n) may comprise an encoder-decoder network, although other architectures are contemplated. In an example that uses an encoder-decoder network with convolutional layers, the encoder layer(s) may use average pooling with a pooling size of (2,2) and the decoder may comprise bilinear up-sampling. Following the decoder, the architecture may comprise a single linear convolution layer that generates logits and a final layer may apply a softmax to produce the final output probabilities associated with the different object classifications (e.g., pedestrian, cyclist, motorcyclist, vehicle, other may be labeled ground), [0089], For example, FIG. 6A depicts an ML model 600 that is trained to determine occupancy maps based at least in part on lidar data and/or radar data may include an encoder comprising a set of five blocks consisting of a pair of convolutional layers with batch normalization followed by an average pooling layer. In some examples, the convolutional layers may comprise ReLU activations, although other activations are contemplated (e.g., sigmoid, hyperbolic tangent, leaky ReLU, parameteric ReLU, softmax, Swish) The decoder may include five blocks consisting of three convolutional layers with batch normalization. The network may additionally or alternatively comprise a skip connection 602 from the fourth block of the encoder to the second block of the decoder, [0090], Continuing an additional or alternate example of variations in the architectures of the ML model(s) 302(1)-(n), FIG. 6B depicts an ML model 604 trained to determine occupancy maps based at least in part on image data may comprise an encoder-decoder network built on top of a ResNet (or other vision) backbone. For example, the ResNet block may comprise three layers, although an additional or alternate ResNet or other vision backbone component may be used. In some examples, the encoder and decoder may four blocks on top of ResNet blocks and, in at least some examples, the architecture 604 may comprise an orthographic feature transform layer between the encoder and decoder. The images are in perspective view even though the output is in a top-down view. The orthographic feature transform layer may convert from pixel space to top-down space. In some examples, the orthographic feature transform layer may comprise a series of unbiased fully connected layers with ReLU activations, although other activations are contemplated (e.g., sigmoid, hyperbolic tangent, leaky ReLU, parameteric ReLU, softmax, Swish). In an example where the image data 304 comprises images from different cameras, the architecture 604 may be configured to receive images from each camera view through a shared encoder and the architecture may be trained to learn a separate orthographic transformation for each view, add together the projected features, and pass the result through a single decoder, [0091]).

Claim(s) 6 and 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ebrahimi Afrouzi et al. (US 20220066456 A1) and Philbin et al. (US 20210101624 A1) as applied to claim 5 and 24 above, further in view of Guo et al. (US 20230177637 A1).

Regarding claims 6 and 25, Ebrahimi Afrouzi et al. and Philbin et al. disclose the method and system of claims 5 and 24.  Ebrahimi Afrouzi et al. and Philbin et al. do not disclose the CNN comprises convolution layers to produce feature vectors.  

Guo et al. teach a CNN comprising convolution layers to produce feature vectors (The teacher network 530 comprises an encoder 532 having a plurality of convolution layers and pooling layers that reduce the dimensionality of the input images 520 to provide encoded intermediate perception outputs (e.g., a feature vector) at an encoded layer 534 (i.e., a bottleneck layer). In at least one embodiment, the encoder 532 includes densely connected convolutional layers (e.g., DenseNet169). The teacher network 530 further comprises a decoder 536 having a plurality of convolution or deconvolution layers and unpooling or upsampling layers that increase the dimensionality of the encoded intermediate perception outputs from the encoded layer 534 to generate an output logit (e.g., having dimensions h1×w1×N, where h1 and w1 are the height and width, respectively, of the perspective projection images 520). The output logit is normalized by the output layer 540 (e.g., softmax) to provide the final perception outputs of the teacher network 530 (e.g., having dimensions h1×w1), [0044] The student network 550 comprises an encoder 552 having a plurality of convolution layers and pooling layers that reduce the dimensionality of the input image 510 to provide encoded intermediate perception outputs (e.g., a feature vector) at an encoded layer 554 (i.e., a bottleneck layer). In at least one embodiment, the encoder 552 include densely connected convolutional layers (e.g., DenseNet121). The student network 550 further comprises a decoder 556 having a plurality of convolution or deconvolution layers and unpooling or upsampling layers that increase the dimensionality of the encoded intermediate perception outputs from the encoded layer 554 to generate an output logit (e.g., having dimensions h2×w2×N, where h2 and w2 are the height and width, respectively, of the omnidirectional image 510). The output logit is normalized by the output layer 560 (e.g., softmax) to provide the final perception outputs of the student network 550 (e.g., having dimensions h2×w2), [0049]).

Ebrahimi Afrouzi et al. and Guo et al. are in the same art of autonomous devices (Ebrahimi Afrouzi et al., [0003]; Guo et al., [0003], [0056]). The combination of Guo et al. with Ebrahimi Afrouzi et al. and Philbin et al. will enable using a CNN comprising convolution layers to produce feature vectors. It would have been obvious at the time of filing to one of ordinary skill in the art combine the convolution layers of Guo et al. with the invention of Ebrahimi Afrouzi et al. and Philbin et al. as this was known at the time of invention, the combination would have predictable results, and as Guo et al. state “By way of this training, the student model learns to perform the same machine perception task, except in the omnidirectional image domain, using limited or no suitably labeled training data in the omnidirectional image domain” (abstract) demonstrating an improvement to training efficiency and decreasing a need for human man hours by requiring less labeled data.


Claim(s) 7-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ebrahimi Afrouzi et al. (US 20220066456 A1) and Philbin et al. (US 20210101624 A1) as applied to claim 2 above, further in view of Papi et al. (US 12416730 B1).

Regarding claim 7, Ebrahimi Afrouzi et al. and Philbin et al. disclose the method of claim 2.  Ebrahimi Afrouzi et al. and Philbin et al. do not disclose the sensor data further comprises two-dimensional infrared sensor data and wherein the recognizing step further comprises feeding the two-dimensional infrared sensor data an infrared recognizer algorithm; and combining the RGB output and infrared output into an ensemble predictor to predict the feature.  

Papi et al. teach the sensor data further comprises two-dimensional infrared sensor data and wherein the recognizing step further comprises feeding the two-dimensional infrared sensor data an infrared recognizer algorithm; and combining the RGB output and infrared output into an ensemble predictor to predict the feature (Object detection and tracking systems may use machine-learned transformer models with self-attention for detecting, classifying, and/or tracking objects in an environment. Techniques described herein may include receiving sensor data generated by different sensor modalities of a vehicle, determining different bounding shapes based on the different sensor modalities, and using a machine-learned transformer model to determine associated and/or combined bounding shapes, abstract, Various examples herein relate to receiving sensor data (and/or object detections or bounding shapes determined based on the sensor data) from with different sensor modalities. As used herein, a sensor modality may refer to a type of sensor data and/or to a type of sensor configured to capture or process sensor data. Examples of sensor modalities may include, but are not limited to, lidar, radar, vision (e.g., image and/or video), sonar, depth, time-of-flight, audio, cameras (e.g., RGB, IR, intensity, depth, etc.), and the like, col. 6, lines 40-50, Each combined object detection may include an updated/refined set of attributes (e.g., location, size dimensions, yaw, classification, intent, etc.) based on the attributes of the associated object detections from the different sensor modalities. The ML transformer model 106 may be trained to determine an optimal set of attributes for each combined object detection, so that the combined object detection represents the corresponding object in the environment 114 more accurately than any of the individual object detections from the different sensor modalities, col. 9, lines 35-60, Although discussed in the context of neural networks, any type of machine-learning can be used consistent with this disclosure. For example, machine-learning algorithms can include ensemble, col. 25, lines 5-45).

Ebrahimi Afrouzi et al. and Papi et al. are in the same art of autonomous devices (Ebrahimi Afrouzi et al., [0003]; Papi et al., col. 5, lines 50-67). The combination of Papi et al. with Ebrahimi Afrouzi et al. and Philbin et al. will enable combining the RGB output and infrared output into an ensemble predictor to predict the feature. It would have been obvious at the time of filing to one of ordinary skill in the art combine the data combination of Papi et al. with the invention of Ebrahimi Afrouzi et al. and Philbin et al. as this was known at the time of invention, the combination would have predictable results, and as Papi et al. state, “For at least these reasons, the techniques described herein also can improve the safe operation of autonomous vehicles. For instance, the disclosed techniques, among other things, improve an autonomous vehicle's ability to detect, classify, and track certain objects in an environment. Being able to detect, classify, and track objects may be critical for the overall safety and quality of autonomous driving. The technologies disclosed herein can classify objects based on a combination of sensor modalities, such as vision (e.g., images), lidar, and/or radar data. For instance, using the technologies described herein, an object can be tracked with high certainty as to the object's location, size, velocity, yaw, and classification, etc. This is due to the ability to process and analyze object detections and/or bounding shapes from different sensor modalities in an ML transformer model, and determine associated (e.g., combined) object detections with improved accuracy over the object detections generated by the individual sensor modalities” (Col. 5, lines 50-67) thereby providing an accuracy benefit and therefore likely safety improvement when the inventions are combined.

Regarding claim 8, Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. disclose the method of claim 7.  Ebrahimi Afrouzi et al. further indicate the recognizing step further comprises image segmentation based on color or contrast and further based on heat gradients (processor of the robot may perform segmentation wherein an object captured in an image is separated from other objects and the background of the image, the processor identifies the object based on the characteristics and features of the object. Characteristics of the object, for example, may include shape, color, size, presence of a leaf, and positioning of the leaf, [0382], In some embodiments, classification of an area may be based on commonalities and differences. Commonalities may include, for example, objects, floor types, patterns on walls, corners, ceiling, painting on the walls, windows, doors, power outlets, light fixtures, furniture, appliances, brightness, curtains, and other commonalities and how each of these commonalities relate to one another. Examples of different commonalities observed for an area include a bed, the color of the walls and the tile flooring. Based on these observed commonalities, the processor may classify the area, [0430], For instance, a first image may be segmented using fixed segmentation, whereas other images may be segmented based on entropy and contrast, [0550], “In some embodiments, the user interface may display information about a current state of the robot or previous states of the robot or its environment. Examples may include a heat map of dirt or debris sensed over an area, visual indications of classifications of floor surfaces in different areas of the map, visual indications of a path that the robot has taken during a current cleaning session or other type of work session, visual indications of a path that the robot is currently following and has computed to plan further movement in the future, and visual indications of a path that the robot has taken between two points in the environment, like between a point A and a point B on different sides of a room or a house in a point-to-point traversal mode”, [1417] ) [heat map interpreted as heat gradients].

Regarding claim 9, Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. disclose the method of claim 8.  Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. further indicate
the deep learning algorithm is a convolutional neural network (CNN) (Ebrahimi Afrouzi et al., [0273]-[0279], [0292], [0313], [0321], [0393]; Philbin et al., [0051]; Papi et al., col. 3, line 34 - col. 4, line 3, col. 25, lines 5-45).

Regarding claim 10, Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. disclose the method of claim 9.  Philbin et al. and Papi et al. further indicate executing the CNN to extract features using an encoding algorithm and decoding the features corresponding to the two-dimensional RGB camera feed data and the two-dimensional infrared sensor data (Philbin et al., camera feed can be RGB, infrared, see [0036], ML can be CNN, see [0051], In some examples, the ML model(s) 302(1)-(n) may comprise an encoder-decoder network, although other architectures are contemplated. In an example that uses an encoder-decoder network with convolutional layers, the encoder layer(s) may use average pooling with a pooling size of (2,2) and the decoder may comprise bilinear up-sampling. Following the decoder, the architecture may comprise a single linear convolution layer that generates logits and a final layer may apply a softmax to produce the final output probabilities associated with the different object classifications (e.g., pedestrian, cyclist, motorcyclist, vehicle, other may be labeled ground), [0089], For example, FIG. 6A depicts an ML model 600 that is trained to determine occupancy maps based at least in part on lidar data and/or radar data may include an encoder comprising a set of five blocks consisting of a pair of convolutional layers with batch normalization followed by an average pooling layer. In some examples, the convolutional layers may comprise ReLU activations, although other activations are contemplated (e.g., sigmoid, hyperbolic tangent, leaky ReLU, parameteric ReLU, softmax, Swish) The decoder may include five blocks consisting of three convolutional layers with batch normalization. The network may additionally or alternatively comprise a skip connection 602 from the fourth block of the encoder to the second block of the decoder, [0090], Continuing an additional or alternate example of variations in the architectures of the ML model(s) 302(1)-(n), FIG. 6B depicts an ML model 604 trained to determine occupancy maps based at least in part on image data may comprise an encoder-decoder network built on top of a ResNet (or other vision) backbone. For example, the ResNet block may comprise three layers, although an additional or alternate ResNet or other vision backbone component may be used. In some examples, the encoder and decoder may four blocks on top of ResNet blocks and, in at least some examples, the architecture 604 may comprise an orthographic feature transform layer between the encoder and decoder. The images are in perspective view even though the output is in a top-down view. The orthographic feature transform layer may convert from pixel space to top-down space. In some examples, the orthographic feature transform layer may comprise a series of unbiased fully connected layers with ReLU activations, although other activations are contemplated (e.g., sigmoid, hyperbolic tangent, leaky ReLU, parameteric ReLU, softmax, Swish). In an example where the image data 304 comprises images from different cameras, the architecture 604 may be configured to receive images from each camera view through a shared encoder and the architecture may be trained to learn a separate orthographic transformation for each view, add together the projected features, and pass the result through a single decoder, [0091]; Papi et al., Object detection and tracking systems may use machine-learned transformer models with self-attention for detecting, classifying, and/or tracking objects in an environment. Techniques described herein may include receiving sensor data generated by different sensor modalities of a vehicle, determining different bounding shapes based on the different sensor modalities, and using a machine-learned transformer model to determine associated and/or combined bounding shapes, abstract, Various examples herein relate to receiving sensor data (and/or object detections or bounding shapes determined based on the sensor data) from with different sensor modalities. As used herein, a sensor modality may refer to a type of sensor data and/or to a type of sensor configured to capture or process sensor data. Examples of sensor modalities may include, but are not limited to, lidar, radar, vision (e.g., image and/or video), sonar, depth, time-of-flight, audio, cameras (e.g., RGB, IR, intensity, depth, etc.), and the like, col. 6, lines 40-50, Each combined object detection may include an updated/refined set of attributes (e.g., location, size dimensions, yaw, classification, intent, etc.) based on the attributes of the associated object detections from the different sensor modalities. The ML transformer model 106 may be trained to determine an optimal set of attributes for each combined object detection, so that the combined object detection represents the corresponding object in the environment 114 more accurately than any of the individual object detections from the different sensor modalities, col. 9, lines 35-60) [Philbin et al. teaches concepts of CNN encoding, decoding, RGB data, infrared data, Papi et al. teach concept of combining RGB and infrared data].

Regarding claim 11, Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. disclose the method of claim 7.  Papi et al. further indicate the sensor data further comprises LIDAR three-dimensional point cloud data fed into a point cloud recognizer, and wherein the recognizing step further comprises combining an output of the point cloud recognizer with the RGB output and the infrared output in the ensemble predictor to predict the feature (Object detection and tracking systems may use machine-learned transformer models with self-attention for detecting, classifying, and/or tracking objects in an environment. Techniques described herein may include receiving sensor data generated by different sensor modalities of a vehicle, determining different bounding shapes based on the different sensor modalities, and using a machine-learned transformer model to determine associated and/or combined bounding shapes, abstract, Various examples herein relate to receiving sensor data (and/or object detections or bounding shapes determined based on the sensor data) from with different sensor modalities. As used herein, a sensor modality may refer to a type of sensor data and/or to a type of sensor configured to capture or process sensor data. Examples of sensor modalities may include, but are not limited to, lidar, radar, vision (e.g., image and/or video), sonar, depth, time-of-flight, audio, cameras (e.g., RGB, IR, intensity, depth, etc.), and the like, col. 6, lines 40-50, Each combined object detection may include an updated/refined set of attributes (e.g., location, size dimensions, yaw, classification, intent, etc.) based on the attributes of the associated object detections from the different sensor modalities. The ML transformer model 106 may be trained to determine an optimal set of attributes for each combined object detection, so that the combined object detection represents the corresponding object in the environment 114 more accurately than any of the individual object detections from the different sensor modalities, col. 9, lines 35-60, Although discussed in the context of neural networks, any type of machine-learning can be used consistent with this disclosure. For example, machine-learning algorithms can include ensemble, col. 25, lines 5-45) [LIDAR, RGB listed as optional modalities]

Regarding claim 12, Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. disclose the method of claim 11.  Papi et al. further indicate the ensemble predictor comprises a linear rule-based model which combines the output of the point cloud recognizer, the RGB output and the projected infrared output (perception component may use a machine-learned transformer model with self-attention to determine associated and/or combined object detections (e.g., bounding shapes) representing the objects in the environment, col. 2, lines 18 - 56, transformer model to receive input bounding shapes from the various sensor modalities, and to output a corresponding set (or stream) of object detections with a one-to-one constraint to solve over segmentation, col. 4, lines 4 - 25, Each of the ML lidar pipeline 122, the ML image pipeline 124, and/or the ML radar pipeline 126 may include one or more machine-learned model(s) incorporating any combination of machine-learning components (e.g., multilayer perceptrons, feedforward neural networks, attention components, etc.), col. 8, line 42 - col. 9, line 3, include linear regression, logistic regression, Linear Discriminant Analysis (LDA), association rule learning algorithms e.g., perceptron, col. 25, lines 5-45).  

Regarding claim 13, Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. disclose the method of claim 12.  Philbin et al. and Papi et al. further indicate the ensemble predictor weights the output of the point cloud recognizer, the RGB output and the infrared output (Philbin et al., In some examples, operation 408 may comprise an ensemble voting technique such as, for example, majority voting, plurality voting, weighted voting (e.g., where certain pipelines are attributed more votes, functionally) and/or an averaging technique such as simple averaging, weighted averaging, and/or the like; Papi et al., Various examples herein relate to receiving sensor data (and/or object detections or bounding shapes determined based on the sensor data) from with different sensor modalities. As used herein, a sensor modality may refer to a type of sensor data and/or to a type of sensor configured to capture or process sensor data. Examples of sensor modalities may include, but are not limited to, lidar, radar, vision (e.g., image and/or video), sonar, depth, time-of-flight, audio, cameras (e.g., RGB, IR, intensity, depth, etc.), and the like, col. 6, lines 40-50, Each combined object detection may include an updated/refined set of attributes (e.g., location, size dimensions, yaw, classification, intent, etc.) based on the attributes of the associated object detections from the different sensor modalities. The ML transformer model 106 may be trained to determine an optimal set of attributes for each combined object detection, so that the combined object detection represents the corresponding object in the environment 114 more accurately than any of the individual object detections from the different sensor modalities, col. 9, lines 35-60, Although discussed in the context of neural networks, any type of machine-learning can be used consistent with this disclosure. For example, machine-learning algorithms can include ensemble, col. 25, lines 5-45).
 
Regarding claim 14, Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. disclose the method of claim 13.  Ebrahimi Afrouzi et al. and Philbin et al. and Papi et al. further indicate an inertial measurement unit (IMU) associated with the robot, wherein the IMU is configured to determine a position of the robot in the environment, and wherein the tracking step is performed using the IMU (Ebrahimi Afrouzi et al., [0238], [0244], [0245], [0420]; Philbin et al., [0045], [vehicle has robotic control, [0023]; Papi et al., col. 7, lines 18-60, col. 22, lines 40-65 [robotic col. 6, lines 1-15]).
  
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M ENTEZARI HAUSMANN whose telephone number is (571)270-5084. The examiner can normally be reached 10-7 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent M Rudolph can be reached at (571) 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHELLE M ENTEZARI HAUSMANN/Primary Examiner, Art Unit 2671
Read full office action
Prosecution Timeline

Oct 23, 2023
Application Filed
Jan 30, 2025
Response after Non-Final Action
May 12, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/267,598
Patent 12638400
Method for monitoring and/or controlling phase separation in chemical processes and samples
2y 11m to grant Granted May 26, 2026
18/348,495
Patent 12639803
SYSTEMS AND METHODS FOR MATERIAL ACCRETION DETECTION AND REMOVAL
2y 10m to grant Granted May 26, 2026
18/136,006
Patent 12629121
METHOD OF DETERMINING VESSEL FLUID FLOW VELOCITY
3y 1m to grant Granted May 19, 2026
18/034,833
Patent 12626375
HOMOGRAPHY MATRIX GENERATION APPARATUS, CONTROL METHOD, AND COMPUTER-READABLE MEDIUM
3y 0m to grant Granted May 12, 2026
18/179,635
Patent 12620252
INFORMATION SOURCE DETECTION USING UNIQUE WATERMARKS
3y 2m to grant Granted May 05, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
98%
With Interview (+21.3%)
3y 0m (~4m remaining)
Median Time to Grant
Low
PTA Risk
Based on 870 resolved cases by this examiner. Grant probability derived from career allowance rate.