Last updated: April 19, 2026
Application No. 18/272,950
EXTRACTING FEATURES FROM SENSOR DATA

Non-Final OA §103
Filed
Jul 18, 2023
Examiner
MCLEAN, NEIL R
Art Unit
2681
Tech Center
2600 — Communications
Assignee
Five Al Limited
OA Round
1 (Non-Final)
Interview Optional

— +10.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 686 resolved cases, 2023–2026
Examiner Intelligence

MCLEAN, NEIL R View full profile →
Grants 79% — above average
Career Allow Rate
545 granted / 686 resolved
+17.4% vs TC avg
Moderate +10% lift
Without
With
+10.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 6m
Avg Prosecution
21 currently pending
Career history
707
Total Applications
across all art units
Statute-Specific Performance

§101
14.8%
-25.2% vs TC avg
§103
50.8%
+10.8% vs TC avg
§102
21.5%
-18.5% vs TC avg
§112
5.4%
-34.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 686 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.        The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
2.       Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Oath/Declaration
3.        The receipt of Oath/Declaration is acknowledged.

Preliminary Amendment
4.         The Preliminary Amendment submitted on 07/18/2023 containing amendments to the specification and amendments to the claims are acknowledged.

Information Disclosure Statement
5.        The information disclosure statements (IDS) submitted on 07/18/2023, 10/04/2023, 11/05/2024, 10/15/2025, and 02/10/2026 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Drawings
6.        The drawing(s) filed on 07/18/2023 are accepted by the Examiner.


Claim Objections
7.         Claim 17 which depends from Claim 16, recites the limitation "the at least one time sequence" in line 2, however, there is insufficient antecedent basis for this limitation in the claim. 
	Please amend this claim by inserting “receiving at least one time-sequence of real sensor data” prior to that limitation. 
	It is noted that claim 17 is substantially similar to claim 3 but missing this limitation.
             For purposes of examination the examiner will examine Claim 17 as if this limitation were present.
Appropriate correction is required.

Status of Claims
8.          Claims 1-12, and 14-21 are pending in this application.
	Claims 10-12, and 14-15 were amended, Claim 13 was canceled, and Claims 16-21 were newly added in the 07/18/2023 Preliminary Amendment.   
	

Claim Rejections - 35 USC § 103
9.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
10.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

11.	The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
12.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
13.	Claims 1, 2, 6, 7, 10,  and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Kaufhold et al. (US 9,990,687 B1) in view of Chen et al. (US 2021/0327029), hereinafter ‘Kaufhold’ and ‘Chen’.

Regarding Claim 1:
Kaufhold discloses a computer implemented method of training an encoder to extract features from sensor data (Kaufhold: Fig. 3; ‘Method and System of Deep Embedding’ Col. 26, lines 38-63), the method comprising:
training a machine learning (ML) system based on a 
Kaufhold discloses training a machine learning system, specifically a variational autoencoder architecture (Fig. 8: 810 and 830) comprising an encoder, based on a loss function applied to a training set comprising real and synthetic sensor images.  “a variational autoencoder can be trained to minimize a loss function that measures the difference between real sensor images and synthetic images. The loss function is commonly cross entropy or mean square error.” (Col. 25, line 65 – Col. 6 line 10).  The ML system comprises the encoder as a component of the autoencoder architecture. (Col. 26, lines 17-36).
wherein the training set comprises sets of real sensor data and corresponding sets of synthetic sensor data, 
Kaufhold discloses “if pairs of synthetic and corresponding real sensor images 311/410/510 are available (this can be accomplished by rendering a plurality of synthetic images to match real sensor images from recorded acquisition geometries), a function can be learned that can convert synthetic images into images that fall closer to the real sensor image manifold.” (Col. 25, lines 59-64).  This constitutes a training set comprising sets of real sensor data (real SAR images acquired from a real operational imaging sensor) and corresponding sets of synthetic sensor data (synthetic SAR images generated from a CAD model).  The synthetic images are generated specifically to match the recorded acquisition geometries of the real sensor images. (Col. 25 lines 59-64).
 	wherein the encoder extracts features from each set of real and synthetic sensor data,
Kaufhold discloses that the encoder component of the variational autoencoder architecture (modules 810/830) processes both real sensor images and synthetic sensor images and extracts low-dimensional embedded feature representations.  The deep embedding module 333 receives as input a high dimensional object and outputs a low dimensional embedded representation (a356) constituting the extracted features (Col. 26, lines 35-65).
 and the 
Kaufhold discloses training a loss function on corresponding pairs of real and synthetic sensor images (311/410/510), which establishes a training objective that associates each real sensor image with its corresponding synthetic sensor image (Col. 25, lines 55-65; Col. 26, lines 1-10).  However, Kaufhold operates at the pixel/reconstruction level rather than an encoder extracted feature representations, and does not explicitly characterize the association of real/synthetic pairs as a feature level operation.

Kaufhold does not expressly disclose a self-supervised loss function applied to a training set; and associate each set of real sensor data with its corresponding set of synthetic sensor data based on their respective features.
Chen discloses a self-supervised loss function applied to a training set; and 
Chen explicitly discloses a self-supervised loss function, the NT-Xent (normalized temperature scaled cross entropy) loss, applied to a training set to train a base encoder neural network in a fully self-supervised manner, without requiring human provided labels.  The NT-Xent loss is applied to the training set without any ground-truth label supervision, rendering it a self-supervised loss function as claimed (Chen:¶¶[0002; 0027; 0042-0043]).
associate each set of real sensor data with its corresponding set of synthetic sensor data based on their respective features.
Chen discloses a feature level association.  Chen’s NT-Xent loss explicitly encourages the ML system to associate positive pairs (corresponding data items) with each other based on their encoder extracted feature representations projected in to a shared latent space.  (¶[0042]).
Kaufhold in view of Chen are combinable because they are from the same field of endeavor of image processing; e.g., both references employ the same fundamental encoder architecture processing of paired data.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the self-supervised loss function, and the use of encoder extracted features as taught by Chen into the pixel-level reconstruction loss and pixel-level features taught by Kaufhold.  The suggestion/motivation for doing so is to obtain an improved quality of encoder feature representations as disclosed by Chen at ¶¶[0029; 0070-0072].  Therefore, it would have been obvious to combine Kaufhold with Chen to obtain the invention as specified in claim 1.

Regarding Claim 2:
The proposed combination of Kaufhold in view of Chen further discloses the method of claim 1, wherein each set of real sensor data comprises sensor data of at least one sensor modality, the method comprising: 
generating the corresponding sets of synthetic sensor data using one or more sensor models for the at least one sensor modality.
Kaufhold discloses that SAR (synthetic aperture radar) techniques can be used to produce synthetic SAR images, using a rendering algorithm applied to a CAD model, which constitutes one or more sensor models for the sensor modality (Col. 24, lines 24-35).  
Kaufhold states “In some modalities, such as synthetic aperture radar (“SAR”), techniques can be used to produce synthetic SAR images, for instance (akin to virtual reality rendering engines producing visually realistic renderings of scenes in the spectrum of visual imaging)” (Col. 25, lines 24-28).

Regarding Claim 6:
The proposed combination of Kaufhold in view of Chen further discloses the method of claim 2, wherein for each real set of sensor data the corresponding set of synthetic sensor data is generated via processing of the real set of sensor data.
As addressed in Claim 2 above, Kaufhold discloses generating synthetic sensor data using sensor models corresponding to a given sensor modality (Col. 24, lines 24-35; Col. 25, lines 24-28).
Chen further expressly discloses that synthetic/augmented data is generated by directly processing the real input data.  Specifically, Chen discloses a stochastic data augmentation module 203 that “transforms any given data example, e.g., an input image x…randomly resulting in two correlated views of the same example” [0038].  The specific processing operations applied to the real input image include random cropping, random color distortions, and random Gaussian blur, each applied directly to the real training image to produce the augmented (synthetic) output [0039].
Kaufhold in view of Chen are combinable because they are from the same field of endeavor of image processing; e.g., both references leverage synthetic data to improve model training.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Chen’s technique of deriving synthetic sensor data through direct processing of the corresponding real sensor data into the framework of Kaufhold.  The suggestion/motivation for doing so is to generate synthetic sensor data that closely corresponds to the really sensor data, improving training. Therefore, it would have been obvious to combine Kaufhold with Chen to obtain the invention as specified in claim 6.

Regarding Claim 7:
The proposed combination of Kaufhold in view of Chen further discloses the method of claim 1, wherein at least one of the sets of real sensor data comprises a real image, and the corresponding set of synthetic sensor data comprises a corresponding synthetic image derived via image rendering.
Kaufhold discloses that real SAR images acquired from “a real operational imaging sensor imaging a real object” (Col. 26, lines 28-29).
Kaufhold further discloses that corresponding synthetic SAR images are explicitly discloses as being “synthetically rendered SAR from CAD models” (Col. 26, lines 25-26) constituting synthetic images derived via image rendering.
Kaufhold further describes the use of “virtual reality rendering engines producing visually realistic renderings of scenes”.

Regarding Claim 10:
The proposed combination of Kaufhold in view of Chen further discloses the method of claim 1, wherein the ML system comprises a trainable projection component which projects the features from a feature space into a projection space,
 Chen expressly discloses “A projection head neural network 206 …that maps the intermediate representations to final representations within the space where contrastive loss is applied.” ([0041]).  The projection head neural network is explicitly described as performing a learnable nonlinear transformation; “the projection head neural network is configured to perform at least one non-linear transformation” [0005], implemented as “the projection head neural network 206 can be a multi-layer perceptron with one hidden layer to obtain z.sub.i=g(h.sub.i)=W.sup.(2)σ(W.sup.(1)h) where σ is a ReLU non-linearity.” [0041]. This maps the encoder’s intermediate representation (feature space) into the projected representation z (projection space), directly corresponding to the claimed trainable projection component.
the self-supervised loss defined on the projected features, 
Chen teaches that “it is beneficial to define the contrastive loss on final representations z.sub.i's rather than intermediate representations h.sub.i's.” [0041], expressly placing the self-supervised contrastive loss in the projection space rather than the feature space.  wherein the trainable projection component is trained simultaneously with the encoder.
Chen’s Algorithm 1 ([0044]) explicitly discloses that both networks are updated together in a single training loop, demonstrating that the projection component and encoder are trained simultaneously.
Kaufhold in view of Chen are combinable because they are from the same field of endeavor of image processing; e.g., both references leverage synthetic data to improve model training.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to wherein the ML system comprises a trainable projection component which projects the features from a feature space into a projection space, the self-supervised loss defined on the projected features, wherein the trainable projection component is trained simultaneously with the encoder. The suggestion/motivation for doing so is to “Introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations” as disclosed by Chen at [0029].  Therefore, it would have been obvious to combine Kaufhold with Chen to obtain the invention as specified in claim 10.

Regarding Claim 15: (drawn to a computer-readable medium)
The proposed combination of Kaufhold in view of Chen, explained in the rejection of method claim 1, renders obvious the computer-readable medium of claim 15 because these steps occur in the operation of the proposed combination as discussed above.  Thus, the arguments similar to that presented above for claim 1 are equally applicable to claim 15.
It is noted that Kaufhold discloses a computer-readable storage medium at least at Fig. 9 ‘memory 920’; ‘processor 930’; Col. 28, lines 11-24, and Col. 30, lines 24-35.

14.	Claims 3, 4, 11, 12, 14, 16, 17, 18, 20 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Kaufhold in view of Chen as applied to claims 1 and 2 above, and further in view of Taralova (US 10,832,093).

Regarding Claim 3:
The proposed combination of Kaufhold in view of Chen disclose the method of claim 2, but do not expressly disclose comprising: 
receiving at least one time-sequence of real sensor data; 
processing the at least one time-sequence to extract a description of a scenario; and
simulating the scenario in a simulator, wherein each set of real sensor data comprises a portion of real sensor data of the at least one time-sequence, and the corresponding set of synthetic sensor data is derived from a corresponding part of the simulated scenario using the one or more sensor models.

Taralova discloses receiving at least one time-sequence of real sensor data;
Taralova’s autonomous vehicle 102 traverses the real environment 100 and generates sequential sensor data 104 over time (Col. 3, lines 12-20; Fig. 1A).
Taralova explicitly discloses “a first acquired image in a video sequence and the second image may be an image acquired within some period of time (e.g., within 1-5 frames).” (Col. 13, lines 41-44).
processing the at least one time-sequence to extract a description of a scenario; and
Taralova discloses “techniques described herein are directed to generating simulated environments and using simulated environments in various scenarios” (Col. 5, lines 57-59).
simulating the scenario in a simulator, wherein each set of real sensor data comprises a portion of real sensor data of the at least one time-sequence, and the corresponding set of synthetic sensor data is derived from a corresponding part of the simulated scenario using the one or more sensor models.
Taralova further discloses Fig. 5 block 502 ‘generate a simulated environment’. 
“the simulation system 244 can generate simulated environments via procedural generation (e.g., creating data algorithmically)…The simulation system 244 described herein can simulate one or more sensor modalities (e.g., LIDAR, RADAR, ToF, SONAR, images (e.g., RGB, IR, intensity, depth, etc.), etc.).” (Col. 11, line 66 – Col. 12, line 4).
Fig. 5, Block 504 explicitly teaches receiving “a first intermediate output…associated with a real environment and a second intermediate output…associated with a corresponding simulated environment” (Col. 21, lines 64-66).
Taralova further states that “The images can correspond such that they represent the same portion of their respective environments.” (Col. 2, lines 39-40).
Kaufhold, Chen & Taralova are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to derive a scenario description by processing real world sensor data logs for log-based simulation.  The suggestion/motivation for doing so is to produce synthetic training data that more closes mirrors the real-world domain being simulated.  Taralova discloses in the background of invention that simulated environments can be useful when testing would be unsafe in a real-world environment such as for a rare or infrequent scenario, expressly recognizing scenario descriptions as inputs to the simulation system.  Therefore, it would have been obvious to combine Kaufhold, Chen & Taralova to obtain the invention as specified in claim 3.

Regarding Claim 4:
The proposed combination of Kaufhold, Chen & Taralova further discloses the method of claim 3, wherein each set of real sensor data captures a real static scene at a time instant in the real sensor data sequence, 
Taralova’s autonomous vehicle 102 traverses the real environment and generates sequential sensor data frames over time.  Taralov discloses “a first acquired image in a video sequence and the second image may be an image acquired within some period of time (e.g., within 1-5 frames).” (Col. 13, lines 41-44).
Each individual image in the sequence represents a static capture of the scene at a discrete time instant.
and the corresponding set of synthetic sensor data captures a synthetic static scene at a corresponding time instant in the simulation.
Taralova: Fig. 5, block 504 teaches receiving “a first intermediate output…associated with a real environment and a second intermediate output…associated with a corresponding simulated environment” (Col. 21, lines 64-66), and that “The images can correspond such that they represent the same portion of their respective environments.” (Col. 2, lines 39-40).
The pairing is of a static snapshot from a real environment and a static snapshot from the simulation at the corresponding moment.
Kaufhold, Chen & Taralova are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement the frame level temporal correspondence taught by Taralova within the training framework taught by Kaufhold and Chen.  The suggestion/motivation for doing so is to get a static capture of a simulated scene.  As discussed in Claim 3, Taralov discloses that simulated environments can be useful when testing would be unsafe in a real-world environment such as for a rare or infrequent scenario, expressly recognizing scenario descriptions as inputs to the simulation system.  Therefore, it would have been obvious to combine Kaufhold, Chen & Taralov to obtain the invention as specified in claim 4.

Regarding Claim 11:
The proposed combination of Kaufhold in view of Chen discloses the method of claim 1, but does not expressly disclose wherein the sets of real sensor data capture real static or dynamic driving scenes, and the corresponding sets of synthetic sensor data capture corresponding synthetic static or dynamic driving scenes.
Taralova discloses wherein the sets of real sensor data capture real static or dynamic driving scenes, and 
Taralova discloses an autonomous vehicle (102) traversing a real driving environment (100) and generating real sensor data (104) representative of that environment (Fig. 1A).  “In an example, the autonomous vehicle 102 can traverse the real environment 100. In some examples, while traversing the real environment 100, the autonomous vehicle 102 can generate sensor data 104 that can be used to inform maps of the real environment 100 and/or inform movement of the autonomous vehicle 102 within the real environment 100.” (Col. 3, lines 7-13).
the corresponding sets of synthetic sensor data capture corresponding synthetic static or dynamic driving scenes.
Taralova further discloses a corresponding simulated driving environment (114), in which a vehicle (116) generates simulated sensor data (118) representing the same type of autonomous driving environment (Fig. 1B).  The real and simulated sensor data correspond directly.  “The images can correspond such that they represent the same portion of their respective environments.” (Col. 2, lines 39-40).
Taralova further discloses “simulated environments can be used for generating training data for rare or infrequently occurring scenarios and/or objects.” (Col. 5, lines 37-39).
Taralova further discloses “LIDAR data recorded in association with a simulated environment (e.g., where the pose of the vehicle 102 is known) can be compared to LIDAR data recorded in association with a corresponding position in a real environment and the localization algorithm can be updated as appropriate.” (Col. 17, line 64 – Col. 18, line 2). 
Taralov discloses “a first acquired image in a video sequence and the second image may be an image acquired within some period of time (e.g., within 1-5 frames).” (Col. 13, lines 41-44).  Each individual image in the sequence represents a static capture of the scene at a discrete time instant.
Kaufhold, Chen & Taralova are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the domain adaptation method of claim 1 to autonomous driving environments, specifically capturing both static scenes and dynamic scenes.  The suggestion/motivation for doing so is to get a static capture of a simulated scene.  As discussed in Claim 3, Taralov discloses that simulated environments can be useful when testing would be unsafe in a real-world environment such as for a rare or infrequent scenario, expressly recognizing scenario descriptions as inputs to the simulation system.  Therefore, it would have been obvious to combine Kaufhold, Chen & Taralov to obtain the invention as specified in claim 11.

Regarding Claim 12:
The proposed combination of Kaufhold in view of Chen further discloses the method of claim 1, wherein the self-supervised loss function is a contrastive loss function that encourages similarity of features between positive pair,

Chen discloses a contrastive learning framework wherein the loss function is explicitly a contrastive loss (NT-Xent loss) that encourages feature similarity between positive pairs, defined as two differently augmented views of the same data example, while discouraging similarity between negative pairs, defined as augmented views from different data examples in the training batch (Chen: ¶¶[0037-0042]).
Chen discloses “learn representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space.” ([0037]).

Kaufhold in view of Chen do not explicitly define the positive pairs as consisting of a set of real sensor data and its corresponding set of synthetic sensor data, nor the negative pairs as non-corresponding real/synthetic sensor data sets.
Taralova explicitly defines the positive pairs as consisting of a set of real sensor data and its corresponding set of synthetic sensor data, nor the negative pairs as non-corresponding real/synthetic sensor data sets.
Taralova teaches pairing real sensor data (Fig. 1A real environment; ‘sensor system(s) 108’ e.g., LIDAR, Radar, camera…) with its corresponding synthetic sensor data for training (Fig. 1B is the simulated environment).  See Col. 4, lines 20-55 ‘real world sensor observations and their corresponding simulated data.’ 
Taralova: (‘POSITIVE’) explicitly discloses in Fig. 5 block 504 “illustrates receiving a pair of intermediate outputs, a first intermediate output of the pair of intermediate outputs being associated with a real environment and a second intermediate output of the pair of intermediate outputs being associated with a corresponding simulated environment.” wherein Taralova’s ‘simulated’ reads on the claimed ‘synthetic’.
Taralova: (‘NEGATIVE’) Fig. 5 block 510 ‘If the simulated environment and the real environment do not activate the same way, e.g., the variation between neural network activations…meets or exceeds a threshold…modifying parameters.’ (non-corresponding pairs contrasted; Abstract; Col. 1, lines 55-65).  ‘The training data can comprise a plurality of differences…from pairs of activations’ (dissimilar real/simulated pairs; Col. 2, lines 25-30).
Kaufhold, Chen & Taralova are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Kaufhold and Chen’s contrastive loss framework, with its positive/negative pair structure to Taralova’s real/simulated data pairing.  The suggestion/motivation for doing so is to improve cross domain feature alignment between real and simulated/synthetic sensor representations.  Therefore, it would have been obvious to combine Kaufhold, Chen & Taralova to obtain the invention as specified in claim 12.

Regarding Claim 14:
Kaufhold discloses a computer system (Kaufhold: Fig. 9 ‘computing device 900’) comprising: 
at least one memory configured to store computer-readable instructions (Kaufhold: discloses a computer-readable storage medium at least at Fig. 9 ‘memory 920’); 
at least one hardware processor coupled to the at least one memory and configured to execute the computer-readable instructions (Kaufhold: Fig. 9 ‘processor 930’), which upon execution cause the at least one hardware processor to train a machine learning (ML) system based on a (Kaufhold: Fig. 3; ‘Method and System of Deep Embedding’ Col. 26, lines 38-63), the ML system comprising an encoder; 
Kaufhold discloses training a machine learning system, specifically a variational autoencoder architecture (Fig. 8: 810 and 830) comprising an encoder, based on a loss function applied to a training set comprising real and synthetic sensor images.  “a variational autoencoder can be trained to minimize a loss function that measures the difference between real sensor images and synthetic images. The loss function is commonly cross entropy or mean square error.” (Col. 25, line 65 – Col. 6 line 10).  The ML system comprises the encoder as a component of the autoencoder architecture. (Col. 26, lines 17-36).
wherein the training set comprises sets of real sensor data and corresponding sets of synthetic sensor data, 
Kaufhold discloses “if pairs of synthetic and corresponding real sensor images 311/410/510 are available (this can be accomplished by rendering a plurality of synthetic images to match real sensor images from recorded acquisition geometries), a function can be learned that can convert synthetic images into images that fall closer to the real sensor image manifold.” (Col. 25, lines 59-64).  This constitutes a training set comprising sets of real sensor data (real SAR images acquired from a real operational imaging sensor) and corresponding sets of synthetic sensor data (synthetic SAR images generated from a CAD model).  The synthetic images are generated specifically to match the recorded acquisition geometries of the real sensor images. (Col. 25 lines 59-64).
 	wherein the encoder extracts features from each set of real and synthetic sensor data,
Kaufhold discloses that the encoder component of the variational autoencoder architecture (modules 810/830) processes both real sensor images and synthetic sensor images and extracts low-dimensional embedded feature representations.  The deep embedding module 333 receives as input a high dimensional object and outputs a low dimensional embedded representation (a356) constituting the extracted features (Col. 26, lines 35-65).
 and the 
Kaufhold discloses training a loss function on corresponding pairs of real and synthetic sensor images (311/410/510), which establishes a training objective that associates each real sensor image with its corresponding synthetic sensor image (Col. 25, lines 55-65; Col. 26, lines 1-10).  However, Kaufhold operates at the pixel/reconstruction level rather than an encoder extracted feature representations, and does not explicitly characterize the association of real/synthetic pairs as a feature level operation.

wherein the encoder is configured to receive an input sensor data representation and extract features therefrom,
Kaufhold discloses that the encoder component of the variational autoencoder architecture (modules 810/830) processes both real sensor images and synthetic sensor images and extracts low-dimensional embedded feature representations.  The deep embedding module 333 receives as input a high dimensional object and outputs a low dimensional embedded representation (a356) constituting the extracted features (Col. 26, lines 35-65).

Kaufhold does not expressly disclose a self-supervised loss function applied to a training set; and associate each set of real sensor data with its corresponding set of synthetic sensor data based on their respective features.
Chen discloses a self-supervised loss function applied to a training set; and 
Chen explicitly discloses a self-supervised loss function, the NT-Xent (normalized temperature scaled cross entropy) loss, applied to a training set to train a base encoder neural network in a fully self-supervised manner, without requiring human provided labels.  The NT-Xent loss is applied to the training set without any ground-truth label supervision, rendering it a self-supervised loss function as claimed (Chen:¶¶[0002; 0027; 0042-0043]).
associate each set of real sensor data with its corresponding set of synthetic sensor data based on their respective features.
Chen discloses a feature level association.  Chen’s NT-Xent loss explicitly encourages the ML system to associate positive pairs (corresponding data items) with each other based on their encoder extracted feature representations projected in to a shared latent space.  (¶[0042]).
Kaufhold in view of Chen are combinable because they are from the same field of endeavor of image processing; e.g., both references employ the same fundamental encoder architecture processing of paired data.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the self-supervised loss function, and the use of encoder extracted features as taught by Chen into the pixel-level reconstruction loss and pixel-level features taught by Kaufhold.  The suggestion/motivation for doing so is to obtain an improved quality of encoder feature representations as disclosed by Chen at ¶¶[0029; 0070-0072].  Therefore, it would have been obvious to combine Kaufhold with Chen to obtain the invention as specified.

The proposed combination of Kaufhold in view of Chen do not expressly disclose a perception component; and the perception component is configured to use the extracted features to interpret the input sensor data representation.
Taralova discloses a perception component; and the perception component is configured to use the extracted features to interpret the input sensor data representation.
Taralova discloses ‘perception system 222 can perform object detection, segmentation, and/or classification based at least in part on sensor data received from the sensor system(s) 206. In at least one example, the perception system 222 can receive raw sensor data (e.g., from the sensor system(s) 206). In other examples, the perception system 222 can receive processed sensor data (e.g., from the sensor system(s) 206). For instance, in at least one example, the perception system 222 can receive data from a vision system that receives and processes camera data (e.g., images). In at least one example, the vision system can utilize one or more image processing algorithms to perform object detection, segmentation, and/or classification with respect to object(s) identified in an image. In some examples, the vision system can associate a bounding box (or otherwise an instance segmentation) with an identified object and can associate a confidence score associated with a classification of the identified object. In some examples, objects, when rendered via a display, can be colored based on their perceived class. In at least other examples, similar processes (detection, classification, segmentation, etc.) may be performed by the perception system 222 for one or more other modalities (e.g., LIDAR, RADAR, ToF systems, etc.).

Kaufhold, Chen & Taralova are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose a perception component; and the perception component is configured to use the extracted features to interpret the input sensor data representation.  The suggestion/motivation for doing so is to build a single coherent view of the surroundings and use that to make decisions.  Taralova further discloses an example wherein a perception system can use a classifier that was trained on simulated data so that the autonomous vehicle can respond to events such as a traffic light that is burned out.  Therefore, it would have been obvious to combine Kaufhold, Chen & Taralova to obtain the invention as specified in claim 14.

Regarding Claim 16:
The proposed combination of Kaufhold, Chen & Taralova further disclose the computer system of claim 14, wherein each set of real sensor data comprises sensor data of at least one sensor modality, and the corresponding sets of synthetic sensor data are generated using one or more sensor models for the at least one sensor modality
Kaufhold discloses that SAR (synthetic aperture radar) techniques can be used to produce synthetic SAR images, using a rendering algorithm applied to a CAD model, which constitutes one or more sensor models for the sensor modality (Col. 24, lines 24-35).  This constitutes a sensor model for the SAR sensor modality generating synthetic sensor data corresponding to real SAR data.  

Regarding Claim 17:
The proposed combination of Kaufhold, Chen & Taralova further disclose the computer system of claim 16, wherein the system is configured to: 
receiving at least one time-sequence of real sensor data;
process the at least one time-sequence to extract a description of a scenario; and simulate the scenario in a simulator, wherein each set of real sensor data comprises a portion of real sensor data of the at least one time-sequence, and the corresponding set of synthetic sensor data is derived from a corresponding part of the simulated scenario using the one or more sensor models.

Taralova discloses receiving at least one time-sequence of real sensor data;
Taralova’s autonomous vehicle 102 traverses the real environment 100 and generates sequential sensor data 104 over time (Col. 3, lines 12-20; Fig. 1A).
Taralova explicitly discloses “a first acquired image in a video sequence and the second image may be an image acquired within some period of time (e.g., within 1-5 frames).” (Col. 13, lines 41-44).
processing the at least one time-sequence to extract a description of a scenario; and
Taralova discloses “techniques described herein are directed to generating simulated environments and using simulated environments in various scenarios” (Col. 5, lines 57-59).
simulating the scenario in a simulator, wherein each set of real sensor data comprises a portion of real sensor data of the at least one time-sequence, and the corresponding set of synthetic sensor data is derived from a corresponding part of the simulated scenario using the one or more sensor models.
Taralova further discloses Fig. 5 block 502 ‘generate a simulated environment’. 
“the simulation system 244 can generate simulated environments via procedural generation (e.g., creating data algorithmically)…The simulation system 244 described herein can simulate one or more sensor modalities (e.g., LIDAR, RADAR, ToF, SONAR, images (e.g., RGB, IR, intensity, depth, etc.), etc.).” (Col. 11, line 66 – Col. 12, line 4).
Fig. 5, Block 504 explicitly teaches receiving “a first intermediate output…associated with a real environment and a second intermediate output…associated with a corresponding simulated environment” (Col. 21, lines 64-66).
Taralova further states that “The images can correspond such that they represent the same portion of their respective environments.” (Col. 2, lines 39-40).
Kaufhold, Chen & Taralova are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to derive a scenario description by processing real world sensor data logs for log-based simulation.  The suggestion/motivation for doing so is to produce synthetic training data that more closes mirrors the real-world domain being simulated.  Taralova discloses in the background of invention that simulated environments can be useful when testing would be unsafe in a real-world environment such as for a rare or infrequent scenario, expressly recognizing scenario descriptions as inputs to the simulation system.  Therefore, it would have been obvious to combine Kaufhold, Chen & Taralova to obtain the invention as specified in claim 17.

Regarding Claim 18:
The proposed combination of Kaufhold, Chen & Taralova further disclose the computer system of claim 17, wherein each set of real sensor data captures a real static scene at a time instant in the real sensor data sequence, 
Taralova’s autonomous vehicle 102 traverses the real environment and generates sequential sensor data frames over time.  Taralov discloses “a first acquired image in a video sequence and the second image may be an image acquired within some period of time (e.g., within 1-5 frames).” (Col. 13, lines 41-44).
Each individual image in the sequence represents a static capture of the scene at a discrete time instant.
and the corresponding set of synthetic sensor data captures a synthetic static scene at a corresponding time instant in the simulation.
Taralova: Fig. 5, block 504 teaches receiving “a first intermediate output…associated with a real environment and a second intermediate output…associated with a corresponding simulated environment” (Col. 21, lines 64-66), and that “The images can correspond such that they represent the same portion of their respective environments.” (Col. 2, lines 39-40).
The pairing is of a static snapshot from a real environment and a static snapshot from the simulation at the corresponding moment.
Kaufhold, Chen & Taralova are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement the frame level temporal correspondence taught by Taralova within the training framework taught by Kaufhold and Chen.  The suggestion/motivation for doing so is to get a static capture of a simulated scene.  As discussed in Claim 3, Taralov discloses that simulated environments can be useful when testing would be unsafe in a real-world environment such as for a rare or infrequent scenario, expressly recognizing scenario descriptions as inputs to the simulation system.  Therefore, it would have been obvious to combine Kaufhold, Chen & Taralova to obtain the invention as specified in claim 18.

Regarding Claim 20:
The proposed combination of Kaufhold, Chen & Taralova further disclose the computer system of claim 16, wherein for each real set of sensor data the corresponding set of synthetic sensor data is generated via processing of the real set of sensor data.
As addressed in Claim 2 above, Kaufhold discloses generating synthetic sensor data using sensor models corresponding to a given sensor modality (Col. 24, lines 24-35; Col. 25, lines 24-28).
Chen further expressly discloses that synthetic/augmented data is generated by directly processing the real input data.  Specifically, Chen discloses a stochastic data augmentation module 203 that “transforms any given data example, e.g., an input image x…randomly resulting in two correlated views of the same example” [0038].  The specific processing operations applied to the real input image include random cropping, random color distortions, and random Gaussian blur, each applied directly to the real training image to produce the augmented (synthetic) output [0039].
Kaufhold, Chen & Taralova are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Chen’s technique of deriving synthetic sensor data through direct processing of the corresponding real sensor data into the framework of Kaufhold and Taralova.  The suggestion/motivation for doing so is to generate synthetic sensor data that closely corresponds to the really sensor data, improving training. Therefore, it would have been obvious to combine Kaufhold, Chen & Taralova to obtain the invention as specified in claim 20.

Regarding Claim 21:
The proposed combination of Kaufhold, Chen & Taralova discloses the computer system of claim 14, wherein at least one of the sets of real sensor data comprises a real image, and the corresponding set of synthetic sensor data comprises a corresponding synthetic image derived via image rendering.
Kaufhold discloses that real SAR images acquired from “a real operational imaging sensor imaging a real object” (Col. 26, lines 28-29).
Kaufhold further discloses that corresponding synthetic SAR images are explicitly discloses as being “synthetically rendered SAR from CAD models” (Col. 26, lines 25-26) constituting synthetic images derived via image rendering.
Kaufhold further describes the use of “virtual reality rendering engines producing visually realistic renderings of scenes”.

15.	Claims 5 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Kaufhold, Chen & Taralova as applied to claims 4 and 18 above, and further in view of Lee et al. (US 2021/0358296) hereinafter ‘Lee’.

Regarding Claim 5:
The proposed combination of Kaufhold, Chen & Taralova discloses the method of claim 4, but do not expressly disclose wherein each real and static scene is a discretised 2D image representation of a 3D point cloud.
	Lee discloses wherein each real and static scene is a discretised 2D image representation of a 3D point cloud.
Lee teaches that the disclosure relates to “converting point cloud information into a two dimensional form.” (¶[0001] and Abstract).  Lee further teaches receiving “two sets of 3D point cloud data of the scene from two consecutive point cloud sweeps” and encoding the  data using a pillar feature network to extract “two-dimensional (2D) bird's-eye-view embeddings for each of the point cloud data sets in the form of pseudo images” (¶¶[0010-0012; Fig. 3).
Lee teaches that encoding the point cloud data includes “voxelizing the point cloud data sets to render surfaces in the data sets onto a grid of discretized volume elements in a 3D space to create a set of pillars.” ([0015]).  The set of pillars comprises a (D,P,N) shape tensor in which P is the number of pillars and N denotes the number of points per pillar (¶[0016]).  Lee further teaches encoding the voxel information to extract features and “scattering the encoded features back to their original pillar locations to create the bird’s eye view.” (¶[0017]; Fig. 4, steps 424-428).
Lee expressly teaches that two consecutive point clouds are encoded by the pillar feature network to build “two BeV pseudo images where each cell has a learned embedding based on points that had fallen inside of it” and that these pseudo images are then fed to a feature pyramid network and an optical flow network for dense flow estimation. (¶[0039]; Fig. 3, elements 332, 334; Fig. 4, step 428).  Lee also explains that the system deliberately implements “a 2D BeV representation over a 3D or projective representation (depth image” for its computational and structural advantages. (¶[0036]).
Lee teaches that its BeV flow framework “not only estimates 2D BeV flow accurately but also improves tracking performance of both dynamic and static objects” (¶[0009]).  Lee discloses that self-supervised learning is performed on the bird’s eye view embeddings using dynamic and static masks, including forward and backward flow estimation and self-supervised learning based on flow estimates. (¶[0021]; Fig. 5; Fig. 6, steps 624-630).
Kaufhold, Chen, Taralova & Lee are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose wherein each real and static scene is a discretised 2D image representation of a 3D point cloud.  The suggestion/motivation for doing so to provide computational efficiency compared to 3D approaches, and allows the encoded feature representation to be shared as disclosed by  Lee. (¶¶[0035-0036; 0039]).  Therefore, it would have been obvious to combine Kaufhold, Chen, Taralova & Lee to obtain the invention as specified in claim 5.

Regarding Claim 19:
The proposed combination of Kaufhold, Chen & Taralova discloses the computer system of claim 18, but do not expressly disclose wherein each real and static scene is a discretised 2D image representation of a 3D point cloud.
	Lee discloses wherein each real and static scene is a discretised 2D image representation of a 3D point cloud.
Lee teaches that the disclosure relates to “converting point cloud information into a two dimensional form.” (¶[0001] and Abstract).  Lee further teaches receiving “two sets of 3D point cloud data of the scene from two consecutive point cloud sweeps” and encoding the  data using a pillar feature network to extract “two-dimensional (2D) bird's-eye-view embeddings for each of the point cloud data sets in the form of pseudo images” (¶¶[0010-0012; Fig. 3).
Lee teaches that encoding the point cloud data includes “voxelizing the point cloud data sets to render surfaces in the data sets onto a grid of discretized volume elements in a 3D space to create a set of pillars.” ([0015]).  The set of pillars comprises a (D,P,N) shape tensor in which P is the number of pillars and N denotes the number of points per pillar (¶[0016]).  Lee further teaches encoding the voxel information to extract features and “scattering the encoded features back to their original pillar locations to create the bird’s eye view.” (¶[0017]; Fig. 4, steps 424-428).
Lee expressly teaches that two consecutive point clouds are encoded by the pillar feature network to build “two BeV pseudo images where each cell has a learned embedding based on points that had fallen inside of it” and that these pseudo images are then fed to a feature pyramid network and an optical flow network for dense flow estimation. (¶[0039]; Fig. 3, elements 332, 334; Fig. 4, step 428).  Lee also explains that the system deliberately implements “a 2D BeV representation over a 3D or projective representation (depth image” for its computational and structural advantages. (¶[0036]).
Lee teaches that its BeV flow framework “not only estimates 2D BeV flow accurately but also improves tracking performance of both dynamic and static objects” (¶[0009]).  Lee discloses that self-supervised learning is performed on the bird’s eye view embeddings using dynamic and static masks, including forward and backward flow estimation and self-supervised learning based on flow estimates. (¶[0021]; Fig. 5; Fig. 6, steps 624-630).
Kaufhold, Chen, Taralova & Lee are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose wherein each real and static scene is a discretised 2D image representation of a 3D point cloud.  The suggestion/motivation for doing so to provide computational efficiency compared to 3D approaches, and allows the encoded feature representation to be shared as disclosed by  Lee. (¶¶[0035-0036; 0039]).  Therefore, it would have been obvious to combine Kaufhold, Chen, Taralova & Lee to obtain the invention as specified in claim 19.

16.	Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Kaufhold in view of  Chen as applied to claim 1 above, and further in view of Lee et al. (US 2021/0358296).

Regarding Claim 8:
The proposed combination of Kaufhold in view of Chen further discloses the method of claim 1, wherein at least one of the sets of real sensor data comprises a real lidar or radar point cloud, and the corresponding set of synthetic sensor data comprises a corresponding synthetic point cloud derived via lidar or radar modelling.
Kaufhold uses radar as a sensor modality within the synthetic data training that comprises real sensor images disclosing “in some modalities, such as synthetic aperture radar (‘SAR’), can be used to produce synthetic SAR images…akin to virtual reality rendering engines producing visually realistic renderings of scenes” (Col. 24, lines 24-35), and further that “large quantities of synthetic SAR model data can be generated quickly with a CAD model of a vehicle…and then translated into the real observed SAR data space” (Co. 24, lines 24-28).  This directly corresponds to a synthetic sensor dataset derived via radar modeling.
Chen discloses an augmentation module 203 that transforms an input image into augmented images 212 and 222, wherein types of images can include LiDAR point clouds ([0038]).
Lee expressly discloses wherein at least one of the sets of real sensor data comprises a real lidar or radar point cloud, and the corresponding set of synthetic sensor data comprises a corresponding synthetic point cloud derived via lidar or radar modelling.
Lee expressly teaches the use of real LiDAR and radar point clouds as the sensor input in a deep learning framework.  Specifically, Lee discloses “receiving two sets of 3D point cloud data of the scene from two consecutive point cloud sweeps” ([0011]), directly mapping to the claimed “real lidar point cloud”.  Lee further discloses that “embodiments may further leverage radar data as an additional input channel to the feature pyramid network, which may include range, range-rate (velocity) and occupancy information from the radar return signal.” ([0039]), confirming that radar point cloud data is a taught sensor modality. 
Lee further states “Embodiments for deep learning for image perception utilize synthetic data, such as data generated programmatically. Synthetic data may include computer-generated data created to mimic real data.” ([0060]), which maps directly to the claimed “corresponding synthetic point cloud derived via lidar or radar modelling” limitation.
Kaufhold, Chen & Lee are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the lidar and radar point cloud teachings of Lee into the combined framework of Kaufhold and Chen.  The suggestion/motivation for doing so is to extend the synthetic sensor data training, which already includes radar to the lidar point cloud domain as explicitly taught by Lee.  Therefore, it would have been obvious to combine Kaufhold, Chen & Lee to obtain the invention as specified in claim 8.

Regarding Claim 9:
The proposed combination of Kaufhold, Chen & Lee further discloses the method of claim 8, wherein each point cloud is represented in the form of a discretised 2D image.
Lee explicitly teaches encoding LiDAR point cloud data into a discretised 2D image representation.  Specifically, Lee discloses “encoding data of the point cloud data sets using a pillar feature network to extract two-dimensional (2D) bird's-eye-view embeddings for each of the point cloud data sets in the form of pseudo images” ([0106] and claim 1).
Kaufhold, Chen & Lee are combinable because they are from the same field of endeavor of image processing.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to represent LiDAR and radar point clouds in the form of a discretised 2D image, as it was a well-known and computationally advantageous technique.  Lee discloses that grouping similar features (in the form of pillars) together and representing them as a single feature makes for more efficient processing ([0107]).  Therefore, it would have been obvious to combine Kaufhold, Chen & Lee to obtain the invention as specified in claim 9.

Conclusion
17.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
	Firner (US 2021/0042575) discloses a neural network is trained to focus on a domain of interest. For example, in a pre-training phase, the neural network in trained using synthetic training data, which is configured to omit or limit content less relevant to the domain of interest, by updating parameters of the neural network to improve the accuracy of predictions. In a subsequent training phase, the pre-trained neural network is trained using real-world training data by updating only a first subset of the parameters associated with feature extraction, while a second subset of the parameters more associated with policies remains fixed.

18.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to NEIL R MCLEAN whose telephone number is (571)270-1679. The examiner can normally be reached Monday-Thursday, 6AM - 4PM, PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi M Sarpong can be reached at 571.270.3438. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NEIL R MCLEAN/             Primary Examiner, Art Unit 2681
Read full office action
Prosecution Timeline

Jul 18, 2023
Application Filed
Mar 13, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/193,390
Patent 12586172
STRUCTURE DIAGNOSTIC CASE PRESENTATION DEVICE, METHOD, AND PROGRAM
2y 5m to grant Granted Mar 24, 2026
18/213,370
Patent 12587845
ULTRASONIC DIAGNOSTIC APPARATUS
2y 5m to grant Granted Mar 24, 2026
18/316,711
Patent 12580071
SYSTEMS AND METHODS TO PROCESS ELECTRONIC IMAGES WITH AUTOMATIC PROTOCOL REVISIONS
2y 5m to grant Granted Mar 17, 2026
17/718,679
Patent 12566270
APPARATUS FOR ASSISTING DRIVING OF VEHICLE AND METHOD THEREOF
2y 5m to grant Granted Mar 03, 2026
17/944,003
Patent 12568181
METHOD AND DEVICE OF VIDEO VIRTUAL BACKGROUND IMAGE PROCESSING AND COMPUTER APPARATUS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
79%
Grant Probability
90%
With Interview (+10.5%)
2y 6m
Median Time to Grant
Low
PTA Risk
Based on 686 resolved cases by this examiner. Grant probability derived from career allow rate.