DETAILED ACTION
This communication is in response to the Application No. 18/165,862 filed on February 07, 2023
in which Claims 1 - 20 are presented for examination.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 1-20 are rejected under 35 U.S.C. 101 because these claimed inventions are
directed to an abstract idea without significantly more.
Regarding Claim 1:
Step 1: Claim 1 is a method type claim. Therefore, Claims 2-13 fall within one of the four statutory
categories (i.e., process, machine, manufacture, or composition of matter).
2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance
of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. If a claim limitation, under its broadest reasonable
interpretation, covers performance of the limitation by mathematical calculation but for the recitation
of generic computer components, then it falls within the “Mathematical Concepts” grouping of abstract
ideas.
obtaining, […] , a first representation and a second representation of an environment, wherein the environment is sensed by […] , and wherein the first representation and the second representation are generated based on sensor data from (mental process – obtaining a first representation and a second representation may be performed manually by a user observing/analyzing the sensor data and accordingly using judgement/evaluation to obtain a first representation and a second representation based on said analysis)
determining, […] , one or more discrepancies between the first representation and the second representation, each discrepancy of the one or more discrepancies corresponding to difference in classification of a portion of the environment as indicated within the respective first representation and second representation (mental process – determining one or more discrepancies between the first representation and the second representation may be performed mentally by a user observing/analyzing the first representation and the second representation and accordingly using judgement/evaluation to determine discrepancies between the two representations)
Generating […] a training data set comprising, for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists (mental process – generating a training dataset comprising, for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists may be performed mentally by a user observing/analyzing the discrepancies and accordingly using judgment/evaluation to generate a training data set based on said analysis)
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
[…] at least one processor of a computing device […] (recited at a high-level of generality (i.e., a generic processor of a computing device, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] by the at least one processor […] at least one sensor of an autonomous vehicle […] the at least one sensor (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] by the at least one processor […] (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] by the at least one processor […] (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
Step 2B: The claim does not include additional elements considered individually and in combination that
are sufficient to amount to significantly more than the judicial exception.
[…] at least one processor of a computing device […] (recited at a high-level of generality (i.e., a generic processor of a computing device, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] by the at least one processor […] at least one sensor of an autonomous vehicle […] the at least one sensor (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] by the at least one processor […] (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] by the at least one processor […] (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
For the reasons above, Claim 1 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 2 - 13. The additional limitations of the dependent claims are addressed below.
Regarding Claim 2:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 2 depends on.
Step 2A Prong 2 & Step 2B:
wherein the first representation is generated by application of a machine learning model to the sensor data, and wherein the method further comprises retraining the machine learning model based at least partly on the training data set (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of training a machine learning model according to training data set without significantly more)
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 1. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 3:
Step 2A Prong 1: See the rejection of Claim 2 above, which Claim 3 depends on.
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
[…] by the at least one processor […] (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
transmitting […] the retrained machine learning model to at least one autonomous vehicle […] (Adding insignificant extra-solution activity to the judicial exception - see MPEP 2106.05(g))
[…] wherein the at least one autonomous vehicle uses the retrained machine learning model to infer object classifications of objects based on additional sensor data ((Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of training a machine learning model according to training data set without significantly more)
Step 2B:
[…] by the at least one processor […] (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
transmitting […] the retrained machine learning model to at least one autonomous vehicle […] (MPEP 2106.05(d)(II) indicates that merely “Receiving or transmitting data over a network” is a well understood, routine, conventional function when it is claimed in a merely generic manner (as it is in the present claim). Thereby, a conclusion that the claimed limitation is well-understood, routine, conventional activity is supported under Berkheimer)
[…] wherein the at least one autonomous vehicle uses the retrained machine learning model to infer object classifications of objects based on additional sensor data (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of training a machine learning model according to training data set without significantly more)
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 2. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 4:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 4 depends on.
Step 2A Prong 2 & Step 2B:
wherein the first representation is a semantic segmentation map of the environment, and the second representation is an occupancy grid of the environment (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that the first representation is a semantic segmentation map of the environment, and the second representation is an occupancy grid of the environment does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 1. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 5:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 5 depends on.
wherein generating the training data set comprises clustering the one or more discrepancies into one or more discrepancy groups, each discrepancy group corresponding to an entry in the training data set (mental process – clustering the one or more discrepancies into one or more discrepancy groups may be performed mentally by a user observing/analyzing the discrepancies and accordingly using judgement/evaluation to cluster the one or more discrepancies into one or more discrepancy groups based on said analysis)
Step 2A Prong 2 & Step 2B:
Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the
abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial
exception.
Regarding Claim 6:
Step 2A Prong 1: See the rejection of Claim 5 above, which Claim 6 depends on.
wherein generating the training data set further comprises programmatically labeling each of the one or more discrepancy groups based on a classification, within at least one of the first or second representations, of the portion of the environment at which the discrepancy exists (mental process – programmatically labeling each of the one or more discrepancy groups based on a classification may be performed mentally by a user observing/analyzing the discrepancies classification and accordingly using judgement/evaluation to label each of the one or more discrepancy groups based on said analysis)
Step 2A Prong 2 & Step 2B:
Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the
abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial
exception.
Regarding Claim 7:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 7 depends on.
Step 2A Prong 2 & Step 2B:
wherein the difference in classification corresponds to a difference in confidence for classification of the portion of the environment in the first representation and confidence for classification of the portion of the environment in the second representation (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that the difference in classification corresponds to a difference in confidence for classification of the portion of the environment in the first representation and confidence for classification of the portion of the environment in the second representation does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 1. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 8:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 8 depends on.
Step 2A Prong 2 & Step 2B:
wherein the first representation is generated by passing the sensor data through a model generated via application of machine learning to additional sensor data and wherein the second representation is generated by application of a non-machine-learned algorithm to the sensor data(Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of using a machine learning model to generate a representation according to the sensor data without significantly more)
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 1. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 9:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 9 depends on.
comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation (mental process – comparing corresponding elements of the first representation and the second representation may be performed mentally by a user observing/analyzing the corresponding elements and accordingly using judgement/evaluation to compare corresponding elements of the first representation and the second representation based on said analysis)
based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element (mental process – identifying a difference in classification between the first particular element and the second particular element may be performed mentally by a user observing/analyzing the classification between the first particular element and the second particular element and using judgement/evaluation to identify a difference in classification between the first particular element and the second particular element based on said analysis)
Step 2A Prong 2 & Step 2B:
Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the
abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial
exception.
Regarding Claim 10:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 10 depends on.
aligning […] the first representation with the second representation (mental process – aligning the first representation with the second representation may be performed mentally by a user observing/analyzing the first representation and the second representation and using judgment/evaluation to align the first representation with the second representation based on said analysis)
Step 2A Prong 2 & Step 2B: This judicial exception is not integrated into a practical application.
[…] by the at least one processor […] (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 2. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 11:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 11 depends on.
wherein the first representation labels portions of the environment as corresponding to a class of a plurality of classes, and wherein determining the one or more discrepancies comprises generating a binary representation of the first representation by representing portions of the environment labeled as a first subset of the plurality of classes with a first value and representing portions of the environment labeled as a second subset of the plurality of classes with a second value (mental process - generating a binary representation of the first representation may be performed mentally by a user observing/analyzing the first representation and using judgement/evaluation to generate a binary representation of the first representation based on said analysis)
Step 2A Prong 2 & Step 2B:
Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the
abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial
exception.
Regarding Claim 12:
Step 2A Prong 1: See the rejection of Claim 11 above, which Claim 12 depends on.
Step 2A Prong 2 & Step 2B:
wherein the first subset of the plurality of classes comprises occupying-type classes, and wherein the second subset of the plurality of classes comprises non-occupying-type classes (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that the first subset of the plurality of classes comprises occupying-type classes, and wherein the second subset of the plurality of classes comprises non-occupying-type classes does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 1. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 13:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 13 depends on.
filtering the one or more discrepancies to remove at least one discrepancy from the one or more discrepancies, the at least one discrepancy satisfying removal criteria comprising one or more of a minimum size, a minimum dimensionality, or a minimum visibility from a point of view of the at least one sensor(mental process - filtering the one or more discrepancies to remove at least one discrepancy from the one or more discrepancies may be performed mentally by a user observing/analyzing the one or more discrepancies and using judgement/evaluation filtering the one or more discrepancies to remove at least one discrepancy from the one or more discrepancies based on said analysis)
Step 2A Prong 2 & Step 2B:
Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the
abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial
exception.
Regarding Claim 14:
Step 1: Claim 14 is a system type claim. Therefore, Claims 15-17 fall within one of the four statutory
categories (i.e., process, machine, manufacture, or composition of matter).
2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance
of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. If a claim limitation, under its broadest reasonable
interpretation, covers performance of the limitation by mathematical calculation but for the recitation
of generic computer components, then it falls within the “Mathematical Concepts” grouping of abstract
ideas.
obtain a first representation and a second representation of an environment, wherein the environment is sensed by […] , and wherein the first representation and the second representation are generated based on sensor data from (mental process – obtaining a first representation and a second representation may be performed manually by a user observing/analyzing the sensor data and accordingly using judgement/evaluation to obtain a first representation and a second representation based on said analysis)
determining one or more discrepancies between the first representation and the second representation, each discrepancy of the one or more discrepancies corresponding to difference in classification of a portion of the environment as indicated within the respective first representation and second representation (mental process – determining one or more discrepancies between the first representation and the second representation may be performed mentally by a user observing/analyzing the first representation and the second representation and accordingly using judgement/evaluation to determine discrepancies between the two representations)
Generating a training data set comprising, for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists (mental process – generating a training dataset comprising, for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists may be performed mentally by a user observing/analyzing the discrepancies and accordingly using judgment/evaluation to generate a training data set based on said analysis)
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
a data store [...]; a processor […] (recited at a high-level of generality (i.e., a data store, a generic processor of a computing device, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] at least one sensor of an autonomous vehicle […] the at least one sensor (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
Step 2B: The claim does not include additional elements considered individually and in combination that
are sufficient to amount to significantly more than the judicial exception.
a data store [...]; a processor […] (recited at a high-level of generality (i.e., a data store, a generic processor of a computing device, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] at least one sensor of an autonomous vehicle […] the at least one sensor (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
For the reasons above, Claim 14 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 15 - 17. The additional limitations of the dependent claims are addressed below.
Regarding Claim 15:
Step 2A Prong 1: See the rejection of Claim 14 above, which Claim 15 depends on.
Step 2A Prong 2 & Step 2B:
wherein the first representation is generated by application of a machine learning model to the sensor data, and wherein the execution of the computer-executable instructions causes the system to retrain the machine learning model based at least partly on the training data set (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of training a machine learning model according to training data set without significantly more)
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 14. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 16:
Step 2A Prong 1: See the rejection of Claim 14 above, which Claim 16 depends on.
wherein generating the training data set comprises clustering the one or more discrepancies into one or more discrepancy groups, each discrepancy group corresponding to an entry in the training data set (mental process – clustering the one or more discrepancies into one or more discrepancy groups may be performed mentally by a user observing/analyzing the discrepancies and accordingly using judgement/evaluation to cluster the one or more discrepancies into one or more discrepancy groups based on said analysis)
Step 2A Prong 2 & Step 2B:
Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the
abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial
exception.
Regarding Claim 17:
Step 2A Prong 1: See the rejection of Claim 14 above, which Claim 17 depends on.
comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation (mental process – comparing corresponding elements of the first representation and the second representation may be performed mentally by a user observing/analyzing the corresponding elements and accordingly using judgement/evaluation to compare corresponding elements of the first representation and the second representation based on said analysis)
based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element (mental process – identifying a difference in classification between the first particular element and the second particular element may be performed mentally by a user observing/analyzing the classification between the first particular element and the second particular element and using judgement/evaluation to identify a difference in classification between the first particular element and the second particular element based on said analysis)
Step 2A Prong 2 & Step 2B:
Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the
abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial
exception.
Regarding Claim 18:
Step 1: Claim 18 is a non-transitory computer-readable storage media type claim. Therefore, Claims 19-20 fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance
of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. If a claim limitation, under its broadest reasonable
interpretation, covers performance of the limitation by mathematical calculation but for the recitation
of generic computer components, then it falls within the “Mathematical Concepts” grouping of abstract
ideas.
obtain a first representation and a second representation of an environment, wherein the environment is sensed by […] , and wherein the first representation and the second representation are generated based on sensor data from (mental process – obtaining a first representation and a second representation may be performed manually by a user observing/analyzing the sensor data and accordingly using judgement/evaluation to obtain a first representation and a second representation based on said analysis)
determining one or more discrepancies between the first representation and the second representation, each discrepancy of the one or more discrepancies corresponding to difference in classification of a portion of the environment as indicated within the respective first representation and second representation (mental process – determining one or more discrepancies between the first representation and the second representation may be performed mentally by a user observing/analyzing the first representation and the second representation and accordingly using judgement/evaluation to determine discrepancies between the two representations)
Generating a training data set comprising, for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists (mental process – generating a training dataset comprising, for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists may be performed mentally by a user observing/analyzing the discrepancies and accordingly using judgment/evaluation to generate a training data set based on said analysis)
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
[...] a processor […] (recited at a high-level of generality (i.e., a data store, a generic processor of a computing device, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] at least one sensor of an autonomous vehicle […] the at least one sensor (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
Step 2B: The claim does not include additional elements considered individually and in combination that
are sufficient to amount to significantly more than the judicial exception.
[...] a processor […] (recited at a high-level of generality (i.e., a data store, a generic processor of a computing device, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
[…] at least one sensor of an autonomous vehicle […] the at least one sensor (recited at a high-level of generality (i.e., a generic processor, sensor of an autonomous vehicle, and memory) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
For the reasons above, Claim 18 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 19 - 20. The additional limitations of the dependent claims are addressed below.
Regarding Claim 19:
Step 2A Prong 1: See the rejection of Claim 18 above, which Claim 19 depends on.
Step 2A Prong 2 & Step 2B:
wherein the first representation is generated by application of a machine learning model to the sensor data, and wherein the execution of the computer-executable instructions causes the system to retrain the machine learning model based at least partly on the training data set (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of training a machine learning model according to training data set without significantly more)
Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the
abstract idea into practical application because it does not impose any meaningful limits on practicing
the abstract idea, as discussed above in the rejection of claim 18. The claim does not include additional
elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
Regarding Claim 20:
Step 2A Prong 1: See the rejection of Claim 18 above, which Claim 20 depends on.
comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation (mental process – comparing corresponding elements of the first representation and the second representation may be performed mentally by a user observing/analyzing the corresponding elements and accordingly using judgement/evaluation to compare corresponding elements of the first representation and the second representation based on said analysis)
based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element (mental process – identifying a difference in classification between the first particular element and the second particular element may be performed mentally by a user observing/analyzing the classification between the first particular element and the second particular element and using judgement/evaluation to identify a difference in classification between the first particular element and the second particular element based on said analysis)
Step 2A Prong 2 & Step 2B:
Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the
abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial
exception.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-4, 7-15, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Subhasis et al. (hereafter Subhasis) (JP 2022554184), in view of Wang et al. (hereinafter Wang) (US 20230139772).
Regarding Claim 1, Subhasis teaches a method (Subhasis, Par. [0098], “A method”, thus a method is disclosed) implemented by at least one processor of a computing device (Subhasis, Par. [0042], “Computing device 214 may also include processor 222 and/or memory 224”, thus a processor of a computing device is disclosed), the method comprising:
obtaining, by the at least one processor, a first representation and a second representation of an environment, wherein the environment is sensed by at least one sensor of an autonomous vehicle, and wherein the first representation and the second representation are generated based on sensor data from the at least one sensor (Subhasis, Par. [0007], “In some examples, one or more sensors of a sensor type are associated with a pipeline (e.g., sequence of operations; steps; networks or layers thereof; machine learning models; analog-to-digital converters; Determine information about objects associated with hardware such as amplifiers, field programmable gate arrays (FPGAs), application specific integrated circuits (ASIC(s), and/or the like) and contained in the associated sensor data can be used to Sensor data may be received from one or more sensors of that type, and a pipeline (sometimes referred to herein as a perceptual pipeline) generates an environmental sensor based at least in part on the sensor data. can generate a representation of For simplicity, the collective output of the pipeline is referred to herein as the environment representation. The environment representation may include one or more object detections and may include one or more output types. For example, video pipeline 302 may output environment representation 308 based at least in part on video data 310 (eg, sensor data including one or more RGB images, thermal images)”, & Par. [0011], “In some examples, aggregated data may additionally or alternatively be data from remote computing devices and/or map data (e.g., road data, drivable surface locations, destinations), for example. , weather data, traffic notifications (e.g. congestion, collisions, lane changes, construction, speed changes), safety notifications (e.g. environmentally hazardous locations, disaster locations, road conditions, visibility conditions), etc. , and/or the like. In some examples, the remote computing device may be another autonomous vehicle, third party service, distributed computing device, remote sensor, and/or the like.”, & Par. [0012], “In some examples, data aggregated from different pipelines may also include at least a portion of the environment representation for one or more previous times. For example, the perceptual pipelines associated with different sensor types can be synchronized to generate environment representations with the same frequency (eg, every 100 ms, 500 ms, 1 second).”, & Par. [0059], “For example, a first channel of the image may contain pixels indicating whether the respective portion of the environment is occupied/unoccupied as determined by the visual pipeline, and a second channel of the image may contain pixels of the environment. may contain pixels indicating whether each portion of the environment is occupied/unoccupied as determined by the lidar pipeline, and a third channel indicates whether each portion of the environment is associated with a certain object class and so on.”, thus obtaining, by at least one processor, a first representation and a second representation of an environment sensed by at least one sensor of an autonomous vehicle and generated based on sensor data is disclosed, because Subhasis teaches that sensor data from sensors of an autonomous vehicle are received and processed by multiple perceptual pipelines to generate environment representations. Subhasis describes a visual pipeline that processes camera sensor data to generate an environment representation indicating occupancy and object information, and a lidar pipeline that processes lidar sensor data to generate a separate environment representation indicating occupancy and object information for the same environment. The environment representation output by the visual pipeline corresponds to the first representation, the environment representation output by the lidar pipeline corresponds to the second representation, and both representations are generated by the processor from sensor data of the autonomous vehicle and represent the same sensed environment)
determining, by the at least one processor, one or more discrepancies between the first representation and the second representation, each discrepancy of the one or more discrepancies corresponding to difference in classification of a portion of the environment as indicated within the respective first representation and second representation (Subhasis, Par. [0003], “Small discrepancies between the detections determined in relation to the two different sensor types can cause jitter (i.e. "flying") and/or flickering (i.e. appearing and disappearing) in the representation of objects created by the vehicle. Also, some sensor types, such as depth cameras, are prone to large errors in depth measurements, which can further complicate object tracking. This can hamper safe navigation of vehicles and training of machine learning (ML) models. Additionally, techniques for reducing discrepancies and/or techniques for smoothing object representations or data associated therewith may consume computing bandwidth and/or memory.”, & Par. [0014], “In some examples, the ML model may be trained to output a final environment representation that may include one or more inferred object detections. As noted above, object detection associated with objects may differ in dimension, location, or even existence between different pipelines. The final environment representation is determined based at least in part on object detections received from different pipelines (e.g., received as input to the ML model as part of the aggregated data). It may include one probable object detection. For example, the inferred object detection generated by the ML model may include a ROI that identifies a part of the environment as occupied (e.g., the region associated with the object), a predicted ROI associated with future time, a velocity associated with the ROI. , the object classification associated with the ROI (e.g., vehicle, pedestrian, heavy vehicle, bicycle), the velocity classification of the ROI (e.g., static or dynamic), the orientation associated with the ROI (e.g., yaw), and/or Azimuth bins (e.g., 2 bins centered at 0 and 180 degrees; 4 bins centered at 0, 90, 180, and 270 degrees; this output also includes the distance from the bin center obtained), and/or the height associated with the ROI (eg, the height of the detected object). In some examples, any region of interest may be generated based at least in part on the output of the trust layer, such as following a non-maximum suppression technique”, thus determining, by the at least one processor, one or more discrepancies between the first representation and the second representation, each discrepancy corresponding to a difference in classification of a portion of the environment, is disclosed, because Subhasis teaches that detections generated from different sensor types and perceptual pipelines may differ, resulting in discrepancies between object representations. Subhasis discloses that each perceptual pipeline produces object detections that include object classification information (e.g., vehicle, pedestrian, bicycle), and that object detections may differ in dimension, location, or even existence between different pipelines. Since object classification is included in each environment representation, a discrepancy between the first representation and the second representation corresponds to a difference in classification of a portion of the environment. Accordingly, the processor’s identification of discrepancies between pipeline outputs reads on determining discrepancies between the first and second representations, and differences in object classification between those representations read on the difference in classification of a portion of the environment.)
[…] by the at least one processor […] for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists (Subhasis, Par. [0014], “In some examples, the ML model may be trained to output a final environment representation that may include one or more inferred object detections. As noted above, object detection associated with objects may differ in dimension, location, or even existence between different pipelines. The final environment representation is determined based at least in part on object detections received from different pipelines (e.g., received as input to the ML model as part of the aggregated data). It may include one probable object detection. For example, the inferred object detection generated by the ML model may include a ROI that identifies a part of the environment as occupied (e.g., the region associated with the object), a predicted ROI associated with future time, a velocity associated with the ROI. , the object classification associated with the ROI (e.g., vehicle, pedestrian, heavy vehicle, bicycle), the velocity classification of the ROI (e.g., static or dynamic), the orientation associated with the ROI (e.g., yaw), and/or Azimuth bins (e.g., 2 bins centered at 0 and 180 degrees; 4 bins centered at 0, 90, 180, and 270 degrees; this output also includes the distance from the bin center obtained), and/or the height associated with the ROI (eg, the height of the detected object). In some examples, any region of interest may be generated based at least in part on the output of the trust layer, such as following a non-maximum suppression technique.”, thus this limitation is disclosed, because Subhasis teaches that discrepancies between environment representations produced by different perceptual pipelines are localized to specific regions of interest corresponding to portions of the environment, and that those regions of interest are derived from and defined by sensor data processed by the pipelines. Identifying regions of interest where object detections differ necessarily involves isolating the sensor data associated with those portions of the environment, which reads on generating, by the processor, a subset of sensor data reflecting the portion of the environment at which each discrepancy exists.)
Subhasis does not explicitly teach generating a training dataset.
However, Wang teaches generating a training dataset (Wang, Par. [128], “Generating Training Data from Real-World Sensor Data. In some embodiments, training data may be generated by collecting and annotating real-world sensor data. For example, one or more vehicles may collect frames of sensor data (e.g., image data and LiDAR data) from one or more sensors (e.g., camera(s) and LiDAR sensor(s)) of the vehicle(s) in real-world (e.g., physical) environments”, & Par. [0141], “Generally, any suitable loss function may be used to update the deep learning model(s) during training. For example, one or more loss functions may be used (e.g., a regression loss function such as L1 or L2 loss may be used for regression tasks) to compare the accuracy of the output(s) of the deep learning model(s) to ground truth, and the parameters of the deep learning model(s) may be updated (e.g., using backward passes, backpropagation, forward passes, etc.) until the accuracy reaches an optimal or acceptable level. In some embodiments in which the deep learning models) includes multiple heads, the multiple heads may be co-trained together on the same dataset, with a common trunk. In this manner, the different heads (tasks) may help each other to learn.”, thus generating a training data set comprising sensor data used for machine-learning training is disclosed, because Wang teaches collecting sensor data from vehicle mounted sensors, annotating that sensor data to form training datasets, and using those datasets to update the parameters of one or more machine-learning models. The collected and annotated sensor data in Wang reads on the training data set, and the sensor frames corresponding to regions used for model training correspond to subsets of sensor data reflecting portions of the environment.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s multi-sensor perception system, which identifies and localizes discrepancies between environment representations produced by different perceptual pipelines and reads on detecting discrepancies associated with specific portions of an environment, with Wang’s techniques for generating training datasets from real-world autonomous-vehicle sensor data to train and update machine-learning models, which read on generating training data sets from subsets of sensor data, because Wang provides a framework for collecting and annotating sensor data for machine-learning training that can be applied to the discrepancy-localized regions identified by Subhasis, thereby enabling the generation of training data sets comprising sensor data subsets corresponding to portions of the environment where discrepancies occur. (Wang, Par. [0051], “As such, the techniques described herein may be used to observe and reconstruct a 3D surface such as a 3D road surface, and a representation of the 3D surface structure (and/or corresponding confidence values) may be provided to an autonomous vehicle drive stack to enable safe and comfortable planning and control of the autonomous vehicle. Generally, the techniques described herein may generate a more accurate representation of road surfaces than prior reconstruction techniques. Furthermore, the present techniques may be used to generate a representation of road surfaces with sufficient accuracy and range for certain autonomous driving applications, unlike prior based reconstruction techniques. As such, the representation of road surfaces generated using the present techniques may enable improved navigation, safety, and comfort in autonomous driving. For example, an autonomous vehicle may be better equipped to adapt the vehicle's suspension system to match the current road surface (e.g., by compensating for bumps in the road), to navigate the vehicle to avoid protuberances (e.g., dips, holes) in the road, and/or to apply an early acceleration or deceleration based on an approaching surface slope in the road. Any of these functions may serve to enhance safety, improve the longevity of the vehicle, improve energy-efficiency, and/or provide a smooth driving experience.”, thus the combined teachings of Subhasis and Wang disclose generating, by at least one processor, a training data set comprising, for each discrepancy, a subset of sensor data reflecting the portion of the environment at which the discrepancy exists)
Regarding Claim 2, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above and Subhasis further teaches:
wherein the first representation is generated by application of a machine learning model to the sensor data (Subhasis, Par. [0010], “In some examples, the techniques discussed herein involve aggregating at least some of the environmental representations associated with different sensor types and applying them to an ML model trained to output inferred object detection. and providing aggregated data as input. In some examples, the aggregated data may be represented in a multi-channel image, and different channels may be associated with different sensor types from which sensory data was generated and/or different types of sensory data. For example, the aggregated data may be a lidar, video, and/or radar occupancy grid {e.g., pixels indicating whether or not the corresponding location in the environment is occupied according to the perceptual pipeline associated with each sensor data type. etc.}, top-down display of ROIs generated in association with lidar, video and/or radar, object classification associated with a portion of the environment, which portion of the environment is occupied It may include probability, yaw of the detected object, and/or the like. See US patent application Ser. No. 16/591,518 relating to occupancy maps, which is hereby incorporated by reference in its entirety. n some examples, the occupancy grid may extend to a maximum height that may correspond to the height of the autonomous vehicle plus a buffer. In other words, the occupancy grid may indicate the occupancy of a portion of the environment below the maximum height. For example, traffic lights and billboards placed on the road may exceed the maximum height, so the occupancy grid does not indicate that they occupy part of the environment”, thus wherein the first representation is generated by application of a machine learning model to the sensor data is disclosed, because Subhasis teaches aggregating sensor data from multiple sensor types and providing that aggregated sensor data as input to a machine-learning model trained to output inferred object detections and environment representations. The ML model generates outputs including object classifications, regions of interest, and associated probabilities, which corresponds to the first representation, while the aggregated lidar, video, and radar data read on the sensor data, such that the first representation is generated by application of a machine-learning model to sensor data sensed by the autonomous vehicle)
Subhasis does not explicitly teach the method further comprising retraining the machine learning model based at least partly on the training data set.
However, Wang teaches the method further comprising retraining the machine learning model based at least partly on the training data set (Wang, Par. [0141], “Generally, any suitable loss function may be used to update the deep learning model(s) during training. For example, one or more loss functions may be used (e.g., a regression loss function such as L1 or L2 loss may be used for regression tasks) to compare the accuracy of the output(s) of the deep learning model(s) to ground truth, and the parameters of the deep learning model(s) may be updated (e.g., using backward passes, backpropagation, forward passes, etc.) until the accuracy reaches an optimal or acceptable level. In some embodiments in which the deep learning models) includes multiple heads, the multiple heads may be co-trained together on the same dataset, with a common trunk. In this manner, the different heads (tasks) may help each other to learn”, thus retraining the machine-learning model based at least partly on a training data set is disclosed, because Wang describes updating model parameters through training iterations using collected sensor derived training data)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s multi-sensor perception system with Wang’s techniques for training and retraining machine-learning models using real-world autonomous-vehicle sensor data, because both references are directed to improving the accuracy and reliability of environment representations used for autonomous driving (Wang, Par. [0051], “As such, the techniques described herein may be used to observe and reconstruct a 3D surface such as a 3D road surface, and a representation of the 3D surface structure (and/or corresponding confidence values) may be provided to an autonomous vehicle drive stack to enable safe and comfortable planning and control of the autonomous vehicle. Generally, the techniques described herein may generate a more accurate representation of road surfaces than prior reconstruction techniques. Furthermore, the present techniques may be used to generate a representation of road surfaces with sufficient accuracy and range for certain autonomous driving applications, unlike prior based reconstruction techniques. As such, the representation of road surfaces generated using the present techniques may enable improved navigation, safety, and comfort in autonomous driving. For example, an autonomous vehicle may be better equipped to adapt the vehicle's suspension system to match the current road surface (e.g., by compensating for bumps in the road), to navigate the vehicle to avoid protuberances (e.g., dips, holes) in the road, and/or to apply an early acceleration or deceleration based on an approaching surface slope in the road. Any of these functions may serve to enhance safety, improve the longevity of the vehicle, improve energy-efficiency, and/or provide a smooth driving experience.”, thus the combined teachings of Subhasis and Wang disclose retraining a machine-learning model used to generate environment representations based at least partly on training data derived from autonomous-vehicle sensor data, such that the machine-learning model applied to sensor data in Subhasis can be updated using Wang’s training techniques. The combination improves the accuracy and robustness of the perception system by using discrepancy-localized sensor data to retrain or refine the machine-learning model, thereby reducing inconsistencies between perceptual pipelines and enhancing the reliability of environment representations used for autonomous vehicle planning and control)
Regarding Claim 3, Subhasis combined with Wang teaches all of the limitations of claim 2 as cited above and Subhasis further teaches:
transmitting, by the at least one processor, the retrained machine learning model to at least one autonomous vehicle, wherein the at least one autonomous vehicle uses the retrained machine learning model to infer object classifications of objects based on additional sensor data (Subhasis, Par. [0010], “In some examples, the techniques discussed herein involve aggregating at least some of the environmental representations associated with different sensor types and applying them to an ML model trained to output inferred object detection. and providing aggregated data as input. In some examples, the aggregated data may be represented in a multi-channel image, and different channels may be associated with different sensor types from which sensory data was generated and/or different types of sensory data. For example, the aggregated data may be a lidar, video, and/or radar occupancy grid {e.g., pixels indicating whether or not the corresponding location in the environment is occupied according to the perceptual pipeline associated with each sensor data type. etc.}, top-down display of ROIs generated in association with lidar, video and/or radar, object classification associated with a portion of the environment, which portion of the environment is occupied It may include probability, yaw of the detected object, and/or the like. See US patent application Ser. No. 16/591,518 relating to occupancy maps, which is hereby incorporated by reference in its entirety. In some examples, the occupancy grid may extend to a maximum height that may correspond to the height of the autonomous vehicle plus a buffer. In other words, the occupancy grid may indicate the occupancy of a portion of the environment below the maximum height. For example, traffic lights and billboards placed on the road may exceed the maximum height, so the occupancy grid does not indicate that they occupy part of the environment”, & Par. [0011], “In some examples, aggregated data may additionally or alternatively be data from remote computing devices and/or map data (e.g., road data, drivable surface locations, destinations), for example. , weather data, traffic notifications (e.g. congestion, collisions, lane changes, construction, speed changes), safety notifications (e.g. environmentally hazardous locations, disaster locations, road conditions, visibility conditions), etc. , and/or the like. In some examples, the remote computing device may be another autonomous vehicle, third party service, distributed computing device, remote sensor, and/or the like”, thus transmitting, by the at least one processor, the retrained machine-learning model to at least one autonomous vehicle and using the retrained machine-learning model to infer object classifications based on additional sensor data is disclosed, because Subhasis teaches deploying a machine-learning model within an autonomous-vehicle perception system where aggregated sensor data from lidar, video, and radar sensors is provided as input to the model to generate object detections, classifications, occupancy grids, and regions of interest. The ML model described by Subhasis is applied within the autonomous vehicle to process newly sensed sensor data and infer object classifications in the vehicle’s environment, which reads on transmitting a trained or retrained machine-learning model for use by an autonomous vehicle to infer object classifications based on additional sensor data.)
Regarding Claim 4, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above and Wang further teaches:
wherein the first representation is a semantic segmentation map of the environment and the second representation is an occupancy grid of the environment (Wang, Par. [0024], “In some examples, the perception component 110 receives sensor data from the sensors 104 and data related to objects in the vicinity of the vehicle 102 (e.g., object classification, instance segmentation, semantic segmentation, 2D and/or 3D bounding boxes, tracking), route data specifying the vehicle's destination, global map data specifying road features (e.g. detectable by different sensor modalities useful for autonomous vehicle orientation). features), local map data identifying features detected in proximity to the vehicle (e.g., location of buildings, trees, fences, fire hydrants, stop signs, and any other features detectable by various sensor modalities and/or or dimensions), tracking data (eg, environment representation, object detection and/or tracking as discussed herein), and the like”, & Par. [0029], “For example, FIG. 1 shows a top-down representation 120 of the environment, which may be part of the final environment representation determined by the ML model of tracking component 114 . A top down representation 120 shows object detection, illustrated in this case as an estimated ROI 122 . Top-down representation 120 and/or estimated ROI 122 may be determined by the ML model of tracking component 114 based at least in part on object detections received from one or more perception pipelines. For example, object detections provided as input to the ML model are associated with three-dimensional ROIs, one of which is denoted as ROI 126, associated with image 124, and lidar data 128 (e.g., two-dimensional and/or three-dimensional). obtain), of which ROI 130 was shown”, & Par. [0056], “Radar pipeline 304 may determine an environment representation (not shown to save drawing space) that includes:- an occupancy map, which may include portions indicated as being occupied by objects - discussed in more detail in US patent application Ser. No. 16/407,139, which is incorporated herein by reference in its entirety , an occlusion grid (including, for example, the probability that parts of the environment are hidden from the line of sight to one or more of the radar sensors)”, thus wherein the first representation is a semantic segmentation map of the environment and the second representation is an occupancy grid of the environment is disclosed, because Wang teaches generating semantic segmentation outputs and occupancy-based environment representations from autonomous-vehicle sensor data. Wang discloses that the perception component receives sensor data and produces semantic segmentation and object classification outputs describing portions of the environment, which corresponds to the first representation as a semantic segmentation map. Wang also teaches generating top-down environment representations and occupancy maps indicating whether portions of the environment are occupied by objects, including radar- and lidar-based occupancy grids, which read on the second representation as an occupancy grid. Accordingly, Wang discloses using semantic segmentation maps and occupancy grids as distinct environment representations generated from sensor data)
Regarding Claim 7, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above and Subhasis further teaches:
wherein the difference in classification corresponds to a difference in confidence for classification of the portion of the environment in the first representation and confidence for classification of the portion of the environment in the second representation (Subhasis, Par. [0072], “Each and any of the object detection components described above may be associated with a regressed confidence score. For example, object classification may be associated with confidence scores, ROI may be determined based at least in part on the confidence scores associated with different pixels via non-maximum suppression techniques, and occupancy may be determined by each It may be determined based at least in part on the likelihood associated with the pixel as determined by the ML model of the respective pipeline, and so on”, & Par. [0073], “The ML model may additionally or alternatively determine a confidence score associated with any of these outputs. In some examples, ROIs may be generated based at least in part on anchor boxes or any other canonical object shape associated with the object classification for which the ML model was trained”, thus the difference in classification corresponding to a difference in confidence is disclosed, because Subhasis teaches that object classifications, occupancy determinations, and regions of interest produced by each perceptual pipeline are each associated with confidence scores output by the respective models. When classifications differ between the first representation and the second representation, those differing classifications are accompanied by differing confidence values generated by the respective pipelines, such that a discrepancy in classification corresponds to a difference in confidence for the same portion of the environment across the two representations.)
Regarding Claim 8, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above and Subhasis further teaches:
wherein the first representation is generated by passing the sensor data through a model generated via application of machine learning to additional sensor data, and wherein the second representation is generated by application of a non-machine-learned algorithm to the sensor data (Subhasis, Par. [0072], “For example, the environment representations may be in a common frame of reference or transformed into a common frame of reference during aggregation”, & Par. [0072], “ROI may be determined based at least in part on the confidence scores associated with different pixels via non-maximum suppression techniques”, & Par. [0073], “At operation 424, example process 400 may include receiving estimated object detection 426 as an output from the ML model in accordance with any of the techniques discussed herein. In some examples, the ML model may be trained to output a final environment representation 428 and/or an inferred object detection 426.”, thus wherein the first representation is generated by passing the sensor data through a model generated via application of machine learning to additional sensor data, and wherein the second representation is generated by application of a non-machine-learned algorithm to the sensor data is disclosed, because Subhasis teaches generating a first environment representation by applying a trained machine-learning model to sensor data to produce outputs such as object detections, classifications, ROIs, and confidence scores, which reads on generating the first representation by passing sensor data through a machine-learned model. Subhasis also teaches determining regions of interest and occupancy information using non-machine learned techniques such as non-maximum suppression and geometric aggregation applied directly to sensor-derived pixel data, which reads on generating the second representation by application of a non-machine-learned algorithm to the sensor data)
Regarding Claim 9, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above and Subhasis further teaches:
comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation (Subhasis, Par. [0072], “In some examples, environment representations may be aggregated and provided as inputs to the ML model. In some examples, object detection may be separated from the rest of the environment representation and provided as input. For example, the environment representations may be in a common frame of reference or transformed into a common frame of reference during aggregation”, thus comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation is disclosed, because Subhasis teaches aligning and aggregating multiple environment representations into a common frame of reference, such that corresponding portions of the environment from different representations spatially correspond to one another. Once the representations are in a common reference frame, individual elements representing the same portion of the environment can be directly compared across representations, which reads on comparing corresponding elements of the first representation and the second representation where a particular element of one representation corresponds to a particular element of the other representation.)
based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element (Subhasis, Par. [0059], “In some examples, the multi-channel data structure 324 may include multi-channel images, where each channel of the image may be processed by different pipelines and/or different types of output (e.g., occupancy maps, occlusion grids, ROIs, object classification, etc.). ). For example, a first channel of the image may contain pixels indicating whether the respective portion of the environment is occupied/unoccupied as determined by the visual pipeline, and a second channel of the image may contain pixels of the environment. may contain pixels indicating whether each portion of the environment is occupied/unoccupied as determined by the lidar pipeline, and a third channel indicates whether each portion of the environment is associated with a certain object classification and so on”, & Par. [0096], “For example, a first set of layers in classification layer 610 may determine whether each portion of the environment is occupied or unoccupied and/or associated with a respective object classification. Another set of layers in layers 610 may determine whether an environment is associated with an estimated height bin and so. In some examples, the discrete portion of the set of object classification layers may additionally or alternatively include latitudes associated with each of the object classifications for which exemplary ML architecture 600 was trained. In other words, the classification output head may output a binary indication that a part of the environment is or is not relevant for classification (e.g. height bins, object classification, occupancy), Or the classification output head may output regressed values to which the NMS algorithm can be applied to determine the classification”, thus based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element is disclosed, because Subhasis teaches multi-channel environment representations in which corresponding elements represent the same portion of the environment across different perceptual pipelines and explicitly indicate occupancy status. Subhasis also teaches that a first channel may indicate whether a portion of the environment is occupied or unoccupied as determined by a visual pipeline, while a second channel indicates occupancy for the same portion as determined by a lidar pipeline. Subhasis discloses classification layers that output binary occupancy indications for discrete portions of the environment. Accordingly, when a corresponding element in one representation indicates unoccupied and the corresponding element in another representation indicates occupied, this reads on identifying a difference in classification between the first particular element and the second particular element)
Regarding Claim 10, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above and Subhasis further teaches:
aligning, by the at least one processor, the first representation with the second representation (Subhasis, Par. [0059], “Various environment representations produced by different pipelines can be aggregated into a multi channel data structure 324 . For example, this aggregation may involve projecting the data into a common reference frame and/or a common representation of the environment, such as a voxel space, mesh representation, etc., having the same dimensions. Aggregation may additionally or alternatively include projecting a 3D ROI onto a 2D ROI from a top-down perspective, and/or at least partially onto a 2D ROI associated with sensor perspective, depth, and/or object classification”, & Par. [0072], “Note that object detection may be provided to the ML model as part of the environment representation. For example, environment representation 406 includes various object detections and data not shown such as object velocity, estimated height, etc., as described above. In some examples, environment representations may be aggregated and provided as inputs to the ML model. In some examples, object detection may be separated from the rest of the environment representation and provided as input”, thus aligning, by the at least one processor, the first representation with the second representation is disclosed, because Subhasis teaches aggregating environment representations produced by different perceptual pipelines by projecting them into a common reference frame or common representation, such as a voxel space or top-down view, so that the representations have corresponding dimensions and spatial alignment. Subhasis also teaches providing these aligned environment representations together as inputs to a machine learning model. These teachings correspond to aligning the first representation with the second representation by the processor)
Regarding Claim 11, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above and Subhasis further teaches:
wherein the first representation labels portions of the environment as corresponding to a class of a plurality of classes, and wherein determining the one or more discrepancies comprises […] of the first representation by representing portions of the environment labeled as a first subset of the plurality of classes with a first value and representing portions of the environment labeled as a second subset of the plurality of classes with a second value (Subhasis, Par. [0026], “Object classifications determined by sensory component 110 may distinguish between different object types such as, for example, cars, pedestrians, bicycles, delivery trucks, semi-trucks, traffic signs, and/or the like”, & Par. [0059], “For example, a first channel of the image may contain pixels indicating whether the respective portion of the environment is occupied/unoccupied as determined by the visual pipeline, and a second channel of the image may contain pixels of the environment. may contain pixels indicating whether each portion of the environment is occupied/unoccupied as determined by the lidar pipeline”, & Par. [0072], “occupancy may be determined by each It may be determined based at least in part on the likelihood associated with the pixel as determined by the ML model of the respective pipeline, and so on”, thus this limitation is disclosed because Subhasis teaches that discrete portions of the environment are assigned object classifications drawn from multiple classes, such as vehicles, pedestrians, bicycles, and other object types, which corresponds to labeling portions of the environment as belonging to a class of a plurality of classes. Subhasis also teaches that portions of the environment may be represented as occupied or unoccupied based on pipeline outputs and associated likelihood values, which corresponds to converting the multi-class representation into a binary representation by assigning one value to a first subset of classes, such as occupying-type classes, and another value to a second subset of classes, such as non-occupying-type classes)
Subhasis does not explicitly teach generating a binary representation.
However, Wang teaches generating a binary representation (Wang, Par. [0046], “0046] To generate a corresponding sparse projection image for the input training data, a known pattern may be applied to the ground truth projection image to cancel out a subset of pixel values (e.g., setting those pixel values to zero) to simulate unobserved values. For example, frames of real-world data may be collected, a 3D surface structure (e.g., of a 3D road surface) may be estimated from each frame (as described herein), the estimated 3D structure (e.g., a. 3D point cloud) may be projected to form a projection image (e.g., a sparse 2D height map), and a corresponding binary map that represents which pixels of projection image are present or observed may be generated. A plurality of binary maps may be generated from real-world data, and one of the binary maps may be randomly chosen and multiplied by a ground truth projection image to generate a corresponding synthetic sparse projection image. As such, a sparse projection image may be generated for each ground truth projection image, and the pairs of synthetic sparse and ground truth projection images may be included in a training dataset”, thus generating a binary representation is disclosed because Wang teaches creating a binary map that assigns one value to a first subset of pixels (e.g., observed or retained pixels) and another value to a second subset of pixels (e.g., canceled or unobserved pixels), which corresponds to assigning one value to a first subset and another value to a second subset in a representation. Wang also explains that these binary maps are used to improve the accuracy and robustness of environmental surface representations for autonomous driving applications. Applying this binary mapping technique to the class-labeled environment regions identified by Subhasis improves clarity and reliability of discrepancy detection and downstream perception processing)
Regarding Claim 12, Subhasis combined with Wang teaches all of the limitations of claim 11 as cited above and Subhasis further teaches:
wherein the first subset of the plurality of classes comprises occupying-type classes, and wherein the second subset of the plurality of classes comprises non-occupying-type classes (Subhasis, Par. [0096], “Classification layer 610 may include one or more sets of convolutional layers or other components for the classification tasks discussed herein. In some examples, the output layer of a classification task may output a tensor (or other data structure) of latitudes, the discrete part of the field being the relevant part of the environment to be classified (e.g., occupied space, object classification, velocity bins, direction bins, height bins). For example, a first set of layers in classification layer 610 may determine whether each portion of the environment is occupied or unoccupied and/or associated with a respective object classification. Another set of layers in layers 610 may determine whether an environment is associated with an estimated height bin and so”, thus the first subset of the plurality of classes comprising occupying-type classes, and wherein the second subset of the plurality of classes comprising non-occupying-type classes is disclosed because Subhasis teaches that the classification layer determines whether each portion of the environment is occupied or unoccupied and/or associated with a respective object classification. By distinguishing portions of the environment as occupied space versus unoccupied space, Subhasis separates environment classifications into occupying-type classes and non-occupying-type classes. This corresponds to the first subset comprising occupying-type classes and the second subset comprising non-occupying-type classes)
Regarding Claim 13, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above and Subhasis further teaches:
filtering the one or more discrepancies to remove at least one discrepancy from the one or more discrepancies, the at least one discrepancy satisfying removal criteria comprising one or more of a minimum size, a minimum dimensionality, or a minimum visibility from a point of view of the at least one sensor (Subhasis, Par. [0010], “In some examples, the occupancy grid may extend to a maximum height that may correspond to the height of the autonomous vehicle plus a buffer. In other words, the occupancy grid may indicate the occupancy of a portion of the environment below the maximum height. For example, traffic lights and billboards placed on the road may exceed the maximum height, so the occupancy grid does not indicate that they occupy part of the environment”, & Par. [0072], “Note that object detection may be provided to the ML model as part of the environment representation. For example, environment representation 406 includes various object detections and data not shown such as object velocity, estimated height, etc., as described above. In some examples, environment representations may be aggregated and provided as inputs to the ML model. In some examples, object detection may be separated from the rest of the environment representation and provided as input. For example, the environment representations may be in a common frame of reference, or transformed into a common frame of reference during aggregation. The pipeline may be configured to output positive object detections along with their coordinates in a common frame of reference. For example, these positive object detections may be part of the environmental representation associated with the likelihood of meeting or exceeding a threshold confidence. Each and any of the object detection components described above may be associated with a regressed confidence score. For example, object classification may be associated with confidence scores, ROI may be determined based at least in part on the confidence scores associated with different pixels via non-maximum suppression techniques, and occupancy may be determined by each It may be determined based at least in part on the likelihood associated with the pixel as determined by the ML model of the respective pipeline, and so on”, thus filtering the one or more discrepancies to remove at least one discrepancy satisfying removal criteria comprising one or more of a minimum size, a minimum dimensionality, or a minimum visibility from a point of view of the at least one sensor is disclosed because Subhasis teaches limiting environment representations based on dimensional constraints and confidence thresholds. Subhasis explains that the occupancy grid extends only up to a maximum height corresponding to the vehicle height plus a buffer, such that objects exceeding that dimensional limit are not represented, which corresponds to filtering based on minimum dimensionality. Subhasis also teaches that object detections and occupancy determinations are associated with likelihoods or confidence scores and that positive detections meet or exceed a threshold confidence, which corresponds to filtering based on minimum visibility from the sensor’s point of view. These teachings together correspond to removing discrepancies that fail to satisfy dimensional or visibility-based removal criteria.)
Regarding Claim 14, Subhasis teaches a system (Subhasis, Par. [0104], “A system”, thus a system is disclosed) comprising: a data store (Subhasis, Par. [0119], “a memory”, thus a data store is disclosed) storing computer-executable instructions; and a processor (Subhasis, Par. [0104], “one or more processors, thus a processor is disclosed) configured to execute the computer-executable instructions, wherein execution of the computer-executable instructions causes the system to:
obtain a first representation and a second representation of an environment, wherein the environment is sensed by at least one sensor of an autonomous vehicle, and wherein the first representation and the second representation are generated based on sensor data from the at least one sensor (Subhasis, Par. [0007], “In some examples, one or more sensors of a sensor type are associated with a pipeline (e.g., sequence of operations; steps; networks or layers thereof; machine learning models; analog-to-digital converters; Determine information about objects associated with hardware such as amplifiers, field programmable gate arrays (FPGAs), application specific integrated circuits (ASIC(s), and/or the like) and contained in the associated sensor data can be used to Sensor data may be received from one or more sensors of that type, and a pipeline (sometimes referred to herein as a perceptual pipeline) generates an environmental sensor based at least in part on the sensor data. can generate a representation of For simplicity, the collective output of the pipeline is referred to herein as the environment representation. The environment representation may include one or more object detections and may include one or more output types. For example, video pipeline 302 may output environment representation 308 based at least in part on video data 310 (eg, sensor data including one or more RGB images, thermal images)”, & Par. [0011], “In some examples, aggregated data may additionally or alternatively be data from remote computing devices and/or map data (e.g., road data, drivable surface locations, destinations), for example. , weather data, traffic notifications (e.g. congestion, collisions, lane changes, construction, speed changes), safety notifications (e.g. environmentally hazardous locations, disaster locations, road conditions, visibility conditions), etc. , and/or the like. In some examples, the remote computing device may be another autonomous vehicle, third party service, distributed computing device, remote sensor, and/or the like.”, & Par. [0012], “In some examples, data aggregated from different pipelines may also include at least a portion of the environment representation for one or more previous times. For example, the perceptual pipelines associated with different sensor types can be synchronized to generate environment representations with the same frequency (eg, every 100 ms, 500 ms, 1 second).”, & Par. [0059], “For example, a first channel of the image may contain pixels indicating whether the respective portion of the environment is occupied/unoccupied as determined by the visual pipeline, and a second channel of the image may contain pixels of the environment. may contain pixels indicating whether each portion of the environment is occupied/unoccupied as determined by the lidar pipeline, and a third channel indicates whether each portion of the environment is associated with a certain object class and so on.”, thus obtaining a first representation and a second representation of an environment sensed by at least one sensor of an autonomous vehicle and generated based on sensor data is disclosed, because Subhasis teaches that sensor data from sensors of an autonomous vehicle are received and processed by multiple perceptual pipelines to generate environment representations. Subhasis describes a visual pipeline that processes camera sensor data to generate an environment representation indicating occupancy and object information, and a lidar pipeline that processes lidar sensor data to generate a separate environment representation indicating occupancy and object information for the same environment. The environment representation output by the visual pipeline corresponds to the first representation, the environment representation output by the lidar pipeline corresponds to the second representation, and both representations are generated by the processor from sensor data of the autonomous vehicle and represent the same sensed environment)
determining one or more discrepancies between the first representation and the second representation, each discrepancy of the one or more discrepancies corresponding to difference in classification of a portion of the environment as indicated within the respective first representation and second representation (Subhasis, Par. [0003], “Small discrepancies between the detections determined in relation to the two different sensor types can cause jitter (i.e. "flying") and/or flickering (i.e. appearing and disappearing) in the representation of objects created by the vehicle. Also, some sensor types, such as depth cameras, are prone to large errors in depth measurements, which can further complicate object tracking. This can hamper safe navigation of vehicles and training of machine learning (ML) models. Additionally, techniques for reducing discrepancies and/or techniques for smoothing object representations or data associated therewith may consume computing bandwidth and/or memory.”, & Par. [0014], “In some examples, the ML model may be trained to output a final environment representation that may include one or more inferred object detections. As noted above, object detection associated with objects may differ in dimension, location, or even existence between different pipelines. The final environment representation is determined based at least in part on object detections received from different pipelines (e.g., received as input to the ML model as part of the aggregated data). It may include one probable object detection. For example, the inferred object detection generated by the ML model may include a ROI that identifies a part of the environment as occupied (e.g., the region associated with the object), a predicted ROI associated with future time, a velocity associated with the ROI. , the object classification associated with the ROI (e.g., vehicle, pedestrian, heavy vehicle, bicycle), the velocity classification of the ROI (e.g., static or dynamic), the orientation associated with the ROI (e.g., yaw), and/or Azimuth bins (e.g., 2 bins centered at 0 and 180 degrees; 4 bins centered at 0, 90, 180, and 270 degrees; this output also includes the distance from the bin center obtained), and/or the height associated with the ROI (eg, the height of the detected object). In some examples, any region of interest may be generated based at least in part on the output of the trust layer, such as following a non-maximum suppression technique”, thus determining one or more discrepancies between the first representation and the second representation, each discrepancy corresponding to a difference in classification of a portion of the environment, is disclosed, because Subhasis teaches that detections generated from different sensor types and perceptual pipelines may differ, resulting in discrepancies between object representations. Subhasis discloses that each perceptual pipeline produces object detections that include object classification information (e.g., vehicle, pedestrian, bicycle), and that object detections may differ in dimension, location, or even existence between different pipelines. Since object classification is included in each environment representation, a discrepancy between the first representation and the second representation corresponds to a difference in classification of a portion of the environment. Accordingly, the processor’s identification of discrepancies between pipeline outputs reads on determining discrepancies between the first and second representations, and differences in object classification between those representations read on the difference in classification of a portion of the environment.)
for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists (Subhasis, Par. [0014], “In some examples, the ML model may be trained to output a final environment representation that may include one or more inferred object detections. As noted above, object detection associated with objects may differ in dimension, location, or even existence between different pipelines. The final environment representation is determined based at least in part on object detections received from different pipelines (e.g., received as input to the ML model as part of the aggregated data). It may include one probable object detection. For example, the inferred object detection generated by the ML model may include a ROI that identifies a part of the environment as occupied (e.g., the region associated with the object), a predicted ROI associated with future time, a velocity associated with the ROI. , the object classification associated with the ROI (e.g., vehicle, pedestrian, heavy vehicle, bicycle), the velocity classification of the ROI (e.g., static or dynamic), the orientation associated with the ROI (e.g., yaw), and/or Azimuth bins (e.g., 2 bins centered at 0 and 180 degrees; 4 bins centered at 0, 90, 180, and 270 degrees; this output also includes the distance from the bin center obtained), and/or the height associated with the ROI (eg, the height of the detected object). In some examples, any region of interest may be generated based at least in part on the output of the trust layer, such as following a non-maximum suppression technique.”, thus this limitation is disclosed, because Subhasis teaches that discrepancies between environment representations produced by different perceptual pipelines are localized to specific regions of interest corresponding to portions of the environment, and that those regions of interest are derived from and defined by sensor data processed by the pipelines. Identifying regions of interest where object detections differ necessarily involves isolating the sensor data associated with those portions of the environment, which reads on generating a subset of sensor data reflecting the portion of the environment at which each discrepancy exists.)
Subhasis does not explicitly teach generating a training dataset.
However, Wang teaches generating a training dataset (Wang, Par. [128], “Generating Training Data from Real-World Sensor Data. In some embodiments, training data may be generated by collecting and annotating real-world sensor data. For example, one or more vehicles may collect frames of sensor data (e.g., image data and LiDAR data) from one or more sensors (e.g., camera(s) and LiDAR sensor(s)) of the vehicle(s) in real-world (e.g., physical) environments”, & Par. [0141], “Generally, any suitable loss function may be used to update the deep learning model(s) during training. For example, one or more loss functions may be used (e.g., a regression loss function such as L1 or L2 loss may be used for regression tasks) to compare the accuracy of the output(s) of the deep learning model(s) to ground truth, and the parameters of the deep learning model(s) may be updated (e.g., using backward passes, backpropagation, forward passes, etc.) until the accuracy reaches an optimal or acceptable level. In some embodiments in which the deep learning models) includes multiple heads, the multiple heads may be co-trained together on the same dataset, with a common trunk. In this manner, the different heads (tasks) may help each other to learn.”, thus generating a training data set comprising sensor data used for machine-learning training is disclosed, because Wang teaches collecting sensor data from vehicle mounted sensors, annotating that sensor data to form training datasets, and using those datasets to update the parameters of one or more machine-learning models. The collected and annotated sensor data in Wang reads on the training data set, and the sensor frames corresponding to regions used for model training correspond to subsets of sensor data reflecting portions of the environment.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s multi-sensor perception system, which identifies and localizes discrepancies between environment representations produced by different perceptual pipelines and reads on detecting discrepancies associated with specific portions of an environment, with Wang’s techniques for generating training datasets from real-world autonomous-vehicle sensor data to train and update machine-learning models, which read on generating training data sets from subsets of sensor data, because Wang provides a framework for collecting and annotating sensor data for machine-learning training that can be applied to the discrepancy-localized regions identified by Subhasis, thereby enabling the generation of training data sets comprising sensor data subsets corresponding to portions of the environment where discrepancies occur. (Wang, Par. [0051], “As such, the techniques described herein may be used to observe and reconstruct a 3D surface such as a 3D road surface, and a representation of the 3D surface structure (and/or corresponding confidence values) may be provided to an autonomous vehicle drive stack to enable safe and comfortable planning and control of the autonomous vehicle. Generally, the techniques described herein may generate a more accurate representation of road surfaces than prior reconstruction techniques. Furthermore, the present techniques may be used to generate a representation of road surfaces with sufficient accuracy and range for certain autonomous driving applications, unlike prior based reconstruction techniques. As such, the representation of road surfaces generated using the present techniques may enable improved navigation, safety, and comfort in autonomous driving. For example, an autonomous vehicle may be better equipped to adapt the vehicle's suspension system to match the current road surface (e.g., by compensating for bumps in the road), to navigate the vehicle to avoid protuberances (e.g., dips, holes) in the road, and/or to apply an early acceleration or deceleration based on an approaching surface slope in the road. Any of these functions may serve to enhance safety, improve the longevity of the vehicle, improve energy-efficiency, and/or provide a smooth driving experience.”, thus the combined teachings of Subhasis and Wang disclose generating, by at least one processor, a training data set comprising, for each discrepancy, a subset of sensor data reflecting the portion of the environment at which the discrepancy exists)
Regarding Claim 15, Subhasis combined with Wang teaches all of the limitations of claim 14 as cited above and Subhasis further teaches:
wherein the first representation is generated by application of a machine learning model to the sensor data (Subhasis, Par. [0010], “In some examples, the techniques discussed herein involve aggregating at least some of the environmental representations associated with different sensor types and applying them to an ML model trained to output inferred object detection. and providing aggregated data as input. In some examples, the aggregated data may be represented in a multi-channel image, and different channels may be associated with different sensor types from which sensory data was generated and/or different types of sensory data. For example, the aggregated data may be a lidar, video, and/or radar occupancy grid {e.g., pixels indicating whether or not the corresponding location in the environment is occupied according to the perceptual pipeline associated with each sensor data type. etc.}, top-down display of ROIs generated in association with lidar, video and/or radar, object classification associated with a portion of the environment, which portion of the environment is occupied It may include probability, yaw of the detected object, and/or the like. See US patent application Ser. No. 16/591,518 relating to occupancy maps, which is hereby incorporated by reference in its entirety. n some examples, the occupancy grid may extend to a maximum height that may correspond to the height of the autonomous vehicle plus a buffer. In other words, the occupancy grid may indicate the occupancy of a portion of the environment below the maximum height. For example, traffic lights and billboards placed on the road may exceed the maximum height, so the occupancy grid does not indicate that they occupy part of the environment”, thus wherein the first representation is generated by application of a machine learning model to the sensor data is disclosed, because Subhasis teaches aggregating sensor data from multiple sensor types and providing that aggregated sensor data as input to a machine-learning model trained to output inferred object detections and environment representations. The ML model generates outputs including object classifications, regions of interest, and associated probabilities, which corresponds to the first representation, while the aggregated lidar, video, and radar data read on the sensor data, such that the first representation is generated by application of a machine-learning model to sensor data sensed by the autonomous vehicle)
Subhasis does not explicitly teach the method further comprising retraining the machine learning model based at least partly on the training data set.
However, Wang teaches the method further comprising retraining the machine learning model based at least partly on the training data set (Wang, Par. [0141], “Generally, any suitable loss function may be used to update the deep learning model(s) during training. For example, one or more loss functions may be used (e.g., a regression loss function such as L1 or L2 loss may be used for regression tasks) to compare the accuracy of the output(s) of the deep learning model(s) to ground truth, and the parameters of the deep learning model(s) may be updated (e.g., using backward passes, backpropagation, forward passes, etc.) until the accuracy reaches an optimal or acceptable level. In some embodiments in which the deep learning models) includes multiple heads, the multiple heads may be co-trained together on the same dataset, with a common trunk. In this manner, the different heads (tasks) may help each other to learn”, thus retraining the machine-learning model based at least partly on a training data set is disclosed, because Wang describes updating model parameters through training iterations using collected sensor derived training data)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s multi-sensor perception system with Wang’s techniques for training and retraining machine-learning models using real-world autonomous-vehicle sensor data, because both references are directed to improving the accuracy and reliability of environment representations used for autonomous driving (Wang, Par. [0051], “As such, the techniques described herein may be used to observe and reconstruct a 3D surface such as a 3D road surface, and a representation of the 3D surface structure (and/or corresponding confidence values) may be provided to an autonomous vehicle drive stack to enable safe and comfortable planning and control of the autonomous vehicle. Generally, the techniques described herein may generate a more accurate representation of road surfaces than prior reconstruction techniques. Furthermore, the present techniques may be used to generate a representation of road surfaces with sufficient accuracy and range for certain autonomous driving applications, unlike prior based reconstruction techniques. As such, the representation of road surfaces generated using the present techniques may enable improved navigation, safety, and comfort in autonomous driving. For example, an autonomous vehicle may be better equipped to adapt the vehicle's suspension system to match the current road surface (e.g., by compensating for bumps in the road), to navigate the vehicle to avoid protuberances (e.g., dips, holes) in the road, and/or to apply an early acceleration or deceleration based on an approaching surface slope in the road. Any of these functions may serve to enhance safety, improve the longevity of the vehicle, improve energy-efficiency, and/or provide a smooth driving experience.”, thus the combined teachings of Subhasis and Wang disclose retraining a machine-learning model used to generate environment representations based at least partly on training data derived from autonomous-vehicle sensor data, such that the machine-learning model applied to sensor data in Subhasis can be updated using Wang’s training techniques. The combination improves the accuracy and robustness of the perception system by using discrepancy-localized sensor data to retrain or refine the machine-learning model, thereby reducing inconsistencies between perceptual pipelines and enhancing the reliability of environment representations used for autonomous vehicle planning and control)
Regarding Claim 17, Subhasis combined with Wang teaches all of the limitations of claim 14 as cited above and Subhasis further teaches:
comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation (Subhasis, Par. [0072], “In some examples, environment representations may be aggregated and provided as inputs to the ML model. In some examples, object detection may be separated from the rest of the environment representation and provided as input. For example, the environment representations may be in a common frame of reference or transformed into a common frame of reference during aggregation”, thus comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation is disclosed, because Subhasis teaches aligning and aggregating multiple environment representations into a common frame of reference, such that corresponding portions of the environment from different representations spatially correspond to one another. Once the representations are in a common reference frame, individual elements representing the same portion of the environment can be directly compared across representations, which reads on comparing corresponding elements of the first representation and the second representation where a particular element of one representation corresponds to a particular element of the other representation.)
based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element (Subhasis, Par. [0059], “In some examples, the multi-channel data structure 324 may include multi-channel images, where each channel of the image may be processed by different pipelines and/or different types of output (e.g., occupancy maps, occlusion grids, ROIs, object classification, etc.). ). For example, a first channel of the image may contain pixels indicating whether the respective portion of the environment is occupied/unoccupied as determined by the visual pipeline, and a second channel of the image may contain pixels of the environment. may contain pixels indicating whether each portion of the environment is occupied/unoccupied as determined by the lidar pipeline, and a third channel indicates whether each portion of the environment is associated with a certain object classification and so on”, & Par. [0096], “For example, a first set of layers in classification layer 610 may determine whether each portion of the environment is occupied or unoccupied and/or associated with a respective object classification. Another set of layers in layers 610 may determine whether an environment is associated with an estimated height bin and so. In some examples, the discrete portion of the set of object classification layers may additionally or alternatively include latitudes associated with each of the object classifications for which exemplary ML architecture 600 was trained. In other words, the classification output head may output a binary indication that a part of the environment is or is not relevant for classification (e.g. height bins, object classification, occupancy), Or the classification output head may output regressed values to which the NMS algorithm can be applied to determine the classification”, thus based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element is disclosed, because Subhasis teaches multi-channel environment representations in which corresponding elements represent the same portion of the environment across different perceptual pipelines and explicitly indicate occupancy status. Subhasis also teaches that a first channel may indicate whether a portion of the environment is occupied or unoccupied as determined by a visual pipeline, while a second channel indicates occupancy for the same portion as determined by a lidar pipeline. Subhasis discloses classification layers that output binary occupancy indications for discrete portions of the environment. Accordingly, when a corresponding element in one representation indicates unoccupied and the corresponding element in another representation indicates occupied, this reads on identifying a difference in classification between the first particular element and the second particular element)
Regarding Claim 18, Subhasis teaches one or more non-transitory computer-readable storage media (Subhasis, Par. [0113], “A non-transitory computer-readable medium”, thus one or more non-transitory computer-readable storage media is disclosed) storing computer- executable instructions that, when executed by a computing system comprising a processor (Subhasis, Par. [0113], “one or more processors”, thus a processor is disclosed), cause the computing system to:
obtain a first representation and a second representation of an environment, wherein the environment is sensed by at least one sensor of an autonomous vehicle, and wherein the first representation and the second representation are generated based on sensor data from the at least one sensor (Subhasis, Par. [0007], “In some examples, one or more sensors of a sensor type are associated with a pipeline (e.g., sequence of operations; steps; networks or layers thereof; machine learning models; analog-to-digital converters; Determine information about objects associated with hardware such as amplifiers, field programmable gate arrays (FPGAs), application specific integrated circuits (ASIC(s), and/or the like) and contained in the associated sensor data can be used to Sensor data may be received from one or more sensors of that type, and a pipeline (sometimes referred to herein as a perceptual pipeline) generates an environmental sensor based at least in part on the sensor data. can generate a representation of For simplicity, the collective output of the pipeline is referred to herein as the environment representation. The environment representation may include one or more object detections and may include one or more output types. For example, video pipeline 302 may output environment representation 308 based at least in part on video data 310 (eg, sensor data including one or more RGB images, thermal images)”, & Par. [0011], “In some examples, aggregated data may additionally or alternatively be data from remote computing devices and/or map data (e.g., road data, drivable surface locations, destinations), for example. , weather data, traffic notifications (e.g. congestion, collisions, lane changes, construction, speed changes), safety notifications (e.g. environmentally hazardous locations, disaster locations, road conditions, visibility conditions), etc. , and/or the like. In some examples, the remote computing device may be another autonomous vehicle, third party service, distributed computing device, remote sensor, and/or the like.”, & Par. [0012], “In some examples, data aggregated from different pipelines may also include at least a portion of the environment representation for one or more previous times. For example, the perceptual pipelines associated with different sensor types can be synchronized to generate environment representations with the same frequency (eg, every 100 ms, 500 ms, 1 second).”, & Par. [0059], “For example, a first channel of the image may contain pixels indicating whether the respective portion of the environment is occupied/unoccupied as determined by the visual pipeline, and a second channel of the image may contain pixels of the environment. may contain pixels indicating whether each portion of the environment is occupied/unoccupied as determined by the lidar pipeline, and a third channel indicates whether each portion of the environment is associated with a certain object class and so on.”, thus obtaining a first representation and a second representation of an environment sensed by at least one sensor of an autonomous vehicle and generated based on sensor data is disclosed, because Subhasis teaches that sensor data from sensors of an autonomous vehicle are received and processed by multiple perceptual pipelines to generate environment representations. Subhasis describes a visual pipeline that processes camera sensor data to generate an environment representation indicating occupancy and object information, and a lidar pipeline that processes lidar sensor data to generate a separate environment representation indicating occupancy and object information for the same environment. The environment representation output by the visual pipeline corresponds to the first representation, the environment representation output by the lidar pipeline corresponds to the second representation, and both representations are generated by the processor from sensor data of the autonomous vehicle and represent the same sensed environment)
determining one or more discrepancies between the first representation and the second representation, each discrepancy of the one or more discrepancies corresponding to difference in classification of a portion of the environment as indicated within the respective first representation and second representation (Subhasis, Par. [0003], “Small discrepancies between the detections determined in relation to the two different sensor types can cause jitter (i.e. "flying") and/or flickering (i.e. appearing and disappearing) in the representation of objects created by the vehicle. Also, some sensor types, such as depth cameras, are prone to large errors in depth measurements, which can further complicate object tracking. This can hamper safe navigation of vehicles and training of machine learning (ML) models. Additionally, techniques for reducing discrepancies and/or techniques for smoothing object representations or data associated therewith may consume computing bandwidth and/or memory.”, & Par. [0014], “In some examples, the ML model may be trained to output a final environment representation that may include one or more inferred object detections. As noted above, object detection associated with objects may differ in dimension, location, or even existence between different pipelines. The final environment representation is determined based at least in part on object detections received from different pipelines (e.g., received as input to the ML model as part of the aggregated data). It may include one probable object detection. For example, the inferred object detection generated by the ML model may include a ROI that identifies a part of the environment as occupied (e.g., the region associated with the object), a predicted ROI associated with future time, a velocity associated with the ROI. , the object classification associated with the ROI (e.g., vehicle, pedestrian, heavy vehicle, bicycle), the velocity classification of the ROI (e.g., static or dynamic), the orientation associated with the ROI (e.g., yaw), and/or Azimuth bins (e.g., 2 bins centered at 0 and 180 degrees; 4 bins centered at 0, 90, 180, and 270 degrees; this output also includes the distance from the bin center obtained), and/or the height associated with the ROI (eg, the height of the detected object). In some examples, any region of interest may be generated based at least in part on the output of the trust layer, such as following a non-maximum suppression technique”, thus determining one or more discrepancies between the first representation and the second representation, each discrepancy corresponding to a difference in classification of a portion of the environment, is disclosed, because Subhasis teaches that detections generated from different sensor types and perceptual pipelines may differ, resulting in discrepancies between object representations. Subhasis discloses that each perceptual pipeline produces object detections that include object classification information (e.g., vehicle, pedestrian, bicycle), and that object detections may differ in dimension, location, or even existence between different pipelines. Since object classification is included in each environment representation, a discrepancy between the first representation and the second representation corresponds to a difference in classification of a portion of the environment. Accordingly, the processor’s identification of discrepancies between pipeline outputs reads on determining discrepancies between the first and second representations, and differences in object classification between those representations read on the difference in classification of a portion of the environment.)
for each discrepancy of the one or more discrepancies, a subset of the sensor data reflecting the portion of the environment at which the discrepancy exists (Subhasis, Par. [0014], “In some examples, the ML model may be trained to output a final environment representation that may include one or more inferred object detections. As noted above, object detection associated with objects may differ in dimension, location, or even existence between different pipelines. The final environment representation is determined based at least in part on object detections received from different pipelines (e.g., received as input to the ML model as part of the aggregated data). It may include one probable object detection. For example, the inferred object detection generated by the ML model may include a ROI that identifies a part of the environment as occupied (e.g., the region associated with the object), a predicted ROI associated with future time, a velocity associated with the ROI. , the object classification associated with the ROI (e.g., vehicle, pedestrian, heavy vehicle, bicycle), the velocity classification of the ROI (e.g., static or dynamic), the orientation associated with the ROI (e.g., yaw), and/or Azimuth bins (e.g., 2 bins centered at 0 and 180 degrees; 4 bins centered at 0, 90, 180, and 270 degrees; this output also includes the distance from the bin center obtained), and/or the height associated with the ROI (eg, the height of the detected object). In some examples, any region of interest may be generated based at least in part on the output of the trust layer, such as following a non-maximum suppression technique.”, thus this limitation is disclosed, because Subhasis teaches that discrepancies between environment representations produced by different perceptual pipelines are localized to specific regions of interest corresponding to portions of the environment, and that those regions of interest are derived from and defined by sensor data processed by the pipelines. Identifying regions of interest where object detections differ necessarily involves isolating the sensor data associated with those portions of the environment, which reads on generating a subset of sensor data reflecting the portion of the environment at which each discrepancy exists.)
Subhasis does not explicitly teach generating a training dataset.
However, Wang teaches generating a training dataset (Wang, Par. [128], “Generating Training Data from Real-World Sensor Data. In some embodiments, training data may be generated by collecting and annotating real-world sensor data. For example, one or more vehicles may collect frames of sensor data (e.g., image data and LiDAR data) from one or more sensors (e.g., camera(s) and LiDAR sensor(s)) of the vehicle(s) in real-world (e.g., physical) environments”, & Par. [0141], “Generally, any suitable loss function may be used to update the deep learning model(s) during training. For example, one or more loss functions may be used (e.g., a regression loss function such as L1 or L2 loss may be used for regression tasks) to compare the accuracy of the output(s) of the deep learning model(s) to ground truth, and the parameters of the deep learning model(s) may be updated (e.g., using backward passes, backpropagation, forward passes, etc.) until the accuracy reaches an optimal or acceptable level. In some embodiments in which the deep learning models) includes multiple heads, the multiple heads may be co-trained together on the same dataset, with a common trunk. In this manner, the different heads (tasks) may help each other to learn.”, thus generating a training data set comprising sensor data used for machine-learning training is disclosed, because Wang teaches collecting sensor data from vehicle mounted sensors, annotating that sensor data to form training datasets, and using those datasets to update the parameters of one or more machine-learning models. The collected and annotated sensor data in Wang reads on the training data set, and the sensor frames corresponding to regions used for model training correspond to subsets of sensor data reflecting portions of the environment.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s multi-sensor perception system, which identifies and localizes discrepancies between environment representations produced by different perceptual pipelines and reads on detecting discrepancies associated with specific portions of an environment, with Wang’s techniques for generating training datasets from real-world autonomous-vehicle sensor data to train and update machine-learning models, which read on generating training data sets from subsets of sensor data, because Wang provides a framework for collecting and annotating sensor data for machine-learning training that can be applied to the discrepancy-localized regions identified by Subhasis, thereby enabling the generation of training data sets comprising sensor data subsets corresponding to portions of the environment where discrepancies occur. (Wang, Par. [0051], “As such, the techniques described herein may be used to observe and reconstruct a 3D surface such as a 3D road surface, and a representation of the 3D surface structure (and/or corresponding confidence values) may be provided to an autonomous vehicle drive stack to enable safe and comfortable planning and control of the autonomous vehicle. Generally, the techniques described herein may generate a more accurate representation of road surfaces than prior reconstruction techniques. Furthermore, the present techniques may be used to generate a representation of road surfaces with sufficient accuracy and range for certain autonomous driving applications, unlike prior based reconstruction techniques. As such, the representation of road surfaces generated using the present techniques may enable improved navigation, safety, and comfort in autonomous driving. For example, an autonomous vehicle may be better equipped to adapt the vehicle's suspension system to match the current road surface (e.g., by compensating for bumps in the road), to navigate the vehicle to avoid protuberances (e.g., dips, holes) in the road, and/or to apply an early acceleration or deceleration based on an approaching surface slope in the road. Any of these functions may serve to enhance safety, improve the longevity of the vehicle, improve energy-efficiency, and/or provide a smooth driving experience.”, thus the combined teachings of Subhasis and Wang disclose generating, by at least one processor, a training data set comprising, for each discrepancy, a subset of sensor data reflecting the portion of the environment at which the discrepancy exists)
Regarding Claim 19, Subhasis combined with Wang teaches all of the limitations of claim 14 as cited above and Subhasis further teaches:
wherein the first representation is generated by application of a machine learning model to the sensor data (Subhasis, Par. [0010], “In some examples, the techniques discussed herein involve aggregating at least some of the environmental representations associated with different sensor types and applying them to an ML model trained to output inferred object detection. and providing aggregated data as input. In some examples, the aggregated data may be represented in a multi-channel image, and different channels may be associated with different sensor types from which sensory data was generated and/or different types of sensory data. For example, the aggregated data may be a lidar, video, and/or radar occupancy grid {e.g., pixels indicating whether or not the corresponding location in the environment is occupied according to the perceptual pipeline associated with each sensor data type. etc.}, top-down display of ROIs generated in association with lidar, video and/or radar, object classification associated with a portion of the environment, which portion of the environment is occupied It may include probability, yaw of the detected object, and/or the like. See US patent application Ser. No. 16/591,518 relating to occupancy maps, which is hereby incorporated by reference in its entirety. n some examples, the occupancy grid may extend to a maximum height that may correspond to the height of the autonomous vehicle plus a buffer. In other words, the occupancy grid may indicate the occupancy of a portion of the environment below the maximum height. For example, traffic lights and billboards placed on the road may exceed the maximum height, so the occupancy grid does not indicate that they occupy part of the environment”, thus wherein the first representation is generated by application of a machine learning model to the sensor data is disclosed, because Subhasis teaches aggregating sensor data from multiple sensor types and providing that aggregated sensor data as input to a machine-learning model trained to output inferred object detections and environment representations. The ML model generates outputs including object classifications, regions of interest, and associated probabilities, which corresponds to the first representation, while the aggregated lidar, video, and radar data read on the sensor data, such that the first representation is generated by application of a machine-learning model to sensor data sensed by the autonomous vehicle)
Subhasis does not explicitly teach the method further comprising retraining the machine learning model based at least partly on the training data set.
However, Wang teaches the method further comprising retraining the machine learning model based at least partly on the training data set (Wang, Par. [0141], “Generally, any suitable loss function may be used to update the deep learning model(s) during training. For example, one or more loss functions may be used (e.g., a regression loss function such as L1 or L2 loss may be used for regression tasks) to compare the accuracy of the output(s) of the deep learning model(s) to ground truth, and the parameters of the deep learning model(s) may be updated (e.g., using backward passes, backpropagation, forward passes, etc.) until the accuracy reaches an optimal or acceptable level. In some embodiments in which the deep learning models) includes multiple heads, the multiple heads may be co-trained together on the same dataset, with a common trunk. In this manner, the different heads (tasks) may help each other to learn”, thus retraining the machine-learning model based at least partly on a training data set is disclosed, because Wang describes updating model parameters through training iterations using collected sensor derived training data)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s multi-sensor perception system with Wang’s techniques for training and retraining machine-learning models using real-world autonomous-vehicle sensor data, because both references are directed to improving the accuracy and reliability of environment representations used for autonomous driving (Wang, Par. [0051], “As such, the techniques described herein may be used to observe and reconstruct a 3D surface such as a 3D road surface, and a representation of the 3D surface structure (and/or corresponding confidence values) may be provided to an autonomous vehicle drive stack to enable safe and comfortable planning and control of the autonomous vehicle. Generally, the techniques described herein may generate a more accurate representation of road surfaces than prior reconstruction techniques. Furthermore, the present techniques may be used to generate a representation of road surfaces with sufficient accuracy and range for certain autonomous driving applications, unlike prior based reconstruction techniques. As such, the representation of road surfaces generated using the present techniques may enable improved navigation, safety, and comfort in autonomous driving. For example, an autonomous vehicle may be better equipped to adapt the vehicle's suspension system to match the current road surface (e.g., by compensating for bumps in the road), to navigate the vehicle to avoid protuberances (e.g., dips, holes) in the road, and/or to apply an early acceleration or deceleration based on an approaching surface slope in the road. Any of these functions may serve to enhance safety, improve the longevity of the vehicle, improve energy-efficiency, and/or provide a smooth driving experience.”, thus the combined teachings of Subhasis and Wang disclose retraining a machine-learning model used to generate environment representations based at least partly on training data derived from autonomous-vehicle sensor data, such that the machine-learning model applied to sensor data in Subhasis can be updated using Wang’s training techniques. The combination improves the accuracy and robustness of the perception system by using discrepancy-localized sensor data to retrain or refine the machine-learning model, thereby reducing inconsistencies between perceptual pipelines and enhancing the reliability of environment representations used for autonomous vehicle planning and control)
Regarding Claim 20, Subhasis combined with Wang teaches all of the limitations of claim 18 as cited above and Subhasis further teaches:
comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation (Subhasis, Par. [0072], “In some examples, environment representations may be aggregated and provided as inputs to the ML model. In some examples, object detection may be separated from the rest of the environment representation and provided as input. For example, the environment representations may be in a common frame of reference or transformed into a common frame of reference during aggregation”, thus comparing corresponding elements of the first representation and the second representation, wherein a first particular element of the first representation corresponds to a second particular element of the second representation is disclosed, because Subhasis teaches aligning and aggregating multiple environment representations into a common frame of reference, such that corresponding portions of the environment from different representations spatially correspond to one another. Once the representations are in a common reference frame, individual elements representing the same portion of the environment can be directly compared across representations, which reads on comparing corresponding elements of the first representation and the second representation where a particular element of one representation corresponds to a particular element of the other representation.)
based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element (Subhasis, Par. [0059], “In some examples, the multi-channel data structure 324 may include multi-channel images, where each channel of the image may be processed by different pipelines and/or different types of output (e.g., occupancy maps, occlusion grids, ROIs, object classification, etc.). ). For example, a first channel of the image may contain pixels indicating whether the respective portion of the environment is occupied/unoccupied as determined by the visual pipeline, and a second channel of the image may contain pixels of the environment. may contain pixels indicating whether each portion of the environment is occupied/unoccupied as determined by the lidar pipeline, and a third channel indicates whether each portion of the environment is associated with a certain object classification and so on”, & Par. [0096], “For example, a first set of layers in classification layer 610 may determine whether each portion of the environment is occupied or unoccupied and/or associated with a respective object classification. Another set of layers in layers 610 may determine whether an environment is associated with an estimated height bin and so. In some examples, the discrete portion of the set of object classification layers may additionally or alternatively include latitudes associated with each of the object classifications for which exemplary ML architecture 600 was trained. In other words, the classification output head may output a binary indication that a part of the environment is or is not relevant for classification (e.g. height bins, object classification, occupancy), Or the classification output head may output regressed values to which the NMS algorithm can be applied to determine the classification”, thus based on a determination that the first particular element indicates the first particular element is not occupied and the second particular element indicates that the second element is occupied, identifying a difference in classification between the first particular element and the second particular element is disclosed, because Subhasis teaches multi-channel environment representations in which corresponding elements represent the same portion of the environment across different perceptual pipelines and explicitly indicate occupancy status. Subhasis also teaches that a first channel may indicate whether a portion of the environment is occupied or unoccupied as determined by a visual pipeline, while a second channel indicates occupancy for the same portion as determined by a lidar pipeline. Subhasis discloses classification layers that output binary occupancy indications for discrete portions of the environment. Accordingly, when a corresponding element in one representation indicates unoccupied and the corresponding element in another representation indicates occupied, this reads on identifying a difference in classification between the first particular element and the second particular element)
Claims 5, 6, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Subhasis et al. (hereafter Subhasis) (JP 2022554184), in view of Wang et al. (hereinafter Wang) (US 20230139772) and further in view of Perez et al. (hereinafter Perez), a non-patent literature reference titled “Cluster-Based Active Learning”).
Regarding Claim 5, Subhasis combined with Wang teaches all of the limitations of claim 1 as cited above.
Subhasis combined with Wang does not explicitly teach clustering […] into one or more discrepancy groups, each discrepancy group corresponding to an entry in the training data set.
However, Perez teaches clustering […] into one or more discrepancy groups, each discrepancy group corresponding to an entry in the training data set (Perez, Page 1 – Section 1, “Instead of annotating single images, experts may annotate clusters that have class consistency (i.e., the vast majority of samples belong to the same class), greatly reducing the required number of human interactions to train a model”, & Page 2 – Section 3, “The cluster-based active learning framework consists of adding the clustering and cluster annotation steps into the common pool-based active learning framework. During the annotation step, it can ask the expert to annotate clusters, single samples based on some acquisition criteria (e.g., the most uncertain samples), or both, thus clustering […] into one or more discrepancy groups, each discrepancy group corresponding to an entry in the training data set is disclosed, because Perez teaches grouping data samples into clusters based on class consistency and using each cluster as a unit for annotation and learning. Perez teaches annotating clusters rather than individual samples, where each cluster represents a group of samples corresponding to a common class and serves as a discrete entry used during training. These teachings read on clustering discrepancies into discrepancy groups and treating each group as a corresponding entry in a training data set)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s identification of discrepancies between environment representations with Perez’s cluster-based active learning techniques, because Perez teaches that clustering data samples improves training efficiency and robustness while reducing annotation burden (Perez, Page 4 – Section 6, “We introduced the cluster-based active learning framework and demonstrated that it can reduce the number of human interactions needed to train a CNN for image classification. Furthermore, the framework can still be improved to achieve better results: training techniques that are more robust to label noise; better feature extraction and clustering methods; better training conditions”, thus applying Perez’s clustering techniques to the discrepancy regions identified by Subhasis enables organizing discrepancies into groups that serve as entries in a training data set, thereby improving the efficiency and effectiveness of training machine-learning models for autonomous vehicle perception)
Regarding Claim 6, Subhasis combined with Wang and Perez teaches all of the limitations of claim 5 as cited above and Perez further teaches:
wherein generating the training data set further comprises programmatically labeling (Perez, Page-1 – Section 1, “Instead of annotating single images, experts may annotate clusters that have class consistency (i.e., the vast majority of samples belong to the same class), greatly reducing the required number of human interactions to train a model”, & Page-3 – Section – 4, “To simulate experts annotating clusters, we automatically assigned a label to a cluster only if the modal class in the cluster corresponds to at least 80% of the samples in it. Otherwise, the cluster is not annotated”, thus, programmatically labeling is disclosed, because Perez teaches automatically assigning labels to clusters based on class consistency of the samples within each cluster, without requiring manual labeling of individual data samples)
Perez does not explicitly teach each of the one or more discrepancy groups based on a classification, within at least one of the first or second representations, of the portion of the environment at which the discrepancy exists.
However, Subhasis teaches each of the one or more discrepancy groups based on a classification, within at least one of the first or second representations, of the portion of the environment at which the discrepancy exists (Subhasis, Par. [0096], “Classification layer 610 may include one or more sets of convolutional layers or other components for the classification tasks discussed herein. In some examples, the output layer of a classification task may output a tensor (or other data structure) of latitudes, the discrete part of the field being the relevant part of the environment to be classified (e.g., occupied space, object classification, velocity bins, direction bins, height bins). For example, a first set of layers in classification layer 610 may determine whether each portion of the environment is occupied or unoccupied and/or associated with a respective object classification. Another set of layers in layers 610 may determine whether an environment is associated with an estimated height bin and so. In some examples, the discrete portion of the set of object classification layers may additionally or alternatively include latitudes associated with each of the object classifications for which exemplary ML architecture 600 was trained. In other words, the classification output head may output a binary indication that a part of the environment is or is not relevant for classification (e.g. height bins, object classification, occupancy), Or the classification output head may output regressed values to which the NMS algorithm can be applied to determine the classification”, thus grouping and labeling discrepancies based on classifications within environment representations is disclosed, because Subhasis teaches generating classification outputs that assign occupancy, object class, and related attributes to discrete portions of the environment, which provides the classification basis used to form discrepancy groups corresponding to portions of the environment where discrepancies exist)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s identification of discrepancies between environment representations with Perez’s cluster-based active learning techniques, because Subhasis identifies and localizes portions of the environment where classification outputs differ across perceptual pipelines, while Perez teaches organizing data samples into clusters and programmatically labeling those clusters to improve training efficiency and reduce annotation effort (Perez, Page 4 – Section 6, “We introduced the cluster-based active learning framework and demonstrated that it can reduce the number of human interactions needed to train a CNN for image classification. Furthermore, the framework can still be improved to achieve better results: training techniques that are more robust to label noise; better feature extraction and clustering methods; better training conditions”, thus applying Perez’s clustering and labeling techniques to the discrepancy localized regions produced by Subhasis allows discrepancies associated with similar classification characteristics to be grouped into discrepancy groups and treated as discrete entries in a training data set, thereby improving the efficiency and robustness of training machine-learning models used for autonomous vehicle perception)
Regarding Claim 16, Subhasis combined with Wang teaches all of the limitations of claim 14 as cited above.
Subhasis combined with Wang does not explicitly teach clustering […] into one or more discrepancy groups, each discrepancy group corresponding to an entry in the training data set.
However, Perez teaches clustering […] into one or more discrepancy groups, each discrepancy group corresponding to an entry in the training data set (Perez, Page 1 – Section 1, “Instead of annotating single images, experts may annotate clusters that have class consistency (i.e., the vast majority of samples belong to the same class), greatly reducing the required number of human interactions to train a model”, & Page 2 – Section 3, “The cluster-based active learning framework consists of adding the clustering and cluster annotation steps into the common pool-based active learning framework. During the annotation step, it can ask the expert to annotate clusters, single samples based on some acquisition criteria (e.g., the most uncertain samples), or both, thus clustering […] into one or more discrepancy groups, each discrepancy group corresponding to an entry in the training data set is disclosed, because Perez teaches grouping data samples into clusters based on class consistency and using each cluster as a unit for annotation and learning. Perez teaches annotating clusters rather than individual samples, where each cluster represents a group of samples corresponding to a common class and serves as a discrete entry used during training. These teachings read on clustering discrepancies into discrepancy groups and treating each group as a corresponding entry in a training data set)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Subhasis’s identification of discrepancies between environment representations with Perez’s cluster-based active learning techniques, because Perez teaches that clustering data samples improves training efficiency and robustness while reducing annotation burden (Perez, Page 4 – Section 6, “We introduced the cluster-based active learning framework and demonstrated that it can reduce the number of human interactions needed to train a CNN for image classification. Furthermore, the framework can still be improved to achieve better results: training techniques that are more robust to label noise; better feature extraction and clustering methods; better training conditions”, thus applying Perez’s clustering techniques to the discrepancy regions identified by Subhasis enables organizing discrepancies into groups that serve as entries in a training data set, thereby improving the efficiency and effectiveness of training machine-learning models for autonomous vehicle perception)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. US2023059924A1 is pertinent because it teaches automatically selecting and curating training data for neural networks based on metadata distributions and target criteria, including sampling scenes from large sets of unlabeled sensor data to approximate a desired distribution for training. The reference further teaches organizing scenes using metadata such as operational design domain (ODD) values, grouping data into buckets, and selecting subsets of data under a labeling budget to improve training efficiency and model performance. Because the applicant likewise concerns generating, selecting, or refining training data derived from sensor-based environment representations to improve machine-learning models in autonomous vehicle contexts, this reference is relevant to the claimed invention as it addresses structured selection, curation, and optimization of neural network training datasets.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAHLIET ADMASU whose telephone number is (571)272-0034. The examiner can normally be reached Mon-Fri, 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached at (571)270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/M.T.A./
Examiner, Art Unit 2123
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123