Last updated: April 19, 2026
Application No. 18/499,066
METHOD, APPARATUS, DEVICE AND MEDIUM FOR PROCESSING IMAGE USING MACHINE LEARNING MODEL

Non-Final OA §101§103
Filed
Oct 31, 2023
Examiner
ANSARI, TAHMINA N
Art Unit
2674
Tech Center
2600 — Communications
Assignee
BEIJING YOUZHUJU NETWORK TECHNOLOGY CO., LTD.
OA Round
1 (Non-Final)
Interview Optional

— +17.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 868 resolved cases, 2023–2026
Examiner Intelligence

ANSARI, TAHMINA N View full profile →
Grants 86% — above average
Career Allow Rate
743 granted / 868 resolved
+23.6% vs TC avg
Strong +18% interview lift
Without
With
+17.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
33 currently pending
Career history
901
Total Applications
across all art units
Statute-Specific Performance

§101
12.2%
-27.8% vs TC avg
§103
40.4%
+0.4% vs TC avg
§102
22.6%
-17.4% vs TC avg
§112
10.5%
-29.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 868 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
Claims 1-20 are pending in this application.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows: 
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 20 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. 
Claim 20 is drawn to functional descriptive material recorded on a “computer readable storage medium having a computer program stored thereon”.  Normally, the claim would be statutory.  However, the broadest reasonable interpretation of a claim drawn to a “computer readable storage medium” in light of the written disclosure in paragraphs [0115]-[0116] typically covers forms of non-transitory tangible media as well as transitory propagating signals per se, making the recited claim language directed towards non-statutory subject matter such as a “signal”.  
“A transitory, propagating signal … is not a “process, machine, manufacture, or composition of matter.”  Those four categories define the explicit scope and reach of subject matter patentable under 35 U.S.C. § 101; thus, such a signal cannot be patentable subject matter.”  (In re Nuijten, 84 USPQ2d 1495 (Fed. Cir. 2007)).
Because the full scope of the claim as properly read in light of the disclosure appears to encompass non-statutory subject matter (i.e., because the specification is silent to the exact embodiment of a computer readable medium, it is interpreted as including the ordinary and customary meaning of computer readable medium covering both non-transitory media and transitory propagating signals, etc.) the claim as a whole is non-statutory.  In view of the USPTO's Interim Examination Instructions for Evaluating Subject Matter Eligibility under 35 U.S.C. 101 (the "Guidelines"), and the Official Gazette Notice (1351 OG 212, made available February 23, 2010), the  examiner suggests amending the claim to include the limitation "non-transitory" in order to exclude any non-statutory subject matter. Any amendment to the claim should be commensurate with its corresponding disclosure.

35 U.S.C. § 112 Sixth Paragraph - Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: “processing unit” and “model” in claims 16-19. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Mounsaveng et al. (US PGPub US20210241041A1), hereby referred to as “Mounsaveng”, in view of Graber et al. (US PGPub US20220319016A1), hereby referred to as “Graber”. 

Consider Claims 1, 16 and 20. 
Mounsaveng teaches: 
1. A method for processing an image using a machine learning model, the machine learning model being used for identifying at least one candidate object from an image, and the machine learning model comprising: / 16. An electronic device comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit which, when executed by the at least one processing unit, cause the electronic device to perform a method for processing an image using a machine learning model, the machine learning model being used for identifying at least one candidate object from an image, and the machine learning model comprising:/ 20. A computer readable storage medium having a computer program stored thereon which, when executed by a processor, causes the processor to perform a method for processing an image using a machine learning model, the machine learning model being used for identifying at least one candidate object from an image, and the machine learning model comprising: (Mounsaveng: abstract, A method and a system for joint data augmentation and classification learning, where an augmentation network learns to perform transformations and a classification network is trained. A set of labelled images is received. During an inner loop iteration, an augmentation network applies a transformation on a given labelled image of the set to obtain a transformed image. The classification network classifies the transformed image to obtain a predicted class, and a training loss is determined based on the predicted class and the respective label. The parameters of the classification network are updated based on the classification loss. During an outer loop iteration, the classification network classifies another labelled image of the set to obtain another predicted class, and a validation loss is determined based on the other predicted class and the respective label. The parameters of the augmentation network are updated based on the validation loss. [0085] Training Server [0086] The training server 210 is configured to: (i) access a set of machine learning algorithms (MLAs) 240; and (ii) train the set of MLAs 240. [0087] How the training server 210 is configured to do so will be explained in more detail herein below. [0088]-[0089], [0090] Machine Learning Algorithms (MLAs) [0091] The training server 210 has access to the set of MLAs 240. [0092] The set of MLAs 240 includes inter alia a classification network 250, and an augmentation network 270. [0093] The classification network 250 is configured to classify digital documents based on features thereof.) 
1. a feature extraction model for describing an association between the image and a feature of the at least one candidate object; / 16. a feature extraction model for describing an association between the image and a feature of the at least one candidate object; / 20. a feature extraction model for describing an association between the image and a feature of the at least one candidate object; (Mounsaveng: [0093] The classification network 250 is configured to classify digital documents based on features thereof. [0094] In one or more embodiments, the classification network 250 is configured to classify digital documents in the form of digital images. It is contemplated that the classification network 250 may modified and used to classify documents including text, images, and sound or a combination thereof. [0128] Data Augmentation Learning Procedure [0129] With reference to FIG. 3 there is shown a schematic diagram of a data augmentation learning procedure 300 in accordance with one or more non-limiting embodiments of the present technology. [0130] The data augmentation learning procedure 300 is executed by the training server 210. It will be appreciated that the data augmentation learning procedure 300 may be executed by another electronic device comprising a processor such as the processor 110 or the GPU 111. In one or more other embodiments, the data augmentation learning procedure 300 is executed in a distributed manner. [0174] In one or more embodiments, the classification network 250 extracts a set of image features from the transformed training image 322 to perform classification and output a class prediction 332. The classification network 250 performs the classification of the according to current values of the set of classification parameters 255, i.e. values at the current training loop iteration 305 (which have been determined at the previous loop iteration). [0175] The classification network 250 outputs a class prediction 332 for the transformed training image 322.)
1. and a classification learning model for describing an association between the feature of the at least one candidate object and a classification of the at least one candidate object, the method comprising: / 16. and a classification learning model for describing an association between the feature of the at least one candidate object, the method comprising: / 20. and a classification learning model for describing an association between the feature of the at least one candidate object and a classification of the at least one candidate object, the method comprising: (Mounsaveng: [0128] Data Augmentation Learning Procedure [0129] With reference to FIG. 3 there is shown a schematic diagram of a data augmentation learning procedure 300 in accordance with one or more non-limiting embodiments of the present technology. [0130] The data augmentation learning procedure 300 is executed by the training server 210. It will be appreciated that the data augmentation learning procedure 300 may be executed by another electronic device comprising a processor such as the processor 110 or the GPU 111. In one or more other embodiments, the data augmentation learning procedure 300 is executed in a distributed manner. [0131] The data augmentation learning procedure 300 is an online learning procedure, where parameters of the models, i.e. of the classification network 250 and the augmentation network 270 are updated jointly by performing training iterations and validation iterations. [0132] Online Bilevel Optimization [0133] The purpose of the data augmentation learning procedure 300 is to automatically learn data augmentation transformations (i.e. find optimal values for the set of transformation parameters 275 of the augmentation network 270) that generalize well on unseen data and which also maximize performance of a network (i.e. find optimal values for the set of classification parameters 255 of the classification network 250). [0134] The data augmentation learning procedure 300 aims to learn to solve a bilevel optimization problem in which data augmentation transformations parametrized by optimal values of the set of transformation parameters 275 represented by θ* that minimize the loss on the validation data Xval given optimal values of the set of classification parameters 265 represented by W* learned on the training data Xtr, which is expressed by equations (1) and (2):

    PNG
    media_image1.png
    302
    535
    media_image1.png
    Greyscale
)
1. determining an update parameter associated with the classification learning model based on the classification score of the at least one candidate object and a ground truth classification score of at least one ground truth object in the image; / 16. determining an update parameter associated with the classification learning model based on the classification score of the at least one candidate object and a ground truth classification score of at least one ground truth object in the image; / 20. determining an update parameter associated with the classification learning model based on the classification score of the at least one candidate object and a ground truth classification score of at least one ground truth object in the image;  (Mounsaveng: [0137] Generally speaking, gradient descent is used to optimize parameters of a network. However, in this case the transformations that need to be optimized on a validation data are applied only on the training data, where first order approximation would not work. The purpose of data augmentation is to introduce transformations during the training phase that can make the model invariant or partially invariant to any transformations that can occur during the validation phase. If the transformations are applied on the validation data, the parameters learned by the model will select the transformation parameters that make the data easier to model independently of the data distribution. [0138] To obtain the right validation loss, the classification network 250 should be trained until convergence, and then the training must be unrolled back to back propagate the gradient of the set of transformation parameters 275 of the augmentation network 270. However, as this process is time consuming, due to the amount of epochs needed until converge, and memory consuming, as all intermediate steps of the training need to be stored, truncated back propagation is performed. [0139] Truncated back propagation enables obtaining an estimation of the state of the classification network 250 at convergence and the right validation loss by applying one step of gradient descent on the classification network 250, instead of a plurality of iterations (e.g. hundreds) as with gradient descent. [0140] Thus, the data augmentation learning procedure 300 enables approximating the bilevel optimization problem in the case of a differentiable augmentation network 270 parametrized by the set of transformation parameters 275 by performing truncated back propagation. [0141]-[0144])
1. updating the classification learning model based on the update parameter associated with the classification scoring learning; / 16. updating the classification learning model based on the update parameter associated with the classification learning model; / 20. updating the classification learning model based on the update parameter associated with the classification learning model; (Mounsaveng: [0141] The bilevel optimization problem of equations (1) and (2) may be solved by iteratively solving equation (2) and finding the optimal set of transformation parameters 275 represented by θ. The set of classification parameters 265 represented by W are shared between the training data and validation data, i.e. the classification network 250, and the chain rule can be used to differentiate the validation loss (Xval, W*) with respect to the set of transformation parameters 275 represented by θ. Gradient information is exploited due to optimal values of the set of classification parameters 265 represented by W* being shared between the validation loss and the training loss. [0142] The gradient of the validation loss with respect to the set of transformation parameters 275 represented by θ is expressed as equations (3-5):

    PNG
    media_image2.png
    218
    658
    media_image2.png
    Greyscale
)
1. and preventing the feature extraction model from being updated with the update parameter associated with the classification learning model. / 16. and preventing the feature extraction model from being updated with the update parameter associated with the classification learning model. / 20. and preventing the feature extraction model from being updated with the update parameter associated with the classification learning model. (Mounsaveng: [0144] It will be appreciated that following the chain rule, the gradient of the validation loss given validation data Xval and optimal values of the set of classification parameters 255 with respect to the set of transformation parameters 275 can be expressed as: the gradient of the validation loss given Xval and the optimal values of the set of classification parameters 255 with respect to the optimal values of the set of classification parameters 255 multiplied by the gradient of the set of classification parameters 255 given augmented training data θ(Xtr) with respect to the set of transformation parameters 275. [0145] As W* represents optimal values of the set of classification parameters 255 at training convergence, the values depend on θ for each iteration of gradient descent. Thus, to compute 
    PNG
    media_image3.png
    39
    58
    media_image3.png
    Greyscale
back-propagation through the entire T iteration of the training cycle is required. However, it will be appreciated that this approach may be performed only for small problems due to the large requirements in terms of computation and memory. [0151] The data augmentation learning procedure 300 learns values for a set of transformation parameters 275 that define a distribution of transformation performed by the augmentation network 270 which may be applied on the training data to improve generalization of the classification network 250. [0152] Thus, the data augmentation learning procedure 300 jointly learns parameters of a classification network, and parameters of an augmentation network. [0153] The data augmentation learning procedure 300 adapts the data augmentation transformations by updating dynamically the set of transformation parameters 275 with the evolution of the training of the classification network 250. [0154] Backpropagation is used for adjusting each weight in a network in proportion to how much it contributes to overall error, i.e. by iteratively reducing each weight's error, weight values producing good predictions may be obtained. [0155]-[0169], [0159] The data augmentation learning procedure 300 receives a set of noise vectors (only one noise vector 318 depicted in FIG. 3). The set of noise vectors may be generated using techniques known in the art. A given noise vector of the set of noise vectors may be used during training to expand the size of the training dataset. A given noise vector 318 is a vector including random numerical values sampled from a distribution. Each element of the vector may correspond to a random value from the distribution. In one or more alternative embodiments, elements of a given noise vector may be sampled from different distributions. [0160] It will be appreciated that a noise vector used for learning a distribution of transformations by the augmentation network 270.)
Even if Mounsaveng does not teach: 
a classification scoring model, and a classification score, the classification score representing a probability that the at least one candidate object is classified as foreground in the image 
Graber teaches: 
1. A method for processing an image using a machine learning model, the machine learning model being used for identifying at least one candidate object from an image, and the machine learning model comprising: / 16. An electronic device comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit which, when executed by the at least one processing unit, cause the electronic device to perform a method for processing an image using a machine learning model, the machine learning model being used for identifying at least one candidate object from an image, and the machine learning model comprising:/ 20. A computer readable storage medium having a computer program stored thereon which, when executed by a processor, causes the processor to perform a method for processing an image using a machine learning model, the machine learning model being used for identifying at least one candidate object from an image, and the machine learning model comprising: (Graber: abstract, Panoptic segmentation forecasting predicts future positions of foreground objects and background objects separately. An egomotion model may be implemented to estimate egomotion of the camera. Pixels in frames of captured video are classified between foreground and background. The foreground pixels are grouped into foreground objects. A foreground motion model forecasts motion of the foreground objects to a future timestamp. A background motion model backprojects the background pixels into point clouds in a three-dimensional space. The background motion model predicts future positions of the point clouds based on egomotion. The background motion model may further generate novel point clouds to fill in occluded space. With the predicted future positions, the foreground objects and the background pixels are combined into a single panoptic segmentation forecast. An augmented reality mobile game may utilize the panoptic segmentation forecast to accurately portray movement of virtual elements in relation to the real-world environment. [0024]-[0034], Figures 1-3, [0024] Referring back FIG. 1, the networked computing environment 100 uses a client-server architecture, where a game server 120 communicates with a client device 110 over a network 105 to provide a parallel reality game to players at the client device 110. The networked computing environment 100 also may include other external systems such as sponsor/advertiser systems or business systems. Although only one client device 110 is illustrated in FIG. 1, any number of clients 110 or other external systems may be connected to the game server 120 over the network 105. Furthermore, the networked computing environment 100 may contain different or additional elements and functionality may be distributed between the client device 110 and the server 120 in a different manner than described below.) 
1. a feature extraction model for describing an association between the image and a feature of the at least one candidate object; / 16. a feature extraction model for describing an association between the image and a feature of the at least one candidate object; / 20. a feature extraction model for describing an association between the image and a feature of the at least one candidate object; (Graber: [0037] The foreground motion model 420 forecasts motion of foreground pixels in the input frames. In accordance with one or more embodiments, the foreground motion model 420 includes an object tracking model 422, an object motion encoder 424, and an object motion decoder 426. The object tracking model 422 tracks a position of each foreground in each frame captured. The object motion encoder 424 inputs the captured frames and outputs abstract features relating to predicted motion of each foreground object. The object motion decoder 426 inputs the abstract features and outputs predicted a future position for each foreground object, e.g., at a subsequent time from the input frames. [0038] The object tracking model 422 tracks movement of foreground objects over time. The object tracking model 422 may implement machine learning algorithms, e.g., DeepSort. As the foreground motion model 420 (and its various components) predict positions and/or motion of the foreground objects, the object tracking model 422 may track the foreground objects in different input frames. As additional image data is captured, the object tracking model 422 may further track the position of the foreground objects based on the additional image data. In some embodiments, the object tracking model 422 may score a predicted position of a foreground object predicted by the foreground motion model 420 against the actual position of the foreground object in subsequently captured image data. The score may be utilized by the foreground motion model 420 to further refined the foreground motion model 420. [0039] The object motion encoder 424 inputs frames including a foreground object identified by the object tracking model 422 and outputs abstract features relating to predicted motion for that foreground object. The object motion encoder 424 may also input egomotion determined by the egomotion model 450. In one or more embodiments, the object motion encoder 424 comprises two sub-encoders.)
1. and a classification scoring model for describing an association between the feature of the at least one candidate object and a classification score of the at least one candidate object, the classification score representing a probability that the at least one candidate object is classified as foreground in the image, the method comprising: / 16. and a classification scoring model for describing an association between the feature of the at least one candidate object and a classification score of the at least one candidate object, the classification score representing a probability that the at least one candidate object is classified as foreground in the image, the method comprising: / 20. and a classification scoring model for describing an association between the feature of the at least one candidate object and a classification score of the at least one candidate object, the classification score representing a probability that the at least one candidate object is classified as foreground in the image, the method comprising: (Graber: [0038] The object tracking model 422 tracks movement of foreground objects over time. The object tracking model 422 may implement machine learning algorithms, e.g., DeepSort. As the foreground motion model 420 (and its various components) predict positions and/or motion of the foreground objects, the object tracking model 422 may track the foreground objects in different input frames. As additional image data is captured, the object tracking model 422 may further track the position of the foreground objects based on the additional image data. In some embodiments, the object tracking model 422 may score a predicted position of a foreground object predicted by the foreground motion model 420 against the actual position of the foreground object in subsequently captured image data. The score may be utilized by the foreground motion model 420 to further refined the foreground motion model 420. [0039] The object motion encoder 424 inputs frames including a foreground object identified by the object tracking model 422 and outputs abstract features relating to predicted motion for that foreground object. The object motion encoder 424 may also input egomotion determined by the egomotion model 450. In one or more embodiments, the object motion encoder 424 comprises two sub-encoders. For a foreground object, the objection motion encoder 424 inputs bounding box features, mask features, and odometry as determined by the pixel classification model 410 from the input frames. A bounding box feature may be the smallest rectangle that fully encompasses a foreground object. A mask feature may be a bitmap retaining pixels of a foreground object while excluding other pixels. The odometry of a foreground object can be measured by tracking movement of the foreground over the input frames. A first sub-encoder determines a box state representation from the bounding box features, the odometry, and a transformation of mask features. A second sub-encoder determines a mask state representation from mask features and the box state representation. [0040] The object motion decoder 426 inputs the abstract features and outputs a predicted future position of each foreground object. In some embodiments, the foreground motion model 420 inputs a single foreground object (e.g., at a time) to predict a future position of that foreground object. In one or more embodiments, the object motion decoder 426 comprises two sub-decoders. A first sub-decoder predicts future bounding boxes, and a second sub-decoder predicts future mask features. The sub-decoders can predict a future position of that foreground object for each of a plurality of future timestamps. For example, input frames for t1, t2, . . . tT (where tT is the most recent timestamp of the input frames, and preceding timestamps) and can output a future position for tT+1, tT+2, . . . tT+F (wherein tT+F is the furthest future timestamp). The predicted future position may also change perspective and/or scale of the foreground objects. [0041]-[0045])
1. determining an update parameter associated with the classification scoring model based on the classification score of the at least one candidate object and a ground truth classification score of at least one ground truth object in the image; / 16. determining an update parameter associated with the classification scoring model based on the classification score of the at least one candidate object and a ground truth classification score of at least one ground truth object in the image; / 20. determining an update parameter associated with the classification scoring model based on the classification score of the at least one candidate object and a ground truth classification score of at least one ground truth object in the image; (Graber: [0041]-[0049], [0041] The foreground motion model 420 may further consider a category of each foreground object. For example, the foreground motion model 420 may comprise a plurality of sub-models, each sub-model trained for each category of foreground object. This allows for more precise modeling of the motion for different categories of foreground objects. For example, vehicles can move very fast compared to pedestrians. [0042] The background motion model 430 forecasts motion of background pixels in the input frames, i.e., predicts a future position of the background pixels. In accordance with one or more embodiments, the background motion model 430 includes a backprojection model 432, a semantic motion model 434, and optionally a refinement model 436. [0043] The backprojection model 432 backprojects the background pixels into a 3D point cloud space as 3D point clouds based on depth of the background pixels. Depth may be determined by a stereo depth estimation model and/or a monodepth estimation model, e.g., described in U.S. application Ser. No. 16/332,343 entitled “Predicting Depth From Image Data Using a Statistical Model,” filed on Sep. 12, 2017; U.S. application Ser. No. 16/413,907 entitled “Self-Supervised Training of a Depth Estimation System,” filed on May 16, 2019; and U.S. application Ser. No. 16/864,743 entitled “Self-Supervised Training of a Depth Estimation Model Using Depth Hints,” filed on May 1, 2020. The backprojection model 432 generates a 3D point cloud space from the perspective of the input frames. The backprojection model 432 may further consider camera intrinsic parameters in the backprojection. For example, the backprojection model 432 utilizes a camera focal length and sensor size to establish a viewing frustum from the perspective of the camera. The backprojection model 432 may also utilize the camera focal length to estimate depth of the pixels. With the estimated depth for each pixel, the backprojection model 432 projects the pixel into a 3D point cloud based on the estimated depth.)
1. updating the classification scoring model based on the update parameter associated with the classification scoring model; / 16. updating the classification scoring model based on the update parameter associated with the classification scoring model; / 20. updating the classification scoring model based on the update parameter associated with the classification scoring model; (Graber: [0051] The game server 120 can be configured to receive requests for game data from a client device 110 (for instance via remote procedure calls (RPCs)) and to respond to those requests via the network 105. For instance, the game server 120 can encode game data in one or more data files and provide the data files to the client device 110. In addition, the game server 120 can be configured to receive game data (e.g. player positions, player actions, player input, etc.) from a client device 110 via the network 105. For instance, the client device 110 can be configured to periodically send player input and other updates to the game server 120, which the game server 120 uses to update game data in the game database 115 to reflect any and all changed conditions for the game. [0052]-[0056], [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430.)
1. and preventing the feature extraction model from being updated with the update parameter associated with the classification scoring model. / 16. and preventing the feature extraction model from being updated with the update parameter associated with the classification scoring model. / 20. and preventing the feature extraction model from being updated with the update parameter associated with the classification scoring model. (Graber: [0057]-[0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430. [0059] Once the panoptic segmentation module 142 is trained, the panoptic segmentation module 142 receives image data and outputs a panoptic segmentation predicting future positions of pixels in the input image data. The panoptic segmentation training system 170 provides the trained panoptic segmentation module 142 to the client device 110. The client device 110 uses the trained panoptic segmentation module 142 to predict a future panoptic segmentation based on input images (e.g., captured by a camera on the device). [0060] Various embodiments of panoptic segmentation forecasting and approaches to training the various models of the panoptic segmentation module 142 are described in greater detail in Appendix A, which is a part of this disclosure and specification. Note that Appendix A describes exemplary embodiments, and any features that may be described as or implied to be important, critical, essential, or otherwise required in Appendix A should be understood to only be required in the specific embodiment described and not required in all embodiments.)
It would have been obvious before the effective filing date of the claimed invention to one of ordinary skill in the art to modify the data augmentation and classification model of Mounsaveng with Graber’s improved object detection and motion estimation models for object tracking. The determination of obviousness is predicated upon the following findings: One skilled in the art would have been motivated to modify Mounsaveng in order to improve the data augmentation and classification model to leverage Graber’s machine learning model for object tracking that incorporates in both score-based operator for predictive analysis and enhanced accuracy. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of Mounsaveng, while the teaching of Garber continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of refined object detection and estimation model for enhanced accuracy in tracking and classification. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question. 


Consider Claims 2 and 17.
The combination of Mounsaveng and Graber teaches: 
2. The method of claim 1, wherein the machine learning model further comprises a position scoring model that describes an association between the feature of the at least one candidate object and a position score of the at least one candidate object, / 17. The device of claim 16, wherein the machine learning model further comprises a position scoring model that describes an association between the feature of the at least one candidate object and a position score of the at least one candidate object, (Mounsaveng: [0182] The data augmentation learning procedure 300 updates the current values of the set of classification parameters 255 of the classification network 250 based on the current training loss to obtain updated values of the set of classification parameters 255. It will be appreciated that one or more values of the set of classification parameters 255 are updated depending on the current training loss. [0183] Graber: [0038] The object tracking model 422 tracks movement of foreground objects over time. The object tracking model 422 may implement machine learning algorithms, e.g., DeepSort. As the foreground motion model 420 (and its various components) predict positions and/or motion of the foreground objects, the object tracking model 422 may track the foreground objects in different input frames. As additional image data is captured, the object tracking model 422 may further track the position of the foreground objects based on the additional image data. In some embodiments, the object tracking model 422 may score a predicted position of a foreground object predicted by the foreground motion model 420 against the actual position of the foreground object in subsequently captured image data. The score may be utilized by the foreground motion model 420 to further refined the foreground motion model 420.)
2. the position score representing a difference between a position of the at least one candidate object and a ground truth position of the at least one ground truth object, and the method further comprises: / 17. the position score representing a difference between a position of the at least one candidate object and a ground truth position of the at least one ground truth object, and the method further comprises: (Mounsaveng: [0183] The updated values of the set of classification parameters 255 are used by the classification network 250 to perform classification on another transformed training image (not depicted) generated by the augmentation network 270 based on another given training image (not depicted) of the subset of training images 285 during a subsequent training loop iteration 305. Graber: [0045] The refinement model 436 fills in such gaps using the forecasted 3D point clouds. There may be sparsity of point clouds and lack of information in regions of previously occluded pixels. To train the background refinement model, a cross-entropy loss is applied at pixels which do not correspond to foreground objects in the target frame. This encourages the output of the refinement model 436 to match the ground truth semantic segmentation at each pixel. To fill the gaps, the refinement model 436 may generate novel point clouds interpolating from the existing point clouds. [0046] The aggregation model 440 layers the future positions of the foreground pixels onto the future positions of the background pixels. The layering is ordered such that objects at closer depths are layered atop objects at farther depths. The result is a future panoptic segmentation that includes future positions of foreground objects and future positions of background objects. [0057]-[0059])
2. determining an update parameter associated with the position scoring model based on the position of the at least one candidate object and the ground truth position of the at least one ground truth object; and updating the feature extraction model based on the update parameter associated with the position scoring model./ 17. determining an update parameter associated with the position scoring model based on the position of the at least one candidate object and the ground truth position of the at least one ground truth object; and updating the feature extraction model based on the update parameter associated with the position scoring model. (Mounsaveng: [0184] At each training loop iteration 305, the updated values of the set of classification parameters 255 are shared with the classification network 250 used during the validation loop iteration 360. The updated values of the set of classification parameter 255 obtained during the training loop iteration 305 are used for performing predictions on validation images of the subset of validation images 290 during the validation loop iteration 360. Graber: [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430. [0059])

Consider Claims 3 and 18.
The combination of Mounsaveng and Graber teaches:
3. The method of claim 2, wherein the machine learning model further comprises a mask model that describes an association between the feature of the at least one candidate object and a region of the at least one candidate object, and the method further comprises: / 18. The device of claim 17, wherein the machine learning model further comprises a mask model that describes an association between the feature of the at least one candidate object and a region of the at least one candidate object, and the method further comprises: (Graber: [0038] The object tracking model 422 tracks movement of foreground objects over time. The object tracking model 422 may implement machine learning algorithms, e.g., DeepSort. As the foreground motion model 420 (and its various components) predict positions and/or motion of the foreground objects, the object tracking model 422 may track the foreground objects in different input frames. As additional image data is captured, the object tracking model 422 may further track the position of the foreground objects based on the additional image data. In some embodiments, the object tracking model 422 may score a predicted position of a foreground object predicted by the foreground motion model 420 against the actual position of the foreground object in subsequently captured image data. The score may be utilized by the foreground motion model 420 to further refined the foreground motion model 420.
3. determining an update parameter associated with the mask model based on the region of the at least one candidate object and a ground truth region of the at least one ground truth object; and updating the feature extraction model with the update parameter associated with the mask model. / 18. determining an update parameter associated with the mask model based on the region of the at least one candidate object and a ground truth region of the at least one ground truth object; and updating the feature extraction model with the update parameter associated with the mask model. (Graber: [0036]-[0039] [0039] The object motion encoder 424 inputs frames including a foreground object identified by the object tracking model 422 and outputs abstract features relating to predicted motion for that foreground object. The object motion encoder 424 may also input egomotion determined by the egomotion model 450. In one or more embodiments, the object motion encoder 424 comprises two sub-encoders. For a foreground object, the objection motion encoder 424 inputs bounding box features, mask features, and odometry as determined by the pixel classification model 410 from the input frames. A bounding box feature may be the smallest rectangle that fully encompasses a foreground object. A mask feature may be a bitmap retaining pixels of a foreground object while excluding other pixels. The odometry of a foreground object can be measured by tracking movement of the foreground over the input frames. A first sub-encoder determines a box state representation from the bounding box features, the odometry, and a transformation of mask features. A second sub-encoder determines a mask state representation from mask features and the box state representation.)

Consider Claims 4 and 19.
The combination of Mounsaveng and Graber teaches:
4. The method of claim 3, wherein the machine learning model further comprises a bounding box model that describes an association between the feature of the at least one candidate object and a bounding box of the at least one candidate object in the image, and the method further comprises: determining an update parameter associated with the bounding box model based on the bounding box of the at least one candidate object and a ground truth bounding box of the at least one ground truth object; and updating the feature extraction model based on the update parameter associated with the bounding box model./ 19. The device of claim 18, wherein the machine learning model further comprises a bounding box model that describes an association between the feature of the at least one candidate object and a bounding box of the at least one candidate object in the image, and the method further comprises: determining an update parameter associated with the bounding box model based on the bounding box of the at least one candidate object and a ground truth bounding box of the at least one ground truth object; and updating the feature extraction model based on the update parameter associated with the bounding box model. (Graber: [0038] The object tracking model 422 tracks movement of foreground objects over time. The object tracking model 422 may implement machine learning algorithms, e.g., DeepSort. As the foreground motion model 420 (and its various components) predict positions and/or motion of the foreground objects, the object tracking model 422 may track the foreground objects in different input frames. As additional image data is captured, the object tracking model 422 may further track the position of the foreground objects based on the additional image data. In some embodiments, the object tracking model 422 may score a predicted position of a foreground object predicted by the foreground motion model 420 against the actual position of the foreground object in subsequently captured image data. The score may be utilized by the foreground motion model 420 to further refined the foreground motion model 420. [0039] The object motion encoder 424 inputs frames including a foreground object identified by the object tracking model 422 and outputs abstract features relating to predicted motion for that foreground object. The object motion encoder 424 may also input egomotion determined by the egomotion model 450. In one or more embodiments, the object motion encoder 424 comprises two sub-encoders. For a foreground object, the objection motion encoder 424 inputs bounding box features, mask features, and odometry as determined by the pixel classification model 410 from the input frames. A bounding box feature may be the smallest rectangle that fully encompasses a foreground object. A mask feature may be a bitmap retaining pixels of a foreground object while excluding other pixels. The odometry of a foreground object can be measured by tracking movement of the foreground over the input frames. A first sub-encoder determines a box state representation from the bounding box features, the odometry, and a transformation of mask features. A second sub-encoder determines a mask state representation from mask features and the box state representation. [0057]-[0059] [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp.)

Consider Claim 5. 
The combination of Mounsaveng and Graber teaches:
5. The method of claim 4, wherein the machine learning model further comprises a contrastive learning model, and the method further comprises: selecting, from the at least one candidate object, a positive sample and a negative sample for contrastive learning; determining, using the positive sample and the negative sample, an update parameter associated with the contrastive learning model; and updating the feature extraction model with the update parameter associated with the contrastive learning model.(Mounsaveng: [0144] It will be appreciated that following the chain rule, the gradient of the validation loss given validation data Xval and optimal values of the set of classification parameters 255 with respect to the set of transformation parameters 275 can be expressed as: the gradient of the validation loss given Xval and the optimal values of the set of classification parameters 255 with respect to the optimal values of the set of classification parameters 255 multiplied by the gradient of the set of classification parameters 255 given augmented training data θ(Xtr) with respect to the set of transformation parameters 275. [0145] As W* represents optimal values of the set of classification parameters 255 at training convergence, the values depend on θ for each iteration of gradient descent. Thus, to compute 
    PNG
    media_image3.png
    39
    58
    media_image3.png
    Greyscale
back-propagation through the entire T iteration of the training cycle is required. However, it will be appreciated that this approach may be performed only for small problems due to the large requirements in terms of computation and memory. [0151] The data augmentation learning procedure 300 learns values for a set of transformation parameters 275 that define a distribution of transformation performed by the augmentation network 270 which may be applied on the training data to improve generalization of the classification network 250. [0152] Thus, the data augmentation learning procedure 300 jointly learns parameters of a classification network, and parameters of an augmentation network. [0153] The data augmentation learning procedure 300 adapts the data augmentation transformations by updating dynamically the set of transformation parameters 275 with the evolution of the training of the classification network 250. [0154] Backpropagation is used for adjusting each weight in a network in proportion to how much it contributes to overall error, i.e. by iteratively reducing each weight's error, weight values producing good predictions may be obtained. [0155]-[0169], [0159] The data augmentation learning procedure 300 receives a set of noise vectors (only one noise vector 318 depicted in FIG. 3). The set of noise vectors may be generated using techniques known in the art. A given noise vector of the set of noise vectors may be used during training to expand the size of the training dataset. A given noise vector 318 is a vector including random numerical values sampled from a distribution. Each element of the vector may correspond to a random value from the distribution. In one or more alternative embodiments, elements of a given noise vector may be sampled from different distributions. [0160] It will be appreciated that a noise vector used for learning a distribution of transformations by the augmentation network 270. Graber: [0045], [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430.)

Consider Claim 6. 
The combination of Mounsaveng and Graber teaches:
6. The method of claim 5, wherein selecting the positive sample and the negative sample comprises: determining a sequence of the at least one candidate object based on a comparison between the at least one candidate object and the at least one ground truth object; and selecting, from the sequence, the positive sample and the negative sample. (Mounsaveng: [0144] It will be appreciated that following the chain rule, the gradient of the validation loss given validation data Xval and optimal values of the set of classification parameters 255 with respect to the set of transformation parameters 275 can be expressed as: the gradient of the validation loss given Xval and the optimal values of the set of classification parameters 255 with respect to the optimal values of the set of classification parameters 255 multiplied by the gradient of the set of classification parameters 255 given augmented training data θ(Xtr) with respect to the set of transformation parameters 275. [0145] As W* represents optimal values of the set of classification parameters 255 at training convergence, the values depend on θ for each iteration of gradient descent. Thus, to compute 
    PNG
    media_image3.png
    39
    58
    media_image3.png
    Greyscale
back-propagation through the entire T iteration of the training cycle is required. However, it will be appreciated that this approach may be performed only for small problems due to the large requirements in terms of computation and memory. [0151] The data augmentation learning procedure 300 learns values for a set of transformation parameters 275 that define a distribution of transformation performed by the augmentation network 270 which may be applied on the training data to improve generalization of the classification network 250. [0152] Thus, the data augmentation learning procedure 300 jointly learns parameters of a classification network, and parameters of an augmentation network. [0153] The data augmentation learning procedure 300 adapts the data augmentation transformations by updating dynamically the set of transformation parameters 275 with the evolution of the training of the classification network 250. [0154] Backpropagation is used for adjusting each weight in a network in proportion to how much it contributes to overall error, i.e. by iteratively reducing each weight's error, weight values producing good predictions may be obtained. [0155]-[0169], [0159] The data augmentation learning procedure 300 receives a set of noise vectors (only one noise vector 318 depicted in FIG. 3). The set of noise vectors may be generated using techniques known in the art. A given noise vector of the set of noise vectors may be used during training to expand the size of the training dataset. A given noise vector 318 is a vector including random numerical values sampled from a distribution. Each element of the vector may correspond to a random value from the distribution. In one or more alternative embodiments, elements of a given noise vector may be sampled from different distributions. [0160] It will be appreciated that a noise vector used for learning a distribution of transformations by the augmentation network 270. Graber: [0045], [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430.)

Consider Claim 7. 
The combination of Mounsaveng and Graber teaches:
7. The method of claim 6, wherein determining the sequence of the at least one candidate object comprises: selecting, from the at least one candidate object, a similar candidate object that is similar to the at least one ground truth object based on a comparison between the feature of the at least one candidate object and a ground truth feature of the at least one ground truth object; determining a feature center using the feature of the similar candidate object; and determining the sequence of the at least one candidate object based on a distance between the feature of the at least one candidate object and the feature center. (Mounsaveng: [0144] It will be appreciated that following the chain rule, the gradient of the validation loss given validation data Xval and optimal values of the set of classification parameters 255 with respect to the set of transformation parameters 275 can be expressed as: the gradient of the validation loss given Xval and the optimal values of the set of classification parameters 255 with respect to the optimal values of the set of classification parameters 255 multiplied by the gradient of the set of classification parameters 255 given augmented training data θ(Xtr) with respect to the set of transformation parameters 275. [0145] As W* represents optimal values of the set of classification parameters 255 at training convergence, the values depend on θ for each iteration of gradient descent. Thus, to compute 
    PNG
    media_image3.png
    39
    58
    media_image3.png
    Greyscale
back-propagation through the entire T iteration of the training cycle is required. However, it will be appreciated that this approach may be performed only for small problems due to the large requirements in terms of computation and memory. [0151] The data augmentation learning procedure 300 learns values for a set of transformation parameters 275 that define a distribution of transformation performed by the augmentation network 270 which may be applied on the training data to improve generalization of the classification network 250. [0152] Thus, the data augmentation learning procedure 300 jointly learns parameters of a classification network, and parameters of an augmentation network. [0153] The data augmentation learning procedure 300 adapts the data augmentation transformations by updating dynamically the set of transformation parameters 275 with the evolution of the training of the classification network 250. [0154] Backpropagation is used for adjusting each weight in a network in proportion to how much it contributes to overall error, i.e. by iteratively reducing each weight's error, weight values producing good predictions may be obtained. [0155]-[0169], [0159] The data augmentation learning procedure 300 receives a set of noise vectors (only one noise vector 318 depicted in FIG. 3). The set of noise vectors may be generated using techniques known in the art. A given noise vector of the set of noise vectors may be used during training to expand the size of the training dataset. A given noise vector 318 is a vector including random numerical values sampled from a distribution. Each element of the vector may correspond to a random value from the distribution. In one or more alternative embodiments, elements of a given noise vector may be sampled from different distributions. [0160] It will be appreciated that a noise vector used for learning a distribution of transformations by the augmentation network 270. Graber: [0045], [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430. [0059]-[0060])

Consider Claim 8. 
The combination of Mounsaveng and Graber teaches:
8. The method of claim 7, wherein determining the distance between the feature of the at least one candidate object and the feature center comprises: determining the distance based on an optimal transport strategy. (Mounsaveng: [0144] It will be appreciated that following the chain rule, the gradient of the validation loss given validation data Xval and optimal values of the set of classification parameters 255 with respect to the set of transformation parameters 275 can be expressed as: the gradient of the validation loss given Xval and the optimal values of the set of classification parameters 255 with respect to the optimal values of the set of classification parameters 255 multiplied by the gradient of the set of classification parameters 255 given augmented training data θ(Xtr) with respect to the set of transformation parameters 275. [0145] As W* represents optimal values of the set of classification parameters 255 at training convergence, the values depend on θ for each iteration of gradient descent. Thus, to compute 
    PNG
    media_image3.png
    39
    58
    media_image3.png
    Greyscale
back-propagation through the entire T iteration of the training cycle is required. However, it will be appreciated that this approach may be performed only for small problems due to the large requirements in terms of computation and memory. [0151] The data augmentation learning procedure 300 learns values for a set of transformation parameters 275 that define a distribution of transformation performed by the augmentation network 270 which may be applied on the training data to improve generalization of the classification network 250. [0152] Thus, the data augmentation learning procedure 300 jointly learns parameters of a classification network, and parameters of an augmentation network. [0153] The data augmentation learning procedure 300 adapts the data augmentation transformations by updating dynamically the set of transformation parameters 275 with the evolution of the training of the classification network 250. [0154] Backpropagation is used for adjusting each weight in a network in proportion to how much it contributes to overall error, i.e. by iteratively reducing each weight's error, weight values producing good predictions may be obtained. [0155]-[0169], [0159] The data augmentation learning procedure 300 receives a set of noise vectors (only one noise vector 318 depicted in FIG. 3). The set of noise vectors may be generated using techniques known in the art. A given noise vector of the set of noise vectors may be used during training to expand the size of the training dataset. A given noise vector 318 is a vector including random numerical values sampled from a distribution. Each element of the vector may correspond to a random value from the distribution. In one or more alternative embodiments, elements of a given noise vector may be sampled from different distributions. [0160] It will be appreciated that a noise vector used for learning a distribution of transformations by the augmentation network 270. Graber: [0045], [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430. [0059]-[0060])

Consider Claim 9. 
The combination of Mounsaveng and Graber teaches:
9. The method of claim 6, wherein selecting the positive sample from the sequence comprises: selecting a first number of candidate objects from an end of the sequence as the positive samples. (Mounsaveng: [0144] It will be appreciated that following the chain rule, the gradient of the validation loss given validation data Xval and optimal values of the set of classification parameters 255 with respect to the set of transformation parameters 275 can be expressed as: the gradient of the validation loss given Xval and the optimal values of the set of classification parameters 255 with respect to the optimal values of the set of classification parameters 255 multiplied by the gradient of the set of classification parameters 255 given augmented training data θ(Xtr) with respect to the set of transformation parameters 275. [0145] As W* represents optimal values of the set of classification parameters 255 at training convergence, the values depend on θ for each iteration of gradient descent. Thus, to compute 
    PNG
    media_image3.png
    39
    58
    media_image3.png
    Greyscale
back-propagation through the entire T iteration of the training cycle is required. However, it will be appreciated that this approach may be performed only for small problems due to the large requirements in terms of computation and memory. [0151] The data augmentation learning procedure 300 learns values for a set of transformation parameters 275 that define a distribution of transformation performed by the augmentation network 270 which may be applied on the training data to improve generalization of the classification network 250. [0152] Thus, the data augmentation learning procedure 300 jointly learns parameters of a classification network, and parameters of an augmentation network. [0153] The data augmentation learning procedure 300 adapts the data augmentation transformations by updating dynamically the set of transformation parameters 275 with the evolution of the training of the classification network 250. [0154] Backpropagation is used for adjusting each weight in a network in proportion to how much it contributes to overall error, i.e. by iteratively reducing each weight's error, weight values producing good predictions may be obtained. [0155]-[0169], [0159] The data augmentation learning procedure 300 receives a set of noise vectors (only one noise vector 318 depicted in FIG. 3). The set of noise vectors may be generated using techniques known in the art. A given noise vector of the set of noise vectors may be used during training to expand the size of the training dataset. A given noise vector 318 is a vector including random numerical values sampled from a distribution. Each element of the vector may correspond to a random value from the distribution. In one or more alternative embodiments, elements of a given noise vector may be sampled from different distributions. [0160] It will be appreciated that a noise vector used for learning a distribution of transformations by the augmentation network 270. Graber: [0045], [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430. [0059]-[0060])

Consider Claim 10. 
The combination of Mounsaveng and Graber teaches:
10. The method of claim 9, wherein selecting the negative sample from the sequence comprises: selecting a second number of candidate objects, from further candidate objects after the first number of candidate objects in the sequence, as the negative sample. (Mounsaveng: [0144] It will be appreciated that following the chain rule, the gradient of the validation loss given validation data Xval and optimal values of the set of classification parameters 255 with respect to the set of transformation parameters 275 can be expressed as: the gradient of the validation loss given Xval and the optimal values of the set of classification parameters 255 with respect to the optimal values of the set of classification parameters 255 multiplied by the gradient of the set of classification parameters 255 given augmented training data θ(Xtr) with respect to the set of transformation parameters 275. [0145] As W* represents optimal values of the set of classification parameters 255 at training convergence, the values depend on θ for each iteration of gradient descent. Thus, to compute 
    PNG
    media_image3.png
    39
    58
    media_image3.png
    Greyscale
back-propagation through the entire T iteration of the training cycle is required. However, it will be appreciated that this approach may be performed only for small problems due to the large requirements in terms of computation and memory. [0151] The data augmentation learning procedure 300 learns values for a set of transformation parameters 275 that define a distribution of transformation performed by the augmentation network 270 which may be applied on the training data to improve generalization of the classification network 250. [0152] Thus, the data augmentation learning procedure 300 jointly learns parameters of a classification network, and parameters of an augmentation network. [0153] The data augmentation learning procedure 300 adapts the data augmentation transformations by updating dynamically the set of transformation parameters 275 with the evolution of the training of the classification network 250. [0154] Backpropagation is used for adjusting each weight in a network in proportion to how much it contributes to overall error, i.e. by iteratively reducing each weight's error, weight values producing good predictions may be obtained. [0155]-[0169], [0159] The data augmentation learning procedure 300 receives a set of noise vectors (only one noise vector 318 depicted in FIG. 3). The set of noise vectors may be generated using techniques known in the art. A given noise vector of the set of noise vectors may be used during training to expand the size of the training dataset. A given noise vector 318 is a vector including random numerical values sampled from a distribution. Each element of the vector may correspond to a random value from the distribution. In one or more alternative embodiments, elements of a given noise vector may be sampled from different distributions. [0160] It will be appreciated that a noise vector used for learning a distribution of transformations by the augmentation network 270. Graber: [0045], [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430. [0059]-[0060])

Consider Claim 11. 
The combination of Mounsaveng and Graber teaches:
11. The method of claim 10, wherein selecting the second number of candidate objects as the negative sample comprises: selecting the second number of candidate objects, from candidate objects adjacent to the first number of candidate objects among the further candidate objects, as the negative sample. (Mounsaveng: [0144] It will be appreciated that following the chain rule, the gradient of the validation loss given validation data Xval and optimal values of the set of classification parameters 255 with respect to the set of transformation parameters 275 can be expressed as: the gradient of the validation loss given Xval and the optimal values of the set of classification parameters 255 with respect to the optimal values of the set of classification parameters 255 multiplied by the gradient of the set of classification parameters 255 given augmented training data θ(Xtr) with respect to the set of transformation parameters 275. [0145] As W* represents optimal values of the set of classification parameters 255 at training convergence, the values depend on θ for each iteration of gradient descent. Thus, to compute 
    PNG
    media_image3.png
    39
    58
    media_image3.png
    Greyscale
back-propagation through the entire T iteration of the training cycle is required. However, it will be appreciated that this approach may be performed only for small problems due to the large requirements in terms of computation and memory. [0151] The data augmentation learning procedure 300 learns values for a set of transformation parameters 275 that define a distribution of transformation performed by the augmentation network 270 which may be applied on the training data to improve generalization of the classification network 250. [0152] Thus, the data augmentation learning procedure 300 jointly learns parameters of a classification network, and parameters of an augmentation network. [0153] The data augmentation learning procedure 300 adapts the data augmentation transformations by updating dynamically the set of transformation parameters 275 with the evolution of the training of the classification network 250. [0154] Backpropagation is used for adjusting each weight in a network in proportion to how much it contributes to overall error, i.e. by iteratively reducing each weight's error, weight values producing good predictions may be obtained. [0155]-[0169], [0159] The data augmentation learning procedure 300 receives a set of noise vectors (only one noise vector 318 depicted in FIG. 3). The set of noise vectors may be generated using techniques known in the art. A given noise vector of the set of noise vectors may be used during training to expand the size of the training dataset. A given noise vector 318 is a vector including random numerical values sampled from a distribution. Each element of the vector may correspond to a random value from the distribution. In one or more alternative embodiments, elements of a given noise vector may be sampled from different distributions. [0160] It will be appreciated that a noise vector used for learning a distribution of transformations by the augmentation network 270. Graber: [0045], [0057] The panoptic segmentation training system 170 trains the models used by the panoptic segmentation module 142. The panoptic segmentation training system 170 receives image data for use in training the models of the panoptic segmentation module 142. Generally, the panoptic segmentation training system 170 may perform supervised training of the models of the panoptic segmentation module 142. The training of the models may be simultaneous or separate. With supervised training, a data set used to train a particular model or models has a ground truth that a prediction is evaluated against to calculate a loss. The training system 170 iteratively adjusts weights of the models to optimize the loss. As a future panoptic segmentation predicts future positions of foreground objects and future positions of background objects in a scene, a video captured by a camera on a moving agent can be used for supervised training. The training system 170 inputs a subset of frames and attempts to generate a future panoptic segmentation at a subsequent timestamp in the video. The training system 170 may compare the future panoptic segmentation to the frame at that subsequent timestamp. [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430. [0059]-[0060])

Consider Claim 12. 
The combination of Mounsaveng and Graber teaches:
12. The method of claim 1, wherein the image comprises at least one labeled ground truth object, and the image further comprises at least one unlabeled object.(Mounsaveng: [0042] MLAs may generally be divided into broad categories such as supervised learning, unsupervised learning and reinforcement learning. Supervised learning involves presenting a machine learning algorithm with training data consisting of inputs and outputs labelled by assessors, where the objective is to train the machine learning algorithm such that it learns a general rule for mapping inputs to outputs. Unsupervised learning involves presenting the machine learning algorithm with unlabeled data, where the objective for the machine learning algorithm is to find a structure or hidden patterns in the data. Reinforcement learning involves having an algorithm evolving in a dynamic environment guided only by positive or negative reinforcement. [0043], Graber: [0057]- [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430.)

Consider Claim 13. 
The combination of Mounsaveng and Graber teaches:
13. The method of claim 12, further comprising: inputting an image to be processed to the machine learning model; determining, using the position scoring model, a position score of at least one candidate object in the image to be processed; determining, using the classification scoring model, a classification score of the at least one candidate object in the image to be processed; and determining, based on geometric mean of the position score and the classification score, a final score of the at least one candidate object in the image to be processed. (Mounsaveng: [0042] MLAs may generally be divided into broad categories such as supervised learning, unsupervised learning and reinforcement learning. Supervised learning involves presenting a machine learning algorithm with training data consisting of inputs and outputs labelled by assessors, where the objective is to train the machine learning algorithm such that it learns a general rule for mapping inputs to outputs. Unsupervised learning involves presenting the machine learning algorithm with unlabeled data, where the objective for the machine learning algorithm is to find a structure or hidden patterns in the data. Reinforcement learning involves having an algorithm evolving in a dynamic environment guided only by positive or negative reinforcement. [0043], Graber: [0057]- [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430.)

Consider Claim 14. 
The combination of Mounsaveng and Graber teaches:
14. The method of claim 13, wherein the image to be processed comprises at least one unlabeled object, and the at least one unlabeled object is not comprised in labeled ground truth objects in an image used to train the machine learning model. (Mounsaveng: [0042] MLAs may generally be divided into broad categories such as supervised learning, unsupervised learning and reinforcement learning. Supervised learning involves presenting a machine learning algorithm with training data consisting of inputs and outputs labelled by assessors, where the objective is to train the machine learning algorithm such that it learns a general rule for mapping inputs to outputs. Unsupervised learning involves presenting the machine learning algorithm with unlabeled data, where the objective for the machine learning algorithm is to find a structure or hidden patterns in the data. Reinforcement learning involves having an algorithm evolving in a dynamic environment guided only by positive or negative reinforcement. [0043], Graber: [0057]- [0058] This principle applies to each of the components of the panoptic segmentation module 142. For example, taking the foreground motion model 420, the training system 170 subdivides the video into input frames and ground truth future positions. For example, the training system 170 uses a sliding window to capture subsets of some number of adjacent timestamped frames (e.g., grouping into six frames). A supposed current timestamp is used to split each subset of adjacent timestamped frames into training input frames and training ground truth frames (e.g., three out of six frames are training input frames and three out of six frames are training ground truth frames). The training system 170 inputs the training input frames into the foreground motion model 420 to predict future positions of the foreground objects which is compared against the training ground truth frames to calculate a loss for the foreground motion model 420. And with the background motion model 430, the training system 170 may use a similar subdivision of video data. The training system 170 inputs the training input frames into the background motion model 430 to determine future position of the background pixels which is compared against the training ground truth frames to calculate a loss for the background motion model 430.)

Consider Claim 16. 
The combination of Mounsaveng and Graber teaches:
15. The method of claim 13, further comprising at least any of: determining, using the mask model, a region of at least one candidate object in the image to be processed; and determining, using the bounding box model, a bounding box of at least one candidate object in the image to be processed.(Graber: [0039] The object motion encoder 424 inputs frames including a foreground object identified by the object tracking model 422 and outputs abstract features relating to predicted motion for that foreground object. The object motion encoder 424 may also input egomotion determined by the egomotion model 450. In one or more embodiments, the object motion encoder 424 comprises two sub-encoders. For a foreground object, the objection motion encoder 424 inputs bounding box features, mask features, and odometry as determined by the pixel classification model 410 from the input frames. A bounding box feature may be the smallest rectangle that fully encompasses a foreground object. A mask feature may be a bitmap retaining pixels of a foreground object while excluding other pixels. The odometry of a foreground object can be measured by tracking movement of the foreground over the input frames. A first sub-encoder determines a box state representation from the bounding box features, the odometry, and a transformation of mask features. A second sub-encoder determines a mask state representation from mask features and the box state representation.)

Conclusion
The prior art made of record in form PTO-892 and not relied upon is considered pertinent to applicant's disclosure. 

    PNG
    media_image4.png
    197
    872
    media_image4.png
    Greyscale


Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAHMINA ANSARI whose telephone number is 571-270-3379.  The examiner can normally be reached on IFP Flex - Monday through Friday 9 to 5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O’NEAL MISTRY can be reached on 313-446-4912.  The fax phone numbers for the organization where this application or proceeding is assigned are 571-273-8300 for regular communications and 571-273-8300 for After Final communications. TC 2600’s customer service number is 571-272-2600.
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the receptionist whose telephone number is 571-272-2600.




2674
/Tahmina Ansari/

March 7, 2026
/TAHMINA N ANSARI/Primary Examiner, Art Unit 2674
Read full office action
Prosecution Timeline

Oct 31, 2023
Application Filed
Mar 07, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/068,590
Patent 12586249
PROCESSING APPARATUS, PROCESSING METHOD, AND STORAGE MEDIUM FOR CALIBRATING AN IMAGE CAPTURE APPARATUS
2y 5m to grant Granted Mar 24, 2026
18/484,909
Patent 12586354
TRAINING METHOD, APPARATUS AND NON-TRANSITORY COMPUTER READABLE MEDIUM FOR A MACHINE LEARNING MODEL
2y 5m to grant Granted Mar 24, 2026
18/471,055
Patent 12573083
COMPUTER-READABLE RECORDING MEDIUM STORING OBJECT DETECTION PROGRAM, DEVICE, AND MACHINE LEARNING MODEL GENERATION METHOD OF TRAINING OBJECT DETECTION MODEL TO DETECT CATEGORY AND POSITION OF OBJECT
2y 5m to grant Granted Mar 10, 2026
17/976,971
Patent 12548297
IMAGE PROCESSING METHOD AND APPARATUS, COMPUTER DEVICE, STORAGE MEDIUM, AND PROGRAM PRODUCT BASED ON FEATURE AND DISTRIBUTION CORRELATION
2y 5m to grant Granted Feb 10, 2026
18/444,143
Patent 12524504
METHOD AND DATA PROCESSING SYSTEM FOR PROVIDING EXPLANATORY RADIOMICS-RELATED INFORMATION
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
86%
Grant Probability
99%
With Interview (+17.9%)
2y 8m
Median Time to Grant
Low
PTA Risk
Based on 868 resolved cases by this examiner. Grant probability derived from career allow rate.