Last updated: May 29, 2026
Application No. 18/308,452
SEGMENTATION OF A SEQUENCE OF VIDEO IMAGES WITH A TRANSFORMER NETWORK

Non-Final OA §101§102§103
Filed
Apr 27, 2023
Priority
May 06, 2022 — DE 10 2022 204 493.2
Examiner
NASHER, AHMED ABDULLALIM-M
Art Unit
2675
Tech Center
2600 — Communications
Assignee
Robert Bosch GmbH
OA Round
1 (Non-Final)
Interview Optional

— +32.8% interview lift. Examiner has a relatively high allowance rate (81%); +32.8% interview lift. A written response may suffice.
Based on 101 resolved cases, 2023–2026
Examiner Intelligence

NASHER, AHMED ABDULLALIM-M View full profile →
Grants 81% — above average
Career Allowance Rate
82 granted / 101 resolved
+19.2% vs TC avg
Strong +33% interview lift
Without
With
+32.8%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
12 currently pending
Career history
120
Total Applications
across all art units
Statute-Specific Performance

§101
0.4%
-39.6% vs TC avg
§103
87.7%
+47.7% vs TC avg
§102
5.7%
-34.3% vs TC avg
§112
1.3%
-38.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 101 resolved cases
Office Action

§101 §102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. DE10 2022 204 493.2, filed on 05/06/2022.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/27/2023, 05/15/2023 are being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 15 rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because a compute instance in claim 15 is a software per se. Examiner suggests removing the language for compute instance to overcome the 101 rejection. 
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-4, 6-7, 14-15 is/are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by Wang (OadTR:  Online Action Detection with Transformers).
Regarding claims 1, 14 and 15, Wang discloses A method for transforming a frame sequence of video frames into a scene sequence of scenes that belong to different classes of a predetermined classification and that each extend over a region on a time axis, the method comprising the following steps ("page 3, col 1: 3.1.Problem description
Given a video stream that may contain multiple actions, the goal of the task is to identify the actions currently taking place in real-time. We denote V={ft}0 t=−T as the input streaming video, which needs to classify the current frame chunk f0. We use y0 to represent the action category of the current frame chunk f0, and y0∈{0,1,...,C}, where C is the total number of the action categories and index 0 denotes the background category."):
extracting features from each video frame of the frame sequence (page 3: Figure2: Illustration of the proposed Online Action Detection TRansformer (OadTR). Given an input streaming video V={ft}0 t=−T, a task token is  attached to the visual features output by the feature extraction network. Then the token feature sequence is input into the standard Transformer’s encoder to model long-range historical temporal dependencies.);
transforming the features belonging to each video frame into a feature representation in a first working space ("page 4, col 1: Intuitively, if there is no token class here, the final feature representation obtained by other tokens will inevitably be biased towards this specified token as a whole, and thus cannot be used to represent this learning task (i.e., w/o task token in Figure 3). In contrast, the semantic embedding of token class can be obtained by adaptively interacting with other tokens in the encoder, which is more suitable for feature representations (i.e., w/ task token in Figure 3). We will further confirm the necessity of tokenclass in Sec 4.3.");
ascertaining, with a trainable encoder of a transformer network, a feature interaction of each feature representation with respectively other feature representations, the feature interactions characterizing a frame prediction (page 4, col 1: Multi-head self-attention (MSA) is the core component of the Transformer. Intuitively, the idea behind self attention is that each token can interact with other tokens and can learn to gather useful semantic information more effectively, which is very suitable for capturing long-range dependencies. We compute the dot products of the query with all keys and apply a softmax function to obtain the weights on the values.);
transforming a class belonging to each already-ascertained scene into a scene representation in a second working space, into which a position of the respective scene in the scene sequence is encoded ("page 4, col 1: Since there is no frame order information in the encoder, we need to embed position encoding additionally. Position encoding can take two forms: sinusoidal inputs and trainable embeddings. We add position encoding Epos ∈ R(T+2)×D to the token sequence (i.e., element-wise addition) to retain positional information:  X0 = ˜F +Epos  (1)
In this way, positional information can be kept despite the orderless self-attention.");
ascertaining, with a trainable decoder of the transformer network, a scene interaction of each scene representation with each of all other scene representations ("page 3: Afterward, the decoder of OadTR anticipates the future context information in parallel. Finally, the predicted future context are involved in classifying the current action.
Page 4, col 2:  Consequently, the decoder of OadTR makes use of the observation of past information to predict the actions that will occur in the near future, so as to better learn more discriminative features.");
ascertaining, with the decoder, a scene-feature interaction of each scene interaction with each feature interaction ("page 4, col 2: The difference with the original Transformer [47] is that our decoder decodes the d prediction queries in parallel at each decoding layer. The decoder is allowed to utilize semantic information from the encoder through the encoder-decoder cross-attention mechanism.
For the classification task of the current frame chunk, we first concatenate the task-related features in the encoder with the pooled predicted features in the decoder. Then the resulting features go through a full connection layer and a softmax operation for action classification."); and
ascertaining from the scene-feature interactions, with the decoder at least the class of a next scene in the scene sequence that is most plausible in view of the frame sequence and the already-ascertained scenes (page 4, col 2: The difference with the original Transformer [47] is that our decoder decodes the d prediction queries in parallel at each decoding layer. The decoder is allowed to utilize semantic information from the encoder through the encoder-decoder cross-attention mechanism.).
Regarding claim 2, Wang discloses ascertaining similarity measures between each respective feature representation and each of all the other feature representations (page 4, col 1: Multi-head self-attention (MSA) is the core component of the Transformer. Intuitively, the idea behind self attention is that each token can interact with other tokens and can learn to gather useful semantic information more effectively, which is very suitable for capturing long-range dependencies. We compute the dot products of the query with all keys and apply a softmax function to obtain the weights on the values.), and aggregating contributions of each of the other feature representations in weighted fashion with the similarity measures (page 4, col 1: Multi-head self-attention (MSA) is the core component of the Transformer. Intuitively, the idea behind self attention is that each token can interact with other tokens and can learn to gather useful semantic information more effectively, which is very suitable for capturing long-range dependencies. We compute the dot products of the query with all keys and apply a softmax function to obtain the weights on the values.).
Regarding claim 3, ascertaining similarity measures between each respective scene representation and each of all the other scene representations ("page 3, fig. 2: Afterward, the decoder of OadTR anticipates the future context information in parallel. Finally, the predicted future context are involved in classifying the current action.
Page 4, col 2:  Consequently, the decoder of OadTR makes use of the observation of past information to predict the actions that will occur in the near future, so as to better learn more discriminative features."), and
aggregating contributions from each of the other scene representations in weighted fashion with the similarity measures (Figure 3: Comparison of similarity distribution between the classification features (i.e., features before sending to classifier)and the input features sequence ˜F. Note that w/o task token: the output classification features correspond to the token of the f0 input; w/ task token: the output classification features correspond to the task token.).

    PNG
    media_image1.png
    344
    618
    media_image1.png
    Greyscale

Regarding claim 4, Wang discloses ascertaining similarity measures between each respective scene interaction and the feature interactions (Figure 3: Comparison of similarity distribution between the classification features (i.e., features before sending to classifier) and the input features sequence ˜F. Note that w/o task token: the output classification features correspond to the token of the f0 input; w/ task token: the output classification features correspond to the task token.), and
aggregating contributions of the feature interactions in weighted fashion with these similarity measures (Figure 5: Attention visualization maps. They indicate how much attention is paid to parts of the input streaming video.).

    PNG
    media_image2.png
    662
    730
    media_image2.png
    Greyscale

Regarding claim 6, Wang discloses wherein both the class of the next scene and the region on the time axis over which the next scene extends are ascertained with the decoder of the transformer network ("page 4, col 2, 3.2.2: The difference with the original Transformer [47] is that our decoder decodes the d prediction queries in parallel at each decoding layer. The decoder is allowed to utilize semantic information from the encoder through the encoder-decoder cross-attention mechanism.
3.2.3: In OadTR, we mainly use the encoder to identify the current frame chunk f0, and the decoder to predict the coming future. ..... In addition to the estimated current action, OadTR also outputs predicted features for the next d time steps.").
Regarding claim 7, Wang discloses wherein the region on the time axis over which the next scene extends is ascertained using a trained auxiliary decoder network that receives as inputs both the classes provided by the decoder of the transformer network and the feature interactions (3.2.3: In OadTR, we mainly use the encoder to identify the current frame chunk f0, and the decoder to predict the coming future. At the same time, the prediction results are used as auxiliary information to better recognize the action.). 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 5, 8-11, 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang (OadTR:  Online Action Detection with Transformers) and further in view of Cherian (US 11582485 B1).
Regarding claim 5, Wang discloses the feature representations, the feature interactions, the scene representations, the scene interactions , and the scene-feature interactions are each divided into a query portion, a key portion, and a value portion (page 4, col 1: Multi-head self-attention (MSA) is the core component of the Transformer. Intuitively, the idea behind self attention is that each token can interact with other tokens and can learn to gather useful semantic information more effectively, which is very suitable for capturing long-range dependencies. We compute the dot products of the query with all keys and apply a softmax function to obtain the weights on the values.
Note that queries, keys, and values are all vectors, and N head is the number of heads.);
value portions being capable of being aggregated in weighted fashion with similarity measures (page 4, col 1: Intuitively, the idea behind self-attention is that each token can interact with other tokens and can learn to gather useful semantic information more effectively, which is very suitable for capturing long-range dependencies. We compute the dot products of the query with all keys and apply a softmax function to obtain the weights on the values.).
Wang does not explicitly teach query portions being capable of being compared to key portions for the calculation of similarity measures. Wang implicitly teaches query portions being capable of being compared to key portions for the calculation of similarity measures (page 4, col 1: Intuitively, the idea behind self-attention is that each token can interact with other tokens and can learn to gather useful semantic information more effectively, which is very suitable for capturing long-range dependencies.).
However, Cherian explicitly teaches query portions being capable of being compared to key portions for the calculation of similarity measures (col 20, lines 65-68: The values 1250 are features of the keys produced by a neural network model 1290. The kernelized self-attention 1260 will compute kernel similarities between the queries and the keys, and produce scores that are used to weight the values 1250.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known system of scene action labeling and clustering in Wang’s disclosure, and augmenting it with Cherian’s technique of thresholding a predicted action class by comparing a ground-truth video to an unlabeled video, which yields predictable results of efficiently identifying and prioritizing the most relevant information by calculating the similarity or "alignment" between what the model is looking for (query) and what is available in the video (key).
Regarding claim 8, Wang discloses wherein the transformer network, which is used for the ascertaining of the feature interaction and scene interaction, is one trained by a training process comprising the following steps (page 5, col 2: THUMOS14. This data set has 1010 validation videos and 1574 testing videos with 20 classes. For the online action detection task, there are 200 validation videos and 213 testing videos labeled with temporal annotations. As in the previous works [15,51], we train our model on the validation set and evaluate on the test set.): 
providing one or more training frame sequences of video frames that are labeled with target classes of scenes to which the video frames respectively belong (page 5, col 2: THUMOS14. This data set has 1010 validation videos and 1574 testing videos with 20 classes. For the online action detection task, there are 200 validation videos and 213 testing videos labeled with temporal annotations. As in the previous works [15,51], we train our model on the validation set and evaluate on the test set.); 
transforming each of the one or more training frame sequences into a scene sequence of scenes by ("page 3: Figure2: Illustration of the proposed Online Action Detection TRansformer (OadTR). Given an input streaming video V={ft}0 t=−T, a task token is  attached to the visual features output by the feature extraction network. Then the token feature sequence is input into the standard Transformer’s encoder to model long-range historical temporal dependencies. 
page 4, col 1: Intuitively, if there is no token class here, the final feature representation obtained by other tokens will inevitably be biased towards this specified token as a whole, and thus cannot be used to represent this learning task (i.e., w/o task token in Figure 3). In contrast, the semantic embedding of token class can be obtained by adaptively interacting with other tokens in the encoder, which is more suitable for feature representations (i.e., w/ task token in Figure 3). We will further confirm the necessity of token class in Sec 4.3."):
extracting training features from each video frame of the frame sequence ("page 5, col 2: TV Series contains many unconstrained perspectives and a wide variety of backgrounds.
THUMOS14. This data set has 1010 validation videos and 1574 testing videos with 20 classes. For the online action detection task, there are 200 validation videos and 213 testing videos labeled with temporal annotations. As in the previous works [15,51], we train our model on the validation set and evaluate on the test set. Implementation details. For feature extractor, following
previous works [15,16,51] ,we adopt the two-stream network [48] (3072 dimensions) pre-trained on ActivityNet v1.3 [4] (TSN-Anet), where spatial and temporal subnetworks adopt ResNet-200 [18] and BN-Inception [21] separately."), 
transforming the extracted training features from into the feature representations in the first working space ("page 4, col 1: Intuitively, if there is no token class here, the final feature representation obtained by other tokens will inevitably be biased towards this specified token as a whole, and thus cannot be used to represent this learning task (i.e., w/o task token in Figure 3). In contrast, the semantic embedding of token class can be obtained by adaptively interacting with other tokens in the encoder, which is more suitable for feature representations (i.e., w/ task token in Figure 3). We will further confirm the necessity of tokenclass in Sec 4.3."), 
ascertaining, with the  trainable encoder of a transformer network, the feature interactions characterizing frame predictions (page 4, col 1: Multi-head self-attention (MSA) is the core component of the Transformer. Intuitively, the idea behind self attention is that each token can interact with other tokens and can learn to gather useful semantic information more effectively, which is very suitable for capturing long-range dependencies. We compute the dot products of the query with all keys and apply a softmax function to obtain the weights on the values.),
transforming classes belonging to already-ascertained scenes into the scene representations in the second working space with encoded scene positions ("page 4, col 1: Since there is no frame order information in the encoder, we need to embed position encoding additionally. Position encoding can take two forms: sinusoidal inputs and trainable embeddings. We add position encoding 
Epos ∈ R(T+2)×D to the token sequence (i.e., element-wise addition) to retain positional information:
X0 = ˜F +Epos  (1)
In this way, positional information can be kept despite the orderless self-attention.");
ascertaining, with a trainable decoder of the transformer network, the scene interactions and the scene-feature interactions ("page 4, col 2: The difference with the original Transformer [47] is that our decoder decodes the d prediction queries in parallel at each decoding layer. The decoder is allowed to utilize semantic information from the encoder through the encoder-decoder cross-attention mechanism.
For the classification task of the current frame chunk, we first concatenate the task-related features in the encoder with the pooled predicted features in the decoder. Then the resulting features go through a full connection layer and a softmax operation for action classification"), and
and ascertaining, from the scene-feature interactions, interactions, andascertaining, from the scene-feature interactions,  at least the class of a next scene in a respective sequence (page 4, col 2: The difference with the original Transformer [47] is that our decoder decodes the d prediction queries in parallel at each decoding layer. The decoder is allowed to utilize semantic information from the encoder through the encoder-decoder cross-attention mechanism.). Wang does not explicitly disclose evaluating, with a predetermined cost function, to what extent at least the ascertained scene sequence is in accord with the target classes of scenes with which the video frames in the training frame sequences are labeled; 
and optimizing parameters that characterize the behavior of the transformer network, based on the evaluation, with a goal that upon further processing of training frame sequences, the evaluation by the cost function is expected to improve. 
Wang implicitly discloses evaluating, with a predetermined cost function, to what extent at least the ascertained scene sequence is in accord with the target classes of scenes with which the video frames in the training frame sequences are labeled ("page 5, col 1: therefore, the final joint training loss is:
Loss=CE(p0,y0)+λ d i=1 CE(pi,yi) (13)
where CE is the cross entropy loss, yi is the actual action category for the next step I and λ is a balance coefficient, set to 0.5 in the experiment."); 
and optimizing parameters that characterize the behavior of the transformer network, based on the evaluation, with a goal that upon further processing of training frame sequences, the evaluation by the cost function is expected to improve (page 5, col 2: In terms of training, we implement our proposed OadTR in PyTorch and conduct all experiments with Nvidia V100 graphics cards. Without those bells and whistles, we use Adam[25] for optimization, the batch size is set to 128, the learning rate is set to 0.0001, and weight decay is 0.0005. Unless otherwise specified, we set T to 31 for HDD dataset and 63 for both TVSeries and THUMOS1 dataset.).
In a similar field of endeavor of a scene aware video encoder system, Cherian teaches evaluating, with a predetermined cost function, to what extent at least the ascertained scene sequence is in accord with the target classes of scenes with which the video frames in the training frame sequences are labeled (col 13, lines 59-67: In some embodiments, during training process of the standard transformer 606 for the VQA task 512, a cross-entropy loss between the predicted answer A.sub.pred and the ground truth answer A.sub.gt may be computed. In particular, the cross-entropy loss is computed against b×custom character answers produced via concatenating all the answers in a batch. The computation of the cross-entropy loss for b×custom character answers may output accurate gradients and may improve the training process.);
and optimizing parameters that characterize the behavior of the transformer network, based on the evaluation, with a goal that upon further processing of training frame sequences, the evaluation by the cost function is expected to improve (col 13, lines 62-67: The computation of the cross-entropy loss for b×custom character answers may output accurate gradients and may improve the training process.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known system of scene action labeling and clustering in Wang’s disclosure, and augmenting it with Cherian’s technique of thresholding a predicted action class by comparing a ground-truth video to an unlabeled video, which yields predictable results of accurate action labels by training a scene action labeling systems within a predetermined threshold based on ground-truth data.
Regarding claim 9, Wang discloses wherein the video frames in the training frame sequences, as well as the ascertained scenes, are sorted according to class (fig. 4).

    PNG
    media_image3.png
    448
    570
    media_image3.png
    Greyscale

Wang does not explicitly disclose but Cherian teaches and the cost function measures an agreement of the class prediction, respectively averaged over all members of the classes, with the respective target class ("col 4, lines 1-5: The predicted answer may be a representative of embeddings of a set of candidate answers (that includes a ground truth answer). In some embodiments, the decoder may be trained based on a cross-entropy loss between the predicted answer and the ground truth answer. 
col 13, lines 59-67: In some embodiments, during training process of the standard transformer 606 for the VQA task 512, a cross-entropy loss between the predicted answer A.sub.pred and the ground truth answer A.sub.gt may be computed. In particular, the cross-entropy loss is computed against b×custom character answers produced via concatenating all the answers in a batch. The computation of the cross-entropy loss for b×custom character answers may output accurate gradients and may improve the training process.").
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known system of scene action labeling and clustering in Wang’s disclosure, and augmenting it with Cherian’s technique of thresholding a predicted action class by comparing a ground-truth video to an unlabeled video, which yields predictable results of accurate action labels by training a scene action labeling systems within a predetermined threshold based on ground-truth data.
Regarding claim 10, Wang does not explicitly disclose but Cherian teaches wherein the cost function measures an extent to which the decoder assigns each video frame to a correct scene ("col 4, lines 1-5: The predicted answer may be a representative of embeddings of a set of candidate answers (that includes a ground truth answer). In some embodiments, the decoder may be trained based on a cross-entropy loss between the predicted answer and the ground truth answer.
col 13, lines 59-67: In some embodiments, during training process of the standard transformer 606 for the VQA task 512, a cross-entropy loss between the predicted answer A.sub.pred and the ground truth answer A.sub.gt may be computed. In particular, the cross-entropy loss is computed against b×custom character answers produced via concatenating all the answers in a batch. The computation of the cross-entropy loss for b×custom character answers may output accurate gradients and may improve the training process.").
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known system of scene action labeling and clustering in Wang’s disclosure, and augmenting it with Cherian’s technique of thresholding a predicted action class by comparing a ground-truth video to an unlabeled video, which yields predictable results of accurate action labels by training a scene action labeling systems within a predetermined threshold based on ground-truth data.
Regarding claim 11, Wang discloses wherein parameters that characterize a behavior of the auxiliary decoder network are optimized ("abstract:  The decoder extracts auxiliary information by aggregating anticipated future clip representations.
page 5, col 2: In terms of training, we implement our proposed OadTR in PyTorch and conduct all experiments with Nvidia V100 graphics cards. Without those bells and whistles, we use Adam[25] for optimization, the batch size is set to 128, the learning rate is set to 0.0001, and weight decay is 0.0005. Unless otherwise specified, we set T to 31 for HDD dataset and 63 for both TVSeries and THUMOS1dataset."). Wang does not explicitly disclose and the cost function measures an extent to which the  (col 13, lines 59-67: In some embodiments, during training process of the standard transformer 606 for the VQA task 512, a cross-entropy loss between the predicted answer A.sub.pred and the ground truth answer A.sub.gt may be computed. In particular, the cross-entropy loss is computed against b×custom character answers produced via concatenating all the answers in a batch. The computation of the cross-entropy loss for b×custom character answers may output accurate gradients and may improve the training process.). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known system of scene action labeling and clustering using an auxiliary decoder in Wang’s disclosure, and augmenting it with Cherian’s technique of thresholding a predicted action class by comparing a ground-truth video to an unlabeled video, which yields predictable results of accurate action labels by training a scene action labeling systems within a predetermined threshold based on ground-truth data.
Regarding claim 13, Wang discloses wherein the labeled video frames are clustered with respect to their target classes (fig. 2: two classes with respect to two videos  and fig. 4).

    PNG
    media_image1.png
    344
    618
    media_image1.png
    Greyscale

Fig. 2: : two classes with respect to two videos.


    PNG
    media_image3.png
    448
    570
    media_image3.png
    Greyscale

Wang does not explicitly disclose but Cherian teaches and missing target classes for unlabeled video frames are ascertained corresponding to the clusters to which the unlabeled video frames belong (col 2, lines 15-26: The key frames may be extracted using key frame extraction methods, such as cluster-based key frame extraction, visual-based key frame extraction, motion analysis based key frame extraction or the like. In some other example embodiments, the key frames may be extracted based on features of models trained on datasets, e.g., VisualGenome dataset. For example, key frames of a soccer sports video may be extracted based on features extracted from datasets that include players in soccer field, soccer ball with the players, or the like. In some embodiments, the key frames may be extracted by discarding redundant video frames of the video.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known system of scene action labeling and clustering in Wang’s disclosure, and augmenting it with Cherian’s technique of thresholding a predicted action class by comparing a ground-truth video to an unlabeled video, which yields predictable results of accurate action labels by training a scene action labeling systems within a predetermined threshold based on ground-truth data.
Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang (OadTR:  Online Action Detection with Transformers), in view of Cherian (US 11582485 B1) and further in view of Liu (US 11375194 B2). 
Regarding claim 12, Wang and Cherian do not explicitly disclose, but in a similar field of endeavor of conditional entropy coding for efficient video compression, Liu teaches wherein parameters that characterize a behavior of the transformer network is held constant during the training of the auxiliary decoder network (claim 6: wherein during said modifying, all hyperprior decoder model and decoder model parameters are fixed.). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the Wang and Cherian’s disclosures of thresholding a predicted action class by comparing a ground-truth video to an unlabeled video, with Liu’s teaching of a fixed decoder parameter, in order to for the encoding runtime to be traded off to further optimize the latent codes along the rate-distortion curve, while not affecting decoding runtime (col 4, lines 43-45 of Liu).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 20220301310 A1 to limitation 1 of claim 5: [0032] Temporal feature fusion networks 206 (for teacher) and 216 (for student) are generally configured to mix the feature representations of different clips. Note that, unlike image classification, a video action is generally recognized by a sequence of clips. Thus, aggregating clip information over time is useful for accurate video action recognition. To this end, the temporal feature fusion networks 206 and 216 are configured to perform a self-attention technique using three linear projection layers to generate queries, keys, and values. In this example, the query and key dimensions are set to d.sub.k, and the dimension of value is the same as the input feature.
US 11854206 B2: To second limitation of claim 8: claim 3. The method of claim 1, wherein the plurality of sub-neural networks comprises a first sub-neural network and a second sub-neural network, the first sub-neural network trained to extract a first group of features from a first video frame in the contiguous sequence of video frames, the second sub-neural network trained to extract a second group of features from a second video frame in the contiguous sequence of video frames, wherein the first video frame is different from the second video frame and the first group of features is different from the second group of features.
US 20220051025 A1 to first limitation of claim 9: [0105] For example, the computer device may calculate a loss function value corresponding to the video classification model according to the predicted classification result and the standard classification result, and trains the video classification model according to the loss function value. In a possible implementation, when the loss function value is less than a preset threshold, the training of the video classification model is stopped. The loss function value is used for representing an inconsistency degree between the predicted classification result and the standard classification result. If the loss function value is relatively small, it indicates that the predicted classification result is very close to the standard classification result, and performance of the video classification model is good; and if the loss function value is relatively large, it indicates that the predicted classification result is very different from the standard classification result, and performance of the video classification model is poor.
US 10911775 B1 to claim 10: claim 2. The method of claim 1 further comprising: determining a current action label for each motion feature contained in the state information for a first frame of the series of video frames, the series of video frames being in red green blue (RGB) format; predicting, by a decoder, future action labels for each motion feature in a second frame of the series of video frames subsequent to the first frame, based on the current pose, action label and the state information; predicting, by a decoder, future poses for each motion feature in the second frame based on the current poses and the state information; and refining the current action label, the future action labels, and the future poses based on a loss function.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AHMED A NASHER whose telephone number is (571)272-1885. The examiner can normally be reached Mon - Fri 0800 - 1700.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Moyer can be reached at (571) 272-9523. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AHMED A NASHER/               Examiner, Art Unit 2675                                                                                                                                                                                         
/ANDREW M MOYER/               Supervisory Patent Examiner, Art Unit 2675
Read full office action
Prosecution Timeline

Apr 27, 2023
Application Filed
Apr 09, 2026
Non-Final Rejection mailed — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/284,225
Patent 12626524
Method and Apparatus for Generating Captioning Device, and Method and Apparatus for Outputting Caption
2y 7m to grant Granted May 12, 2026
18/598,176
Patent 12614390
VEHICLE LOCATION RECOGNITION SYSTEM AND VEHICLE LOCATION RECOGNITION METHOD
2y 1m to grant Granted Apr 28, 2026
17/396,379
Patent 12601840
TUNING PARAMETER DETERMINATION METHOD FOR TRACKING AN OBJECT, A GROUP DENSITY-BASED CLUSTERING METHOD, AN OBJECT TRACKING METHOD, AND AN OBJECT TRACKING APPARATUS USING A LIDAR SENSOR
4y 8m to grant Granted Apr 14, 2026
17/631,480
Patent 12586329
MODELING METHOD, DEVICE, AND SYSTEM FOR THREE-DIMENSIONAL HEAD MODEL, AND STORAGE MEDIUM
4y 1m to grant Granted Mar 24, 2026
18/618,872
Patent 12582373
GENERATING SYNTHETIC ELECTRON DENSITY IMAGES FROM MAGNETIC RESONANCE IMAGES
1y 12m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
81%
Grant Probability
99%
With Interview (+32.8%)
2y 8m (~0m remaining)
Median Time to Grant
Low
PTA Risk
Based on 101 resolved cases by this examiner. Grant probability derived from career allowance rate.