Office Action Analysis: 17970907 — METHOD AND APPARATUS WITH OBJECT RECOGNITION

Examiner Intelligence

CHEN, JOSHUA NMN View full profile →
Grants 85% — above average
Career Allow Rate
34 granted / 40 resolved
+23.0% vs TC avg
Strong +26% interview lift
Without
With
+26.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
20 currently pending
Career history
60
Total Applications
across all art units
Statute-Specific Performance

§101
18.7%
-21.3% vs TC avg
§103
52.0%
+12.0% vs TC avg
§102
15.7%
-24.3% vs TC avg
§112
12.0%
-28.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/27/2025 were filed and the submission are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Amendment
Applicant’s arguments and claim amendments, see P. 1 - P. 5, filed 11/20/2025, with respect to amended claims 1, 14, and 20 (incorporating claims 6-7) have been fully considered but are not found convincing. The 35 U.S.C. 103 rejection of 08/22/2025 has NOT been withdrawn.

Regarding the amended independent claims 1, 14, and 20, examiner acknowledges the original 102 rejection has been overcame. However, examiner does not find the amendment and argument to be convincing. 
Firstly, applicant stated that Guo fails to teach extract feature maps comprising respective local feature representations from an input image and determine a global feature representation corresponding to the input image by fusing the local feature representations. Examiner hereby restates the following reasoning that was part of the rejection: 
Regarding extract feature maps comprising respective local feature representations from an input image; under BRI, the feature maps are any reasonable representations of the image. A vector can be feature maps, and any intermediate output layers of the model can be a feature maps as well.
Regarding determine a global feature representation corresponding to the input image by fusing the local feature representations; since pooling is part of AlexNet and the researchers did not remove all pooling layers, there is combining local feature to form global feature inherently by basing the model architecture on AlexNet.
Secondly, it is pointed out by the applicant that Gong fails to describe or suggest that a second recognition model that estimates a classification. Examiner points to Fig. 1 of Gong. Within Fig. 1 of Gong, there exists at least three different models and at least two of the models are performing different kinds of classification. There is one neural network that is surrounded by the L shape dash line box, which goes through global average pooling to obtain positions of target, which is a kind of classification (target exist at location x or not exist at location x). There is a second neural network that takes the output from the feature pyramid network (middle portion of Fig. 1) and input into a neural network similar to the structure of the L shape dash box. As such, the second model performs similar classification but mainly only on the selected K regions. Finally, the whole Fig. 1 is a third neural network that performs prediction. 
For the above reasoning, examiner believes that there exists one that goes through global average pooling, one model performs prediction only on K filtered area, and the third model that performs prediction based on both the first and second model. The amendment and argument are not convincing and the 103 rejection is maintained. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-3, 7-11, 13-14, 16, 18-22, 27-28 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. (An Attention Model Based on Spatial Transformers for Scene Recognition, hereinafter Guo) in view of Gong et al. (CN 110619369 B, herein after Gong). 

Regarding claim 1, 14, and 20, Guo discloses
Claim 1: A processor-implemented method comprising:
Claim 20: An electronic apparatus, comprising: one or more processors comprising processing circuitry; and a memory comprising one or more storage media storing instructions that, when executed individually or collectively by the one or more processors, cause the electronic device to:
extract feature maps comprising respective local feature representations from an input image (Fig. 1 N-D vectors after Local CNN, P. 3759 Para. 7: “Given the original input images and corresponding sets of attention regions, we employ the finetuned PlacesCNNs with the output layer removed as feature extractors, in which we use batch normalization [17] in convolutional layers and fully-connected layers”, P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet)”, P.3760 Para. 8: “All spatial transformers share the same localization network that is derived from Places20CNN in the following way. In order to preserve spatial information, the last classification layer, pooling layer and 2 fully-connected layers are removed. The output of the truncated CNN has 13x13 spatial resolution with 256 feature channels. Sequentially, an 128D fully-connected layer is added, and N fully-connected layers with 4D output are used to produce transformer parameters, where N is the number of transformers (in our experiments, N = 1 or 2)”; Under BRI, the feature maps are any reasonable representations of the image. A vector can be feature maps, and any intermediate output layers of the model can be a feature maps as well.);
determine a global feature representation corresponding to the input image by fusing the local feature representations (Fig. 1 N-D vector after Global CNN, P. 3759 Para. 7: “Given the original input images and corresponding sets of attention regions, we employ the finetuned PlacesCNNs with the output layer removed as feature extractors, in which we use batch normalization [17] in convolutional layers and fully-connected layers”, P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet)”, P.3760 Para. 8: “All spatial transformers share the same localization network that is derived from Places20CNN in the following way. In order to preserve spatial information, the last classification layer, pooling layer and 2 fully-connected layers are removed. The output of the truncated CNN has 13x13 spatial resolution with 256 feature channels. Sequentially, an 128D fully-connected layer is added, and N fully-connected layers with 4D output are used to produce transformer parameters, where N is the number of transformers (in our experiments, N = 1 or 2)”; Since Pooling is part of AlexNet and the researchers did not remove all pooling layers, there is combining local feature to form global feature inherently by basing the model architecture on AlexNet.);
perform a recognition task on the input image based on the first recognition result corresponding to the local feature representations and the second recognition result corresponding to the global feature representation (Fig. 1 The Summation and Softemax, P. 3759 Para. 7: “Given the original input images and corresponding sets of attention regions, we employ the finetuned PlacesCNNs with the output layer removed as feature extractors, in which we use batch normalization [17] in convolutional layers and fully-connected layers”, P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet)”, P.3760 Para. 8: “All spatial transformers share the same localization network that is derived from Places20CNN in the following way. In order to preserve spatial information, the last classification layer, pooling layer and 2 fully-connected layers are removed. The output of the truncated CNN has 13x13 spatial resolution with 256 feature channels. Sequentially, an 128D fully-connected layer is added, and N fully-connected layers with 4D output are used to produce transformer parameters, where N is the number of transformers (in our experiments, N = 1 or 2).”).
However Guo does not explicitly disclose
use a  first recognition model to estimate a first recognition result corresponding to the local feature representations;
use a second recognition model to estimate a second recognition result corresponding to the global feature representation, wherein the second recognition model comprises a classification model configured to estimate a classification result from the global feature representation.
Gong teaches
use a  first recognition model to estimate a first recognition result corresponding to the local feature representations (Fig. 1 the middle pyramid feature network, the K regions, and inputting the K regions into the neural network, Claim 1: “3.3, selecting candidate areas with different sizes on the fusion characteristic graph with N scales, filtering the bounding box generated in the step 2, predicting and sequencing according to the size of an activation value of the bounding box to obtain a local area, wherein the bounding box is generated by taking the maximum connected area in the saliency map and setting a threshold to obtain a specific position of a target”);
use a second recognition model to estimate a second recognition result corresponding to the global feature representation, wherein the second recognition model comprises a classification model configured to estimate a classification result from the global feature representation (Fig. 1, Claim 1: “4, aggregating the local features of the K local regions and global feature prediction obtained by the input image through the convolutional neural network to output the final identification category.”).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Guo with local feature recognition and multiple models for global feature prediction and other aspects of Gong to effectively reduce the influence of background noises.

Regarding claim 2 and 22, dependent upon claims 1 and 20, Guo in view of Gong teaches every element regarding claims 1 and 20. 
Guo further discloses
determine the global feature representation by fusing pooling results corresponding to the local feature representations (P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet); Inherently, this step is part of AlexNet.”).

Regarding claim 3, dependent upon claim 2, Guo in view of Gong teaches every element regarding claims 2. 
Gong further teaches
the pooling comprises global average pooling (Claim 1: “step 2, the multi-channel feature map passes through a global average pooling layer to obtain a saliency map of the input image, and position information of a target is extracted”).

Regarding claim 7, dependent upon claim 1, Guo in view of Gong teaches every element regarding claims 1. 
the first recognition model comprises: an object detection model configured to estimate a detection result from the local feature representations (Claim 1: “3.3, selecting candidate areas with different sizes on the fusion characteristic graph with N scales, filtering the bounding box generated in the step 2, predicting and sequencing according to the size of an activation value of the bounding box to obtain a local area, wherein the bounding box is generated by taking the maximum connected area in the saliency map and setting a threshold to obtain a specific position of a target”).

Regarding claim 8, dependent upon claim 7, Guo in view of Gong teaches every element regarding claims 7. 
Gong further teaches
the detection result comprises: one or more of bounding box information, objectness information, or class information (Claim 1: “3.3, selecting candidate areas with different sizes on the fusion characteristic graph with N scales, filtering the bounding box generated in the step 2, predicting and sequencing according to the size of an activation value of the bounding box to obtain a local area, wherein the bounding box is generated by taking the maximum connected area in the saliency map and setting a threshold to obtain a specific position of a target”), and wherein 
the classification result comprises: one or more of multi-class classification information, context classification information, or object count information (Claim 1: “4, aggregating the local features of the K local regions and global feature prediction obtained by the input image through the convolutional neural network to output the final identification category.”).

Regarding claim 9, dependent upon claim 7, Guo in view of Gong teaches every element regarding claims 7. 
Guo further discloses
the training the first recognition model affects the training of the second recognition model, and the training of the second recognition model affects the training of the first recognition model (P. 3759 Para. 9: “For each input image, its final representation R can be obtained:
                
                                            R
                                            =
                                             
                                                    W
                                                
                                                    g
                                                
                                                    F
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                                            F
                                                        
                                                            l
                                                            i
                                                        
                                                    W
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                            =
                                            1
                                        
Where                         
                            
                                    F
                                
                                    g
                                
                     represents global features, while                         
                            
                                    F
                                
                                    l
                                    i
                                
                     is local features of Regioni. The equation (5) above is formulated to calculate the weighting sum of global features and local features of all attention regions. The output of the fusion layer is the input for the last softmax classification layer”; There is a weight that totals to 1 in the equation. As such, if the weight of any local features or the global feature is changed during training, then all the other weights are affected.).

Regarding claim 10, dependent upon claim 9, Guo in view of Gong teaches every element regarding claims 9. 
Guo further discloses
using an in-training feature extraction model, or a trained feature extraction model, to extract training feature maps comprising in-training local feature representations from a training input image (Fig. 1 N-D vectors after Local CNN, P. 3759 Para. 7: “Given the original input images and corresponding sets of attention regions, we employ the finetuned PlacesCNNs with the output layer removed as feature extractors, in which we use batch normalization [17] in convolutional layers and fully-connected layers”, P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet)”, P.3760 Para. 8: “All spatial transformers share the same localization network that is derived from Places20CNN in the following way. In order to preserve spatial information, the last classification layer, pooling layer and 2 fully-connected layers are removed. The output of the truncated CNN has 13x13 spatial resolution with 256 feature channels. Sequentially, an 128D fully-connected layer is added, and N fully-connected layers with 4D output are used to produce transformer parameters, where N is the number of transformers (in our experiments, N = 1 or 2)”; Under BRI, the first and second recognition results are any reasonable output of the recognition model. It is not limited to classification, it could include probabilities, embeddings (vectors), detecting bounding boxes, segmentation masks, etc. This also includes any intermediate outputs of the layers of the model being feature representations as well);
using an in-training feature fusion model, or a trained fusion model, to determine a training global feature representation corresponding to the training input image by fusing the training local feature representations (Fig. 1 N-D vector after Global CNN, P. 3759 Para. 7: “Given the original input images and corresponding sets of attention regions, we employ the finetuned PlacesCNNs with the output layer removed as feature extractors, in which we use batch normalization [17] in convolutional layers and fully-connected layers”, P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet)”, P.3760 Para. 8: “All spatial transformers share the same localization network that is derived from Places20CNN in the following way. In order to preserve spatial information, the last classification layer, pooling layer and 2 fully-connected layers are removed. The output of the truncated CNN has 13x13 spatial resolution with 256 feature channels. Sequentially, an 128D fully-connected layer is added, and N fully-connected layers with 4D output are used to produce transformer parameters, where N is the number of transformers (in our experiments, N = 1 or 2)”; Since Pooling is part of AlexNet and the researchers did not remove all pooling layers, there is combining local feature to form global feature inherently by basing the model architecture on AlexNet.);
using an in-training first recognition model to estimate a training first recognition result corresponding to the training local feature representations (Fig. 1 N-D vectors after Local CNN, P. 3759 Para. 7: “Given the original input images and corresponding sets of attention regions, we employ the finetuned PlacesCNNs with the output layer removed as feature extractors, in which we use batch normalization [17] in convolutional layers and fully-connected layers”, P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet)”, P.3760 Para. 8: “All spatial transformers share the same localization network that is derived from Places20CNN in the following way. In order to preserve spatial information, the last classification layer, pooling layer and 2 fully-connected layers are removed. The output of the truncated CNN has 13x13 spatial resolution with 256 feature channels. Sequentially, an 128D fully-connected layer is added, and N fully-connected layers with 4D output are used to produce transformer parameters, where N is the number of transformers (in our experiments, N = 1 or 2)”; Under BRI, the first and second recognition results are any reasonable output of the recognition model. It is not limited to classification, it could include probabilities, embeddings (vectors), detecting bounding boxes, segmentation masks, etc. A vector can be a recognition result.);
using an in-training second recognition model to estimate a training second recognition result corresponding to the training global feature representation (Fig. 1 N-D vectors after Local CNN, P. 3759 Para. 7: “Given the original input images and corresponding sets of attention regions, we employ the finetuned PlacesCNNs with the output layer removed as feature extractors, in which we use batch normalization [17] in convolutional layers and fully-connected layers”, P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet)”, P.3760 Para. 8: “All spatial transformers share the same localization network that is derived from Places20CNN in the following way. In order to preserve spatial information, the last classification layer, pooling layer and 2 fully-connected layers are removed. The output of the truncated CNN has 13x13 spatial resolution with 256 feature channels. Sequentially, an 128D fully-connected layer is added, and N fully-connected layers with 4D output are used to produce transformer parameters, where N is the number of transformers (in our experiments, N = 1 or 2).”);  and
generating the first model and the second model by training the in-training first recognition model and the in-training second recognition model together based on the training first recognition result and the training second recognition result (Fig. 1, P. 3760 Para. 3: “PlacesCNN is pretrained on the Places205 dataset in [2]. The architecture of PlacesCNN is the same as the one used in the Caffe reference network (AlexNet)”,,P.3760 Para. 8: “All spatial transformers share the same localization network that is derived from Places20CNN in the following way. In order to preserve spatial information, the last classification layer, pooling layer and 2 fully-connected layers are removed. The output of the truncated CNN has 13x13 spatial resolution with 256 feature channels. Sequentially, an 128D fully-connected layer is added, and N fully-connected layers with 4D output are used to produce transformer parameters, where N is the number of transformers (in our experiments, N = 1 or 2).”).

Regarding claim 11, dependent upon claim 7, Guo in view of Gong teaches every element regarding claims 7. 
Guo further discloses
determining a task result recognized by the recognition task by fusing the first recognition result and the second recognition result (Fig. 1, P. 3759 Para. 9: “For each input image, its final representation R can be obtained:
                
                                            R
                                            =
                                             
                                                    W
                                                
                                                    g
                                                
                                                    F
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                                            F
                                                        
                                                            l
                                                            i
                                                        
                                                    W
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                            =
                                            1
                                        
Where                         
                            
                                    F
                                
                                    g
                                
                     represents global features, while                         
                            
                                    F
                                
                                    l
                                    i
                                
                     is local features of Regioni. The equation (5) above is formulated to calculate the weighting sum of global features and local features of all attention regions. The output of the fusion layer is the input for the last softmax classification layer.”).

Regarding claim 13 and 21, dependent upon claims 1 and 20, Guo in view of Gong teaches every element regarding claims 1 and 20. 
Guo further discloses
capturing the input image using a camera (P. 3759 Para. 10: “We use multiple spatial transformers in parallel to perform scene recognition and evaluate the model on a subset of the Places205 dataset (Places20) for illustration. In our experiments, we only use image-level labels to train models”; The dataset contains images captured by camera.).

Regarding claim 16, dependent upon claim 14, Guo in view of Gong teaches every element regarding claims 14. 
Guo further discloses
the first recognition model and the second recognition model are trained as an integrated model such that each affect training of the other (P. 3759 Para. 9: “For each input image, its final representation R can be obtained:
                
                                            R
                                            =
                                             
                                                    W
                                                
                                                    g
                                                
                                                    F
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                                            F
                                                        
                                                            l
                                                            i
                                                        
                                                    W
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                            =
                                            1
                                        
Where                         
                            
                                    F
                                
                                    g
                                
                     represents global features, while                         
                            
                                    F
                                
                                    l
                                    i
                                
                     is local features of Regioni. The equation (5) above is formulated to calculate the weighting sum of global features and local features of all attention regions. The output of the fusion layer is the input for the last softmax classification layer”; There is a weight that totals to 1 in the equation. As such, if the weight of any local features or the global feature is changed during training, then all the other weights are affected.).

Regarding claim 18, dependent upon claim 14, Guo in view of Gong teaches every element regarding claims 14. 
Guo further discloses
training the first recognition model affects training the second recognition model and training the second recognition model affects training the first recognition model (P. 3759 Para. 9: “For each input image, its final representation R can be obtained:
                
                                            R
                                            =
                                             
                                                    W
                                                
                                                    g
                                                
                                                    F
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                                            F
                                                        
                                                            l
                                                            i
                                                        
                                                    W
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                            =
                                            1
                                        
Where                         
                            
                                    F
                                
                                    g
                                
                     represents global features, while                         
                            
                                    F
                                
                                    l
                                    i
                                
                     is local features of Regioni. The equation (5) above is formulated to calculate the weighting sum of global features and local features of all attention regions. The output of the fusion layer is the input for the last softmax classification layer”; There is a weight that totals to 1 in the equation. As such, if the weight of any local features or the global feature is changed during training, then all the other weights are affected.).

Regarding claim 19, dependent upon claim 1, Guo in view of Gong teaches every element regarding claims 1. 
Gong further teaches
A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, configure the processor to perform the method of claim 1 (P. 9 Para 12: “in addition, the experimental hardware environment: ubuntu 16.04, Telsa-P100 video card, video memory 12G, core (TM) i7 processor, main frequency 3.4G, memory 16G”).

Regarding claim 27, dependent upon claim 20, Guo in view of Gong teaches every element regarding claims 20. 
Gong further teaches
the first recognition model comprises: an object detection model configured to estimate a detection result corresponding to each of the local feature representations (Claim 1: “3.3, selecting candidate areas with different sizes on the fusion characteristic graph with N scales, filtering the bounding box generated in the step 2, predicting and sequencing according to the size of an activation value of the bounding box to obtain a local area, wherein the bounding box is generated by taking the maximum connected area in the saliency map and setting a threshold to obtain a specific position of a target”), and
wherein the second recognition model comprises: a classification model configured to estimate a classification result corresponding to the global feature representation (Claim 1: “4, aggregating the local features of the K local regions and global feature prediction obtained by the input image through the convolutional neural network to output the final identification category.”).

Regarding claim 28, dependent upon claim 27, Guo in view of Gong teaches every element regarding claims 27. 
Gong further teaches
determine a task result of the recognition task by fusing the first recognition result and the second recognition result (Fig. 1, P. 3759 Para. 9: “For each input image, its final representation R can be obtained:
                
                                            R
                                            =
                                             
                                                    W
                                                
                                                    g
                                                
                                                    F
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                                            F
                                                        
                                                            l
                                                            i
                                                        
                                                    W
                                                
                                                    g
                                                
                                            +
                                            
                                                    ∑
                                                    
                                                        i
                                                        =
                                                        1
                                                    
                                                        n
                                                    
                                                            W
                                                        
                                                            i
                                                        
                                            =
                                            1
                                        
Where                         
                            
                                    F
                                
                                    g
                                
                     represents global features, while                         
                            
                                    F
                                
                                    l
                                    i
                                
                     is local features of Regioni. The equation (5) above is formulated to calculate the weighting sum of global features and local features of all attention regions. The output of the fusion layer is the input for the last softmax classification layer.”).

Claims 4-5 and 23-24 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. (An Attention Model Based on Spatial Transformers for Scene Recognition, hereinafter Guo) in view of Gong et al. (CN 110619369 B, herein after Gong) and Dosovitski et al. (AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE, hereinafter Dosovitski).

Regarding claim 4, dependent upon claim 1, Guo in view of Gong teaches every element regarding claims 1. 
However, Guo in view of Gong does not explicitly teach
performing an attention mechanism using query data pre-trained in association with the recognition task.
Dosovitski teaches
performing an attention mechanism using query data pre-trained in association with the recognition task (Fig. 1 Embedded Patches, P. 14 Para. 4: “We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%).”).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Guo in view of Gong with learnable embedding patches that lead to query and other aspects of Dosovitski to effectively increase the accuracy of computer vision tasks in general.

Regarding claim 5, dependent upon claim 4, Guo in view of Gong and Dosovitski teaches every element regarding claims 4. 
Dosovitski further teaches
determining key data and value data corresponding to the local feature representations; determining a weighted sum of the value data based on similarity between the key data and the query data (P. 13 Para. 1: “Standard qkv self-attention (SA, Vaswani et al. (2017)) is a popular building block for neural architectures. For each element in an input sequence                         
                            z
                            ∈
                            
                                    R
                                
                                    N
                                     
                                    ×
                                     
                                    D
                                
                    , we compute a weighted sum over all values v in the sequence. The attention weights                         
                            
                                    A
                                
                                    i
                                    j
                                
                     are based on the pairwise similarity between two elements of the sequence and their respective query                         
                            
                                    q
                                
                                    i
                                
                     and key                         
                            
                                    k
                                
                                    j
                                
                     representations.
                        
                                    q
                                    ,
                                     
                                    k
                                    ,
                                     
                                    v
                                
                            =
                            
                                    z
                                    U
                                
                                    q
                                    k
                                    v
                                
                                    U
                                
                                    q
                                    k
                                    v
                                
                            ∈
                            
                                    R
                                
                                    N
                                     
                                    ×
                                     
                                    D
                                
                    ,                                                          (5)
                        
                            A
                             
                            =
                             
                            s
                            o
                            f
                            t
                            m
                            a
                            x
                             
                            (
                            
                                    q
                                    k
                                
                                    Τ
                                
                            /
                            
                                        D
                                    
                                        h
                                    
                            )
                        
                    ,                                                                     (6)
                        
                            S
                            A
                            (
                            z
                            )
                             
                            =
                             
                            A
                            v
                        
                    ,                                                                                              (7)
Multihead self-attention (MSA) is an extension of SA in which we run k self-attention operations, called “heads”, in parallel, and project their concatenated outputs. To keep compute and number of parameters constant when changing k, Dh (Eq. 5) is typically set to D/k.
                        
                            M
                            S
                            A
                            
                                    z
                                
                            =
                             
                                            S
                                            A
                                        
                                            1
                                        
                                            z
                                        
                                    ;
                                     
                                            S
                                            A
                                        
                                            2
                                        
                                            z
                                        
                                    ;
                                    …
                                     
                                    ;
                                     
                                            S
                                            A
                                        
                                            k
                                        
                                            z
                                        
                                    U
                                
                                    m
                                    s
                                    a
                                
                                    U
                                
                                    m
                                    s
                                    a
                                
                            ∈
                            
                                    R
                                
                                    k
                                    ∙
                                    
                                            D
                                        
                                            h
                                        
                                    ×
                                     
                                    D
                                
                            (8)”); and 
determining the global feature representation based on the weighted sum (P.3 Figure 1:” Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence. The illustration of the Transformer encoder was inspired by Vaswani et al. (2017).”).

Regarding claim 23, dependent upon claim 20, Guo in view of Gong teaches every element regarding claims 20. 
However, Guo in view of Gong does not explicitly teach
determine the global feature representation by performing an attention mechanism using query data pre-trained in response to the recognition task.
Dosovitski teaches
determine the global feature representation by performing an attention mechanism using query data pre-trained in response to the recognition task (Fig. 1 Embedded Patches, P. 14 Para. 4: “We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%).”).

Regarding claim 24, dependent upon claim 23, Guo in view of Gong and Dosovitski teaches every element regarding claims 23. 
Dosovitski further teaches
the attention mechanism comprises a vision transformer model that performs the fusing based on similarity of keys and values of the query data (P. 13 Para. 1: “Standard qkv self-attention (SA, Vaswani et al. (2017)) is a popular building block for neural architectures. For each element in an input sequence                         
                            z
                            ∈
                            
                                    R
                                
                                    N
                                     
                                    ×
                                     
                                    D
                                
                    , we compute a weighted sum over all values v in the sequence. The attention weights                         
                            
                                    A
                                
                                    i
                                    j
                                
                     are based on the pairwise similarity between two elements of the sequence and their respective query                         
                            
                                    q
                                
                                    i
                                
                     and key                         
                            
                                    k
                                
                                    j
                                
                     representations.
                        
                                    q
                                    ,
                                     
                                    k
                                    ,
                                     
                                    v
                                
                            =
                            
                                    z
                                    U
                                
                                    q
                                    k
                                    v
                                
                                    U
                                
                                    q
                                    k
                                    v
                                
                            ∈
                            
                                    R
                                
                                    N
                                     
                                    ×
                                     
                                    D
                                
                    ,                                                          (5)
                        
                            A
                             
                            =
                             
                            s
                            o
                            f
                            t
                            m
                            a
                            x
                             
                            (
                            
                                    q
                                    k
                                
                                    Τ
                                
                            /
                            
                                        D
                                    
                                        h
                                    
                            )
                        
                    ,                                                          	          (6)
                        
                            S
                            A
                            (
                            z
                            )
                             
                            =
                             
                            A
                            v
                        
                    ,                                                                                               (7)
Multihead self-attention (MSA) is an extension of SA in which we run k self-attention operations, called “heads”, in parallel, and project their concatenated outputs. To keep compute and number of parameters constant when changing k, Dh (Eq. 5) is typically set to D/k.
                        
                            M
                            S
                            A
                            
                                    z
                                
                            =
                             
                                            S
                                            A
                                        
                                            1
                                        
                                            z
                                        
                                    ;
                                     
                                            S
                                            A
                                        
                                            2
                                        
                                            z
                                        
                                    ;
                                    …
                                     
                                    ;
                                     
                                            S
                                            A
                                        
                                            k
                                        
                                            z
                                        
                                    U
                                
                                    m
                                    s
                                    a
                                
                                    U
                                
                                    m
                                    s
                                    a
                                
                            ∈
                            
                                    R
                                
                                    k
                                    ∙
                                    
                                            D
                                        
                                            h
                                        
                                    ×
                                     
                                    D
                                
                            (8)”).
Claims 12, 15, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. (An Attention Model Based on Spatial Transformers for Scene Recognition, hereinafter Guo) in view of Gong et al. (CN 110619369 B, herein after Gong) and Jetley et al. (LEARN TO PAY ATTENTION, hereinafter Jetley).

Regarding claim 12, dependent upon claim 1, Guo in view of Gong teaches every element regarding claims 1. 
However, Guo in view of Gong does not explicitly teach
one of plural task candidates, the task candidates having respectively associated pre-trained query data items, and
wherein the determining of the global feature representation comprises: selecting, from among the pre-trained query data items, a query data item associated with the recognition task; and
determining the global feature representation by performing an attention mechanism based on the selected query data item.
Jetley teaches
one of plural task candidates, the task candidates having respectively associated pre-trained query data items (Fig. 10, P. 13 Para. 2: “the global feature vector is obtained by processing the query image specific to the category being considered, shown in column 1.”), and
wherein the determining of the global feature representation comprises: selecting, from among the pre-trained query data items, a query data item associated with the recognition task (Fig. 10 P. 13 Para. 2: “the global feature vector is obtained by processing the query image specific to the category being considered, shown in column 1.”); and
determining the global feature representation by performing an attention mechanism based on the selected query data item (Fig. 10, P. 13 Para. 2: “The new attention patterns are displayed in columns 4 and 7 respectively. The changes in the attention values at different spatial locations as a proportion of the original attention pattern values are shown in columns 5 and 8 respectively.”).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Guo in view of Gong with determining global feature based on query and other aspects of Jetley to effectively increase the efficiency of the model.

Regarding claim 15, dependent upon claim 14, Guo in view of Gong teaches every element regarding claims 14. 
However, Guo in view of Gong does not explicitly teach
determining a training loss based on the first recognition result and the second recognition result, wherein the training is based on the training loss.
Jetley teaches
determining a training loss based on the first recognition result and the second recognition result, wherein the training is based on the training loss (P. 4 Para. 5: “All free network parameters are learned in end-to-end training under a cross-entropy loss function.”).

Regarding claim 17, dependent upon claim 14, Guo in view of Gong teaches every element regarding claims 14. 
However, Guo in view of Gong does not explicitly teach
wherein the feature fusion model is configured to: determine the global feature representation by performing an attention mechanism based on query data corresponding to a current task candidate among the task candidates, and
wherein the determining of the training loss comprises: determining the training loss by applying a classification result of a classification model corresponding to the current task candidate among the task candidates as the second recognition result.
Jetley teaches
wherein the feature fusion model is configured to: determine the global feature representation by performing an attention mechanism based on query data corresponding to a current task candidate among the task candidates (Fig. 10, P. 2. Para. 2: “We experiment with applying the proposed attention mechanism to the popular CNN architectures of VGGNet (Simonyan & Zisserman, 2014) and ResNet (He et al., 2015), and capturing coarse-to-fine attention maps at multiple levels. We observe that the proposed mechanism can bootstrap baseline CNN architectures for the task of image classification: for example, adding attention to the VGG model offers an accuracy gain of 7% on CIFAR-100. Our use of attention-weighted representations leads to improved fine-grained recognition and superior generalisation on 6 benchmark datasets for domain-shifted classification. As observed on models trained for fine-grained bird recognition, attention aware models offer limited resistance to adversarial fooling at low and moderate                         
                            
                                    L
                                
                                    ∞
                                
                     -noise norms. … In [Section] 5, we present sample results which suggest that these improvements may owe to the method’s tendency to highlight the object of interest while suppressing background clutter.”, P. 13 Para. 2: “the global feature vector is obtained by processing the query image specific to the category being considered, shown in column 1.”), and
wherein the determining of the training loss comprises: determining the training loss by applying a classification result of a classification model corresponding to the current task candidate among the task candidates as the second recognition result (P. 4 Para. 5: “All free network parameters are learned in end-to-end training under a cross-entropy loss function.”).

Claim 26 is rejected under 35 U.S.C. 103 as being unpatentable over Claims 12, 15, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. (An Attention Model Based on Spatial Transformers for Scene Recognition, hereinafter Guo) in view of Gong et al. (CN 110619369 B, herein after Gong) and Liu et al. (US 10,891,514 B2, hereinafter Liu).

Regarding claim 26, dependent upon claim 20, Guo in view of Gong teaches every element regarding claims 20. 
However, Guo in view of Gong does not explicitly teach
the processor is further configured to select between the first recognition model and the second recognition model based on the recognition task.
Liu teaches
the processor is further configured to select between the first recognition model and the second recognition model based on the recognition task (Abstract: “A scheduler of the image recognition pipeline optimizes image recognition processing by selecting at least: a subset of the image recognition models for image recognition processing and a device configuration for execution of the subset of image recognition models, in order to return image recognition results within a threshold time period that satisfies application-specific execution.”).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Guo in view of Gong with a processor configured to select image recognition models of Liu to effectively increase the robustness of image analysis program.

Relevant Prior Art Directed to State of Art
Minaee et al. (Image Segmentation Using Deep Learning: A Survey, hereinafter Minaee) is prior art not applied in the rejection(s) above. Minaee discloses the most recent literature in image segmentation and discusses more than a hundred deep learning-based segmentation methods proposed until 2019; including a comprehensive review and insights on different aspects of these methods, including the training data, the choice of network architectures, loss functions, training strategies, and their key contributions; and including a comparative summary of the performance of the reviewed methods and discuss several challenges and potential future directions for deep learning-based image segmentation models.

Kilickaya et al. (US 12,249,138 B2, hereinafter Kilickaya) is prior art not applied in the rejection(s) above. Kilickaya discloses a method for classifying a human-object interaction includes identifying a human-object interaction in the input. Context features of the input are identified. Each identified context feature is compared with the identified human-object interaction. An importance of the identified context feature is determined for the identified human-object interaction. The context feature is fused with the identified human-object interaction when the importance is greater than a threshold.

Chen et al. (Pyramid of Spatial Relatons for Scene-Level Land Use Classification, hereinafter Chen) is prior art not applied in the rejection(s) above. Chen discloses a pyramid-of-spatial-relatons (PSR) model to capture both absolute and relative spatial relationships of local features.

WANG et al. (CN113011386A, hereinafter Wang) is prior art not applied in the rejection(s) above. Wang discloses  an expression recognition method and system based on an equant feature map. 

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOSHUA CHEN whose telephone number is (703)756-5394. The examiner can normally be reached M-Th: 9:30 am - 4:30pm ET F: 9:30 am - 2:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, STEPHEN R KOZIOL can be reached at (408)918-7630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J. C./            Examiner, Art Unit 2665                         

/Stephen R Koziol/            Supervisory Patent Examiner, Art Unit 2665
Read full office action
Prosecution Timeline

Oct 21, 2022
Application Filed
Aug 20, 2025
Non-Final Rejection — §103
Nov 20, 2025
Response Filed
Dec 08, 2025
Applicant Interview (Telephonic)
Dec 08, 2025
Examiner Interview Summary
Feb 27, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/026,081
Patent 12602747
METHOD AND APPARATUS FOR DENOISING A LOW-LIGHT IMAGE
2y 5m to grant Granted Apr 14, 2026
17/904,842
Patent 12592090
COMPENSATION OF INTENSITY VARIANCES IN IMAGES USED FOR COLONY ENUMERATION
2y 5m to grant Granted Mar 31, 2026
17/978,489
Patent 12579614
IMAGING DEVICE
2y 5m to grant Granted Mar 17, 2026
18/170,803
Patent 12579678
INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND COMPUTER PROGRAM PRODUCT
2y 5m to grant Granted Mar 17, 2026
17/924,570
Patent 12573065
Vision Sensing Device and Method
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

3-4
Expected OA Rounds
85%
Grant Probability
99%
With Interview (+26.1%)
2y 11m
Median Time to Grant
Moderate
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD AND APPARATUS WITH OBJECT RECOGNITION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

METHOD AND APPARATUS WITH OBJECT RECOGNITION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email