Office Action Analysis: 18635931 — ARTIFICIAL INTELLIGENCE DEVICE FOR ATTENTION OVER DETECTION BASED OBJECT SELECTION AND CONTROL METHOD THEREOF

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) filed on April 15, 2024 has been considered by the examiner.
Specification
The disclosure is objected to because of the following informalities:
In paragraph [00200], ‘significate’ should read ‘significant’.  
In paragraph [00231], ‘Fig. 14’ should read ‘Fig. 15’.
Appropriate correction is required.

Claim Objections
Claims 18-19 are objected to because of the following informalities:
Claims 18 and 19 are directed to an ‘AI device’ and are dependent on Claim 1. However, Claim 1 claims a method of controlling an AI device, not the device itself. The examiner suggests rewriting Claims 18 and 19 to depend on Claim 11, which is directed towards an ‘AI device’.  
Appropriate correction is required.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-5, 8-15 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong et al. (US Pub No 20230106716), hereinafter Xiong,  in view of Heisler (US Pub No 20220292685), hereinafter Heisler, and further in view of Yu et al. (Z. Yu, et al., "Deep Modular Co-Attention Networks for Visual Question Answering," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6274-6283), hereinafter Yu.
	
As to Claim 1, Xiong teaches a method for controlling an artificial intelligence (AI) device (see Fig. 5, computer system, and see paragraph [0012], “Embodiments of this disclosure provide improved VQA performance by improving the alignment between image and natural-language input modalities”, where natural language processing is a subset of AI) , the method comprising: 
obtaining an input query, an input image (see paragraph [0015], “At step 110, the method of FIG. 1 includes accessing an image and a natural-language question regarding the image”),  
and at least one topic label for one or more words in the input query (see paragraph [0028], “In particular embodiments, such as the example of FIG. 2 , a granularity layer for a question may include a noun phrase level (e.g., layer 220) that is constructed by filtering the result from a constituency parser for the noun phrase level, for example by discarding the determiners (e.g., ‘a’, ‘the’) and filtering out the words expressing positional relations (e.g., ‘left’, ‘right’) to save computational resources”, where the ‘nouns’ are the topic label);
 generating, via the processor (see Fig 5., processor 502), at least one word embedding for the at least one topic label from the input query (see paragraph [0028], “Then the phrases are split into word tokens and, in particular embodiments, their GloVe features are processed by the MLP to obtain the token features as Tnp”, where GloVe is a tool known in the art to produce word embeddings), 
the at least one word embedding being a multi-dimensional vector (see paragraph [0018], “For example, each text feature may be a text token feature vector, with each text token feature vector corresponding to one of the words in the natural-language question”); 
generating scaled dot product attention matrices based on the at least one word embedding for the at least one topic label from the input query (see paragraph [0033], “For each token, a query vector (Q), a key vector (K), and a value vector (V) are created, by multiplying the embeddings of the three matrices that are trained during the training process… Each of the sets of vectors is then input into the scaled dot-product attention, and pointwise multiplied with the lead graph (GGA) from the graph-merging module”; 
and executing, via the processor, a function based on the final output (see paragraph [0039], “At step 160, the method of FIG. 1 includes determining an answer to the question based on the first output and the second output”).
Xiong fails to teach receiving bounding boxes for objects detected in the input image and object labels corresponding to the bounding boxes. Xiong further fails to teach generating a plurality of word embeddings for the object labels corresponding to the bounding boxes. Xiong teaches extracting features from the input image, and then embedding these image features (see paragraph [0019]), and then using these embedded features to calculate a scaled dot product attention matrix (see paragraph [0033]. 
However, in an analogous art, Heisler teaches teach receiving bounding boxes for objects detected in the input image (see paragraph [0057], “Additionally or alternatively, a bounding box representing the location of each instance of a detected object is defined”)
and object labels corresponding to the bounding boxes (see paragraph [0057], “The result is that the segmented input image is associated with a set of one or more object class labels and also associated with a set of pixel groups and/or bounding boxes for each respective object class label”), 
and generating a plurality of word embeddings for the object labels corresponding to the bounding boxes (see paragraph [0089], “At 308, an embedding is generated for the object label associated with each object image. The embedding may be generated any suitable embedding technique (e.g., BERT, Word2Vec, GloVe, or fastText)”),
the plurality of word embeddings being multi-dimensional vectors (see paragraph [0053], “For example, in FIG. 1, the embedding 106 is represented as a 1×6 vector (i.e., having a dimensionality of 6)”). 
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the image embeddings taught Heisler with the word embeddings from bounding box labels taught by Xiong. The motivation for doing so would be to create a dictionary of embeddings for objects, which can be used to identify objects in later images. Heisler teaches in paragraph [0060], “The collection of embeddings may be stored as a dictionary of embeddings, in which each entry in the dictionary is a text string (e.g., a word or phrase) that is associated with a corresponding embedding generated by the pre-trained embedder. An embedding may then be generated for each object class label associated with the segmented input image by looking up each object class label in the dictionary of embeddings to identify and select the corresponding embedding that represents each respective object class label.”
Xiong fails to explicitly teach outputting attention maps according output attention maps corresponding to scaled dot product attention matrices. 
However, in an analogous art, Yu teaches outputting attention maps according to scaled dot product attention matrices, (see page 6280, Section 5.3, Figure 7, “Visualizations of the learned attention maps (softmax(qK/√d)) of the attention units from typical layers”, where ‘(softmax(qK/√d))’ is the scaled dot product), 
combining the output attention maps to generate a final attention map (see page 6280, Section 5.3, Figure 7, where there are multiple attention maps on the image)
corresponding to the at least one topic label from the input query (see page 6280, Section 5.3, Figure 7, “SA(Y)-l, SA(X)-l and GA(X,Y)-l denote the question self-attention, image self-attention, and question guided-attention from the l-th layer, respectively. Q, A, P denote the question, answer and prediction respectively”, and see how sheep are highlighted in attention map, which corresponds to the query ‘How many sheep can we see in this picture’) ; 
and executing a function based on the final attention map (see page 6280, Section 5.3, Figure 7, where an answer ‘3’ is output for the question, ‘How many sheep can we see in this picture’).
Yu is combinable with Xiong and Heisler because all three are from the analogous field of image analysis. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the attention maps taught by Yu with the teachings of Xiong and Heisler. The motivation for doing so would be to improve the performance of the model. Yu teaches on page 6275, Section 1, “Furthermore, we find that modeling self attention for image regions can greatly improve the object counting performance, which is challenging for VQA.” Thus, it would have been obvious to combine the attention maps taught by Yu with the teachings of Xiong and Heisler in order to obtain the invention as claimed in Claim 1. 

As to Claim 2, Xiong in view of Heisler fails to explicitly teach wherein the function includes at least one of identifying an object within the input image, moving an arm of the AI device to grip the object, moving the AI device to avoid a collision with the object, moving the AI device toward the object, or capturing a picture of the object.
However, Yu teaches the function of identifying an object within the input image (see page 6280, Section 5.3, Figure 7, where bounding boxes are used to identify sheep in the input image). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the object identifying function taught by Yu with the teachings of Xiong and Heisler in order to obtain the invention as claimed in Claim 2. The motivation for doing so would be to identify key objects in response to user queries. Yu teaches on page 6274, “Therefore, designing an effective ‘co-attention’ model to associate key words in questions with key objects in images is central to VQA performance.” Thus, it would have been obvious to combine the object identification taught by Yu with the teachings of Xiong and Heisler in order to obtain the invention as claimed in Claim 2. 

As to Claim 3, Xiong in view of Heisler and Yu teaches multiplying, via a first linear layer, the at least one word embedding and a first weight matrix to generate a query matrix multiplying, via a second linear layer, at least one of the plurality of word embeddings and a second weight matrix to generate a key matrix;  (see Xiong, paragraph [0033], “After linear projection, learnable positional encoding is used to include both relative and absolute position information. For each token, a query vector (Q), a key vector (K), and a value vector (V) are created, by multiplying the embeddings of the three matrices that are trained during the training process… Instead of utilizing a single attention module, embodiments also linearly project Q, K, and V h times with different, learned linear projections”, where the learned linear projections are weights); 
transposing the key matrix to generate a transposed key matrix (see Xiong, paragraph [0033], formula shown below); 

    PNG
    media_image1.png
    168
    817
    media_image1.png
    Greyscale

and multiplying the query matrix and the transposed key matrix to generate a resulting matrix (see Xiong, formula shown above, where matrix                         
                            Q
                        
                     and the transposed key matrix are multiplied).

As to Claim 4, Xiong in view of Heisler and Yu teaches dividing the resulting matrix by a dimension based on the key matrix to generate a scaled matrix (see Xiong, formula shown in paragraph [0033], where the resulting matrix                         
                            Q
                            
                                
                                    K
                                
                                
                                    T
                                
                            
                        
                     is divided by dimension                         
                            
                                
                                    d
                                
                                
                                    k
                                
                            
                        
                    ).

As to Claim 5, Xiong in view of Heisler and Yu teaches applying softmax to the scaled matrix to generate a normalized matrix (see Xiong, formula shown in paragraph [0033], where a softmax function is applied to the scaled matrix                          
                            
                                
                                    Q
                                    
                                        
                                            K
                                        
                                        
                                            T
                                        
                                    
                                
                                
                                    
                                        
                                            
                                                d
                                            
                                            
                                                k
                                            
                                        
                                    
                                
                            
                        
                     ).

As to Claim 8, Xiong in view of Heisler and Yu teaches wherein the final attention map includes a uniform box overlapping with an object in the input image that corresponds to the at least one topic label from the input query (see Yu, page 6280, Section 5.3, Figure 7, “SA(Y)-l, SA(X)-l and GA(X,Y)-l denote the question self-attention, image self-attention, and question guided-attention from the l-th layer, respectively. Q, A, P denote the question, answer and prediction respectively”, and see how sheep are highlighted in attention map, which corresponds to the word ‘sheep’ from the user query, ‘How many sheep can we see in this picture’).

As to Claim 9, Xiong in view of Heisler and Yu teaches drawing boxes around the objects detected in the input image to generate the bounding boxes (see Heisler, paragraph [0057], “The group of pixels belonging to each instance of a detected object are identified and associated with the respective object class label. Additionally, or alternatively, a bounding box representing the location of each instance of a detected object is defined (e.g., defined by the x and y coordinates of the corners of the bounding box, in the frame of reference of the segmented input image) and associated with the respective object class label. ”).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the bounding box taught by Heisler with the teachings of Xiong and Yu. The motivation for doing so would be to use bounding boxes and their labels to create a dictionary, which can be used to identify objects in later images as taught by Heisler in paragraph [0060]. Thus, it would have been obvious to combine the teachings of Xiong, Heisler, and Yu in order to obtain the invention as claimed in Claim 9. 

As to Claim 10, Xiong in view of Heisler and Yu teaches determining a characteristic of an object in the input image that corresponds to the at least one topic label from the input query based on the final attention map (see Yu, page 6280, Section 5.3, Figure 7, “SA(Y)-l, SA(X)-l and GA(X,Y)-l denote the question self-attention, image self-attention, and question guided-attention from the l-th layer, respectively. Q, A, P denote the question, answer and prediction respectively”, and see how sheep are highlighted in attention map, which corresponds to the query ‘How many sheep can we see in this picture’, and see the answer ‘3’ output).

As to Claim 11, Xiong teaches an artificial intelligence (AI) device (see Fig. 5, computer system 500, and see paragraph [0012], “Embodiments of this disclosure provide improved VQA performance by improving the alignment between image and natural-language input modalities”, where natural language processing is a subset of AI), the AI device comprising: 
a memory (see Fig 5., memory 504)  
and a controller (see Fig. 5, processor 402) configured to: 
obtain an input query, an input image (see paragraph [0015], “At step 110, the method of FIG. 1 includes accessing an image and a natural-language question regarding the image”),  
and at least one topic label for one or more words in the input query (see paragraph [0028], “In particular embodiments, such as the example of FIG. 2 , a granularity layer for a question may include a noun phrase level (e.g., layer 220) that is constructed by filtering the result from a constituency parser for the noun phrase level, for example by discarding the determiners (e.g., ‘a’, ‘the’) and filtering out the words expressing positional relations (e.g., ‘left’, ‘right’) to save computational resources”, where the ‘nouns’ are the topic label);
generate at least one word embedding for the at least one topic label from the input query (see paragraph [0028], “Then the phrases are split into word tokens and, in particular embodiments, their GloVe features are processed by the MLP to obtain the token features as Tnp”, where GloVe is a tool known in the art to produce word embeddings), 
the at least one word embedding being a multi-dimensional vector (see paragraph [0018], “For example, each text feature may be a text token feature vector, with each text token feature vector corresponding to one of the words in the natural-language question”); 
 generate scaled dot product attention matrices based on the at least one word embedding for the at least one topic label from the input query (see paragraph [0033], “For each token, a query vector (Q), a key vector (K), and a value vector (V) are created, by multiplying the embeddings of the three matrices that are trained during the training process… Each of the sets of vectors is then input into the scaled dot-product attention, and pointwise multiplied with the lead graph (GGA) from the graph-merging module”; 
and executing, via the processor, a function based on the final attention map (see paragraph [0039], “At step 160, the method of FIG. 1 includes determining an answer to the question based on the first output and the second output”).
Xiong fails to teach receiving bounding boxes for objects detected in the input image and object labels corresponding to the bounding boxes. Xiong further fails to teach generating, a plurality of word embeddings for the object labels corresponding to the bounding boxes. Xiong teaches extracting features from the input image, and then embedding these image features (see paragraph [0019]), and then using these embedded features to calculate a scaled dot product attention matrix (see paragraph [0033]. 
However, in an analogous art, Heisler teaches teach receiving bounding boxes for objects detected in the input image (see paragraph [0057], “Additionally or alternatively, a bounding box representing the location of each instance of a detected object is defined”)
and object labels corresponding to the bounding boxes (see paragraph [0057], “The result is that the segmented input image is associated with a set of one or more object class labels and also associated with a set of pixel groups and/or bounding boxes for each respective object class label”), 
and generating a plurality of word embeddings for the object labels corresponding to the bounding boxes (see paragraph [0089], “At 308, an embedding is generated for the object label associated with each object image. The embedding may be generated any suitable embedding technique (e.g., BERT, Word2Vec, GloVe, or fastText)”)
the plurality of word embeddings being multi-dimensional vectors (see paragraph [0053], “For example, in FIG. 1, the embedding 106 is represented as a 1×6 vector (i.e., having a dimensionality of 6)”). 
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the image embeddings taught Heisler with the word embeddings from bounding box labels taught by Xiong. The motivation for doing so would be to create a dictionary of embeddings for objects, which can be used to identify objects in later images. Heisler teaches in paragraph [0060], “The collection of embeddings may be stored as a dictionary of embeddings, in which each entry in the dictionary is a text string (e.g., a word or phrase) that is associated with a corresponding embedding generated by the pre-trained embedder. An embedding may then be generated for each object class label associated with the segmented input image by looking up each object class label in the dictionary of embeddings to identify and select the corresponding embedding that represents each respective object class label.”
Xiong fails to explicitly teach outputting attention maps according output attention maps corresponding to scaled dot product attention matrices. 
However, in an analogous art, Yu teaches outputting attention maps according to scaled dot product attention matrices, (see page 6280, Section 5.3, Figure 7, “Visualizations of the learned attention maps (softmax(qK/√d)) of the attention units from typical layers”, where ‘(softmax(qK/√d))’ is the scaled dot product), 
combining the output attention maps to generate a final attention map (see page 6280, Section 5.3, Figure 7, where there are multiple attention maps on the image)
corresponding to the at least one topic label from the input query (see page 6280, Section 5.3, Figure 7, “SA(Y)-l, SA(X)-l and GA(X,Y)-l denote the question self-attention, image self-attention, and question guided-attention from the l-th layer, respectively. Q, A, P denote the question, answer and prediction respectively”, and see how sheep are highlighted in attention map, which corresponds to the query ‘How many sheep can we see in this picture’); 
and executing a function based on the final attention map (see page 6280, Section 5.3, Figure 7, where an answer ‘3’ is output for the question, ‘How many sheep can we see in this picture’).
Yu is combinable with Xiong and Heisler because all three are from the analogous field of image analysis. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the attention maps taught by Yu with the teachings of Xiong and Heisler. The motivation for doing so would be to improve the performance of the model. Yu teaches on page 6275, Section 1, “Furthermore, we find that modeling self attention for image regions can greatly improve the object counting performance, which is challenging for VQA.” Thus, it would have been obvious to combine the attention maps taught by Yu with the teachings of Xiong and Heisler in order to obtain the invention as claimed in Claim 11. 

	 As to Claim 12, Claim 12 claims the same limitation claimed as Claim 2 and is dependent on a similarly rejected independent claim. Therefore, the rejection and rationale are similar to that of Claim 2.

As to Claim 13, Claim 13 claims the same limitation claimed as Claim 3 and is dependent on a similarly rejected independent claim. Therefore, the rejection and rationale are similar to that of Claim 3.

As to Claim 14, Claim 14 claims the same limitation claimed as Claim 4 and is dependent on a similarly rejected independent claim. Therefore, the rejection and rationale are similar to that of Claim 4.
As to Claim 15, Claim 15 claims the same limitation claimed as Claim 5 and is dependent on a similarly rejected independent claim. Therefore, the rejection and rationale are similar to that of Claim 5.

As to Claim 18, Claim 18 claims the same limitation claimed as Claim 8 and is dependent on a similarly rejected independent claim. Therefore, the rejection and rationale are similar to that of Claim 8.

As to Claim 19, Claim 19 claims the same limitation claimed as Claim 10 and is dependent on a similarly rejected independent claim. Therefore, the rejection and rationale are similar to that of Claim 10.

As to Claim 20, Xiong teaches method for controlling an artificial intelligence (AI) device (see Fig. 5, computer system, and see paragraph [0012], “Embodiments of this disclosure provide improved VQA performance by improving the alignment between image and natural-language input modalities”, where natural language processing is a subset of AI), the method comprising: 
obtaining, via a processor in the AI device, an input query, an input image (see paragraph [0015], “At step 110, the method of FIG. 1 includes accessing an image and a natural-language question regarding the image”),  
and at least one topic label for one or more words in the input query (see paragraph [0028], “In particular embodiments, such as the example of FIG. 2 , a granularity layer for a question may include a noun phrase level (e.g., layer 220) that is constructed by filtering the result from a constituency parser for the noun phrase level, for example by discarding the determiners (e.g., ‘a’, ‘the’) and filtering out the words expressing positional relations (e.g., ‘left’, ‘right’) to save computational resources”, where the ‘nouns’ are the topic label);
generating, via the processor (see Fig 5., processor 502), at least one word embedding for the at least one topic label from the input query (see paragraph [0028], “Then the phrases are split into word tokens and, in particular embodiments, their GloVe features are processed by the MLP to obtain the token features as Tnp”, where GloVe is a tool known in the art to produce word embeddings), 
generating scaled dot product attention matrices based on the at least one word embedding for the at least one topic label from the input query (see paragraph [0033], “For each token, a query vector (Q), a key vector (K), and a value vector (V) are created, by multiplying the embeddings of the three matrices that are trained during the training process… Each of the sets of vectors is then input into the scaled dot-product attention, and pointwise multiplied with the lead graph (GGA) from the graph-merging module”; 
and executing, via the processor, a function based on the final attention map (see paragraph [0039], “At step 160, the method of FIG. 1 includes determining an answer to the question based on the first output and the second output”).
Xiong fails to teach receiving bounding boxes for objects detected in the input image and object labels corresponding to the bounding boxes. Xiong further fails to teach generating, via the processor, a plurality of word embeddings for the object labels corresponding to the bounding boxes. Xiong teaches extracting features from the input image, and then embedding these image features (see paragraph [0019]), and then using these embedded features to calculate a scaled dot product attention matrix (see paragraph [0033]. 
However, in an analogous art, Heisler teaches teach receiving bounding boxes for objects detected in the input image (see paragraph [0057], “Additionally or alternatively, a bounding box representing the location of each instance of a detected object is defined”)
and object labels corresponding to the bounding boxes (see paragraph [0057], “The result is that the segmented input image is associated with a set of one or more object class labels and also associated with a set of pixel groups and/or bounding boxes for each respective object class label”), 
and generating a plurality of word embeddings for the object labels corresponding to the bounding boxes (see paragraph [0089], “At 308, an embedding is generated for the object label associated with each object image. The embedding may be generated any suitable embedding technique (e.g., BERT, Word2Vec, GloVe, or fastText)”).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the image embeddings taught Heisler with the word embeddings from bounding box labels taught by Xiong. The motivation for doing so would be to create a dictionary of embeddings for objects, which can be used to identify objects in later images. Heisler teaches in paragraph [0060], “The collection of embeddings may be stored as a dictionary of embeddings, in which each entry in the dictionary is a text string (e.g., a word or phrase) that is associated with a corresponding embedding generated by the pre-trained embedder. An embedding may then be generated for each object class label associated with the segmented input image by looking up each object class label in the dictionary of embeddings to identify and select the corresponding embedding that represents each respective object class label.”
Xiong fails to explicitly teach outputting attention maps according output attention maps corresponding to scaled dot product attention matrices. 
However, in an analogous art, Yu teaches outputting attention maps according to scaled dot product attention matrices, (see page 6280, Section 5.3, Figure 7, “Visualizations of the learned attention maps (softmax(qK/√d)) of the attention units from typical layers”, where ‘(softmax(qK/√d))’ is the scaled dot product), 
combining the output attention maps to generate a final attention map (see page 6280, Section 5.3, Figure 7, where there are multiple attention maps on the image)
corresponding to the at least one topic label from the input query (see page 6280, Section 5.3, Figure 7, “SA(Y)-l, SA(X)-l and GA(X,Y)-l denote the question self-attention, image self-attention, and question guided-attention from the l-th layer, respectively. Q, A, P denote the question, answer and prediction respectively”, and see how sheep are highlighted in attention map, which corresponds to the query ‘How many sheep can we see in this picture’) ; 
and executing a function based on the final attention map (see page 6280, Section 5.3, Figure 7, where an answer ‘3’ is output for the question, ‘How many sheep can we see in this picture’).
Yu is combinable with Xiong and Heisler because all three are from the analogous field of image analysis. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the attention maps taught by Yu with the teachings of Xiong and Heisler. The motivation for doing so would be to improve the performance of the model. Yu teaches on page 6275, Section 1, “Furthermore, we find that modeling self attention for image regions can greatly improve the object counting performance, which is challenging for VQA.” Thus, it would have been obvious to combine the attention maps taught by Yu with the teachings of Xiong and Heisler in order to obtain the invention as claimed in Claim 20.

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong et al. (US Pub No 20230106716), hereinafter Xiong,  in view of Heisler (US Pub No 20220292685), hereinafter Heisler, in view of Yu et al. (Z. Yu, et al., "Deep Modular Co-Attention Networks for Visual Question Answering," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6274-6283), hereinafter Yu, and further in view of Bera et al. (A. Bera, Z. Wharton, Y. Liu, N. Bessis and A. Behera, "SR-GNN: Spatial Relation-Aware Graph Neural Network for Fine-Grained Image Categorization," in IEEE Transactions on Image Processing, vol. 31, pp. 6017-6031, 2022), hereinafter Bera.

As to Claim 6, Xiong in view of Heisler teaches multiplying the normalized matrix and a value matrix (see Xiong, formula shown in paragraph [0033],  where the normalized matrix                          
                            
                                
                                    Q
                                    
                                        
                                            K
                                        
                                        
                                            T
                                        
                                    
                                
                                
                                    
                                        
                                            
                                                d
                                            
                                            
                                                k
                                            
                                        
                                    
                                
                            
                        
                      is multipled by value matrix                         
                            V
                        
                    ), 
Xiong in view of Heisler to explicitly teach generating an output attention map by multiplying the normalized matrix with a value matrix.
Yu teaches that attention maps can be output through multipoint a normalized matrix and a value matrix (see page 6276, Section 3.1, “Given a query q ∈R1xd, n key-value pairs (packed into a key matrix K ∈ Rn×d and a value matrix V ∈ Rn×d), the attended feature f ∈R1×d is obtained by weighted summation overall values V with respect to the attention learned from q and K”), 
the output attention map being one of the output attention maps (see page 6280, Section 5.3, Figure 7, see multiple attention maps),
but fails to explicitly teach that the value matrix is based on an attention map corresponding to object detection.
However, in an analogous art, Bera teaches multiplying a value vector with the dot product of a query and key vector (see page 6021, Section III, Subsection D., “The dot product of Q and K results in the attention weight matrix, which is multiplied with V to produce the desired transformed feature representation”),
wherein the value matrix is based on attention map corresponding to object detection (see page 6021, Section III, Subsection D.,  “The aim is to generate an attention-focused context vector (i.e., value V ) that enables our model to selectively focus on more relevant regions to generate holistic context information”).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the value attention matrix taught by Bera with the teachings of Xiong, Heisler and Yu. The motivation for doing so would be to ensure the model focuses on more relevant regions, as taught by Bera in page 6021. Thus, it would have been obvious to combine the teachings of Xiong, Heisler, Yu and Bera in order to obtain the invention as claimed in Claim 6.

As to Claim 16, Claim 16 claims the same limitation claimed as Claim 6 and is dependent on a similarly rejected independent claim. Therefore, the rejection and rationale are similar to that of Claim 6.

Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong et al. (US Pub No 20230106716), hereinafter Xiong,  in view of Heisler (US Pub No 20220292685), hereinafter Heisler, in view of Yu et al. (Z. Yu, et al., "Deep Modular Co-Attention Networks for Visual Question Answering," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6274-6283), hereinafter Yu, and further in view of Bera et al. (A. Bera, Z. Wharton, Y. Liu, N. Bessis and A. Behera, "SR-GNN: Spatial Relation-Aware Graph Neural Network for Fine-Grained Image Categorization," in IEEE Transactions on Image Processing, vol. 31, pp. 6017-6031, 2022), hereinafter Bera, and further in view of Zhou et al. (US Pub No 20240169733), hereinafter Zhou.

As to Claim 7, Xiong in view of Heisler, Yu and, Bera fails to teach wherein the attention map for the value matrix is a type of heat map.
Bera teaches that the value matrix may be based on an attention map (see page 6021, Section III, Subsection D.), but fails to explicitly teach that the value matrix is a heat map. 
However, in an analogous art, Zhou teaches a heat map where important features are highlighted (see paragraph [0083], “When extracting a target object from a feature map, a target object may be highlighted in response to filtering an image through an non matrix (the value of n may be considered based on elements such as a receptive field and accuracy, for example, may be set to 3*3, 5*5, 7*7, etc.)”),
and that the highlighted features can be used as a value for the value matrix (see paragraph [0115], “First, a linear task may be performed on each of the input Q (e.g., a clip query, an object representation by previous iteration processing, corresponding to the dimension L, C), K (using a video feature as a key feature, corresponding to the dimension THW, C), and V (using a video feature as a value feature”).
Thus, it would have been obvious to combine the feature highlighting taught by Zhou with the teachings of Xiong, Heisler, Yu, and Bera. The motivation for doing so would be to improve the segmentation of key features from images. Yu teaches in paragraph [0055], “the accuracy and robustness of a segmentation may be improved to a certain level”).  Thus, it would have been obvious to combine the feature highlighting taught by Zhou with the teachings of Xiong, Heisler, Yu, and Bera in order to obtain the invention as claimed in Claim 7.

As to Claim 17, Claim 17 claims the same limitation claimed as Claim 7 and is dependent on a similarly rejected independent claim. Therefore, the rejection and rationale are similar to that of Claim 7.




Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Hu et al. (Hu R, Andreas J, Darrell T, Saenko K. Explainable neural computation via stack neural module networks. Applied AI Letters. 2021) teaches a modular neural network which can receive an image and query from a user, and can output an answer or bounding box corresponding to an answer.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOUMYA THOMAS whose telephone number is (571)272-8639. The examiner can normally be reached M-F 8:30-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached at (571) 272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/S.T./Examiner, Art Unit 2664                                                                                                                                                                                                        /NANCY BITAR/Primary Examiner, Art Unit 2664
Read full office action
ARTIFICIAL INTELLIGENCE DEVICE FOR ATTENTION OVER DETECTION BASED OBJECT SELECTION AND CONTROL METHOD THEREOF

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

ARTIFICIAL INTELLIGENCE DEVICE FOR ATTENTION OVER DETECTION BASED OBJECT SELECTION AND CONTROL METHOD THEREOF

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email