Last updated: April 19, 2026
Application No. 18/422,046
VIDEO FRAME FEATURE EXTRACTION METHOD, DEVICE AND COMPUTER-READABLE STORAGE MEDIUM

Non-Final OA §102§103§112
Filed
Jan 25, 2024
Examiner
RUSH, ERIC
Art Unit
2677
Tech Center
2600 — Communications
Assignee
UBTECH ROBOTICS CORP LTD
OA Round
1 (Non-Final)
This examiner grants 61% of cases after interview

— +36.2% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 628 resolved cases, 2023–2026
Examiner Intelligence

RUSH, ERIC View full profile →
Grants 61% of resolved cases
Career Allow Rate
383 granted / 628 resolved
-1.0% vs TC avg
Strong +36% interview lift
Without
With
+36.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
32 currently pending
Career history
660
Total Applications
across all art units
Statute-Specific Performance

§101
10.8%
-29.2% vs TC avg
§103
40.0%
+0.0% vs TC avg
§102
12.7%
-27.3% vs TC avg
§112
27.7%
-12.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 628 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 1 is objected to because of the following informalities: Line 2 of claim 1 recites, in part, “each video frame in a video sequence;” which appears to contain a minor informality. The Examiner suggests amending the claim to --each video frame of video frames in a video sequence;-- in order to improve the clarity and precision of the claims. Appropriate correction is required.
Claim 2 is objected to because of the following informalities: Line 1 of claim 2 recites, in part, “wherein calculating global channel attention information of the video” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein calculating the global channel attention information of the video-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 2 is objected to because of the following informalities: Line 4 of claim 2 recites, in part, “fusion processing on the initial features of each video frame” which appears to contain inconsistent claim terminology and/or a minor informality. The Examiner suggests amending the claim to --fusion processing on the plurality of initial features of each video frame-- in order to maintain consistency with line 2 of claim 1 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 3 is objected to because of the following informalities: Lines 1 - 2 of claim 3 recite, in part, “wherein performing spatial dimension compression on the global initial features to obtain global initial channel attention” which appears to contain a minor informalities. The Examiner suggests amending the claim to --wherein performing the spatial dimension compression on the global initial features to obtain the global initial channel attention-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 4 is objected to because of the following informalities: Line 1 of claim 4 recites, in part, “wherein performing a purification operation” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein performing [[a]] the purification operation-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 5 is objected to because of the following informalities: Lines 1 - 2 of claim 5 recite, in part, “wherein calculating local channel attention information of the target video frame according to initial features” which appears to contain minor informalities. The Examiner suggests amending the claim to --wherein calculating the local channel attention information of the target video frame according to the initial features-- in order to improve the clarity and precision of the claims. Appropriate correction is required.
Claim 6 is objected to because of the following informalities: Line 1 of claim 6 recites, in part, “wherein performing channel attention mechanism processing” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein performing the channel attention mechanism processing-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 6 is objected to because of the following informalities: Line 3 of claim 6 recites, in part, “the local channel attention information to obtain optimized features” which appears to contain a minor informality. The Examiner suggests amending the claim to --the local channel attention information to the obtain optimized features-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 7 is objected to because of the following informalities: Lines 1 - 2 of claim 7 recite, in part, “wherein performing channel attention mechanism processing on the initial features of the target video frame to obtain optimized features of the target video frame” which appears to contain minor informalities. The Examiner suggests amending the claim to --wherein performing the channel attention mechanism processing on the initial features of the target video frame to obtain the optimized features of the target video frame-- in order to improve the clarity and precision of the claims. Appropriate correction is required.
Claim 8 is objected to because of the following informalities: Line 5 of claim 8 recites, in part, “each video frame in a video sequence;” which appears to contain a minor informality. The Examiner suggests amending the claim to --each video frame of video frames in a video sequence;-- in order to improve the clarity and precision of the claims. Appropriate correction is required.
Claim 9 is objected to because of the following informalities: Line 1 of claim 9 recites, in part, “wherein calculating global channel attention information of the video” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein calculating the global channel attention information of the video-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 9 is objected to because of the following informalities: Line 4 of claim 9 recites, in part, “fusion processing on the initial features of each video frame” which appears to contain inconsistent claim terminology and/or a minor informality. The Examiner suggests amending the claim to --fusion processing on the plurality of initial features of each video frame-- in order to maintain consistency with line 5 of claim 8 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 10 is objected to because of the following informalities: Lines 1 - 2 of claim 10 recite, in part, “wherein performing spatial dimension compression on the global initial features to obtain global initial channel attention” which appears to contain a minor informalities. The Examiner suggests amending the claim to --wherein performing the spatial dimension compression on the global initial features to obtain the global initial channel attention-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 11 is objected to because of the following informalities: Line 1 of claim 11 recites, in part, “wherein performing a purification operation” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein performing [[a]] the purification operation-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 12 is objected to because of the following informalities: Lines 1 - 2 of claim 12 recite, in part, “wherein calculating local channel attention information of the target video frame according to initial features” which appears to contain minor informalities. The Examiner suggests amending the claim to --wherein calculating the local channel attention information of the target video frame according to the initial features-- in order to improve the clarity and precision of the claims. Appropriate correction is required.
Claim 13 is objected to because of the following informalities: Line 1 of claim 13 recites, in part, “wherein performing channel attention mechanism processing” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein performing the channel attention mechanism processing-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 13 is objected to because of the following informalities: Line 3 of claim 13 recites, in part, “the local channel attention information to obtain optimized features” which appears to contain a minor informality. The Examiner suggests amending the claim to --the local channel attention information to the obtain optimized features-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 14 is objected to because of the following informalities: Lines 1 - 2 of claim 14 recite, in part, “wherein performing channel attention mechanism processing on the initial features of the target video frame to obtain optimized features of the target video frame” which appears to contain minor informalities. The Examiner suggests amending the claim to --wherein performing the channel attention mechanism processing on the initial features of the target video frame to obtain the optimized features of the target video frame-- in order to improve the clarity and precision of the claims. Appropriate correction is required.
Claim 15 is objected to because of the following informalities: Line 4 of claim 15 recites, in part, “each video frame in a video sequence;” which appears to contain a minor informality. The Examiner suggests amending the claim to --each video frame of video frames in a video sequence;-- in order to improve the clarity and precision of the claims. Appropriate correction is required.
Claim 16 is objected to because of the following informalities: Lines 1 - 2 of claim 16 recite, in part, “wherein calculating global channel attention information” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein calculating the global channel attention information-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 16 is objected to because of the following informalities: Line 4 of claim 16 recites, in part, “fusion processing on the initial features of each video frame” which appears to contain inconsistent claim terminology and/or a minor informality. The Examiner suggests amending the claim to --fusion processing on the plurality of initial features of each video frame-- in order to maintain consistency with line 4 of claim 15 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 17 is objected to because of the following informalities: Lines 1 - 2 of claim 17 recite, in part, “wherein performing spatial dimension compression on the global initial features to obtain global initial channel attention” which appears to contain a minor informalities. The Examiner suggests amending the claim to --wherein performing the spatial dimension compression on the global initial features to obtain the global initial channel attention-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 18 is objected to because of the following informalities: Lines 1 - 2 of claim 18 recite, in part, “wherein performing a purification operation” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein performing [[a]] the purification operation-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 19 is objected to because of the following informalities: Lines 1 - 2 of claim 19 recite, in part, “wherein calculating local channel attention information of the target video frame according to initial features” which appears to contain minor informalities. The Examiner suggests amending the claim to --wherein calculating the local channel attention information of the target video frame according to the initial features-- in order to improve the clarity and precision of the claims. Appropriate correction is required.
Claim 20 is objected to because of the following informalities: Lines 1 - 2 of claim 20 recite, in part, “wherein performing channel attention mechanism processing” which appears to contain a minor informality. The Examiner suggests amending the claim to --wherein performing the channel attention mechanism processing-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 20 is objected to because of the following informalities: Lines 3 - 4 of claim 20 recite, in part, “the local channel attention information to obtain optimized features” which appears to contain a minor informality. The Examiner suggests amending the claim to --the local channel attention information to the obtain optimized features-- in order to improve the clarity and precision of the claim. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1 - 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which target video frame “the target video frame” recited on line 6, along with subsequent recitations of “the target video frame” throughout the claims, are referencing. Are they referring to the “target video frame” recited on line 5 of claim 1 or the ”target video frame” recited on line 6 of claim 1? Additionally, it is unclear as to whether the “target video frame” recited on line 5 of claim 1 and the ”target video frame” recited on line 6 of claim 1 are the same target video frame or different target video frames. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claims as requiring and referencing a single same target video frame.
Claim 2 recites the limitation "the global initial features of the video sequence;" in line 5. There is insufficient antecedent basis for this limitation in the claim.
Claim 6 recites the limitation "the fused channel attention information of the target video frame;" in lines 6 - 7. There is insufficient antecedent basis for this limitation in the claim.
Claim 7 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which optimized features of the target video frame “the optimized features of the target video frame” recited on lines 7 - 8 are referencing. Are they referring to the “optimized features of the target video frame” recited on line 10 of claim 1 or the “optimized features of the target video frame” recited on line 9 of claim 6? Additionally, it is unclear as to whether the “optimized features of the target video frame” recited on line 10 of claim 1 and the “optimized features of the target video frame” recited on line 9 of claim 6 are the same optimized features or different optimized features. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claims as requiring and referencing a single same set of optimized features of the target video frame.
Claim 8 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which target video frame “the target video frame” recited on line 9, along with subsequent recitations of “the target video frame” throughout the claims, are referencing. Are they referring to the “target video frame” recited on line 8 of claim 8 or the ”target video frame” recited on line 9 of claim 8? Additionally, it is unclear as to whether the “target video frame” recited on line 8 of claim 8 and the ”target video frame” recited on line 9 of claim 8 are the same target video frame or different target video frames. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claims as requiring and referencing a single same target video frame.
Claim 9 recites the limitation "the global initial features of the video sequence;" in line 5. There is insufficient antecedent basis for this limitation in the claim.
Claim 13 recites the limitation "the fused channel attention information of the target video frame;" in lines 6 - 7. There is insufficient antecedent basis for this limitation in the claim.
Claim 14 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which optimized features of the target video frame “the optimized features of the target video frame” recited on lines 7 - 8 are referencing. Are they referring to the “optimized features of the target video frame” recited on line 13 of claim 8 or the “optimized features of the target video frame” recited on line 9 of claim 13? Additionally, it is unclear as to whether the “optimized features of the target video frame” recited on line 13 of claim 8 and the “optimized features of the target video frame” recited on line 9 of claim 13 are the same optimized features or different optimized features. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claims as requiring and referencing a single same set of optimized features of the target video frame.
Claim 15 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which target video frame “the target video frame” recited on line 8, along with subsequent recitations of “the target video frame” throughout the claims, are referencing. Are they referring to the “target video frame” recited on line 7 of claim 15 or the ”target video frame” recited on line 8 of claim 15? Additionally, it is unclear as to whether the “target video frame” recited on line 7 of claim 15 and the ”target video frame” recited on line 8 of claim 15 are the same target video frame or different target video frames. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claims as requiring and referencing a single same target video frame.
Claim 16 recites the limitation "the global initial features of the video sequence;" in line 5. There is insufficient antecedent basis for this limitation in the claim.
Claim 20 recites the limitation "the fused channel attention information of the target video frame;" in lines 6 - 7. There is insufficient antecedent basis for this limitation in the claim.
Claims 3 - 5, 10 - 12 and 17 - 19 are also rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, due to being dependent upon a rejected base claim(s) but would be withdrawn from the rejection if their base claim(s) overcome the rejection. 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 6 - 8, 13 - 15 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Cao et al. Chinese Publication No. CN 114842539 A. The Examiner notes that citations to Cao et al. correspond to both of the provided machine translation of Cao et al. and the original Chinese Publication.

-	With regards to claims 1, 8 and 15, Cao et al. disclose a computer-implemented method for extracting video frame features, (Cao et al., Figs. 2 & 3, Pg. 2 ¶ n0010 - n0011, Pg. 3 ¶ n0022, Pg. 4 ¶ n0034 - Pg. 5 ¶ n0035, Pg. 7 ¶ n0056, Pg. 12 ¶ n0114 and n0117, Pg. 13 ¶ n0119 - n0124) a device for extracting video frame features (Cao et al., Figs. 2 & 3, Pg. 2 ¶ n0010 - n0011, Pg. 3 ¶ n0022, Pg. 4 ¶ n0034 - Pg. 5 ¶ n0035, Pg. 7 ¶ n0056, Pg. 12 ¶ n0114 and n0117, Pg. 13 ¶ n0119 - n0124) comprising: one or more processors; (Cao et al., Pg. 4 ¶ n0034 - Pg. 5 ¶ n0035, Pg. 12 ¶ n0117, Pg. 13 ¶ n0119 - n0124) and a memory coupled to the one or more processors, the memory storing programs that, when executed by the one or more processors, cause performance of operations; (Cao et al., Pg. 4 ¶ n0034 - Pg. 5 ¶ n0035, Pg. 12 ¶ n0117, Pg. 13 ¶ n0119 - n0124) and a non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor of a device, cause the at least one processor to perform a method, (Cao et al., Pg. 4 ¶ n0034 - Pg. 5 ¶ n0035, Pg. 12 ¶ n0117, Pg. 13 ¶ n0119 - n0124) the method/operations comprising: obtaining a plurality of initial features of each video frame in a video sequence; (Cao et al., Fig. 2, Pg. 2 ¶ n0010, Pg. 7 ¶ n0056, Pg. 9 ¶ n0077 [“the spatial feature module uses the first 35 layers of the VGG16 network architecture to extract a feature vector of length 4096 for each frame of the image sequence, and obtains the spatial feature Fsp                        
                            ∈
                        
                    RNxL, where N is the number of frames in the image sequence and L = 4096”]) calculating global channel attention information of the video sequence based on the plurality of initial features of each video frame in the video sequence; (Cao et al., Figs. 2 & 3, Pg. 2 ¶ n0010 - n0011, Pg. 3 ¶ n0022, Pg. 7 ¶ n0056, Pg. 9 ¶ n0077 - n0078 and n0081 - n0085 [“The global feature module uses Bi-LSTM to extract global features Fg                        
                            ∈
                        
                    RNx200 from the spatial features Fsp”]) calculating local channel attention information of a target video frame according to initial features of a target video frame; (Cao et al., Figs. 2 & 3, Pg. 2 ¶ n0010 - n0011, Pg. 3 ¶ n0017 - n0019 and n0022, Pg. 7 ¶ n0056 - n0057, Pg. 9 ¶ n0077 - n0079, Pg. 10 ¶ n0088 - n0092 [“the spatial feature module uses the first 35 layers of the VGG16 network architecture to extract a feature vector of length 4096 for each frame of the image sequence, and obtains the spatial feature Fsp                        
                            ∈
                        
                    RNxL, where N is the number of frames in the image sequence and L = 4096” and “The local feature module uses one-dimensional convolution to extract local features Fl                        
                            ∈
                        
                    RNx200 from the spatial features Fsp ”]) wherein the target video frame is one of the video frames in the video sequence; (Cao et al., Figs. 2 & 3, Pg. 9 ¶ n0077 - n0079, Pg. 10 ¶ n0088 - n0094 [“where N is the number of frames in the image sequence and L = 4096” and “The local feature module uses one-dimensional convolution to extract local features Fl                        
                            ∈
                        
                    RNx200 from the spatial features Fsp ”]) and performing channel attention mechanism processing on the initial features of the target video frame according to the global channel attention information and the local channel attention information to obtain optimized features of the target video frame. (Cao et al., Fig. 3, Pg. 9 ¶ n0077 - n0079 and n0081 - n0085, Pg. 10 ¶ n0088 - n0095 [“local attention features of local features… are extracted based on global features Fg, and they will automatically assign greater weights to more relevant and information-rich local features”, “First, calculate the local attention weights. The local attention weights are calculated based on the correlation between the local features of a certain image and other images, as shown in formula (8)”, “Attl = softmax(                        
                            
                                            F
                                        
                                            g
                                        
                                    ∙
                                    
                                            F
                                        
                                            l
                                        
                                            T
                                        
                                        d
                                    
                    )  (8)”, “the local attention weights are multiplied by the local feature Fl to obtain the local attention feature F’l” and “F’l                        
                            ∈
                        
                    RNx200”]) 

-	With regards to claims 6, 13 and 20, Cao et al. disclose the method, device and non-transitory computer-readable storage medium of claims 1, 8 and 15, respectively, wherein performing channel attention mechanism processing on the initial features of the target video frame according to the global channel attention information and the local channel attention information to obtain optimized features of the target video frame comprises: performing fusion processing on the global channel attention information and the local channel attention information to obtain the fused channel attention information of the target video frame; (Cao et al., Fig. 3, Pg. 9 ¶ n0077 - n0079, Pg. 10 ¶ n0088 - n0092 [“local attention features of local features… are extracted based on global features Fg, and they will automatically assign greater weights to more relevant and information-rich local features”, “First, calculate the local attention weights. The local attention weights are calculated based on the correlation between the local features of a certain image and other images, as shown in formula (8)” and “Attl = softmax(                        
                            
                                            F
                                        
                                            g
                                        
                                    ∙
                                    
                                            F
                                        
                                            l
                                        
                                            T
                                        
                                        d
                                    
                    )  (8)”]) and performing channel attention mechanism processing on the initial features of the target video frame to obtain optimized features of the target video frame according to the fused channel attention information. (Cao et al., Fig. 3, Pg. 9 ¶ n0077 - n0079, Pg. 10 ¶ n0088 - n0095 [“local attention features of local features… are extracted based on global features Fg, and they will automatically assign greater weights to more relevant and information-rich local features”, “First, calculate the local attention weights. The local attention weights are calculated based on the correlation between the local features of a certain image and other images, as shown in formula (8)”, “Attl = softmax(                        
                            
                                            F
                                        
                                            g
                                        
                                    ∙
                                    
                                            F
                                        
                                            l
                                        
                                            T
                                        
                                        d
                                    
                    )  (8)”, “the local attention weights are multiplied by the local feature Fl to obtain the local attention feature F’l” and “F’l                        
                            ∈
                        
                    RNx200”]) 

-	With regards to claims 7 and 14, Cao et al. disclose the method and device of claims 6 and 13, respectively, wherein performing channel attention mechanism processing on the initial features of the target video frame to obtain optimized features of the target video frame according to the fused channel attention information comprises: obtaining channel attention information corresponding to each feature channel in the fused channel attention information; (Cao et al., Fig. 3, Pg. 9 ¶ n0077 - n0079, Pg. 10 ¶ n0088 - n0092 [‘”local attention features of local features… are extracted based on global features Fg, and they will automatically assign greater weights to more relevant and information-rich local features”, “First, calculate the local attention weights. The local attention weights are calculated based on the correlation between the local features of a certain image and other images, as shown in formula (8)” and “Attl = softmax(                        
                            
                                            F
                                        
                                            g
                                        
                                    ∙
                                    
                                            F
                                        
                                            l
                                        
                                            T
                                        
                                        d
                                    
                    )  (8)”]) and performing channel attention mechanism processing on the initial features of the target video frame using the channel attention information corresponding to each feature channel to obtain the optimized features of the target video frame. (Cao et al., Figs. 2 & 3, Pg. 9 ¶ n0077 - n0079, Pg. 10 ¶ n0088 - n0095 [“local attention features of local features… are extracted based on global features Fg, and they will automatically assign greater weights to more relevant and information-rich local features”, “First, calculate the local attention weights. The local attention weights are calculated based on the correlation between the local features of a certain image and other images, as shown in formula (8)”, “Attl = softmax(                        
                            
                                            F
                                        
                                            g
                                        
                                    ∙
                                    
                                            F
                                        
                                            l
                                        
                                            T
                                        
                                        d
                                    
                    )  (8)”, “the local attention weights are multiplied by the local feature Fl to obtain the local attention feature F’l” and “F’l                        
                            ∈
                        
                    RNx200”]) 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 2 - 5, 9 - 12 and 16 - 19 are rejected under 35 U.S.C. 103 as being unpatentable over Cao et al. Chinese Publication No. CN 114842539 A as applied to claims 1, 8 and 15 above, and further in view of Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu, “Squeeze-and-Excitation Networks”, IEEE, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, No. 8, Aug. 2020, pages 2011 - 2023, herein referred to as “Hu et al.”. The Examiner notes that citations to Cao et al. correspond to both of the provided machine translation of Cao et al. and the original Chinese Publication.

-	With regards to claims 2, 9 and 16, Cao et al. disclose the method, device and non-transitory computer-readable storage medium of claims 1, 8 and 15, respectively, wherein calculating global channel attention information of the video sequence based on the plurality of initial features of each video frame in the video sequence comprises: performing fusion processing on the initial features of each video frame in the video sequence to obtain the global initial features of the video sequence. (Cao et al., Figs. 2 & 3, Pg. 2 ¶ n0010 - n0011, Pg. 3 ¶ n0022, Pg. 7 ¶ n0056, Pg. 9 ¶ n0077 - n0078 and n0081 - n0085 [“The global feature module uses Bi-LSTM to extract global features Fg                        
                            ∈
                        
                    RNx200 from the spatial features Fsp”]) Cao et al. fail to disclose explicitly performing spatial dimension compression on the global initial features to obtain global initial channel attention information of the video sequence; and performing a purification operation on the global initial channel attention information to obtain the global channel attention information. Pertaining to analogous art, Hu et al. disclose performing spatial dimension compression on the global initial features to obtain global initial channel attention information of the video sequence; (Hu et al., Pg. 2011 § 1 ¶ 2 - 3, Pg. 2012 Fig. 1, Pg. 2013 § 3 - § 3.1, Pg. 2014 Figs. 2 - 3) and performing a purification operation on the global initial channel attention information to obtain the global channel attention information. (Hu et al., Pg. 2011 § 1 ¶ 2 - 3, Pg. 2012 Fig. 1, Pg. 2013 § 3 - § 3.2, Pg. 2014 Figs. 2 - 3) Cao et al. and Hu et al. are combinable because they are both directed towards computer vision systems that aim to extract very meaningful, relevant and information-rich features from images. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Cao et al. with the teachings of Hu et al. This modification would have been prompted in order to enhance the base device of Cao et al. with the well-known and applicable technique Hu et al. applied to a comparable device. Performing spatial dimension compression followed by a purification operation on the global initial features to obtain global channel attention information, as taught by Hu et al., would enhance the base device of Cao et al. by improving the representational power of its global channel attention information/features before it is used to calculate the attention weights that are utilized to optimize features of a video frame so as to further enhance its ability to effectively obtain the most representative and information-rich features possible from the frames of the video stream. Furthermore, this modification would have been prompted by the teachings and suggestions of Hu et al. that their teachings improve the representational power of features, can be integrated into standard architectures such as VGGNet and may prove useful for tasks requiring strong discriminative features, see at least page 2011 section 1 paragraphs 2 - 3, page 2013 section 3 and section 3.3 paragraph 1 and page 2021 section 8 of Hu et al. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that a spatial dimension compression operation followed by a purification operation would be performed on the global initial features to obtain the global channel attention information so as to further enhance the representative and discriminative power of the global channel attention information/features of the base device of Cao et al. Therefore, it would have been obvious to combine Cao et al. with Hu et al. to obtain the invention as specified in claims 2, 9 and 16.

-	With regards to claims 3, 10 and 17, Cao et al. in view of Hu et al. disclose the method, device and non-transitory computer-readable storage medium of claims 2, 9 and 16, respectively, wherein performing spatial dimension compression on the global initial features to obtain global initial channel attention information of the video sequence comprises: obtaining features corresponding to each feature channel in the global initial features. (Cao et al., Figs. 2 & 3, Pg. 2 ¶ n0010 - n0011, Pg. 3 ¶ n0022, Pg. 7 ¶ n0056, Pg. 9 ¶ n0077 - n0078 and n0081 - n0085 [“The global feature module uses Bi-LSTM to extract global features Fg                        
                            ∈
                        
                    RNx200 from the spatial features Fsp”]) Cao et al. fail to disclose explicitly obtaining two-dimensional features corresponding to each feature channel in the global initial features; and performing spatial dimension compression on the two-dimensional features corresponding to each feature channel to obtain the global initial channel attention information. Pertaining to analogous art, Hu et al. disclose obtaining two-dimensional features corresponding to each feature channel in the global initial features; (Hu et al., Pg. 2011 § 1 ¶ 2 - 3, Pg. 2012 Fig. 1, Pg. 2013 § 3 - § 3.1, Pg. 2014 Figs. 2 - 3) and performing spatial dimension compression on the two-dimensional features corresponding to each feature channel to obtain the global initial channel attention information. (Hu et al., Pg. 2011 § 1 ¶ 2 - 3, Pg. 2012 Fig. 1, Pg. 2013 § 3 - § 3.1, Pg. 2014 Figs. 2 - 3) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Cao et al. in view of Hu et al. with additional teachings of Hu et al. This modification would have been prompted in order to substitute the plurality of initial features of Cao et al. for the two-dimensional features of Hu et al. The two-dimensional features of Hu et al. could be substituted in place of the plurality of initial features of Cao et al. utilizing well-known techniques in the art and would likely yield predictable results, in that, in the combination, the combined base device would extract, process and obtain two-dimensional features of each frame of the video stream. Furthermore, this modification enhance the combined base device by further improving its ability to obtain the most representative and information-rich features from a video stream since a wider variety of features would be able to be extracted from frames of video streams thereby enhancing its ability to identify the most meaningful features of a video stream. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the combined base device would extract, process and obtain two-dimensional features of each frame of the video stream. Therefore, it would have been obvious to combine Cao et al. in view of Hu et al. with additional teachings of Hu et al. to obtain the invention as specified in claims 3, 10 and 17. 

-	With regards to claims 4, 11 and 18, Cao et al. in view of Hu et al. disclose the method, device and non-transitory computer-readable storage medium of claims 2, 9 and 16, respectively. Cao et al. fail to disclose explicitly wherein performing a purification operation on the global initial channel attention information to obtain the global channel attention information comprises: performing dimensionality reduction processing on the global initial channel attention information using a preset first fully connected layer to obtain global dimensionality reduction channel attention information; and performing dimensionality enhancement processing on the global dimensionality reduction channel attention information using a preset second fully connected layer to obtain the global channel attention information. Pertaining to analogous art, Hu et al. disclose wherein performing a purification operation on the global initial channel attention information to obtain the global channel attention information comprises: performing dimensionality reduction processing on the global initial channel attention information using a preset first fully connected layer to obtain global dimensionality reduction channel attention information; (Hu et al., Pg. 2011 § 1 ¶ 2 - 3, Pg. 2012 Fig. 1, Pg. 2013 § 3 - § 3.2, Pg. 2014 Figs. 2 - 3) and performing dimensionality enhancement processing on the global dimensionality reduction channel attention information using a preset second fully connected layer to obtain the global channel attention information. (Hu et al., Pg. 2011 § 1 ¶ 2 - 3, Pg. 2012 Fig. 1, Pg. 2013 § 3 - § 3.2, Pg. 2014 Figs. 2 - 3) 

-	With regards to claims 5, 12 and 19, Cao et al. disclose the method, device and non-transitory computer-readable storage medium of claims 1, 8 and 15, respectively. Cao et al. fail to disclose explicitly wherein calculating local channel attention information of the target video frame according to initial features of the target video frame comprises: performing spatial dimension compression on the initial features of the target video frame to obtain local initial channel attention information of the target video frame; and performing a purification operation on the local initial channel attention information to obtain the local channel attention information. Pertaining to analogous art, Hu et al. disclose wherein calculating local channel attention information of the target video frame according to initial features of the target video frame comprises: performing spatial dimension compression on the initial features of the target video frame to obtain local initial channel attention information of the target video frame; (Hu et al., Pg. 2011 § 1 ¶ 2 - 3, Pg. 2012 Fig. 1, Pg. 2013 § 3 - § 3.1, Pg. 2014 Figs. 2 - 3) and performing a purification operation on the local initial channel attention information to obtain the local channel attention information. (Hu et al., Pg. 2011 § 1 ¶ 2 - 3, Pg. 2012 Fig. 1, Pg. 2013 § 3 - § 3.2, Pg. 2014 Figs. 2 - 3) Cao et al. and Hu et al. are combinable because they are both directed towards computer vision systems that aim to extract very meaningful, relevant and information-rich features from images. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Cao et al. with the teachings of Hu et al. This modification would have been prompted in order to enhance the base device of Cao et al. with the well-known and applicable technique Hu et al. applied to a comparable device. Performing spatial dimension compression followed by a purification operation on the initial features to obtain local channel attention information, as taught by Hu et al., would enhance the base device of Cao et al. by improving the representational power of its local channel attention information/features before it is used to calculate the attention weights that are utilized to optimize features of a video frame so as to further enhance its ability to effectively obtain the most representative and information-rich features possible from the frames of the video stream. Furthermore, this modification would have been prompted by the teachings and suggestions of Hu et al. that their teachings improve the representational power of features, can be integrated into standard architectures such as VGGNet and may prove useful for tasks requiring strong discriminative features, see at least page 2011 section 1 paragraphs 2 - 3, page 2013 section 3 and section 3.3 paragraph 1 and page 2021 section 8 of Hu et al. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that a spatial dimension compression operation followed by a purification operation would be performed on the initial features to obtain the local channel attention information so as to further enhance the representative and discriminative power of the local channel attention information/features of the base device of Cao et al. Therefore, it would have been obvious to combine Cao et al. with Hu et al. to obtain the invention as specified in claims 5, 12 and 19. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Hsiao et al. U.S. Publication No. 2023/0005264 A1; which is directed towards a method for video recognition, wherein an attention mechanism determines an attention vector for an original set of clip descriptors for a video and an enhanced set of clip descriptors for the video is obtained based on the original set of clip descriptors and the attention vector.
Shen U.S. Publication No. 2015/0030242 A1; which is directed towards a method and system for fusing multiple images, wherein local and global features of source images and processed to form local and global weight matrices, respectively, the local and global weight matrices are combined to form a final weight matrix and the source images are weighted by the final weight matrix to generate a fused image. 
Wu et al. U.S. Publication No. 2024/0320976 A1; which is directed towards method and systems for video processing, wherein frame-level features are determined for each frame of a video and a video-level feature characterizing feature information of the entire video is determined by aggregating the frame-level features determined from a plurality of frames of the video.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC RUSH whose telephone number is (571) 270-3017. The examiner can normally be reached 9am - 5pm Monday - Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached at (571) 270 - 5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ERIC RUSH/Primary Examiner, Art Unit 2677
Read full office action
Prosecution Timeline

Jan 25, 2024
Application Filed
Mar 21, 2026
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/066,428
Patent 12586229
COMPUTER IMPLEMENTED METHODS AND DEVICES FOR DETERMINING DIMENSIONS AND DISTANCES OF HEAD FEATURES
2y 5m to grant Granted Mar 24, 2026
18/525,899
Patent 12548292
METHOD AND SYSTEM FOR IDENTIFYING REFLECTIONS IN THERMAL IMAGES
2y 5m to grant Granted Feb 10, 2026
19/007,438
Patent 12548395
SYSTEMS, METHODS AND DEVICES FOR MONITORING BETTING ACTIVITIES
2y 5m to grant Granted Feb 10, 2026
18/184,317
Patent 12541856
MASKING OF OBJECTS IN AN IMAGE STREAM
2y 5m to grant Granted Feb 03, 2026
17/961,510
Patent 12518504
METHOD FOR CALIBRATING AN OBJECT RE-IDENTIFICATION SOLUTION IMPLEMENTING AN ARRAY OF A PLURALITY OF CAMERAS
2y 5m to grant Granted Jan 06, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
61%
Grant Probability
97%
With Interview (+36.2%)
3y 5m
Median Time to Grant
Low
PTA Risk
Based on 628 resolved cases by this examiner. Grant probability derived from career allow rate.