Last updated: April 19, 2026
Application No. 18/704,896
ROBOT CONTROL METHOD AND APPARATUS, AND STORAGE MEDIUM

Final Rejection §103§112
Filed
Jun 05, 2024
Examiner
MCCLEARY, CAITLIN RENEE
Art Unit
3669
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Midea Robozone Technology Co. Ltd.
OA Round
2 (Final)
This examiner grants 57% of cases after interview

— +32.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 95 resolved cases, 2023–2026
Examiner Intelligence

MCCLEARY, CAITLIN RENEE View full profile →
Grants 57% of resolved cases
Career Allow Rate
54 granted / 95 resolved
+4.8% vs TC avg
Strong +32% interview lift
Without
With
+32.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
56 currently pending
Career history
151
Total Applications
across all art units
Statute-Specific Performance

§101
12.9%
-27.1% vs TC avg
§103
43.5%
+3.5% vs TC avg
§102
14.0%
-26.0% vs TC avg
§112
27.4%
-12.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 95 resolved cases
Office Action

§103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claims 1-20 were previously pending. Claims 1, 6-8, 10, 14-17, and 20 have been amended. Claims 4-5 and 18-19 have been cancelled. No claims have been newly added. Thus, claims 1-3, 6-17, and 20 are currently pending and have been examined in this application. 

Examiner's Note
Examiner has cited particular paragraphs/columns and line numbers or figures in the
references as applied to the claims below for the convenience of the applicant. Although the
specified citations are representative of the teachings in the art and are applied to the specific
limitations within the individual claim, other passages and figures may apply as well. It is
respectfully requested from the applicant, in preparing the responses, to fully consider the
references in their entirety as potentially teaching all or part of the claimed invention, as well as
the context of the passage as taught by the prior art or disclosed by the examiner. Applicant is
reminded that the Examiner is entitled to give the broadest reasonable interpretation to the
language of the claims. Furthermore, the Examiner is not limited to Applicant's definition which is not specifically set forth in the disclosure.

Claim Interpretation
	Use of the word "means" ( or "step for") in a claim with functional language creates a
rebuttable presumption that the claim element is to be treated in accordance with 35 U.S.C.
112(-f) (pre-AIA  35 U.S.C. 112, sixth paragraph). The presumption that 35 U.S.C. 112(-f) (pre-
AIA  35 U.S.C. 112, sixth paragraph) is invoked is rebutted when the function is recited with
sufficient structure, material, or acts within the claim itself to entirely perform the recited
function.
Absence of the word "means" ( or "step for") in a claim creates a rebuttable
presumption that the claim element is not to be treated in accordance with 35 U.S.C. 112(-f)
(pre-AIA  35 U.S.C. 112, sixth paragraph). The presumption that 35 U.S.C. 112(-f) (pre-AIA  35
U.S.C. 112, sixth paragraph) is not invoked is rebutted when the claim element recites function
but fails to recite sufficiently definite structure, material or acts to perform that function.
The claims in this application are given their broadest reasonable interpretation using
the plain meaning of the claim language in light of the specification as it would be understood
by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element
(also commonly referred to as a claim limitation) is limited by the description in the
specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following
three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth
paragraph:
the claim limitation uses the term “means” or “step” or a term used as a substitute for
“means” that is a generic placeholder (also called a nonce term or a non-structural term
having no specific structural meaning) for performing the claimed function;
the term “means” or “step” or the generic placeholder is modified by functional
language, typically, but not always linked by the transition word “for” (e.g., “means for”)
or another linking word or phrase, such as “configured to” or “so that”; and
the term “means” or “step” or the generic placeholder is not modified by sufficient
structure, material, or acts for performing the claimed function.
Claim limitations in this application that use the word “means” (or “step”) are being
interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as
otherwise indicated in an Office action. Conversely, claim limitations in this application that do
not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-
AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word
“means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112,
sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with
functional language without reciting sufficient structure to perform the recited function and the
generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: “feature extraction module” in claims 1-3, 6-17, and 20, “temporal shift module” in claims 1-3, 6-17, and 20, “recognition module” in claims 1-3, 6-17, and 20, “convolutional module” in claims 8-9, “attention enhancement mechanism” in claims 8-9, “attention module” in claim 9, “channel attention module” in claim 9, and “spatial attention module” in claim 9.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or
pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the
corresponding structure described in the specification as performing the claimed function, and
equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C.
112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim
limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112,
sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2)
present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform
the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 
35 U.S.C. 112, sixth paragraph.
The above-referenced claim limitations has/have been interpreted under 35 U.S.C.
112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because: “feature extraction module” in claims 1-3, 6-17, and 20, “temporal shift module” in claims 1-3, 6-17, and 20, “recognition module” in claims 1-3, 6-17, and 20, “convolutional module” in claims 8-9, “attention enhancement mechanism” in claims 8-9, “attention module” in claim 9, “channel attention module” in claim 9, and “spatial attention module” in claim 9 all use a generic placeholder “module” or “mechanism” coupled with functional language without reciting sufficient structure to achieve the function. Furthermore, the generic placeholder is not preceded by a structural modifier.
Since the claim limitation(s) invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth
paragraph, the claims have been interpreted to cover the corresponding structure described in
the specification that achieves the claimed function, and equivalents thereof.
A review of the specification shows that the following appears to be the corresponding
structure described in the specification for the 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth
paragraph limitation:
Feature extraction module: [0080-0081, 0102, 0105]
Temporal shift module: [0085, 0102, 0105]
Recognition module: [0090-0096, 0102, 0105]
Convolutional module: [0081-0082, 0102, 0105]
Attention enhancement mechanism: [0081-0082, 0102, 0105] 
Attention module: [0082-0083, 0102, 0105]
Channel attention module: [0083, 0102, 0105]
Spatial attention module: [0083, 0102, 0105]
For all the units corresponding to a computer (hardware) the software (steps in an
algorithm/flowchart) should be included to indicate proper support.
If applicant wishes to provide further explanation or dispute the examiner's interpretation of the corresponding structure, applicant must identify the corresponding structure with reference to the specification by page and line number, and to the drawing, if any, by reference characters in response to this Office action.
If applicant does not intend to have the claim limitation(s) treated under 35 U.S.C. l 12(f)
or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may amend the claim(s) so that it/they will
clearly not invoke 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, or present a
sufficient showing that the claim recites/recite sufficient structure, material, or acts for
performing the claimed function to preclude application of 35 U.S.C. 112(f) or pre-AIA  35 U.S.C.
112, sixth paragraph.
For more information, see MPEP § 2173 et seq. and Supplementary Examination
Guidelines for Determining Compliance With 35 U.S. C. 112 and for Treatment of Related Issues
in Patent Applications, 76 FR 7162, 7167 (Feb. 9, 2011).

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 14 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 14 recites “A robot… the second scene image” and there is insufficient antecedent basis for these limitations in the claim. It is unclear if this is referring to the same robot as introduced in claim 1 (upon which claim 14 depends) or a different robot. It is unclear if this is intended to refer to the second scene image or the second scene images, as multiple second scene images are introduced in claim 1 (upon which claim 14 depends). The metes and bounds of the claim language are vague and ill-defined, rendering the claim indefinite. As best understood, the claim will be interpreted to be referring to the robot and the second scene images.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 6-7, 13-16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Han (CN 112379781 A, cited in the IDS dated 6/5/2024, a full machine translation was provided with the Office action dated 11/4/2025 and is being relied upon) in view of Wu (US 2022/0198836 A1) and Chen (CN 112347861 A, a full machine translation was provided with the Office action dated 11/4/2025 and is being relied upon).

Regarding claim 1, Han discloses a robot control method, comprising: obtaining a first scene image captured by a camera of a robot, and determining whether the first scene image comprises a foot (see at least [0042, 0046, 0053-0054, 0072] – robot… camera… recognizing foot state image to obtain wake-up action recognition information); obtaining a plurality of frames of second scene images captured by the camera consecutively in response to determining a predetermined quantity of consecutive frames of the first scene image comprises the foot (see at least [0057] - When the robot is awakened, it enters a standby state and collects new foot state images in real time; wherein the foot state images record two complete feet to be identified.); recognizing a foot posture based on the plurality of frames of second scene images, and controlling the robot based on a control manner corresponding to the recognized foot posture (see at least [0047-0049, 0058-0061] - Optionally, the interactive response signal corresponding to the foot state identification information is fed back to the robot, and the robot may perform a response action associated with the pre-stored interactive response signal according to the interactive response signal… Optionally, the response action includes one or more of: stop, start, pause, cancel, charge, test, action direction control, action speed control, action type control and action time control… Recognizing the foot state image to obtain foot state recognition information; obtaining, according to the foot state identification information, an interactive response signal corresponding to the foot state identification information… If the interactive signal is a non-cancellation response signal, it is fed back to the robot so that it can make a response action corresponding to the interactive response signal.).  
Han does not appear to explicitly disclose inputting the plurality of frames of second scene images that are obtained consecutively into a trained second neural network model, and recognizing the foot posture by the second neural network model based on the plurality of frames of second scene images. 
Wu, in the same field of endeavor, teaches the following limitations: inputting the plurality of frames of second scene images that are obtained consecutively into a trained second neural network model, and recognizing the hand posture by the second neural network model based on the plurality of frames of second scene images (see at least [0100-0104] - The first neural network submodel is configured to detect an image stream to determine a hand bounding box in each frame of image in the image stream. The second neural network submodel is configured to perform gesture recognition on the image stream to determine a gesture action of a user. The third neural network submodel is configured to recognize a frame of image to determine a hand posture type of the user. The three neural network submodels may be separately obtained through separate training.). 
	It would have been obvious to one of ordinary skill in the art before the effective filing date to have incorporated the teachings of Wu into the invention of Han with a reasonable expectation of success so that a start state of the gesture action can be accurately recognized to avoid an erroneous response to gesture recognition as much as possible, thereby increasing an accuracy rate of gesture recognition, and enhancing gesture interaction experience of the user (Wu – [0017]).
Han does not appear to explicitly disclose wherein the recognizing the foot posture by the second neural network model based on the plurality of frames of second scene images comprises: obtaining a plurality of frames of feature maps by performing, using a feature extraction module in the second neural network model, a feature extraction sequentially on the plurality of frames of second scene images; obtaining, by a temporal shift module in the second neural network model, the plurality of frames of feature maps from the feature extraction module, and obtaining a plurality of frames of shifted feature maps by performing, using the temporal shift module in the second neural network model, a temporal shift on each of the plurality of frames of feature maps; and obtaining, by a recognition module in the second neural network model, the plurality of frames of shifted feature maps from the temporal shift module and the plurality of frames of feature maps from the feature extraction module, and recognizing, by the recognition module in the second neural network model, the foot posture based on the plurality of frames of shifted feature maps and the plurality of frames of feature maps.  
	Chen, in the same field of endeavor, teaches the following limitations: wherein the recognizing the posture by the second neural network model based on the plurality of frames of second scene images comprises: obtaining a plurality of frames of feature maps by performing, using a feature extraction module in the second neural network model, a feature extraction sequentially on the plurality of frames of second scene images (see at least [0016, 0026, 0035] - obtains feature maps of three video images under the window through the image feature extraction module); obtaining, by a temporal shift module in the second neural network model, the plurality of frames of feature maps from the feature extraction module, and obtaining a plurality of frames of shifted feature maps by performing, using the temporal shift module in the second neural network model, a temporal shift on each of the plurality of frames of feature maps (see at least [0016, 0026, 0035] - the motion feature extraction module extracts the corresponding motion context feature map based on the feature map; the pose correction module dynamically generates convolution kernel parameters based on the motion feature map, and performs a convolution operation with the feature map of the center frame of the window, i.e., the target frame image, to obtain the adjusted feature map); and obtaining, by a recognition module in the second neural network model, the plurality of frames of shifted feature maps from the temporal shift module and the plurality of frames of feature maps from the feature extraction module, and recognizing, by the recognition module in the second neural network model, the posture based on the plurality of frames of shifted feature maps and the plurality of frames of feature maps (see at least [0016, 0026-0027, 0035] - performs a convolution operation with the feature map of the center frame of the window, i.e., the target frame image, to obtain the adjusted feature map; the pose classification module takes the adjusted feature map as input, and finally obtains the predicted heatmap of human key points).
It would have been obvious to one of ordinary skill in the art before the effective filing date to have incorporated the teachings of Chen into the invention of Han with a reasonable expectation of success for the purpose of improving the accuracy of human pose estimation in video scenes (Chen – [0009]).

Regarding claim 2, Han does not appear to explicitly disclose wherein the detecting whether the first scene image comprises the foot comprises: inputting the first scene image into a trained first neural network model, the first neural network model being configured to detect whether the first scene image comprises the foot, and output a detection result.  
	Wu, in the same field of endeavor, teaches the following limitations: wherein the detecting whether the first scene image comprises a hand comprises: inputting the first scene image into a trained first neural network model, the first neural network model being configured to detect whether the first scene image comprises the hand, and output a detection result (see at least [0100-0104] - The first neural network submodel is configured to detect an image stream to determine a hand bounding box in each frame of image in the image stream. The second neural network submodel is configured to perform gesture recognition on the image stream to determine a gesture action of a user. The third neural network submodel is configured to recognize a frame of image to determine a hand posture type of the user. The three neural network submodels may be separately obtained through separate training.).  
	The motivation to combine Han and Wu is the same as in the rejection of claim 1 above.

Regarding claim 6, Han does not appear to explicitly disclose wherein the obtaining the plurality of frames of shifted feature maps by performing, using the temporal shift module in the second neural network model, the temporal shift on each of the plurality of frames of feature maps comprises: for each of frames of feature maps ranging from a first frame of feature map to a penultimate frame of feature map in the plurality of frames of feature maps, shifting features of part of channels in the feature map to corresponding channels of a successively subsequent frame of feature map to obtain the plurality of frames of shifted feature maps. 
	Chen, in the same field of endeavor, teaches the following limitations: wherein the obtaining the plurality of frames of shifted feature maps by performing, using the temporal shift module in the second neural network model, the temporal shift on each of the plurality of frames of feature maps comprises: for each of frames of feature maps ranging from a first frame of feature map to a penultimate frame of feature map in the plurality of frames of feature maps, shifting features of part of channels in the feature map to corresponding channels of a successively subsequent frame of feature map to obtain the plurality of frames of shifted feature maps (see at least [0016, 0026, 0035, 0050] - Further, the specific implementation of establishing the human spatiotemporal window in step (1) is as follows: First, for each frame of the video, the ROI (Region of Interest, i.e., the human body location region) of all people in the image is detected by the Cascaded R-CNN detection algorithm. Then, the center point of the location region is fixed and expanded in all directions. The enlarged bounding box is used to crop in the single frame of the video and its neighboring frames respectively. The cropped area represents the approximate location region of a person in that time interval, which is called the human spatiotemporal window. This ensures that each person has a unique human spatiotemporal window in each frame… The Temporal Adaptive model takes a human spatiotemporal window as input, and obtains feature maps of three video images under the window through the image feature extraction module; the motion feature extraction module extracts the corresponding motion context feature map based on the feature map; the pose correction module dynamically generates convolution kernel parameters based on the motion feature map, and performs a convolution operation with the feature map of the center frame of the window, i.e., the target frame image, to obtain the adjusted feature map; the pose classification module takes the adjusted feature map as input, and finally obtains the predicted heatmap of human key points.). 
The motivation to combine Han and Chen is the same as in the rejection of claim 1 above.

Regarding claim 7, Han does not appear to explicitly disclose wherein the recognizing, by the recognition module in the second neural network model, the foot posture based on the plurality of frames of shifted feature maps and the plurality of frames of feature maps comprises: performing, by a convolutional layer in the recognition module, a convolution operation on each of the plurality of frames of shifted feature maps; obtaining, by a merging layer in the recognition module, each of a plurality of frames of convolved feature maps from the convolutional layer, and obtaining a plurality of frames of merged feature maps by merging, by the merging layer in the recognition module, each of the plurality of frames of convolved feature maps with a corresponding one of the plurality of frames of feature maps; and obtaining, by a fully connected layer in the recognition module, the plurality of frames of merged feature maps from the merging layer, and obtaining, by the fully connected layer in the recognition module, a foot posture recognition result based on the plurality of frames of merged feature maps.  
	Chen, in the same field of endeavor, teaches the following limitations: wherein the recognizing, by the recognition module in the second neural network model, the posture based on the plurality of frames of shifted feature maps and the plurality of frames of feature maps comprises: performing, by a convolutional layer in the recognition module, a convolution operation on each of the plurality of frames of shifted feature maps (see at least [0026-0030, 0035] - The Temporal Adaptive model takes a human spatiotemporal window as input, and obtains feature maps of three video images under the window through the image feature extraction module; the motion feature extraction module extracts the corresponding motion context feature map based on the feature map; the pose correction module dynamically generates convolution kernel parameters based on the motion feature map, and performs a convolution operation with the feature map of the center frame of the window, i.e., the target frame image, to obtain the adjusted feature map; the pose classification module takes the adjusted feature map as input, and finally obtains the predicted heatmap of human key points.); obtaining, by a merging layer in the recognition module, each of a plurality of frames of convolved feature maps from the convolutional layer, and obtaining a plurality of frames of merged feature maps by merging, by the merging layer in the recognition module, each of the plurality of frames of convolved feature maps with a corresponding one of the plurality of frames of feature maps (see at least [0035, 0060] - The input is a human spatiotemporal window containing multiple frames of video images. Each frame of video image is independently processed by the image extraction module to obtain its own feature map. All feature maps are fused together and then fed into the motion feature extraction module to obtain the motion feature map of the spatiotemporal window. The pose correction module dynamically generates convolution kernel parameters based on the motion feature map and performs convolution operation on the feature map of the center frame of the window to output a refined image feature map. This feature map is then sent to the pose classification module to obtain Gaussian heatmaps of various key points of the human body.); and obtaining, by a fully connected layer in the recognition module, the plurality of frames of merged feature maps from the merging layer, and obtaining, by the fully connected layer in the recognition module, a posture recognition result based on the plurality of frames of merged feature maps (see at least [0026, 0035, 0060] - The input is a human spatiotemporal window containing multiple frames of video images. Each frame of video image is independently processed by the image extraction module to obtain its own feature map. All feature maps are fused together and then fed into the motion feature extraction module to obtain the motion feature map of the spatiotemporal window. The pose correction module dynamically generates convolution kernel parameters based on the motion feature map and performs convolution operation on the feature map of the center frame of the window to output a refined image feature map. This feature map is then sent to the pose classification module to obtain Gaussian heatmaps of various key points of the human body.).
The motivation to combine Han and Chen is the same as in the rejection of claim 1 above.

Regarding claim 13, Han discloses a robot control apparatus, comprising: a memory; a processor; and a computer program stored on the memory and executable on the processor, wherein the processor is configured to implement, when executing the computer program, steps of the method according to claim 1 (see at least [0017, 0082-0087]). 

Regarding claim 14, Han discloses a robot, comprising: the robot control apparatus according to claim 13 (see at least [0030] - robot); and the camera configured to capture the first scene image and the second scene image (see at least [0042, 0072] - camera).  

With respect to claim 15, all the limitations have been analyzed in view of claim 1, and it has been determined that claim 15 does not teach or define any new limitations beyond those previously recited in claim 1; therefore, claim 15 is also rejected over the same rationale as claim 1.

With respect to claim 16, all the limitations have been analyzed in view of claim 2, and it has been determined that claim 16 does not teach or define any new limitations beyond those previously recited in claim 2; therefore, claim 16 is also rejected over the same rationale as claim 2.

With respect to claim 20, all the limitations have been analyzed in view of claim 6, and it has been determined that claim 20 does not teach or define any new limitations beyond those previously recited in claim 6; therefore, claim 20 is also rejected over the same rationale as claim 6.

Claims 3, 8-9, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Han in view of Wu, Chen, and Hu (CN 109934249 A, a full machine translation was provided with the Office action dated 11/4/2025 and is being relied upon).

Regarding claim 3, Han does not appear to explicitly disclose further comprising a training process of the first neural network model, the training process comprising: obtaining an image, captured by the camera, containing the foot as a positive sample; obtaining an image, captured by the camera, without the foot as a negative sample; and obtaining the first neural network model by training a pre-established classification model using the positive sample and the negative sample.  
Hu, in the same field of endeavor, teaches the following limitations: a training process of the first neural network model, the training process comprising: obtaining an image, captured by the camera, containing the object as a positive sample (see at least [0010] - acquiring multiple sample images; adding labels to the acquired multiple sample images, wherein positive sample labels are added to sample images containing predetermined features); obtaining an image, captured by the camera, without the object as a negative sample (see at least [0010] - negative sample labels are added to sample images not containing the predetermined features); and obtaining the first neural network model by training a pre-established classification model using the positive sample and the negative sample (see at least [0010] - positive sample labels are added to sample images containing predetermined features, and negative sample labels are added to sample images not containing the predetermined features; establishing a neural network classification model based on an attention mechanism; and training the neural network classification model using the labeled sample images to obtain an optimal classification model).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to have incorporated the teachings of Hu into the invention of Han with a reasonable expectation of success for the purpose of improving data processing for training a classification model for achieving accurate classification of images with blurred or irregular features (Hu – [0007-0008, 0028]).

Regarding claim 8, Han does not appear to explicitly disclose wherein the obtaining the plurality of frames of feature maps by performing, using the feature extraction module in the second neural network model, the feature extraction sequentially on the plurality of frames of second scene images comprises: obtaining the plurality of frames of feature maps by performing, using a convolutional module with an attention enhancement mechanism in the feature extraction module, the feature extraction sequentially on the plurality of frames of second scene images.  
	Hu, in the same field of endeavor, teaches the following limitations: wherein the obtaining the plurality of frames of feature maps by performing, using the feature extraction module in the second neural network model, the feature extraction sequentially on the plurality of frames of second scene images comprises: obtaining the plurality of frames of feature maps by performing, using a convolutional module with an attention enhancement mechanism in the feature extraction module, the feature extraction sequentially on the plurality of frames of second scene images (see at least [0080-0082] - An attention sub-model is introduced into the hidden layer of the above convolutional neural network classification model. This attention sub-model includes attention parameters. As an optional embodiment, the attention parameters of the attention sub-model are used to construct the feature weights of each channel or pixel of the feature map input to the attention sub-model. For example, an attention sub-model is introduced after the first convolutional layer of the convolutional neural network classification model shown in Figure 3A, and the three feature maps output by the first convolutional layer are input into this attention sub-model.
Assuming each feature map corresponds to 3 channels, for each feature map input into the attention sub-model, the attention parameters of the attention sub-model are used to construct the feature weights of each channel of the feature map, or, more finely, the attention parameters are used to construct the feature weights of each pixel of the feature map. In this way, the attention sub-model outputs 3 new feature maps with reconstructed feature weights, and these 3 new feature maps are used as the input to the next pooling layer.).
The motivation to combine Han and Hu is the same as in the rejection of claim 3 above.

Regarding claim 9, Han does not appear to explicitly disclose wherein the convolutional module with the attention enhancement mechanism comprises an attention module arranged between at least one pair of adjacent convolutional layers, the attention module comprising a channel attention module, a first fusion layer, a spatial attention module, and a second fusion layer; and wherein the method further comprises: obtaining, by the channel attention module, a channel weight based on a feature map output by a previous convolutional layer; obtaining a first fusion feature map by fusing, by the first fusion layer, the channel weight to the feature map outputted by the previous convolutional layer; obtaining, by the spatial attention module, a spatial position weight based on the first fusion feature map outputted by the first fusion layer; and obtaining a second fusion feature map by fusing, by the second fusion layer, the spatial position weight to the first fusion feature map, and inputting, by the second fusion layer, the second fusion feature map to a next convolutional layer. 
	Hu, in the same field of endeavor, teaches the following limitations: wherein the convolutional module with the attention enhancement mechanism comprises an attention module arranged between at least one pair of adjacent convolutional layers, the attention module comprising a channel attention module, a first fusion layer, a spatial attention module, and a second fusion layer (see at least [0078-0082] - attention sub-model is introduced into the hidden layer… the attention parameters of the attention sub-model are used to construct the feature weights of each channel of the feature map… aggregating spatial information… convolutional layers… pooling layers); and wherein the method further comprises: obtaining, by the channel attention module, a channel weight based on a feature map output by a previous convolutional layer (see at least [0078-0082] - Assuming each feature map corresponds to 3 channels, for each feature map input into the attention sub-model, the attention parameters of the attention sub-model are used to construct the feature weights of each channel of the feature map, or, more finely, the attention parameters are used to construct the feature weights of each pixel of the feature map. In this way, the attention sub-model outputs 3 new feature maps with reconstructed feature weights, and these 3 new feature maps are used as the input to the next pooling layer.); obtaining a first fusion feature map by fusing, by the first fusion layer, the channel weight to the feature map outputted by the previous convolutional layer (see at least [0078-0082] - Assuming each feature map corresponds to 3 channels, for each feature map input into the attention sub-model, the attention parameters of the attention sub-model are used to construct the feature weights of each channel of the feature map, or, more finely, the attention parameters are used to construct the feature weights of each pixel of the feature map. In this way, the attention sub-model outputs 3 new feature maps with reconstructed feature weights, and these 3 new feature maps are used as the input to the next pooling layer.); obtaining, by the spatial attention module, a spatial position weight based on the first fusion feature map outputted by the first fusion layer (see at least [0078-0082] - The output of the previous layer serves as the input of the next layer…Following the combination of convolutional and pooling layers is a fully connected layer, where each feature map in the fully connected layer has a mapping relationship with each feature map in the previous layer.); and obtaining a second fusion feature map by fusing, by the second fusion layer, the spatial position weight to the first fusion feature map, and inputting, by the second fusion layer, the second fusion feature map to a next convolutional layer (see at least [0078-0082] - The output of the previous layer serves as the input of the next layer…Following the combination of convolutional and pooling layers is a fully connected layer, where each feature map in the fully connected layer has a mapping relationship with each feature map in the previous layer.). 
The motivation to combine Han and Hu is the same as in the rejection of claim 3 above.

With respect to claim 17, all the limitations have been analyzed in view of claim 3, and it has been determined that claim 17 does not teach or define any new limitations beyond those previously recited in claim 3; therefore, claim 17 is also rejected over the same rationale as claim 3.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Han in view of Wu, Chen, and Chi (US 2022/0051061 A1).

Regarding claim 10, Han does not appear to explicitly disclose further comprising a training process of the second neural network model, the training process comprising: obtaining a plurality of video segments each containing the foot that are captured by the camera, and labeling a predetermined foot posture contained in each of the plurality of video segments; and obtaining the second neural network model by training a pre-established action recognition model using a plurality of labeled video segments.  
	Chi, in the same field of endeavor, teaches the following limitations: a training process of the second neural network model, the training process comprising: obtaining a plurality of video segments each containing the object that are captured by the camera, and labeling a predetermined posture contained in each of the plurality of video segments (see at least [0059-0063, 0125] - Determine, according to node sequence information corresponding to N consecutive video frames in the video data, action categories respectively corresponding to the N consecutive video frames. Action categories corresponding to a video frame may be used to reflect which action category an action posture of the interactive object in this video frame belongs to. The action category is determined by the server in combination with information carried in the N consecutive video frames where the video frame is located.); and obtaining the second neural network model by training a pre-established action recognition model using a plurality of labeled video segments (see at least [0125-0126] - Adjust a proportion of positive and negative samples in training: According to the standard of generating the supervision information of the training samples in b), if there is no defined action in the supervision information of a training sample, the training sample is used as a negative sample; otherwise, the training sample is used as a positive sample. In each training cycle, all the positive samples and a part of the negative samples are put into training, so that a ratio of the quantity of the positive samples and negative samples participating in training in the cycle is about 2.5:1.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to have incorporated the teachings of Chi into the invention of Han with a reasonable expectation of success for the purpose of training a neural network to accurately recognize the interactive actions made by interactive objects (Chi – [0030-0031, 0066]).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Han in view of Wu, Chen, Chi, and Gao (CN 109117742 A, a full machine translation was provided with the Office action dated 11/4/2025 and is being relied upon).

Regarding claim 11, Han does not appear to explicitly disclose subsequent to the obtaining the second neural network model, the method further comprising: performing an integer quantization on at least one model parameter of the second neural network model.  
	Gao, in the same field of endeavor, teaches the following limitations: subsequent to the obtaining the second neural network model, the method further comprising: performing an integer quantization on at least one model parameter of the second neural network model (see at least [0100, 0104] - In this embodiment, the weights of the gesture detection model are quantized layer by layer, and the gesture detection model after layer-by-layer quantization is stored to compress the size of the gesture detection model and reduce the storage space occupied by the gesture detection model… For example, the preset number of scales can be 256. The weights of each layer in the convolutional neural network model are floating-point data. Storing floating-point data requires 32 bits, while storing integer data only requires 8 bits. Indexing these 256 numbers only requires 8 bits, which can achieve the effect of compression.).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to have incorporated the teachings of Gao into the invention of Han with a reasonable expectation of success for the purpose of compressing the size of the gesture detection model to reduce storage space occupied by the gesture detection model while maintaining the gesture detection effect (Gao – [0100, 0104-0105]).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Han in view of Wu, Chen, Williams (US 2018/0364045 A1) and Lee (US 2019/0090711 A1).

Regarding claim 12, Han discloses wherein the controlling the robot based on the control manner corresponding to the recognized foot posture comprises: controlling, based on the recognized foot posture being a first predetermined posture, the robot to initiate (see at least [0013, 0049, 0078] - start); controlling, based on the recognized foot posture being a second predetermined posture, the robot to stop (see at least [0013, 0049, 0078] - stop); controlling, based on the recognized foot posture being a third predetermined posture (see at least [0013, 0049, 0078] – action direction control); and controlling, based on the recognized foot posture being a fourth predetermined posture (see at least [0013, 0049, 0078] – action type control). 
Han does not appear to explicitly disclose controlling, based on a first predetermined posture, the robot to initiate a cleaning mode to start cleaning; controlling, based on a second predetermined posture, the robot to stop cleaning; controlling, based on a third predetermined posture, the robot to access a target-tracking mode; and controlling, based on a fourth predetermined posture, the robot to perform cleaning in a predetermined range around a position of the foot. 
Williams, in the same field of endeavor, teaches the following limitations: controlling, based on a first predetermined gesture, the robot to initiate a cleaning mode to start cleaning (see at least [0136] – second hand gesture to start a cleaning function of the robotic platform 100); controlling, based on a second predetermined gesture, the robot to stop cleaning (see at least [0136] – third hand gesture to stop the cleaning function); controlling, based on a third predetermined gesture, the robot to access a target-tracking mode (see at least [0136] – a user may initiate the “follow-me” mode with a first hand gesture).
It would have been obvious to one of ordinary skill in the art before the effective filing date to have incorporated the teachings of Williams into the invention of Han with a reasonable expectation of success for the purpose making it easy for a user to direct the robotic platform to control different cleaning functions (Williams – [0136-0137]).
Lee, in the same field of endeavor, teaches the following limitations: controlling, based on a fourth predetermined selection, the robot to perform cleaning in a predetermined range around a position of the user (see at least [0086, 0091] – cleaning the surrounding area within a certain distance around the location of the user). 
It would have been obvious to one of ordinary skill in the art before the effective filing date to have incorporated the teachings of Lee into the invention of Han with a reasonable expectation of success for the purpose of conveniently and efficiently allowing the user and robot cleaner to cooperate with each to perform cleaning (Lee – [0115]).

Response to Arguments
In light of the amendments to the claims, the previous 35 U.S.C. 112 rejections have been withdrawn, with the exception of claim 14 (see 35 U.S.C. 112 rejections above for further explanation).
In light of the amendments to the claims, the previous 35 U.S.C. 101 rejections have been withdrawn.
Applicant's arguments, see pages 9-12 filed 2/2/2026 have been fully considered but they are not persuasive. Applicant argues that Chen fails to disclose “obtaining, by a temporal shift module in the second neural network model, the plurality of frames of feature maps from the feature extraction module, and obtaining a plurality of frames of shifted feature maps by performing, using the temporal shift module in the second neural network model, a temporal shift on each of the plurality of frames of feature maps” as recited in previous claim 5, and newly amended claim 1. The relied-upon apparatus in Chen simply cannot be understood as the above claimed features as recited in claim 1. Chen only teaches the pose correction module dynamically generates convolution kernel parameters based on the motion feature map, and performs a convolution operation with the feature map of the center frame of the window, i.e., the target frame image, to obtain the adjusted feature map. Chen does not contemplate temporal shift module and performing temporal shift, as recited in claim 1. Chen teaches away from this application. And combining other references to Chen will change operation principle of Chen, and render Chen inoperable for its intended purpose. Thus one skilled in the art will not consider combining other references to Chen when faced with the technical problems of this application. Thus, Chen cannot be understood as teaching the features identified above as recited in claim 1, and claim 1 is allowable. 
The examiner respectfully disagrees. To summarize, Applicant asserts that Chen does not disclose the claim limitation, restates the cited portions of Chen from the previous Office action, asserts that these cited portions do not contemplate the temporal shift module as in the claim limitation, asserts that Chen teaches away from the temporal shift module, asserts that combining with other references will change Chen’s operation and render it inoperable, and finally asserts that claim 1 is allowable. None of these assertions provide any accompanying substantive arguments, reasoning, or further explanation. Applicant has not clearly pointed out how their claim is distinguishable over the cited prior art, specifically the Chen reference. 
Chen, at least paragraphs [0016, 0026-0027, 0035], reads on the limitations argued above. The cited portions of Chen teach at least a temporal adaptive model (this reads on the limitation of the temporal shift module) that obtains feature maps of video images through a feature extraction module (this reads on the limitation of obtaining a plurality of frame of feature maps using a feature extraction module) and uses motion feature maps to obtain an adjusted feature map (this reads on the limitation of obtaining a plurality of frames of shifted feature maps by performing a temporal shift). 
	
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CAITLIN MCCLEARY whose telephone number is (703)756-1674. The examiner can normally be reached Monday - Friday 10:00 am - 7:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Navid Z Mehdizadeh can be reached at (571) 272-7691. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/C.R.M./Examiner, Art Unit 3669                                                                                                                                                                                                        
/NAVID Z. MEHDIZADEH/Supervisory Patent Examiner, Art Unit 3669
Read full office action
Prosecution Timeline

Jun 05, 2024
Application Filed
Oct 29, 2025
Non-Final Rejection — §103, §112
Feb 02, 2026
Response Filed
Mar 04, 2026
Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/159,454
Patent 12589771
VEHICLE CONTROL DEVICE, STORAGE MEDIUM FOR STORING COMPUTER PROGRAM FOR VEHICLE CONTROL, AND METHOD FOR CONTROLLING VEHICLE
2y 5m to grant Granted Mar 31, 2026
17/671,719
Patent 12583670
LIFT ARM ASSEMBLY FOR A FRONT END LOADING REFUSE VEHICLE
2y 5m to grant Granted Mar 24, 2026
18/660,237
Patent 12552379
STAGGERING DETERMINATION DEVICE, STAGGERING DETERMINATION METHOD, AND STORAGE MEDIUM
2y 5m to grant Granted Feb 17, 2026
17/972,782
Patent 12539840
SYSTEM AND METHOD FOR PROBING PROPERTIES OF A TRAILER TOWED BY A TOWING VEHICLE IN A HEAVY-DUTY VEHICLE COMBINATION
2y 5m to grant Granted Feb 03, 2026
17/794,529
Patent 12509934
Sensor Device
2y 5m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
57%
Grant Probability
89%
With Interview (+32.0%)
2y 11m
Median Time to Grant
Moderate
PTA Risk
Based on 95 resolved cases by this examiner. Grant probability derived from career allow rate.