Last updated: May 29, 2026
Application No. 17/121,763
CLASSIFICATION DEVICE AND CLASSIFICATION METHOD BASED ON NEURAL NETWORK

Final Rejection §103
Filed
Dec 15, 2020
Priority
Oct 13, 2020 — provisional 63/091,280 +1 more
Examiner
TRAN, DAVID HOANG
Art Unit
2147
Tech Center
2100 — Computer Architecture & Software
Assignee
Industrial Technology Research Institute
OA Round
4 (Final)
This examiner grants 14% of cases after interview

— +23.2% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 14 resolved cases, 2023–2026
Examiner Intelligence

TRAN, DAVID HOANG View full profile →
Grants only 14% of cases
Career Allowance Rate
2 granted / 14 resolved
-40.7% vs TC avg
Strong +23% interview lift
Without
With
+23.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 2m
Avg Prosecution
18 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
98.6%
+58.6% vs TC avg
§102
1.4%
-38.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 14 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments filed 12/01/2025 on pages 9-14 of Remarks regarding the rejection under 35 U.S.C. 103 with respect to claims 1, 3-4, 6-7, 9-10 and 12 have been fully considered but are not persuasive.
Beginning on page 9 of Remarks, Applicant respectfully asserts that in Huang, the conclusive neural network receives and accumulates the first and second classification results from two different sources (i.e., the Multi-Modal Pattern Fusion Module and the previous conclusive neural network). However, Examiner respectfully disagrees. In Figure 1 of Huang, both inputs to the conclusive neural network originate from the same multimodal integration process at different time points, rather than from different independent sources.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Yang et al. (US20170220854A1); hereinafter Yang in view of Wang et al. (Sensor Fusion for Myoelectric Control Based on Deep Learning With Recurrent Convolutional Neural Networks); hereinafter Wang and in further view of Huang et al. (MiST: A Multiview and Multimodal Spatial-Temporal Learning Framework for Citywide Abnormal Event Forecasting); hereinafter Huang
Claim 1 is rejected over Yang, Wang and Huang.
Regarding claim 1, Yang teaches a classification device based on a neural network, comprising: (“The system will fuse a group of the extracted video features and a group of the extracted other features to create a fused feature representation for a time period. It will then analyze the fused feature representation to identify a class, access a data store of classes and actions to identify an action that is associated with the class, and save the identified action to a memory device.”; Abstract; Note: See Figure 2 and 3 of Yang to see that image data of the video and the other feature data are heterogeneous data.)
a storage medium, storing a heterogenous integration module; and 
a processor, coupled to the storage medium, wherein the processor is configured to access and execute the heterogeneous integration module, wherein
(“In this document, the term “computing device” refers to an electronic device having a processor and a non-transitory, computer-readable medium (i.e., memory).”; [0020])
the heterogeneous integration module comprises:
a convolutional layer, generating a first feature map according to a first image data; (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “This may be accomplished by any now or hereafter known methods, such as by using the features used by a system that performs image classification using a deep convolutional neural network that has been trained on a dataset of images so that the system recognizes objects and features within those images.”; [0043]; Note: 202 of Figure 2 can be a convolutional neural network (CNN), which has at least one convolutional layer and the output of the layer in the CNN is the feature map of the image data.)
a data normalization layer, normalizing a first numerical data to generate a first normalized numerical data, wherein the first numerical data corresponds to the first image data, wherein the first image data and the first numerical data correspond to a first time point; (“The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032] and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The other sensor data such as motion data 212 is the first normalized numerical data)
a connected layer, coupled to the convolutional layer and the data normalization layer, concatenating the first feature map and the first normalized numerical data to generate a concatenation data, and generating the first feature vector according to the concatenation data; (“The system will then fuse 221 (concatenate) the extracted video features (first feature map) associated with a time period (which corresponds to at least one, but possibly many time spans, stamps or instants) and the extracted other data features (first normalized numerical data) that are associated with the same time period to create a fused feature representation for the time period. In one embodiment, the fusion may be performed with the aid of one or more multi-layer long-short-term memory (LSTM) networks operating on features corresponding to the time period.”; [0032]; and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The image from the video features (first feature map) are concatenated with the other data features (first normalized numerical data) in the temporal fusion (LSTM));
a classification layer, coupled to the connected layer, and generating a first classification result corresponding to the first image data and the first numerical data according to the first feature vector, wherein (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032]; Note: The system can extract the features at a second time point and generate a second classification result)
the heterogeneous integration module generates a second classification result according to a second image data and a second numerical data corresponding to a second time point, wherein the second numerical data corresponds to the second image data; and (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032]; Note: The system can extract the features at a second time point and generate a second classification result)
Yang does not teach a recurrent neural network, coupled to the heterogeneous integration module,
However, Wang teaches a recurrent neural network, coupled to the heterogeneous integration module, (See pages 6 and 7 of Wang to see that the RNN Module is connected to the CNN Module where the CNN Module consists of Fully Connected layers. The classification layer is interpreted as a fully connected (FC) layer cited in paragraph [0028] of the instant application.)
It would have been obvious before the effective filing date to combine the multimodal data processing of Yang with the RNN Module for improved accuracy (Wang, page 6, Sensor fusion with RCNN model).  Yang and Wang are analogous art because they both concern fusion of temporal data.
Yang does not teach wherein the recurrent neural network receives the first classification result and the second classification result from the same heterogeneous integration module and generates a third classification result corresponding to the second image data and the second numerical data according to the first classification result and the second classification result. 
wherein the heterogeneous integration module generates a fourth classification result according to a third image data and a third numerical data corresponding to a third time point, and 
the recurrent neural network receives the second classification result corresponding to the second time point and the fourth classification result corresponding to the third time point from the same heterogeneous integration module, and generates a fifth classification result corresponding to the third image data and the third numerical data according to the second classification result and the fourth classification result.
However, Huang teaches wherein the recurrent neural network receives the first classification result and the second classification result from the same heterogeneous integration module and generates a third classification result corresponding to the second image data and the second numerical data according to the first classification result and the second classification result. (“The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction (third classification result) during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step t-k that each goes into a conclusive neural network and accumulates the first and second classification result to generate a third classification result.)
wherein the heterogeneous integration module generates a fourth classification result according to a third image data and a third numerical data corresponding to a third time point, and 
the recurrent neural network receives the second classification result corresponding to the second time point and the fourth classification result corresponding to the third time point from the same heterogeneous integration module, and generates a fifth classification result corresponding to the third image data and the third numerical data according to the second classification result and the fourth classification result. (“Relying on the hidden representations generated from spatial-temporal-categorical views, we then develop a conclusive recurrent networks to effectively capture the sequential patterns of cross-modal correlations between location, time, and event category. The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step tk that each goes into a conclusive neural network and at a third time step, a third classification result is generated. A classification result is generated at each time step tk. The contextual attention module is the same module applied across different time steps and can be applied to the fourth and fifth classification result. Examiner is interpreting outputs generated at each time step that reflect the classification processing of the model as classification results.)
It would have been obvious before the effective filing date to combine the multimodal data processing of Yang with the conclusive recurrent neural networks of Huang to effectively capture the sequential patterns of cross-modal correlations. (Huang, page 719, Conclusive Recurrent Neural Networks). Yang and Huang are analogous art because they both concern multimodal fusion.
Claim 7 is rejected over Yang, Wang and Huang.
Regarding claim 7, Yang teaches a classification method based on a neural network (“The system will fuse a group of the extracted video features and a group of the extracted other features to create a fused feature representation for a time period. It will then analyze the fused feature representation to identify a class, access a data store of classes and actions to identify an action that is associated with the class, and save the identified action to a memory device.”; Abstract; Note: The image data of the video and the other feature data are heterogeneous data.), comprising:
obtaining a first image data and (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031]) a first numerical data corresponding to a first time point, wherein the first numerical data corresponds to the first image data; (“The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032] and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The other sensor data such as motion data 212 is the first normalized numerical data)
obtaining a heterogeneous integration module, wherein the heterogeneous integration module comprises a convolutional layer, a data normalization layer, a connected layer and a classification layer; (See Figure 2 of Yang to see that 202 can consist of a convolutional layer, 212 can consist of a data normalization layer, 221 is the connected layer and 22 is the classification layer.)
generating a first feature map according to the first image data by the convolutional layer; (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “This may be accomplished by any now or hereafter known methods, such as by using the features used by a system that performs image classification using a deep convolutional neural network that has been trained on a dataset of images so that the system recognizes objects and features within those images.”; [0043]; Note: 202 of Figure 2 can be a convolutional neural network (CNN), which has at least one convolutional layer and the output of the layer in the CNN is the feature map of the image data.)
normalizing the first numerical data to generate a first normalized numerical data by the data normalization layer; (“The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032] and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The other sensor data such as motion data 212 is the first normalized numerical data)
concatenating the first feature map and the first normalized numerical data to generate a concatenation data, and generating the first feature vector according to the concatenation data by the connected layer (“The system will then fuse 221 (concatenate) the extracted video features (first feature map) associated with a time period (which corresponds to at least one, but possibly many time spans, stamps or instants) and the extracted other data features (first normalized numerical data) that are associated with the same time period to create a fused feature representation for the time period. In one embodiment, the fusion may be performed with the aid of one or more multi-layer long-short-term memory (LSTM) networks operating on features corresponding to the time period.”; [0032]; and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The image from the video features (first feature map) are concatenated with the other data features (first normalized numerical data) in the temporal fusion (LSTM));
generating a first classification result corresponding to the first image data and the first numerical data according to the first feature vector by the classification layer; (“The processor will implement instructions to analyze the fused feature representation to perform a classification process 222 that includes identifying a class that applies to both the extracted video features and the extracted other data features, and also identifying an action that is associated with the class.”; [0036]; Note: The extracted video features contain the first image and the other data features contain the first numerical data)
obtaining a second image data and a second numerical data corresponding to a second time point, wherein the second numerical data corresponds to the second image data; (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032]; Note: The can extract the second image features at a second time point)
generating a second classification result according to the second image data and the second numerical data by the heterogeneous integration module; (“The processor will implement instructions to analyze the fused feature representation to perform a classification process 222 that includes identifying a class that applies to both the extracted video features and the extracted other data features, and also identifying an action that is associated with the class.”; [0036]; Note: The extracted video features contain the second image and the other data features contain the second numerical data to form the second classification result)
Yang does not teach obtaining a recurrent neural network and connecting the recurrent neural network to an output of the classification layer; and
However, Wang teaches obtaining a recurrent neural network and connecting the recurrent neural network to an output of the classification layer; and (See pages 6 and 7 of Wang to see that the RNN Module is connected to the CNN Module where the CNN Module consists of Fully Connected layers. The classification layer is interpreted as a fully connected (FC) layer cited in paragraph [0028] of the instant application.)
It would have been obvious before the effective filing date to combine the multimodal data processing of Yang with the RNN Module for improved accuracy (Wang, page 6, Sensor fusion with RCNN model).  Yang and Wang are analogous art because they both concern fusion of temporal data.
Yang does not teach receiving the first classification result and the second classification result from the same heterogeneous integration module and generating a third classification result corresponding to the second image data and the second numerical data according to the first classification result and the second classification result by the recurrent neural network,
wherein the heterogeneous integration module generates a third feature vector according to a third image data and a third numerical data corresponding to a third time point, wherein the third numerical data corresponds to the third image data,
the recurrent neural network receives the second feature vector and the third feature vector from the same heterogeneous integration module and generates a third classification result corresponding to the third image data and the third numerical data according to the second feature vector and the third feature vector.
However, Huang teaches receiving the first classification result and the second classification result from the same heterogeneous integration module and generating a third classification result corresponding to the second image data and the second numerical data according to the first classification result and the second classification result by the recurrent neural network. (“The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction (third classification result) during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step t-k that each goes into a conclusive neural network and accumulates the first and second classification result to generate a third classification result.)
wherein the heterogeneous integration module generates a third feature vector according to a third image data and a third numerical data corresponding to a third time point, wherein the third numerical data corresponds to the third image data,
the recurrent neural network receives the second feature vector and the third feature vector from the same heterogeneous integration module and generates a third classification result corresponding to the third image data and the third numerical data according to the second feature vector and the third feature vector. (“Relying on the hidden representations generated from spatial-temporal-categorical views, we then develop a conclusive recurrent networks to effectively capture the sequential patterns of cross-modal correlations between location, time, and event category. The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step tk that each goes into a conclusive neural network and at a third time step, a third classification result is generated. A classification result is generated at each time step tk. The contextual attention module is the same module applied across different time steps.)
It would have been obvious before the effective filing date to combine the multimodal data processing of Yang with the conclusive recurrent neural networks of Huang to effectively capture the sequential patterns of cross-modal correlations. (Huang, page 719, Conclusive Recurrent Neural Networks).
Claims 3 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Yang, Wang and Huang in further view of Tong et al. (Pulmonary Nodule Classification Based on Heterogeneous Features Learning); hereinafter Tong
Claim 3 is rejected over Yang, Wang, Huang and Tong with the incorporation of claim 1.
Regarding claim 3, Yang does not teach wherein the first normalized numerical data is normalized to a value from 0 to 1.
However, Tong teaches wherein the first normalized numerical data is normalized to a value from 0 to 1. (“the features of age are normalized to [0, 1]”; page 579; Note: The features of age are the normalized numerical data associated with the image in Figure 3).
It would have been obvious before the effective filing date to combine the multimodal data processing of Yang with the normalization of numerical data related to the image data of Tong to improve the efficiency of classification (Tong, IV. Conclusion). Yang and Tong are analogous arts because they both teach classification of heterogenous data.
Dependent claim 9 is claim 3 in the form of a method and is rejected for the same reasons as claim 3 above. For the rejection of the limitations specifically pertaining to the device of claim 1, see the rejection of claim 1 above.
Claims 4 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Huang
Claim 4 is rejected over Yang and Huang.
Regarding claim 4, Yang teaches a classification device based on a neural network, comprising: (“The system will fuse a group of the extracted video features and a group of the extracted other features to create a fused feature representation for a time period. It will then analyze the fused feature representation to identify a class, access a data store of classes and actions to identify an action that is associated with the class, and save the identified action to a memory device.”; Abstract; Note: See Figure 2 and 3 of Yang to see that image data of the video and the other feature data are heterogeneous data.)
a storage medium, storing a heterogenous integration module; and 
a processor, coupled to the storage medium, wherein the processor is configured to access and execute the heterogeneous integration module, wherein
(“In this document, the term “computing device” refers to an electronic device having a processor and a non-transitory, computer-readable medium (i.e., memory).”; [0020])
a heterogeneous integration module, comprising:
a convolutional layer, generating a first feature map according to a first image data; (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “This may be accomplished by any now or hereafter known methods, such as by using the features used by a system that performs image classification using a deep convolutional neural network that has been trained on a dataset of images so that the system recognizes objects and features within those images.”; [0043]; Note: 202 of Figure 2 can be a convolutional neural network (CNN), which has at least one convolutional layer and the output of the layer in the CNN is the feature map of the image data.)
a data normalization layer, normalizing a first numerical data to generate a first normalized numerical data, wherein the first numerical data corresponds to the first image data, wherein the first image data and the first numerical data correspond to a first time point; and (“The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032] and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The other sensor data such as motion data 212 is the first normalized numerical data)
a connected layer, coupled to the convolutional layer and the data normalization layer, concatenating the first feature map and the first normalized numerical data to generate a concatenation data, and generating the first feature vector according to the concatenation data. (“The system will then fuse 221 (concatenate) the extracted video features (first feature map) associated with a time period (which corresponds to at least one, but possibly many time spans, stamps or instants) and the extracted other data features (first normalized numerical data) that are associated with the same time period to create a fused feature representation for the time period. In one embodiment, the fusion may be performed with the aid of one or more multi-layer long-short-term memory (LSTM) networks operating on features corresponding to the time period.”; [0032]; and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The image from the video features (first feature map) are concatenated with the other data features (first normalized numerical data) in the temporal fusion (LSTM));
wherein the heterogeneous integration module generates a second feature vector according to a second image data and a second numerical data corresponding to a second time point, wherein the second numerical data corresponds to the second image data (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032]; Note: The can extract the features at a second time point)
	Yang does not teach a recurrent neural network, coupled to the connected layer, wherein the recurrent neural network generates a first classification result corresponding to the first image data and the first numerical data according to the first feature vector, 
wherein the recurrent neural network receives the first feature vector and the second feature vector from the same heterogeneous integration module and generates a second classification result corresponding to the second image data and the second numerical data according to the first feature vector and the second feature vector,
wherein the heterogeneous integration module generates a third feature vector according to a third image data and a third numerical data corresponding to a third time point, wherein the third numerical data corresponds to the third image data,
the recurrent neural network receives the second feature vector and the third feature vector from the same heterogeneous integration module and generates a third classification result corresponding to the third image data and the third numerical data according to the second feature vector and the third feature vector.
However, Huang teaches a recurrent neural network, coupled to the connected layer, wherein the recurrent neural network generates a first classification result corresponding to the first image data and the first numerical data according to the first feature vector, (“The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step t-k that each goes into a conclusive neural network and accumulates the first and second classification result to generate a third classification result.)
wherein the recurrent neural network receives the first feature vector and the second feature vector from the same heterogeneous integration module and generates a second classification result corresponding to the second image data and the second numerical data according to the first feature vector and the second feature vector, (“The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step tk that each goes into a conclusive neural network and at a second time step, a second classification result is generated. A classification result is generated at each time step tk.)
wherein the heterogeneous integration module generates a third feature vector according to a third image data and a third numerical data corresponding to a third time point, wherein the third numerical data corresponds to the third image data,
the recurrent neural network receives the second feature vector and the third feature vector from the same heterogeneous integration module and generates a third classification result corresponding to the third image data and the third numerical data according to the second feature vector and the third feature vector. (“Relying on the hidden representations generated from spatial-temporal-categorical views, we then develop a conclusive recurrent networks to effectively capture the sequential patterns of cross-modal correlations between location, time, and event category. The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step tk that each goes into a conclusive neural network and at a third time step, a third classification result is generated. A classification result is generated at each time step tk. The contextual attention module is the same module applied across different time steps. Examiner is interpreting outputs generated at each time step that reflect the classification processing of the model as classification results.)
It would have been obvious before the effective filing date to combine the multimodal data processing of Yang with the conclusive recurrent neural networks of Huang to effectively capture the sequential patterns of cross-modal correlations. (Huang, page 719, Conclusive Recurrent Neural Networks). Yang and Huang are analogous art because they both concern multimodal fusion.
Claim 10 is rejected over Yang and Huang.
Regarding claim 10, Yang teaches a classification method based on a neural network, (“The system will fuse a group of the extracted video features and a group of the extracted other features to create a fused feature representation for a time period. It will then analyze the fused feature representation to identify a class, access a data store of classes and actions to identify an action that is associated with the class, and save the identified action to a memory device.”; Abstract; Note: The image data of the video and the other feature data are heterogeneous data.), comprising:
obtaining a first image data and (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031]) a first numerical data corresponding to a first time point, wherein the first numerical data corresponds to the first image data; (“The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032] and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The other sensor data such as motion data 212 is the first normalized numerical data)
obtaining a heterogeneous integration module and [a recurrent neural network,] wherein the heterogeneous integration module comprises a convolutional layer, a data normalization layer and a connected layer, [wherein the recurrent neural network is connected to an output of the connected layer;] (See Figure 2 of Yang to see that 202 can consist of a convolutional layer, 212 can consist of a data normalization layer, 221 is the connected layer and 22 is the classification layer.)
generating a first feature map according to the first image data by the convolutional layer; (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “This may be accomplished by any now or hereafter known methods, such as by using the features used by a system that performs image classification using a deep convolutional neural network that has been trained on a dataset of images so that the system recognizes objects and features within those images.”; [0043]; Note: 202 of Figure 2 can be a convolutional neural network (CNN), which has at least one convolutional layer and the output of the layer in the CNN is the feature map of the image data.)
normalizing the first numerical data to generate a first normalized numerical data by the data normalization layer; (“The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032] and “Before any further processing was applied, each channel in the motion data was normalized independently: the mean was subtracted and the signal scaled by its standard deviation.”; [0041]; Note: The other sensor data such as motion data 212 is the first normalized numerical data)
concatenating the first feature map and the first normalized numerical data to generate a concatenation data and generating a first feature vector according to the concatenation data by the connected layer; (“The system will then fuse 221 the extracted video features associated with a time period (which corresponds to at least one, but possibly many time spans, stamps or instants) and the extracted other data features that are associated with the same time period to create a fused feature representation for the time period. In one embodiment, the fusion may be performed with the aid of one or more multi-layer long-short-term memory (LSTM) networks operating on features corresponding to the time period.”; [0032]; Note: The image from the video features are concatenated with the other data features in the temporal fusion (LSTM));
generating a first classification result corresponding to the first image data and the first numerical data according to the first feature vector [by the recurrent neural network;] (“The processor will implement instructions to analyze the fused feature representation to perform a classification process 222 that includes identifying a class that applies to both the extracted video features and the extracted other data features, and also identifying an action that is associated with the class.”; [0036]; Note: The extracted video features contain the first image and the other data features contain the first numerical data to form the first classification result)
obtaining a second image data and a second numerical data corresponding to a second time point, wherein the second numerical data corresponds to the second image data; (“The system will extract video features from short sequences of digital images (video clips) 202 so that each extracted video feature is associated with a time period (which may be a single point in time or a time span) corresponding to the time stamp(s) associated with the frame(s) from which the feature is extracted.”; [0031] and “The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; [0032]; Note: The can extract the second image features at a second time point)
generating a second feature vector according to the second image data and the second numerical data by the heterogeneous integration module; and (“The system will then fuse 221 the extracted video features associated with a time period (which corresponds to at least one, but possibly many time spans, stamps or instants) and the extracted other data features that are associated with the same time period to create a fused feature representation for the time period. In one embodiment, the fusion may be performed with the aid of one or more multi-layer long-short-term memory (LSTM) networks operating on features corresponding to the time period.”; [0032]; Note: The image from the video features are concatenated with the other data features in the temporal fusion (LSTM));
receiving the first feature vector and the second feature vector from the same heterogeneous integration module and generating a second classification result corresponding to the second image data and the second numerical data according to the first feature vector and the second feature vector [by the recurrent neural network.], (“The processor will implement instructions to analyze the fused feature representation to perform a classification process 222 that includes identifying a class that applies to both the extracted video features and the extracted other data features, and also identifying an action that is associated with the class.”; [0036]; Note: The extracted video features contain the second image and the other data features contain the second numerical data to form the second classification result)
Yang does not teach by the recurrent neural network,
wherein the heterogeneous integration module generates a third feature vector according to a third image data and a third numerical data corresponding to a third time point, wherein the third numerical data corresponds to the third image data,
the recurrent neural network receives the second feature vector and the third feature vector from the same heterogeneous integration module and generates a third classification result corresponding to the third image data and the third numerical data according to the second feature vector and the third feature vector.
However, Huang teaches by the recurrent neural network, (“The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step t-k that each goes into a conclusive neural network and accumulates the first and second classification result to generate a third classification result.)
wherein the heterogeneous integration module generates a third feature vector according to a third image data and a third numerical data corresponding to a third time point, wherein the third numerical data corresponds to the third image data,
the recurrent neural network receives the second feature vector and the third feature vector from the same heterogeneous integration module and generates a third classification result corresponding to the third image data and the third numerical data according to the second feature vector and the third feature vector. (“Relying on the hidden representations generated from spatial-temporal-categorical views, we then develop a conclusive recurrent networks to effectively capture the sequential patterns of cross-modal correlations between location, time, and event category. The final multi-view sequence representations from spatial-temporal-categorical dimensions are stored in the last states of the conclusive recurrent network cells, which provides a guidance for the prediction during the occurrence probability decoding phase.”; page 719, Conclusive Recurrent Neural Networks; Note: See Figure 1 of Huang to see that there is a time step tk that each goes into a conclusive neural network and at a third time step, a third classification result is generated. A classification result is generated at each time step tk. The contextual attention module is the same module applied across different time steps. Examiner is interpreting outputs generated at each time step that reflect the classification processing of the model as classification results.)
It would have been obvious before the effective filing date to combine the multimodal data processing of Yang with the conclusive recurrent neural networks of Huang to effectively capture the sequential patterns of cross-modal correlations. (Huang, page 719, Conclusive Recurrent Neural Networks). Yang and Huang are analogous art because they both concern multimodal fusion.
Claims 6 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Yang and Huang in further view of Tong
Claim 6 is rejected over Yang, Huang and Tong with the incorporation of claim 4.
Regarding claim 6, Yang does not teach wherein the first normalized numerical data is normalized to a value from 0 to 1.
However, Tong teaches wherein the first normalized numerical data is normalized to a value from 0 to 1. (“the features of age are normalized to [0, 1]”; page 579; Note: The features of age are the normalized numerical data associated with the image in Figure 3).
It would have been obvious before the effective filing date to combine the multimodal data processing of Yang with the normalization of numerical data related to the image data of Tong to improve the efficiency of classification (Tong, IV. Conclusion). Yang and Tong are analogous arts because they both teach classification of heterogenous data.
Dependent claim 12 is claim 6 in the form of a method and is rejected for the same reasons as claim 6 above. For the rejection of the limitations specifically pertaining to the device of claim 4, see the rejection of claim 4 above.
Conclusion
	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure:
NPL: Gandhi, Ankit, et al. “GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion.” 2016
NPL: Wang, Hongsong, et al. “Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks.” 2017
NPL: Xue, Hongfei, et al. “DeepFusion: A Deep Learning Framework for the Fusion of Heterogeneous Sensory Data.” 2019
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID H TRAN whose telephone number is (703)756-1525. The examiner can normally be reached M-F 9:30 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/DAVID H TRAN/Examiner, Art Unit 2147                                                                                                                                                                                                        
/VIKER A LAMARDO/Supervisory Patent Examiner, Art Unit 2147
Read full office action
Prosecution Timeline

Show 1 earlier event
Sep 03, 2024
Non-Final Rejection mailed — §103
Nov 29, 2024
Response Filed
Mar 19, 2025
Final Rejection mailed — §103
Jun 19, 2025
Request for Continued Examination
Jun 23, 2025
Response after Non-Final Action
Sep 02, 2025
Non-Final Rejection mailed — §103
Dec 01, 2025
Response Filed
Mar 30, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/480,270
Patent 12632724
CANONICALIZATION OF DATA WITHIN OPEN KNOWLEDGE GRAPHS
4y 8m to grant Granted May 19, 2026
17/571,542
Patent 12579404
PROCESSOR FOR NEURAL NETWORK, PROCESSING METHOD FOR NEURAL NETWORK, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM
4y 2m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
14%
Grant Probability
38%
With Interview (+23.2%)
4y 2m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 14 resolved cases by this examiner. Grant probability derived from career allowance rate.