Last updated: May 04, 2026
Application No. 18/228,415
Multi-Modality Aware Transformer

Non-Final OA §103§112
Filed
Jul 31, 2023
Examiner
LU, HWEI-MIN
Art Unit
2142
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
1 (Non-Final)
This examiner grants 63% of cases after interview

— +39.5% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 222 resolved cases, 2023–2026
Examiner Intelligence

LU, HWEI-MIN View full profile →
Grants 63% of resolved cases
Career Allowance Rate
139 granted / 222 resolved
+7.6% vs TC avg
Strong +40% interview lift
Without
With
+39.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
33 currently pending
Career history
255
Total Applications
across all art units
Statute-Specific Performance

§101
11.2%
-28.8% vs TC avg
§103
43.9%
+3.9% vs TC avg
§102
9.4%
-30.6% vs TC avg
§112
32.9%
-7.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 222 resolved cases
Office Action

§103 §112
CTNF 18/228,415 CTNF 94198 DETAILED ACTION Notice of Pre-AIA or AIA Status 07-03-aia AIA 15-10-aia The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. This office action is in responsive to communication(s): original application filed on 07/31/2023. Claims 1-20 are pending. Claims 1 and 19-20 are independent. Specification The use of the term &quot; Bluetooth &quot; in ¶ [0028] and &quot; Wi-Fi &quot; in ¶¶ [0029]-[0030] , which is a trade name or a mark used in commerce, has been noted in this application. The term should be accompanied by the generic terminology; furthermore the term should be capitalized wherever it appears or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM , or ® following the term. Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks. Claim Objections 07-29-01 AIA Claim 7 is objected to because of the following informalities: in Claim 7, lines 1-2 , " … wherein the output from the encoder is used as key vector and a value vector for a cross-attention layer in the decoder and a query vector for the cross-attention layer… " appears to be " … wherein the output from the encoder is used as a keys vector and a values vector for a cross-attention layer in the decoder and a queries vector for the cross-attention layer… " according to Claim 14 and ¶¶ [0046] and [0051] . Appropriate correction is required. Claim Rejections - 35 USC § 112 07-30-02 AIA The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. 07-34-01 Claim12 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Claim 12 recites the limitation &quot; the time-stamped textual data &quot; in line 1. There is insufficient antecedent basis for this limitation in the claim. Since &quot; time-stamped textual data &quot; is recited in Claim 10 and not in Claim 9, for examination purpose, Claim 12 is considered to be depended on Claim 10 instead of Claim 9. Claim Rejections - 35 USC § 103 07-06 AIA 15-10-15 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 07-20-aia AIA The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 07-21-aia AIA Claim s 1-3, 10, and 12 1 -20 are rejected under 35 U.S.C. 103 as being unpatentable over ZZHANG (CN 109598387 A, pub. date: 04/19/2019), hereinafter ZHANG in view of Chen et al. ("Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition", IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, Nov. 18, 2021, pp. 4171-4183), hereinafter Chen . Independent Claims 1 and 19-20 ZHANG discloses a computer-implemented method comprising: inputting target series data into a transformer comprising an encoder and a decoder, the transformer having been trained on first and second datasets comprising different first and second modalities, respectively, the second dataset comprising time series data, the encoder comprising separate first and second modality streams for analyzing the first and the second datasets, respectively (ZHANG, ¶¶ [0007]-[0013] with FIG. 1: provide a method and system for stock price prediction using social text and stock price sequence data; provide a usable network framework, namely a method and prediction system for modeling and predicting stock trends using discrete stock price data and text information in social networks; the first step is to select a dataset, crawl stock closing sequence data and corresponding social text datasets such as Twitter, and preprocess the text data; the second step involves using word vectors to transform text sequences into vector feature representations for social texts, and performing three-class classification on continuous stock price sequences to transform them into discrete data representations for stock price sequences; the fourth step is to split the dataset, use the training samples to learn the parameters of the network model, and use the validation set to fine-tune the parameters; ¶¶ [0014]-[0020] with FIG. 2: the stock price information refers to the collected stock market closing price data, and the stock price closing sequence data refers to the stock price data after preprocessing the original stock price information; the preprocessed stock price closing sequence data is used as the input of the model; preferably, this refers to crawling the closing stock price information of the S&amp;P 500 index on Yahoo Finance; social text information refers to information about stocks on various online social platforms, including Twitter, Weibo, WeChat, and others; preferably, it refers to social text information about stocks in the S&amp;P 500; social text information refers to raw text information, while social text data refers to preprocessed text information; in the first step, text data preprocessing refers to the process of removing stop words, special symbols, and links from the crawled social text data (e.g., Twitter text data) because some words or characters contain no information; ¶¶ [0021]-[0027] with FIG.2: in the second step, the text sequence is transformed into a vector feature representation using word vectors, generated according to the following steps: (a) the preprocessed social text data is trained using the word vector model word2vec to learn the word vector representation of each word in the entire text database; the dimension of the word vector is denoted as D e ; (b) generate vector representations at the social text level; e.g., taking a social media post (Twitter) about a stock as an example, based on the obtained word vectors, average pooling is performed on each dimension of the word vectors of all words in the social media post; the word vector matrix of N word words in the social text is then subjected to dimensional average pooling to obtain a D e dimensional social text representation; (c) generate a vector representation at the day level; e.g., consider social media text related to stocks on a particular day; after obtaining the social text-level vector representations (e.g., Twitter-level representations) of N tweet social texts (e.g., Twitter) according to the method in step (b) above, for the day level stock text matrix representation with N twee t * D e as the dimension, max, min, and average pooling operations are applied to each dimension of the word vectors to obtain a 3* D e stock text representation for that day; realize the transformation of text sequences into vector feature representations using word vectors; in the second step, the operation of performing three-classification processing on the continuous stock price value sequence means that, for the crawled original closing stock price sequence features, if the closing price of the day is higher than the closing price of the previous day, then "+1" is used as the stock price feature of the day; otherwise, "-1" is used as the stock price feature of the day; if the closing price of the day is the same as the closing price of the previous day, then "0" is used as the stock price feature of the day; the continuous stock price sequence features are transformed into a three-class sequence feature taken from {+1, 0, -1}; ¶ [0028] with FIG. 3: use an attention mechanism consisting of an encoder and a decoder: at the encoder end, the attention mechanism is used to select relevant external stocks, and at the decoder end, relevant sequence features are selected for the entire sequence; ¶¶ [0059] and [0066]-[0067] with FIG. 1: in the fourth step, splitting the dataset refers to dividing the entire stock price sequence data set into social text datasets (such as Twitter text datasets) according to time, using the split training set to train the model parameters, and using the validation set to fine-tune the parameters; during model training, the parameters are constrained by using a dropout network and L2 regularization of the parameters to prevent overfitting; in the second step, price data is classified into three categories, text information is pooled using word vectors, and the two parts of data are modeled separately; the key point is to use a long short-term memory network to obtain the hidden state of the memory unit and extract the relationship between stocks and sequences; ¶¶ [0069]-[0072] with FIG. 4: a stock price prediction system utilizing stock price information and social text information, which comprises input the representation unit to preprocess the original stock closing data and Twitter text data respectively, discretize the original stock closing data, and use word vectors to serialize the Twitter text data; ¶¶ [0080]-[0085] with FIG. 1: provide a method for prediction using social text and stock price sequence data; the first step is to select the corresponding Twitter social text dataset for the stock price closing sequence dataset required for the task, and perform preprocessing such as noise removal on the text data; the second step involves using word vectors to transform the preprocessed text sequence into vector feature representations, and then classifying the stock price sequence data into discrete data representations; the fourth step is to split the dataset, train the dataset, and fine-tune the parameters; ¶¶ [0086]-[0089] with FIG. 4: a stock price prediction system comprising: input representation unit where (a) the original stock closing price data and Twitter text data are preprocessed separately; (b) the original stock closing price data is discretely classified into three categories; and (c) the vectorized representation of the Twitter text data is generated using word vector; ¶¶ [0091]-[0100] with FIG. 2: scrape the stocks in the S&amp;P 500 index from Yahoo Finance and extract the closing price of each stock each day; using the stock tag "$" as the crawling keyword, the Python framework tool tweepy was used to crawl Twitter text related to S&amp;P 500 stocks; filter special characters from the crawled Twitter text, remove stop words that have no information content, and remove URL information that appears in a large amount of text; discretization of stock price series: (a) for the crawled stock price sequence data, if the closing price of the day is higher than the closing price of the previous day, "+1" is used as the stock price feature of the day; otherwise, "-1" is used; if the closing price of the day is the same as the closing price of the previous day, 0 is used as the stock price feature of the day; and (b) the continuous stock price sequence features are transformed into a three-class sequence feature derived from {+1, 0, -1}; vectorized representation of text: (a) first, for the denoised Twitter text, the word vector representation of each word in the text library is learned using the word vector model word2vec; (b) taking a stock tweet as an example, based on the obtained word vectors, perform average pooling on the word vectors of all words in each dimension; and (c) generate a vector representation at the day level; this vector representation serves as the text input representation of the model; ¶¶ [0122] and [0127] with FIGS. 3-4: the entire dataset is split according to the timeline, with the training set, validation set, and test set in a ratio of 8: 1: 1; the training set is used to train and learn the parameters of the entire model, and the validation set is used to fine-tune the model parameters; during model training, to prevent overfitting, a dropout network and L2 regularization of the parameters are used to limit the training size of the parameters), each of the first and second modality stream s respectively performing feature-level attention, intra-modal multi-head attention, and inter-modal multi-head attention, the encoder producing an output using the feature-level attention, the intra-modal multi-head attention, and the inter-modal multi-head attention and sending the output to the decoder (ZHANG, ¶¶ [0007]-[0013] with FIG. 1: jointly model stock prices and text, and to bidirectionally calculate cross-modal attention weights; employs a cross-modal bidirectional attention mechanism to model stock price data and social text, effectively extracting important sequence information; the third step involves modeling the stock price sequence data and social text datasets such as Twitter using recurrent neural networks; a bidirectional cross-modal attention mechanism is then used to fuse the two modules, allowing them to learn to extract stock price sequences and social text sequences relevant to the prediction target; ¶¶ [0028]-[0058] and [0068] with FIG. 1 and 3: in the third step, recurrent neural networks are used to model the stock price sequence data and the social text dataset, respectively; among them, the modeling of stock price sequence data is as follows: a recurrent neural network is modeled using external stock prices and target historical price sequence data; the core of the model is to use an attention mechanism consisting of an encoder and a decoder: at the encoder end, the attention mechanism is used to select relevant external stocks, and at the decoder end, relevant sequence features are selected for the entire sequence; use an attention mechanism at the encoding end to select relevant external stocks: (a) input a sequence of M external stocks of length T, [X 1 , …, X M ], where each stock is a vector representation of length T; (b) calculate attention weights using the input; (c) attention weights are used to select external stock price features relevant to the prediction of stocks; this feature is used to update the state value of the memory cell/unit; at the decoding end, relevant sequence features are selected for the entire sequence: (a) the state features of the memory unit at each time step input from the encoding end and the input sequence features of the text module are used to select the sequence features related to the predicted value from the entire sequence using an attention mechanism; wherein, the attention weight is calculated; (b) calculate the weighted sum representation of the state sequence using attention weights; The weighted sum of the state sequence is used to update the memory cell/unit state at the decoding end together with the historical time series of the target stock; in the third step, modeling the social text (Twitter text) dataset using a recurrent neural network refers to: modeling the vectorized social text sequence representation obtained from the preprocessing in the first step using a long short-term memory network; the input is [E 1 , …, E T ], which represents a sequence of T-length Twitter text vectors of the target stock, i.e., the vector representation of social text (e.g., Twitter text) after preprocessing according to the method in the second step; the weighted sequence sum representation C d in the stock price sequence module is used to participate in the calculation of text attention weights; this text attention weight can be calculated as a weighted sum of features of the text sequence; Feature C text is used to update the state of memory cells/units; in the third step, the use of a bidirectional cross-modal attention mechanism means that, at the decoding end of the stock price network module, the input features of the text [E 1 , …, E T ] are used to help train the sequence attention weights; in the text network module, the weighted sum representation of the hidden states C d calculated in the stock price module is used to update the text attention weights; therefore, at each moment in the sequence, both modules bidirectionally compute their respective attention weights using cross-modal data from each other; in the third step, the bidirectional cross-modal attention mechanism integrates stock price and text data; its core is the information interaction between the stock price module decoding end and the text module; the method adopted is to utilize the hidden state sequence of the decoding end and the input sequence of the text respectively; ¶¶ [0069]-[0072] with FIG. 4: a stock price prediction system utilizing stock price information and social text information, which comprises text and price sequence modeling unit to perform sequence modeling on the stock price data and text data of the input representation, calculate the attention weight of the two data parts using mutual information, and select relevant input representations; ¶ [0083] with FIG. 1: the third step involves using recurrent neural networks to model the stock price sequence data and the Twitter text dataset, respectively; a bidirectional cross-modal attention mechanism is then used to fuse the two modules, extracting sequence features relevant to the prediction target; ¶¶ [0086]-[0089] with FIG. 4: a stock price prediction system comprising: text and price series modeling unit, where (a) a long short-term memory network is used to perform sequence modeling on the input representations of stock price data and text data; and (b) a bidirectional attention mechanism is used to select relevant input representations by calculating the attention weights of the two data parts using mutual information; ¶¶ [0101]-[0121] with FIGS. 1 and 3: use the LSTM module in TensorFlow as a basis to perform sequence modeling on the two parts of the sequence data; for stock price sequence data modeling, modeling a recurrent neural network using external stock price sequences and historical price sequences of the target stock; at the encoding end, (a) the input sequence consists of M external stocks of length T, [X 1 , …, X M ], where each stock is a vector representation of length T; (b) calculate attention weights using the input; (c) attention weights are used to select external stock price features relevant to the prediction of stocks; at the decoding end, the state features of the memory unit/cell at each time step input from the input encoder and the input sequence features of the text module are used to select the sequence features in the entire sequence that are relevant to the predicted value using an attention mechanism; attention weights are used to select the relevant memory cell states at the encoder end; for Text sequence data modeling, after obtaining a vectorized text sequence representation through preprocessing, a Long Short-Term Memory (LSTM) network is used to model the text sequence; the initial input is [E 1 , …, E T ], representing a sequence of T-length Twitter text vectors of the target stock, which is the vector representation of the preprocessed Twitter text; the text input features and the weighted sequence representation C d from the stock price sequence module are used to calculate the text attention weights) (NOTE: ZHANG only teaches (a) the second modality stream performing feature-level attention, intra-modal multi-head attention, and inter-modal multi-head attention ; and (b) the first modality stream performing inter-modal multi-head attention ); and in response to the inputting, receiving from the transformer an inferred variable related to the target series data (ZHANG, ¶ [0013] with FIG. 1: the fifth step is to use a network model based on bidirectional cross-modal attention to predict stock price trends in the target data; ¶¶ [0060]-[0065] with FIG. 1: in the fifth step, a network model based on bidirectional cross-modal attention is used to predict the target stock price trend; following the method in step three, obtain the stock price sequence and social text sequence related to the prediction target; i.e., obtain the hidden unit state of the stock price sequence module decoder and the text module at each moment [ h 1 d e , …, h T d e ], [ h 1 t e x t , …, h T t e x t ]; take the state representation of the last day from the two features in step (1) above and concatenate them to obtain [ h T d e , h T t e x t ]; prediction is performed using the features of splicing, as follows: Y = ο ( v o W o h T d e ; h T t e x t + b o + b v ) , where sigmoid is used as the activation function σ; v o , W o , b o , and b v are the parameters that need to be trained in the network; ¶¶ [0069]-[0074] with FIG. 4: a stock price prediction system utilizing stock price information and social text information, which comprises prediction generation unit to obtain the hidden state of the last day in the partial text and price sequence modeling from the text and price sequence modeling unit and splice it, connect it to the two-layer fully connected layer and finally outputs the sigmoid activation; integrate the stock price module and the social text module to calculate their respective attention weights at each time step of the sequence, requiring the attention calculation sequence to be homogeneous; utilize social text sequence data, combined with the stock price information of the target stock and external stocks, to jointly predict the target stock trend, and based on the two-way attention interaction, it can select important stock price sequence and text sequence features respectively; ¶ [0085] with FIG. 1: the fifth step is to predict the stock price trend of the target data based on a network model with bidirectional cross-modal attention; ¶¶ [0086]-[0089] with FIG. 4: a stock price prediction system comprising: prediction generation unit to obtain the hidden state of the long short-term memory network of the last day in the part of the text and price sequence modeling from text and price series modeling unit. and splice it, then connect it to the output of the sigmoid activation; ¶¶ [0123]-[0126] with FIGS. 3-4: during the prediction process, the hidden unit states [ h 1 d e , …, h T d e ], [ h 1 t e x t , …, h T t e x t ] at each time step of the stock price sequence module and the text module are used to extract the state features of the last time step and concatenate them into [ h T d e , h T t e x t ] which is then used for prediction: Y = ο ( v o W o h T d e ; h T t e x t + b o + b v ) , where sigmoid is used as the activation function σ; v o , W o , b o , and b v are the parameters that need to be trained in the network). ZHANG further discloses a computer program product comprising: a computer readable storage medium (excluding transitory signal per se . in ¶ [0020] of the specification) having program code embodied therewith, the program code executable by a processor to perform the method described above (ZHANG, ¶¶ [0007]-[0013], [0069]-[0072], and [0086]-[0089] with FIG. 4: inherited in the system used to train the stock price sequence data and social text datasets, and predict stock price trends in the target data). ZHANG further discloses a computer system comprising: a processor operatively coupled to memory, and an artificial intelligence (Al) platform operatively coupled to the processor, the Al platform comprising a transformer and one or more tools configured to interface with the transformer, including the method described above (ZHANG, ¶¶ [0007]-[0013], [0069]-[0072], and [0086]-[0089] with FIG. 4: inherited in the system used to train the stock price sequence data and social text datasets, and predict stock price trends in the target data). ZHANG fails to explicitly disclose each of the first and second modality stream s respectively performing feature-level attention, intra-modal multi-head attention, and inter-modal multi-head attention . Chen teaches a system and method using attention mechanism (Chen, Abstract in Page 4171 ), wherein each of the first and second modality stream s respectively performing feature-level attention, intra-modal multi-head attention, and inter-modal multi-head attention (Chen, Abstract in Page 4171 : first introduce the transformer-encoder with a self-attention mechanism and propose a Convolutional Neural Network-Transformer Encoder (CNNTE) framework to model the temporal dependency for single modal affect recognition; further, to effectively consider the complementarity and redundancy between multiple streams, propose a Transformer Encoder with Multi-modal Multi-head Attention (TEMMA) for multi-modal affect recognition; TEMMA allows to progressively and simultaneously refine the inter-modality interactions and intra-modality temporal dependency; the learned multi-modal representations are fed to an Inference Sub-network with fully connected layers to estimate the affective state; Section I in Pages 4171-4172 : the transformer, which relies entirely on a self-attention mechanism, demonstrated its effectiveness to draw global dependencies over a sequence; multi-modal transformer frameworks have been built upon the transformer-encoder with a pair-wise cross-modality attention module performing on the low-level descriptors or on the high-level representations; these frameworks considered temporal dynamics and multi-modal dynamic interactions separately; to explore the dynamic interactions between different modalities along with the temporal dependency, an intermediate level multi-modal fusion based on the transformer is proposed; the model constructs multiple feature streams, each consisting of factorized multi-modal features; then the multi-head attention is performed on each feature stream to learn multi-modal dynamic interactions along with temporal dependency; based on the transformer], address the dynamic temporal dependency and multi-modal fusion challenges of continuous affect recognition, as follows: (a) to handle the long-range dependencies, propose to predict affective states by combing one dimensional CNN with the Transformer-Encoder (CNN-TE), wherein the CNN is used to locally aggregate context information, while the transformer-encoder models the long-range dependencies with the attention mechanism; (b) to deal with the dynamic importance of multi-modal feature streams, propose the Multi-modal Multi-head Attention (MMA), which can be easily inserted into the transformer-encoder to promote the dynamic interactions between different modalities and learn their correlations; (c) furthermore, suggest a Transformer-Encoder with Multi-modal Multi-head Attention (TEMMA) and propose a CNN-TEMMA framework, to progressively and simultaneously refine inter-modality interactions and intramodality dependencies, wherein the CNN sub-network extracts and aggregates local temporal features via causal convolution, and the TEMMA sub-network models multi-modal interactions and temporal dynamics, with spatial (audio and visual features) and temporal attention, to learn high-level representations from each modality; and (d) finally, an inference sub-network, with fully connected layers, is used to predict the affective state from the learned multi-modal representations; Section II.B in Page 4173 : [55] propose Tensor Fusion Network learn intra-modality and inter-modality dynamics in an end-to-end fashion, wherein intra-modality dynamics is obtained via Modality Embedding Subnetworks, while inter-modality dynamics is obtained via the 3-fold Cartesian product between each modality embeddings to capture bi-modal and tri-modal interactions; [57] propose the Multimodal Transformer Networks (MTN) to generate conversational responses to queries of humans for a video-grounded dialogue system (VGDS), wherein MTN comprises three major components: (a) transformer-encoder layers to map each sequence of tokens (text, video) into a sequence of continuous representation; (b) transformer-decoder layers to perform reasoning over multiple encoded features through a multi-head attention mechanism; and finally, (c) auto-encoder layers that are used to focus on query-related video features in an unsupervised manner; the authors of [38] extended the transformer of with a pairwise cross-modal attention module to learn representations from different paired multi-modal features; the representations from each pair are further fed into different transformer-encoders to achieve temporal dependency; the authors of [39] proposed to firstly perform the transformer-encoder on each modality individually, then apply the pairwise multi-modal interaction on the high-level multi-modal representations; the authors of [40] propose an intermediate level multi-modal fusion based on the transformer-encoder; concretely, uni-modal, bi-modal and tri-modal feature sequences are formed as one feature sequence and reconstructed into multiple streams with a factorized multi-modal features; then, a multi-head attention is performed on each factorized stream individually, and the output streams are added into one feature sequence in a point-wise manner; we propose an intermediate level multi-modal fusion scheme built upon the standard Transformer network to efficiently learn intra-modality dynamics and inter-modality dynamics; a novel Multi-modal Multi-head Attention (MMA) module, which learns relationships between several multi-modal sequences, is first presented; then the MMA module is followed by a temporal multi-head attention (TMA) module for each feature stream, composing a CNN and transformer-encoder with multi-modal multi-head attention (CNN-TEMMA) framework; this framework progressively learns multi-modal interactions and temporal dependency; the proposed CNN-TEMMA framework can be used not only for audiovisual continuous affect recognition, but also for other multi-modal feature learning tasks with dynamic sequences as inputs; Section II.C in Pages 4173-4174 : for enforcing attention on important cues within a sequence, attention mechanisms have been integrated into sequence-to-sequence models; Vaswani et al. [36] proposed a self-attention based sequence to sequence model, namely the transformer which employs multi-head attention to calculate correlations over a temporal sequence; employ the transformer-encoder to model the temporal dynamics as it produces an attentive feature sequence with the same length as the input sequence, being vital for continuous affect recognition; Section III with FIGS. 1-2 in Pages 4174-4175 : introduce our proposed uni-modal affect recognition framework, which combines a CNN and a Transformer-Encoder (CNN-TE), as illustrated in Fig. 1; the extracted feature vectors are fed into the 1-dimensional temporal convolutional network for aggregating local temporal context information; the output of the 1-D CNN is input into the transformer-encoder to achieve long-range dependencies with dynamic attention weights; finally, an inference sub-network, with fully connected layers, is used to estimate the valence or arousal dimension of the affective state; a 1-D temporal convolution network is adopted to encode the temporal information from the input feature sequence; in the model, use causal convolutions to respect the temporal order during the learning process, wherein "causal" means that the activations computed for a particular time step do not depend on the activations from the future time steps; since the transformer-encoder contains neither recurrence nor convolution, to make use of the time step order of the feature vectors within the sequence, add the position information to the output of the 1-D temporal convolutional network; as shown in Fig. 1, the transformer-encoder is composed of N identical blocks, each consisting of a multi-head attention module followed by a fully connected feed-forward module, which employs multi-head attention (MA) to calculate the temporal dependency (therefore we name it as temporal MA (TMA)); as introduced in [36], the attention function can be described as performing a query on a set of key-values pairs, to generate an output; the output is computed as a weighted sum of values, where the weights is calculated by a compatibility function of the query with the corresponding key; Fig. 2 illustrates the multi-head attention module used in the transformer-encoder, which employs the scaled dot-product attention on the queries (Q), keys (K), and values (V ) under each head; in practice, Q is a set of queries of the whole sequence, packed together in a matrix; the keys and values are also packed together into matrices K and V; Q, K, V are first projected h times in different subspace head i ( i ∈ [1, h ]) with different learned linear projections W i Q , W i K , W i V of dimensions d k , d k , d v , respectively; then the scaled dot-product attention is performed in parallel on head i of queries, keys, and values; concretely, the scaled dot-product attention computes the dot products (MatMul) of the scaled query and keys; the result is multiplied by a pre-defined attention weights mask, before applying the softmax function to obtain the weights A on the values; such attention block can be mathematically described as eqn. (1), where ∗ represents the Hadamard product, and Mask refers to a bidirectional attention weights mask, which can be regarded as a hard attention mask making the encoder ignore the information from far history and consider the near future; the multi-head attention module (TMA) is given as eqn. (2); apart from the multi-head attention module and the fully connected feed-forward module, the transformer-encoder contains a residual connection followed by a normalization layer (LN); as illustrated in Fig. 1, the transformer-encoder maps the input sequence of low-level descriptors, to an output sequence of high-level representations; the learned high-level representations are fed to an Inference sub-network to estimate the arousal or valence dimension of the affective state, which is composed of two fully connected layers with a non-linear activation layer in between; the number of nodes of the last fully connected layer is half of the dimension of the high-level representations; Section IV in Pages 4175-4176 with FIG. 4 in Page 4176 and FIG. 5 in Page 4171 : the proposed CNN and transformer-encoder with multi-modal multi-head attention (CNN-TEMMA) framework consists of three modules: a) the Input Embedding sub-network with 1D convolution, which outputs the embedded feature sequence, to which we inject the position information; b) the Multi-modal Encoder sub-network with stacked N identical encoder blocks, wherein for each block, insert the proposed multi-modal multi-head attention (MMA) module to model the inter-modality interactions, followed by the temporal multi-head attention (TMA) module to model the intra-modality dependencies; c) the Inference sub-network, in which the encoded high-level representations from the different modalities are concatenated into one feature vector linked to a fully connected deep neural network for affective state estimation; for each modality, the input embedding sub-network is the same as the one described in Section III-A for the uni-modal affect recognition; it consists of two parts: local context information aggregation via a 1D temporal convolutional network, and the position information embedding; the Multi-modal Encoder sub-network of Fig. 4, is composed of N stacked encoder blocks, wherein the encoder block comprises a Multi-modal Multi-head Attention (MMA) module followed by the original Multi-head Attention (TMA) of Fig. 1; such architecture not only dynamically enriches each modality with complementary information from the other modalities but also directs TMA on searching Intra-modality dynamics; furthermore, by stacking the encoder blocks, the encoder could progressively refine the inter-modality interactions and intra-modality dynamics to learn high-level semantic representation with temporal contextual information as well as complementary multi-modal information; Temporal Multi-Head Attention (TMA) Module is performed individually on each modality to capture the temporal dependency by following the same principles as described in Section III-B; to obtain complementary information from different modalities, propose a Multi-Modal Multi-Head Attention (MMA) module to compute inter-modality interactions, as illustrated in Fig. 5, which the same as that in the TMA module described in Section IV-B1 following the self-attention paradigm; the MMA will calculate the attention weights indicating the inter-modality correlations, which composes a matrix of weights in R m × m at all time steps t ∈ [1, n ]; the structure of the inference sub-network for multi-modal affect recognition is the same as that for the single modal model; the only difference is the input feature vector, wherein the outputs of the multi-modal encoder sub-network, i.e., a set of high-level representations from multiple modalities, are concatenated to form an input feature vector to the inference sub-network to estimate the value of arousal or valence). ZHANG and Chen are analogous art because they are from the same field of endeavor, a system and method using attention mechanism. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of Chen to ZHANG. Motivation for doing so would extend the framework for any multi-modal feature learning tasks with dynamic sequences as inputs (. Claim 2 ZHANG in view of Chen discloses all the elements as stated in Claim 1 and further discloses wherein one or more identified first and second temporal features identified by the intra-modal multi-head attentions of the first and second modality streams are input into the respective inter-modal multi-head attention to identify one or more cross modality relationships (ZHANG, ¶¶ [0007]-[0013] with FIGS. 1 and 3: jointly model stock prices and text, and to bidirectionally calculate cross-modal attention weights; employ a cross-modal bidirectional attention mechanism to model stock price data and social text, effectively extracting important sequence information; the stock price prediction method based on a bidirectional cross-modal attention network model; a bidirectional cross-modal attention mechanism is then used to fuse the two modules, allowing them to learn to extract stock price sequences and social text sequences relevant to the prediction target; use a network model based on bidirectional cross-modal attention to predict stock price trends in the target data; ¶ [0039] with FIG. 3: the state features of the memory unit at each time step input from the encoding end and the input sequence features of the text module are used to select the sequence features related to the predicted value from the entire sequence using an attention mechanism; ¶¶ [0058], [0060], and [0068] with FIGS.1 and 3: the use of a bidirectional cross-modal attention mechanism means that, at the decoding end of the stock price network module, the input features of the text [E 1 , …, E T ] are used to help train the sequence attention weights; in the text network module, the weighted sum representation of the hidden states C d calculated in the stock price module is used to update the text attention weights; therefore, at each moment in the sequence, both modules bidirectionally compute their respective attention weights using cross-modal data from each other; a network model based on bidirectional cross-modal attention is used to predict the target stock price trend; the bidirectional cross-modal attention mechanism integrates stock price and text data; its core is the information interaction between the stock price module decoding end and the text module; the method adopted is to utilize the hidden state sequence of the decoding end and the input sequence of the text respectively; ¶¶ [0083] and [0085] with FIG. 3: involve using recurrent neural networks to model the stock price sequence data and the Twitter text dataset, respectively; a bidirectional cross-modal attention mechanism is then used to fuse the two modules, extracting sequence features relevant to the prediction target; predict the stock price trend of the target data based on a network model with bidirectional cross-modal attention) (Chen, Section IV in Pages 4175-4176 with FIG. 4 in Page 4176 and FIG. 5 in Page 4171 : for a given Modality j , denote by Q j t the feature vector at time t, and Q j = ( Q j 1 , … Q j t , …, Q j n ) the full feature sequence, then K j = V j = Q j are the input feature sequences to the TMA module of modality j (Fig. 5); for each modality j , the process can be described as eqn. (3), wherein i ∈ [1, h ] denotes the i th head of the total number of h heads, j ∈ [1, m ] denotes the j th modality, t ∈ [1, n ] denotes the frame, and n the number of frames; the projections are parameter matrices: W j , i T Q ∈ R d m o d e l × d k , W j , i T K ∈ R d m o d e l × d k , W j , i T V ∈ R d m o d e l × d v and W j , i T O ∈ R h d v × d m o d e l ; the superscripts (T) denotes matrices of the temporal multi-head attention module; propose a MMA module to compute inter-modality interactions, as illustrated in Fig. 5; Let Q j t the feature vector of modality j at time step t , and Q t = ( Q 1 t , … Q j t , …, Q m t ) the multi-modal feature sequence at time step t, then K t = V t = Q t are the input multi-modal feature sequences to the MMA module at time step t (Fig. 5), which is the same as that in the TMA module (section IV-B1) following the self-attention paradigm, here the input feature sequences are organized in a multi-modal manner under each time step; the MMA will calculate the attention weights indicating the inter-modality correlations, which composes a matrix of weights in R m × m at all time steps t ∈ [1, n ]; in practice, for each modality, the same as for TMA, the input is projected into multiple subspaces but with different parameters W (M)Q , W (M)K , W (M)V , wherein the superscript ( M ) denotes matrices of the MMA module; then, at each time step t, construct the multi-modal queries, keys and values ( Q i t , K i t , V i t ) under subspace head i followed by the scaled dot-product attention; finally, the output values under each subspace head i are concatenated and linearly projected, resulting in the final values as shown in eqns. (4)-(6)). Claim 3 ZHANG in view of Chen discloses all the elements as stated in Claim 2 and further discloses wherein the identified one or more cross modality relationships and signals from the feature-level attentions are combined to produce the output of the encoder (Chen, Section IV.C in Page 4176 : the structure of the inference sub-network for multi-modal affect recognition is the same as that for the single modal model; the only difference is the input feature vector, wherein the outputs of the multi-modal encoder sub-network, i.e., a set of high-level representations from multiple modalities, are concatenated to form an input feature vector to the inference sub-network to estimate the value of arousal or valence) (ZHANG, ¶¶ [0007]-[0013] with FIG. 1: jointly model stock prices and text, and to bidirectionally calculate cross-modal attention weights; employs a cross-modal bidirectional attention mechanism to model stock price data and social text, effectively extracting important sequence information; the third step involves modeling the stock price sequence data and social text datasets such as Twitter using recurrent neural networks; a bidirectional cross-modal attention mechanism is then used to fuse the two modules, allowing them to learn to extract stock price sequences and social text sequences relevant to the prediction target; ¶¶ [0028]-[0058] and [0068] with FIG. 1 and 3: in the third step, recurrent neural networks are used to model the stock price sequence data and the social text dataset, respectively; among them, the modeling of stock price sequence data is as follows: a recurrent neural network is modeled using external stock prices and target historical price sequence data; the core of the model is to use an attention mechanism consisting of an encoder and a decoder: at the encoder end, the attention mechanism is used to select relevant external stocks, and at the decoder end, relevant sequence features are selected for the entire sequence; use an attention mechanism at the encoding end to select relevant external stocks: (a) input a sequence of M external stocks of length T, [X 1 , …, X M ], where each stock is a vector representation of length T; (b) calculate attention weights using the input; (c) attention weights are used to select external stock price features relevant to the prediction of stocks; this feature is used to update the state value of the memory cell/unit; at the decoding end, relevant sequence features are selected for the entire sequence: (a) the state features of the memory unit at each time step input from the encoding end and the input sequence features of the text module are used to select the sequence features related to the predicted value from the entire sequence using an attention mechanism; wherein, the attention weight is calculated; (b) calculate the weighted sum representation of the state sequence using attention weights; The weighted sum of the state sequence is used to update the memory cell/unit state at the decoding end together with the historical time series of the target stock; in the third step, modeling the social text (Twitter text) dataset using a recurrent neural network refers to: modeling the vectorized social text sequence representation obtained from the preprocessing in the first step using a long short-term memory network; the input is [E 1 , …, E T ], which represents a sequence of T-length Twitter text vectors of the target stock, i.e., the vector representation of social text (e.g., Twitter text) after preprocessing according to the method in the second step; the weighted sequence sum representation C d in the stock price sequence module is used to participate in the calculation of text attention weights; this text attention weight can be calculated as a weighted sum of features of the text sequence; Feature C text is used to update the state of memory cells/units; in the third step, the use of a bidirectional cross-modal attention mechanism means that, at the decoding end of the stock price network module, the input features of the text [E 1 , …, E T ] are used to help train the sequence attention weights; in the text network module, the weighted sum representation of the hidden states C d calculated in the stock price module is used to update the text attention weights; therefore, at each moment in the sequence, both modules bidirectionally compute their respective attention weights using cross-modal data from each other; in the third step, the bidirectional cross-modal attention mechanism integrates stock price and text data; its core is the information interaction between the stock price module decoding end and the text module; the method adopted is to utilize the hidden state sequence of the decoding end and the input sequence of the text respectively; ¶¶ [0069]-[0072] with FIG. 4: a stock price prediction system utilizing stock price information and social text information, which comprises text and price sequence modeling unit to perform sequence modeling on the stock price data and text data of the input representation, calculate the attention weight of the two data parts using mutual information, and select relevant input representations; ¶ [0083] with FIG. 1: the third step involves using recurrent neural networks to model the stock price sequence data and the Twitter text dataset, respectively; a bidirectional cross-modal attention mechanism is then used to fuse the two modules, extracting sequence features relevant to the prediction target; ¶¶ [0086]-[0089] with FIG. 4: a stock price prediction system comprising: text and price series modeling unit, where (a) a long short-term memory network is used to perform sequence modeling on the input representations of stock price data and text data; and (b) a bidirectional attention mechanism is used to select relevant input representations by calculating the attention weights of the two data parts using mutual information; ¶¶ [0101]-[0121] with FIGS. 1 and 3: use the LSTM module in TensorFlow as a basis to perform sequence modeling on the two parts of the sequence data; for stock price sequence data modeling, modeling a recurrent neural network using external
Read full office action
Prosecution Timeline

Jul 31, 2023
Application Filed
Mar 21, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/737,938
Patent 12602578
LIGHT SOURCE COLOR COORDINATE ESTIMATION SYSTEM AND DEEP LEARNING METHOD THEREOF
3y 11m to grant Granted Apr 14, 2026
17/804,513
Patent 12596954
MACHINE LEARNING FOR MANAGEMENT OF POSITIONING TECHNIQUES AND RADIO FREQUENCY USAGE
3y 10m to grant Granted Apr 07, 2026
17/231,757
Patent 12591770
PREDICTING A STATE OF A COMPUTER-CONTROLLED ENTITY
4y 11m to grant Granted Mar 31, 2026
17/662,568
Patent 12579466
DYNAMIC USER-INTERFACE COMPARISON BETWEEN MACHINE LEARNING OUTPUT AND TRAINING DATA
3y 10m to grant Granted Mar 17, 2026
17/805,377
Patent 12561222
REDUCING BIAS IN MACHINE LEARNING MODELS UTILIZING A FAIRNESS DEVIATION CONSTRAINT AND DECISION MATRIX
3y 8m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
63%
Grant Probability
99%
With Interview (+39.5%)
2y 11m (~1m remaining)
Median Time to Grant
Low
PTA Risk
Based on 222 resolved cases by this examiner. Grant probability derived from career allowance rate.