DETAILED ACTION
This action is responsive to the application filed on 09/30/2025. Claims 1-20 are pending and have been examined.
This action is Non-final.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 09/30/2025 has been entered.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Argument 1: The applicant argues that claims 1 and 11 were amended to require performing feature embedding on both a set of operations and the corresponding configuration settings using a categorical mapping and a numerical mapping to generate a feature embedded sequence that includes both categorical feature vectors and numerical feature vectors, and further require applying positional encoding and a series of attention functions on that feature embedded sequence including both the categorical and numerical feature vectors, and therefore Kuo does not teach at least these amended elements because Kuo embeds only categorical variables and does not perform feature embedding on the numeric variables or apply the transformer and positional embedding to a sequence that includes the numeric feature vectors (as shown, for example, in Kuo’s figures), so the applicant requests withdrawal of the 35 USC 102 rejection of claims 1, 2, 11, and 12, and further contends the dependent 35 USC 103 rejections also fail because Gao and Vaswani do not supply the same missing amended elements in Kuo, such that the 35 USC 103 rejections of claims 3 to 7 and 13 to 17 over Kuo in view of Gao and claims 8 to 10 and 18 to 20 over Kuo in view of Vaswani should likewise be withdrawn.
Examiner Response to Argument 1: The examiner has considered the argument set forth above but finds it unpersuasive because Kuo expressly teaches forming an input representation that includes both embedded categorical features and numerical features, and further teaches applying transformer style attention with positional encoding in that predictive modeling framework. In particular, Kuo states that “we represent the output of embedding layers concatenated together with any numerical inputs as X′,” and further explains that the embeddings “are then concatenated with the numeric predictors, which have been normalized in data pre-processing,” which teaches that categorical feature vectors produced by embedding are combined with numerical feature vectors (the normalized numeric predictors), thereby meeting the amended requirement of a feature embedded sequence including both categorical feature vectors and numerical feature vectors and teaching a numerical mapping applied to the numerical variables (normalization). Kuo also states that “Model 6 uses a Transformer layer in place of self-attention and a positional encoding of a single dimension,” which teaches applying positional encoding and attention functions in the same predictive modeling context that uses the concatenated representation including numerical inputs, and therefore the amendments to independent claims 1 and 11 do not overcome the rejection on the asserted basis; for the same reasons, dependent claims 2 and 12 remain unpatentable because they depend from claims 1 and 11 and incorporate the amended limitations. Further, because the applicant’s 35 U.S.C. 103 arguments for claims 3-7 and 13-17 (Kuo in view of Gao) and claims 8-10 and 18-20 (Kuo in view of Vaswani) rest on the premise that the amended elements are missing from Kuo and not supplied by the secondary references, that premise is not established where Kuo already teaches the combined categorical and numerical feature representation and transformer with positional encoding, and in any event Gao additionally teaches operation level configuration settings and configuration representations while Vaswani reinforces the well-known positional encoding and attention mechanisms, such that the pending rejections are maintained.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections
under this section made in this Office action:
A person shall be entitled to a patent unless —
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise
available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in patent issued, under section 151, or in an application for patent
published or deemed published under section 122(b), in which the patent or application as the case may be, names
another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1, 2, 11-12 are rejected under 35 U.S.C 102(a)(1) as being anticipated by Kuo, et. al, “Embeddings and Attention in Predictive Modeling” (referred herein as Kuo).
Regarding claim 1, Kuo teaches:
A method of training a prediction engine, which is a transformer based neural network, for predicting performance of a neural network model executed on a hardware platform; ([Kuo, page 17, sec 5.1.2] “Model 6 uses a Transformer layer in place of self-attention and a positional encoding of a single dimension, i.e., dcol = 1”, wherein the examiner interprets “Transformer layer” to be the same as transformer based neural network because they are both directed to a neural-network architecture that employs stacked attention mechanisms, namely, several attention layers placed one on top of another, for sequence modelling).
comprising; receiving, by the prediction engine during training, a plurality of training neural networks compiled for the hardware platform, each training neural network including a plurality of layers and each layer defined by a set of operations and corresponding configuration settings of the operations; ([Kuo, page 4, sec 3.2] “Formally, a K-layer neural network is: z₁ = σ(a₁ X + b₁) … ŷ = σ(a_{K+1} z_K + b_{K+1}), where the regression parameters (weights) for each layer k ∈ [1; K + 1] are represented by the matrices a_k and the intercept terms are represented by b_k”, wherein the examiner interprets “matrices a_k and the intercept terms b_k” to be the same as a set of operations and corresponding configuration settings of the operations because they are both directed to the numerical parameters that configure the computation performed in each layer, in other words, every layer's weight matrix (features), a_k, b_k are embedded for each network layer).
generating, by the prediction engine, a performance metric of executing each training neural network on the hardware platform; ([Kuo, page 15, sec 4.4], “we exhibit the cross-validated performance metrics of each of the models. We report both the root mean squared error (RMSE) and the mean absolute error (MAE); we note that these are just two of many potential metrics to use”, wherein the examiner interprets “root mean squared error” and “mean absolute error” to be the same as a performance metric of executing each training neural network on the hardware platform because they are both directed to quantitative measures that characterize the network’s execution accuracy.)
updating a categorical mapping and a numerical mapping of the prediction engine based on a difference between the performance metric and a simulated performance metric obtained from each training neural network, wherein the categorical mapping maps the operations of the training neural network to categorical feature vectors and the numerical mapping maps the corresponding configuration settings of the operations to numerical feature vectors; ([Kuo, page 4, sec 3.2 - page 5 sec 3.2.1] “First a loss function L(., .) is specified for the network that measures the difference between the observed data y and the predictions of the network yˆ, for example, the Mean Squared Error ((y - yˆ) 2 ). Then, the parameters of the network are changed such that the loss decreases (formally, this is done using the technique of backpropagation). Finally, training is stopped once the predictive performance of the network on unseen data is suitably good… An embedding layer is a neural network component which maps each level of the categorical data to a low dimensional vector of parameters that is learned together with the rest of the GLM or neural network that is used for the modeling problem.” AND [Kuo, page 2, sec 2] “The first of these was in a Property and Casualty (P&C) pricing context, it was shown that the out-of-sample accuracy of a neural network trained to predict claims frequencies on motor third party liability was enhanced by modeling the categorical variables within this dataset using embedding layers.”, wherein the examiner interprets “the parameters of the network are changed such that the loss decreases” to be the same as the updating process on [categorical] mapping, loss function L to be the same as the numerical mapping, and measuring of the difference between the observed data and the predictions to be the same as a difference between the performance metric and a simulated performance metric because they are all directed to adjusting model parameters in response to the discrepancy between actual and predicted performance. Furthermore, the examiner interprets “out-of-sample accuracy of a neural network trained to predict claims frequencies on motor third party liability was enhanced by modeling the categorical variables within this dataset” to be the same as performance metrics obtained from each training NN, because they are both directed to performance metrics for a NN).
performing, for each layer of the training neural network, feature embedding on the set of operations and the corresponding configuration settings using the categorical mapping and the numerical mapping, respectively, to generate a feature embedded sequence including both of the categorical feature vectors and the numerical feature vectors; ([Kuo, page 5, sec. 3.2.1] “The following equation shows how the first layer of a neural network incorporating embeddings might be written: z₁ = σ(a₁·X′ + b₁), where we represent the output of embedding layers concatenated together with any numerical inputs as X”, wherein the examiner interprets “output of embedding layers concatenated together with any numerical inputs” to be the same as a feature-embedded sequence of the categorical feature vectors and the numerical feature vectors).
applying positional encoding and a series of attention functions on the feature embedded sequence including both the categorical feature vectors and the numerical feature vectors to generate an encoded sequence; ([Kuo, page 17, sec 5.1.2] “Model 6 uses a Transformer layer in place of self-attention and a positional encoding of a single dimension, i.e., dcol = 1” AND [Kuo, page 3, sec 3] “which is the task of predicting an unknown outcome y on the basis of information about that outcome contained in predictor variables, or features, stored in a matrix X. For simplicity, we only consider the case of univariate outcomes, i.e., y ∈ R1. The outcomes and the rows of the predictor variable matrix are indexed by i ∈ {1...I}, where i represents a particular observation of (yi,xi), where bold indicates that we are now dealing with a vector.”, wherein the examiner interprets “positional encoding of a single dimension” to be the same as positional encoding and “Transformer layer” to be the same as a series of attention functions because they are both directed to injecting position information and repeatedly computing attention weights across the feature sequence. Furthermore, the examiner interprets “predictor variables, or features, stored in a matrix X. For simplicity, we only consider the case of univariate outcomes” to be the same as a sequence including categorical and numerical feature vectors, because they are directed to feature vectors and variables vectorized in a ).
reducing dimensions of the encoded sequence to output the performance metric of executing the training neural network model on the hardware platform; ([Kuo, page 12, sec 4.3] “Specifically, we need to reduce the dimension of the data to 2D for embeddings with more than two dimensions”, wherein the examiner interprets “reduce the dimension of the data to 2D for embeddings with more than two dimensions” to be the same as reducing dimensions of the encoded sequence because they are both directed to compressing the high-dimensional encoded representations prior to producing the final performance metric). Regarding claim 11, it is analogous to claim 1 with the exception of a method claim vs. a system claim. The amendment made “for each training neural network” is not limiting or change any scope of the claim. Therefore, the art above can be used to reject both claims.
Regarding claim 2, Kuo teaches:
The method of claim 1, wherein performing feature embedding further comprises: concatenating a first sequence of the categorical feature vectors for all layers of the neural network model and a second sequence of the numerical feature vectors to generate the feature embedded sequence; ([Kuo, page 9, sec. 4] “The categorical inputs go through one-dimensional embedding layers ... The embeddings are then concatenated with the numeric predictors, which have been normalized in data pre-processing, before being passed through a feedforward layer (with 8 hidden units and ReLU activation) to obtain a scalar value between 0 and 1, as constrained by a sigmoid output activation,” wherein the examiner interprets categorical “embeddings” to be the same as “categorical feature vectors” as they both involve mapping categorical data into vectors for use in downstream processing in the context of machine learning; the examiner also interprets “categorical inputs go through one-dimensional embedding layers...The embeddings are then concatenated with the numeric predictors” to be the same as “concatenating a first sequence of the categorical feature vectors for all layers of the neural network model and a second sequence of the numerical feature vectors to generate the feature embedded sequence”. This is further illustrated in the figure below from Kuo where “Concat” is interpreted to be the same as concatenation.).
PNG
media_image1.png
417
630
media_image1.png
Greyscale
Regarding claim 12, it is analogous to claim 2 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this
Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not
identically disclosed as set forth in section 102, if the differences between the claimed invention and the
prior art are such that the claimed invention as a whole would have been obvious before the effective filing
date of the claimed invention to a person having ordinary skill in the art to which the claimed invention
pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are
summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 3-7, 13-17 are rejected under 35 U.S.C. 103 as being unpatentable over Kuo et. al “Embeddings and Attention in Predictive Modeling” (referred herein as Kuo) in view of Gao et. al “Resource-Guided Configuration Space Reduction for Deep Learning Models” (referred herein as Gao).
Regarding claim 3, Kuo teaches The method of claim 1 (see rejection of claim 1).
Kuo does not teach wherein each categorical feature vector corresponds to an operation group in the set of operations.
Gao teaches wherein each categorical feature vector corresponds to an operation group in the set of operations. ([Gao, page 176, sec. 2] “Each node represents the invocation of a mathematical operation called an operator (e.g., elementwise matrix addition). An edge delivers an output tensor and specifies the execution dependency. In this paper, we use the terms 'operator' and 'node' interchangeably since a node is completely determined by its invoked operator,” wherein the examiner interprets “Each node represents the invocation of a mathematical operation called an operator” to be the same as “each categorical feature vector corresponds to an operation group in the set of operations,” as “nodes” represent specific operations, which are interpreted to be the same as “categorical feature vectors corresponding to an operation group” because both classify and represent operations for downstream processing and abstract operations into identifiable entities for performance prediction.).
Kuo, Gao, and the instant application are analogous art because they are all directed to classifying and representing operations in neural networks for downstream processing and performance prediction.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the feature embedding disclosed by Kuo to include the process in which “Each node represents the invocation of a mathematical operation called an operator” taught by Gao. One would be motivated to do so to efficiently analyze and model execution dependencies between operations, enabling accurate performance prediction of neural network models, as suggested by Gao ([Gao, Page 176, sec. 2] “An edge delivers an output tensor and specifies the execution dependency”). Regarding claim 13, it is analogous to claim 3 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Regarding claim 4, Kuo in view of Gao teaches The method of claim 3, (see rejection of claim 3).s
Gao further teaches wherein the operation group includes one of: convolution, pooling, and an activation function. ([Gao, page 176, sec. 2] “Fig. 1a shows a simple TensorFlow training program using the Keras [24] API, which sets up a sequential model with the framework built-in Conv2D (2D convolution with a 3 x 3 kernel size), AvgPool2D (2D average pooling with the 'same' padding setting1), Flatten (collapsing the input into one dimension without affecting the batch size), and Dense (fully connected layer with 64 units) operators (lines 4-7). The above filter size, padding, and number of units are hyperparameters, which are parameters to control the training process,” and [Gao, Fig. 1a] “activation = RelU, (see Fig. 1a below)” wherein the examiner interprets “Conv2D (2D convolution with a 3 x 3 kernel size)” to be the same as “convolution,” “AvgPool2D (2D average pooling with the 'same' padding setting1)” to be the same as “pooling,” and “activation = RelU” to be the same as “an activation function”).
PNG
media_image2.png
222
435
media_image2.png
Greyscale
Kuo, Gao, and the instant application are analogous art because they are all directed to grouping operations in neural network models into specific types, such as convolution, pooling, and activation functions.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the feature embedding of operations disclosed by Kuo to include the “simple TensorFlow training program using the Keras [24] API, which sets up a sequential model with the framework built-in Conv2D, AvgPool2D ... and Dense (fully connected layer with 64 units) operators” taught by Gao. One would be motivated to do so to effectively standardize and categorize operations for efficient performance prediction, as suggested by Gao ([Gao, Page 176, sec. 2] “The above filter size, padding, and number of units are hyperparameters, which are parameters to control the training process”). Regarding claim 14, it is analogous to claim 4 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Regarding claim 5, Kuo teaches The method of claim 1 (see rejection of claim 1).
Kuo does not teach further comprising: training the feature embedding to map each operation to a categorical feature vector that has a trainable vector value and a predetermined embedding size.
Gao teaches further comprising: training the feature embedding to map each operation to a categorical feature vector that has a trainable vector value and a predetermined embedding size. ([Gao, page 1-2] “Another useful constraint is that the size of a model’s weights cannot exceed a certain upper bound. ... The inputs and outputs of such a computation graph and its nodes are tensors (multi-dimensional arrays of numerical values). The shape of a tensor is the element number in each dimension plus the element data type. Each node represents the invocation of a mathematical operation called an operator (e.g., elementwise matrix addition). An edge delivers an output tensor and specifies the execution dependency,” and [Gao, page 4] “execution of a DL model can be represented as iterative forward and backward propagation on its computation graph.” wherein the examiner interprets “Each node represents the invocation of a mathematical operation called an operator” to be the same as “map each operation to a categorical feature vector,” “iterative forward and backward propagation on such a computation graph” to be the same as “training the feature embedding,” and “size of a model’s weights cannot exceed a certain upper bound “ to be the same as “a predetermined embedding size”).
Kuo, Gao, and the instant application are analogous art because they are all directed to training feature embeddings to improve the representation of operations in a neural network model for performance prediction.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the feature embedding technique disclosed by Kuo to include the “execution of a DL model can be represented as iterative forward and backward propagation on its computation graph” taught by Gao. One would be motivated to do so to effectively improve the learning of trainable feature embeddings, as suggested by Gao ([Gao, page 4] “ Each node represents the invocation of a mathematical operation called an operator”).
Regarding claim 15, it is analogous to claim 5 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Regarding claim 6, Kuo teaches The method of claim 1 (see rejection of claim 1).
Kuo does not teach wherein one or more of the numerical feature vectors indicate height, width, and number of channels in a corresponding convolution operation.
Gao teaches wherein one or more of the numerical feature vectors indicate height, width, and number of channels in a corresponding convolution operation; ([Gao, page 179, sec. 4] “The following symbols are used to denote the hyperparameters and tensor shapes. S f is the size of input data type (e.g., 4 bytes for FLOAT32 data). N represents batch size. H k and W k are kernel (filter) height and width,” and [Gao, page 177, sec. 3] “Table I lists some commonly used hyperparameters with their domains,” wherein the examiner interprets “H k and W k are kernel (filter) height and width” to be the same as “numerical feature vectors indicate height and width,” and “Table I lists some commonly used hyperparameters with their domains (see table below “output channels”)” to be the same as “numerical feature vectors indicate...number of channels in a corresponding convolution operation”).
PNG
media_image3.png
195
400
media_image3.png
Greyscale
Kuo, Gao, and the instant application are analogous art because they are all directed to representing convolutional operations using numerical feature vectors to describe their properties for downstream performance prediction.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the representation of convolution operations disclosed by Kuo to include the kernels “H k and W k are kernel (filter) height and width” taught by Gao. One would be motivated to do so to effectively enhance the representation of convolutional operations for improved model performance analysis, as suggested by Gao ([Gao, page 179, sec. 4] “The following symbols are used to denote the hyperparameters and tensor shapes.”). Regarding claim 16, it is analogous to claim 6 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Regarding claim 7, Kuo teaches The method of claim 1 (see rejection of claim 1).
Kuo does not teach wherein the performance metric includes one or more of: latency, execution cycles, and power consumption.
Gao teaches wherein the performance metric includes one or more of: latency, execution cycles, and power consumption; ([Gao, page 179, sec. 4] “In this paper, we consider four representative computational constraints with respect to the model, namely weight size, number of floating-point operations, inference time, and GPU memory consumption,” wherein the examiner interprets “inference time” to be the same as “latency” and “number of floating-point operations” to be the same as “execution cycles”).
Kuo, Gao, and the instant application are analogous art because they are all directed to predicting performance metrics of neural network models based on computational constraints and operational characteristics.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method for predicting performance metrics disclosed by Kuo to include the “inference time” taught by Gao. One would be motivated to do so to effectively predict latency as part of the performance metrics, as suggested by Gao ([Gao, page 179, sec. 4] “In this paper, we consider four representative computational constraints with respect to the model, namely weight size, number of floating-point operations, inference time [latency], and GPU memory consumption”). Regarding claim 17, it is analogous to claim 7 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Claims 8-10, 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kuo in view of NPL reference “Attention is all you need” by Vaswani et. al (referred herein as Vaswani).
Regarding claim 8, Kuo teaches The method of claim 1 (see rejection of claim 1).
Kuo does not teach wherein reducing the dimensions of the encoded sequence further comprises: reducing the dimensions of the encoded sequence using a series of fully-connected layers.
Vaswani teaches wherein reducing the dimensions of the encoded sequence further comprises: reducing the dimensions of the encoded sequence using a series of fully-connected layers; ([Vaswani, page 2, sec. 3] “The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively..... In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality,” wherein the examiner interprets “point-wise, fully connected layers” to be the same as “using a series of fully-connected layers” and “due to the reduced dimension of each head” to be the same as “reducing the dimensions of the encoded sequence”).
Kuo, Vaswani, and the instant application are analogous art because they are all directed to reducing the dimensions of encoded sequences to optimize neural network operations.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method for reducing dimensions of encoded sequences disclosed by Kuo to include the “point-wise, fully connected layers for both encoder and decoder” taught by Vaswani. One would be motivated to do so to efficiently reduce the computational complexity of encoding operations, as suggested by Vaswani ([Vaswani, page 2, sec. 3] “Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality”). Regarding claim 18, it is analogous to claim 8 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Regarding claim 9, Kuo teaches The method of claim 1 (see rejection of claim 1).
Kuo does not teach wherein the series of attention functions include a series of multi-head attention functions that identify correlations among vectors in the sequence.
Vaswani teaches wherein the series of attention functions include a series of multi-head attention functions that identify correlations among vectors in the sequence. ([Vaswani, page 2-3, sec. 3.1] “The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position wise fully connected feed-forward network,” wherein the examiner interprets “multi-head self-attention mechanism” to be the same as “a series of multi-head attention functions” and “The encoder is composed of a stack of N = 6 identical layers” to be the same as “the series of attention functions”).
Kuo, Vaswani, and the instant application are analogous art because they are all directed to the use of attention mechanisms in neural networks to identify relationships among inputs in a sequence.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of employing attention functions disclosed by Kuo to include the “multi-head self-attention mechanism” taught by Vaswani. One would be motivated to do so to efficiently enhance the ability to identify correlations among vectors in a sequence, as suggested by Vaswani ([Vaswani, page 2-3, sec. 3.1] “The encoder is composed of a stack of N = 6 identical layers ... fully-connected feed-forward network”). Regarding claim 19, it is analogous to claim 9 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Regarding claim 10, Kuo teaches The method of claim 1 (see rejection of claim 1).
Kuo does not teach further comprising: adding input and output of each attention function to generate a sequence of sums; and normalizing the sequence of sums to output to a feed-forward network.
Vaswani teaches further comprising: adding input and output of each attention function to generate a sequence of sums; and normalizing the sequence of sums to output to a feed-forward network. ([Vaswani, page 7, sec. 5.4] “We apply dropout [27] to the output of each sub-layer, before it is added to the sub-layer input and normalized,” and [Vaswani, page 2-3, sec. 3.1] “The first is a multi-head self-attention mechanism, and the second is a simple, position wise fully connected feed-forward network. We employ a residual connection [i.e. adding input and output] [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself,” wherein the examiner interprets “the output of each sub-layer, before it is added to the sub-layer input” and “residual connection” to be the same as “adding input and output of each attention function to generate a sequence of sums” where specifically, “residual connection” is interpreted to be the same as “adding input and output” (see “Figure 1: The Transformer – model architecture” below). Also, the examiner further interprets “normalized” and “position wise fully connected feed-forward network … LayerNorm(x + Sublayer(x))” to be the same as “normalizing the sequence of sums to output to a feed-forward network”).
PNG
media_image4.png
477
325
media_image4.png
Greyscale
Kuo, Vaswani, and the instant application are analogous art because they are all directed to improving the processing and transformation of input data in neural networks through attention mechanisms and normalization.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method for adding and normalizing inputs disclosed by Kuo to include the process of determining the “residual connection [i.e. adding input and output]” taught by Vaswani. One would be motivated to do so to efficiently combine input and output for enhanced model stability and performance, as suggested by Vaswani ([Vaswani, page 7, sec. 5.4] “we apply dropout .... before it is added to the sub-layer input and normalized”). Regarding claim 20, it is analogous to claim 10 with the exception of a method claim vs. a system claim. Therefore, the art above can be used to reject both claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DEVAN KAPOOR whose telephone number is (703)756-1434. The examiner can normally be reached Monday - Friday: 9:00AM - 5:00 PM EST (times may vary).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DEVAN KAPOOR/Examiner, Art Unit 2126
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126