DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
Regarding 112(b):
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 9, 14, 15, 20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
In regard to Claim 9:
The term “large” in claim 9 is a relative term which renders the claim indefinite. The term “large” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. Paragraph 32 of the specification gives some indication as to what is meant by large negative number, but the term “large” leaves an indefiniteness on how to interpret a “large negative number”. For example, a large negative number could be a number with the higher value, but that would mean -1 is greater than -1000. In another example a large negative number can mean the absolute value of the negative number is higher, so -1000 would be greater than -1. To help correct for the indefiniteness, an indication as to what is meant by large can be added to the claims. A possible interpretation, supported by paragraph 32, could be that “a large negative value” is a value that is more negative than any value in the tensor (one of the broadest examples given in paragraph 32). The indication would both tell that large is meant to mean “more negative” and indicates a minimum of what large should be.
In regard to claim 14:
Claim 14 recites the limitation "a first tensor X having dimensions […, P, …, Q, …] and a second tensor Y having dimensions […, Q, …, R, …]". There is insufficient antecedent basis for this limitation in the claim. There is no definition given for the variables P, Q, or R, nor is there an indication as to what is meant by “…”. Providing a definition for variables or rewriting the claim to indicate the wanted limitation without undefined variables is required.
In regard to claim 15:
Claim 15 recites the same 112(b) rejection as claim 14.
In regard to claim 20:
Claim 20 recites the limitation "configured to cause the method as set forth in claim 1 to be performed when the code is run on at least one processor". There is insufficient antecedent basis for this limitation in the claim. Claim 1 indicates processing is performed by a “neural network accelerator comprising fixed function hardware”. However, in claim 20, the processing is being performed on “at least one processor”. One cannot determine whether the “neural network accelerator comprising fixed function hardware” is intended to be “at least one processor” or whether the two items are intended to be two separate processing devices. Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed towards an abstract idea without significantly more.
In regards to Claim 1:
Step 1: Is the claim directed towards a process, machine, manufacture, or composition of matter?
Yes, the claim is directed towards a method, so a process.
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 1 recites the following abstract ideas:
padding the first input sequence with padding values to produce a first padded input sequence of a first fixed length
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
The process of padding an input is possible by a person with pen and paper. An example is a person padding a value with zeroes when writing values for money (Ex. $04.00)
generating a first padding mask identifying the part of the first padded input sequence that contains the padding values
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generating a first attention mask from the first padding mask, wherein the generating comprises an outer product operation applied to the first padding mask
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 1 recites the following additional elements:
A method of implementing, using a neural network accelerator comprising fixed- function hardware, inference using an attention-based neural network, the method comprising
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
receiving a first input sequence for the attention-based neural network
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)).
processing, by the fixed-function hardware, the first padded input sequence and the first attention mask to perform the inference using the attention-based neural network
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 1 recites the following additional elements:
A method of implementing, using a neural network accelerator comprising fixed- function hardware, inference using an attention-based neural network, the method comprising
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a neural network accelerator appears to be an implementation of the abstract idea on a computer, so merely using a computer as a tool to perform the abstract idea.
The limitation of a neural network accelerator does not limit the type of hardware applicable for implementing the claim. Even when interpreted in light of the specification using paragraph 14, the idea of a neural network accelerator appears to fit a generic processor utilized with limited commands or simply utilizing the instruction set of a processor. The specification of what hardware a neural network accelerator is, such as a field programmable logic array or some other form of non-generic processor, would help indicate the claims are directed to a particular machine.
receiving a first input sequence for the attention-based neural network
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)). This is a well understood, routine, conventional activity of transmitting data (see MPEP 2106.05(d) example i in computer functions).
processing, by the fixed-function hardware, the first padded input sequence and the first attention mask to perform the inference using the attention-based neural network
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “processing, by the fixed function hardware…” to perform inference does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 2:
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 2 recites the following additional elements:
wherein the attention-based neural network comprises a decoder, wherein the first input sequence is an input for the decoder
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 2 recites the following additional elements:
wherein the attention-based neural network comprises a decoder, wherein the first input sequence is an input for the decoder
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). Noting the neural network comprises a decoder does not integrate the invention into a practical application.
In regards to Claim 3:
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 3 recites the following abstract ideas:
wherein, at each iteration the decoder produces an output sequence, and, at each iteration other than an initial iteration, the input to the decoder comprises the output sequence from the preceding iteration
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 3 recites the following additional elements:
executing the decoder for a number of iterations equal to the first fixed length
This limitation is directed towards the insignificant extra solution activity of repetitive calculations (see MPEP § 2106.05(d)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 3 recites the following additional elements:
executing the decoder for a number of iterations equal to the first fixed length
This limitation is directed towards the insignificant extra solution activity of repetitive calculations (see MPEP § 2106.05(d)). Repetitive calculations are considered a well understood, routine, and conventional activity acknowledged by the courts (see MPEP § 2106.05(d) subsection 2 example 2 for a computer).
In regards to Claim 4:
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 4 recites the following additional elements:
the attention based neural network comprises an encoder, and wherein the first input sequence is an input for the encoder
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 4 recites the following additional elements:
the attention based neural network comprises an encoder, and wherein the first input sequence is an input for the encoder
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). Noting the neural network comprises an encoder does not integrate the invention into a practical application.
In regards to Claim 5:
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 5 recites the following abstract ideas:
padding the second input sequence with padding values to produce a second padded input sequence of a second fixed length
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generating a second padding mask identifying the part of the second padded input sequence that contains the padding values
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generating a second attention mask from the second padding mask, wherein the generating comprises an outer product operation applied to the second padding mask
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 5 recites the following additional elements:
wherein the attention-based neural network further comprises a decoder, the method further comprising
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
receiving a second input sequence, wherein the second input sequence is an input for the decoder
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)).
wherein the method comprises processing, by the fixed-function hardware, the first padded input sequence and the first attention mask using the encoder and processing, by the fixed-function hardware, the second padded input sequence and the second attention mask using the decoder, to perform the inference
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 5 recites the following additional elements:
wherein the attention-based neural network further comprises a decoder, the method further comprising
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). Noting the neural network comprises an encoder does not integrate the invention into a practical application.
receiving a second input sequence, wherein the second input sequence is an input for the decoder
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)). This is a well understood, routine, conventional activity of transmitting data (see MPEP 2106.05(d) example i in computer functions).
wherein the method comprises processing, by the fixed-function hardware, the first padded input sequence and the first attention mask using the encoder and processing, by the fixed-function hardware, the second padded input sequence and the second attention mask using the decoder, to perform the inference
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “processing, by the fixed function hardware…” to perform inference does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 6:
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 6 recites the following abstract ideas:
generating a cross-attention mask from the first padding mask and the second padding mask, comprising an outer product of the first padding mask with the second padding mask
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 6 recites the following additional elements:
wherein the method further comprises processing, by the fixed-function hardware, the first padded input sequence and the first attention mask using the encoder and processing, by the fixed-function hardware, the second padded input sequence, the second attention mask, and the cross-attention mask using the decoder, to perform the inference
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 6 recites the following additional elements:
wherein the method further comprises processing, by the fixed-function hardware, the first padded input sequence and the first attention mask using the encoder and processing, by the fixed-function hardware, the second padded input sequence, the second attention mask, and the cross-attention mask using the decoder, to perform the inference
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “processing, by the fixed function hardware…” to perform inference does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 7:
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 7 recites the following additional elements:
modifying the generated first attention mask to ignore one or more elements of the first input sequence
At a high level of generality, this is an activity of using an element as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 7 recites the following additional elements:
modifying the generated first attention mask to ignore one or more elements of the first input sequence
At a high level of generality, this is an activity of using an element as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “apply” or equivalent does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”. “modifying” noted in manner without indication as to what the modification makes the limitation applying a modification in a generic manner.
In regards to Claim 8:
Step 1: Is the claim directed towards a process, machine, manufacture, or composition of matter?
Yes, the claim is directed towards a method, so a process.
Step 2A Prong 8: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 8 recites the following abstract ideas:
the attention-based neural network comprises a scaled-dot product attention calculation
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
In regards to Claim 9:
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 9 recites the following abstract ideas:
wherein the first attention mask comprises or consists of: a plurality of zeros, in locations corresponding to the elements of the first input sequence; and one or more large negative values, in locations corresponding to the padding values of the first padded input sequence
This limitation is directed towards the continuation of the abstract ideas of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3) from claim 1.
In regards to Claim 10:
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 10 recites the following abstract ideas:
wherein the attention-based neural network comprises a Softmax function
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 10 recites the following additional elements:
wherein the processing comprises adding the first attention mask to an input to the Softmax function
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 10 recites the following additional elements:
wherein the processing comprises adding the first attention mask to an input to the Softmax function
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)). This is a well understood, routine, conventional activity of transmitting data (see MPEP 2106.05(d) example i in computer functions).
In regards to Claim 11:
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 11 recites the following additional elements:
wherein the attention-based neural network comprises a transformer network
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 11 recites the following additional elements:
wherein the attention-based neural network comprises a transformer network
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). Noting the neural network comprises a transformer does not integrate the invention into a practical application.
In regards to Claim 12:
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 12 recites the following additional elements:
wherein the attention-based neural network comprises a layer normalisation
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 12 recites the following additional elements:
wherein the attention-based neural network comprises a layer normalisation
At a high level of generality, this is a continuation of an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). Noting the neural network comprises a normalization does not integrate the invention into a practical application.
In regards to Claim 13:
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 13 recites the following abstract ideas:
and evaluating said plurality of elementary neural network operations using the fixed- function hardware, wherein each of the plurality of elementary neural network operations is selected from the list consisting of: a convolution operation; an element-wise subtraction operation; an element-wise multiplication operation; a reciprocal operation; a square root operation; an element-wise division operation; a rectified linear activation function; a local response normalisation; and an element-wise addition
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
The evaluations of different operations are seen as performable in the human mind (performing evaluation of operations) or as math (particular operations such as element-wise multiplication, square root, etc). The utilization of the neural network accelerator is seen as a generic processor as indicated in claim 1.
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 13 recites the following additional elements:
wherein the fixed-function hardware is configured to perform a set of available elementary neural network operations, the method comprising
At a high level of generality, this is a continuation of an activity of using a computer as an “apply it” use (see MPEP 2106.05(f)).
mapping the layer normalisation to a representation comprising a plurality of elementary neural network operations from the set of available elementary neural network operations
At a high level of generality, this is an activity of mapping to network operations as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 13 recites the following additional elements:
wherein the fixed-function hardware is configured to perform a set of available elementary neural network operations, the method comprising
At a high level of generality, this is a continuation of an activity of using a computer as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a computer or computer parts, such as the neural network accelerator, appears to be an implementation of the abstract idea on a computer, so merely using a computer as a tool to perform the abstract idea.
mapping the layer normalisation to a representation comprising a plurality of elementary neural network operations from the set of available elementary neural network operations
At a high level of generality, this is an activity of mapping to network operations as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “mapping” to network operations does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 14:
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 14 recites the following abstract ideas:
wherein the attention-based neural network comprises a matrix multiplication operation defined in two or more dimensions between a first tensor X having dimensions [..., P,..., Q,...] and a second tensor Y having dimensions [..., Q,...,R,...], the method further comprising
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
evaluating the graph of neural network operations to thereby evaluate the matrix multiplication operation, wherein the at least one convolution operation is evaluated in the fixed-function hardware
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 14 recites the following additional elements:
mapping the matrix multiplication operation to a graph of neural network operations including at least one transformation and at least one convolution operation
At a high level of generality, this is an activity of mapping to network operations as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 14 recites the following additional elements:
mapping the matrix multiplication operation to a graph of neural network operations including at least one transformation and at least one convolution operation
At a high level of generality, this is an activity of mapping to network operations as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “mapping” to network operations does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 15:
Step 1: Is the claim directed towards a process, machine, manufacture, or composition of matter?
Yes, the claim is directed towards a method, so a process.
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 15 recites the following abstract ideas:
wherein the attention-based neural network comprises a matrix multiplication operation defined in two or more dimensions between a first tensor X having dimensions [..., P,..., Q,...] and a second tensor Y having dimensions [..., Q,...,R,...], the method further comprising
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
and evaluating the graph of neural network operations to thereby evaluate the matrix multiplication operation, wherein the at least one element-wise operation is evaluated in the fixed-function hardware
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 15 recites the following additional elements:
mapping the matrix multiplication operation to a graph of neural network operations including at least one element-wise operation
At a high level of generality, this is an activity of mapping to network operations as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 15 recites the following additional elements:
mapping the matrix multiplication operation to a graph of neural network operations including at least one element-wise operation
At a high level of generality, this is an activity of mapping to network operations as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “mapping” to network operations does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 16:
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 16 recites the following abstract ideas:
determining a length of the further input sequence
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation and observation.
identifying that said length is longer than the first fixed length
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as judgement.
padding the further input sequence with padding values to produce a further padded input sequence of the further fixed length
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generating a further padding mask identifying the part of the further padded input sequence that contains the padding values
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generating a further attention mask from the further padding mask, wherein the generating comprises an outer product operation applied to the further padding mask
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 16 recites the following additional elements:
receiving a further input sequence
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)).
responsive to said identifying, loading into the neural network accelerator a representation of a further attention-based neural network, wherein the further attention- based neural network is associated with a further fixed length, the further fixed length being longer than the length of the further input sequence, the method further comprising
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)).
processing, by the fixed-function hardware, the further padded input sequence and the further attention mask to perform the inference using the further attention-based neural network
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 16 recites the following additional elements:
receiving a further input sequence
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)). This is a well understood, routine, conventional activity of transmitting data (see MPEP 2106.05(d) example i in computer functions).
responsive to said identifying, loading into the neural network accelerator a representation of a further attention-based neural network, wherein the further attention- based neural network is associated with a further fixed length, the further fixed length being longer than the length of the further input sequence, the method further comprising
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)). This is a well understood, routine, conventional activity of transmitting data (see MPEP 2106.05(d) example i in computer functions).
This is seen as giving a neural network as input to the accelerator
processing, by the fixed-function hardware, the further padded input sequence and the further attention mask to perform the inference using the further attention-based neural network
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “processing, by the fixed function hardware…” to perform inference does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 17:
Step 1: Is the claim directed towards a process, machine, manufacture, or composition of matter?
Yes, the claim is directed towards a process.
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 17 recites the following abstract ideas:
padding each first training input sequence with padding values to produce a respective first padded input sequence of a first fixed length
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generating, for each first padded input sequence, a respective first padding mask identifying the part of the first padded input sequence that contains the padding values
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generating a first attention mask from each first padding mask, wherein the generating comprises an outer product operation applied to the first padding mask
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 17 recites the following additional elements:
A computer-implemented method for training an attention-based neural network for hardware implementation, the method comprising
At a high level of generality, this is an activity of using a computer as an “apply it” use (see MPEP 2106.05(f)).
obtaining a dataset of first training input sequences for the attention-based neural network, wherein the dataset includes first training input sequences of varying length
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)).
training the attention-based neural network using the first padded input sequences and the first attention masks
At a high level of generality, this is an activity of training as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 17 recites the following additional elements:
A computer-implemented method for training an attention-based neural network for hardware implementation, the method comprising
At a high level of generality, this is an activity of using a computer as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a computer or computer parts appears to be an implementation of the abstract idea on a computer, so merely using a computer as a tool to perform the abstract idea.
obtaining a dataset of first training input sequences for the attention-based neural network, wherein the dataset includes first training input sequences of varying length
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)). This is a well understood, routine, conventional activity of transmitting data (see MPEP 2106.05(d) example i in computer functions).
training the attention-based neural network using the first padded input sequences and the first attention masks
At a high level of generality, this is an activity of training as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “training” using input sequences and masks does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 18:
Claim 18 is analogous to claim 1
In regards to Claim 19:
Step 1: Is the claim directed towards a process, machine, manufacture, or composition of matter?
Yes, the claim is directed towards a machine.
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 19 recites the following abstract ideas:
pad the first input sequence with padding values to produce a first padded input sequence of a first fixed length
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generate a first padding mask identifying the part of the first padded input sequence that contains the padding values
This limitation is directed towards the abstract idea of a mental process, or a concept performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP 2106.04(a)(2) subsection 3). Here the limitation is seen as evaluation.
generate a first attention mask from the first padding mask, comprising an outer product operation applied to the first padding mask
This limitation is directed towards the abstract idea of a mathematical concept (see MPEP 2106.04(a)(2) subsection 1).
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 19 recites the following additional elements:
a mapping unit configured to: receive a first input sequence for the attention-based neural network
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)).
a neural network accelerator comprising fixed-function hardware configured to process the first padded input sequence and the first attention mask to perform the inference using the attention-based neural network
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 19 recites the following additional elements:
a mapping unit configured to: receive a first input sequence for the attention-based neural network
This limitation is directed towards the insignificant extra solution activity of mere data gathering (see MPEP § 2106.05(g)). This is a well understood, routine, conventional activity of transmitting data (see MPEP 2106.05(d) example i in computer functions).
a neural network accelerator comprising fixed-function hardware configured to process the first padded input sequence and the first attention mask to perform the inference using the attention-based neural network
At a high level of generality, this is an activity of using a neural network accelerator as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a generic recitation of “processing, by the fixed function hardware…” to perform inference does not incorporate the abstract idea into a practical invention and is seen as a variation of the phrase “apply it”.
In regards to Claim 20:
Step 1: Is the claim directed towards a process, machine, manufacture, or composition of matter?
Yes, the claim is directed towards a manufacture.
Step 2A Prong 1: Does the claim recite a law of nature, a natural phenomenon, or an abstract idea?
Yes, the claim does recite a(n) abstract idea.
Claim 20 recites the same abstract ideas as claim 1.
Step 2A Prong 2: Does the claim recite additional elements that integrate the exception into a practical application of the exception?
No, the application does not recite any additional elements that would integrate the abstract idea into a practical application.
Claim 20 recites the following additional elements:
A non-transitory computer readable storage medium having stored thereon computer executable code configured to cause the method as set forth in claim 1 to be performed when the code is run on at least one processor
At a high level of generality, this is an activity of using a computer as an “apply it” use (see MPEP 2106.05(f)).
Step 2B: Does the claim as a whole amount to significantly more than the judicial exception?
No, the claim as a whole does not amount to significantly more than the judicial exception. All elements of the claim, viewed individually or wholistically, do not provide an inventive concept or otherwise significantly more than the abstract idea itself.
Claim 20 recites the following additional elements:
A non-transitory computer readable storage medium having stored thereon computer executable code configured to cause the method as set forth in claim 1 to be performed when the code is run on at least one processor
At a high level of generality, this is an activity of using a computer as an “apply it” use (see MPEP 2106.05(f)). At said high level of generality, a computer or computer parts appears to be an implementation of the abstract idea on a computer, so merely using a computer as a tool to perform the abstract idea.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1, 4, 7, 9-12, 17-20 is/are rejected under 35 U.S.C. 102(a)(1) and (a)(2) as being anticipated by Xi (US 20230376663 A1), referred to as Xi in this document.
In regards to Claim 1:
A method of implementing, using a neural network accelerator comprising fixed- function hardware,
[Xi 0002]: “Compared to software implementations executed by a general purpose processor [processor], an FPGA brings the benefits of higher performance and lower power consumption of implementing computations at a low level (e.g., at a circuit level). This is similar to the benefits of using an application specific integrated circuit (ASIC) such as specialized co-processors such as a graphics processing unit (GPU) or neural accelerator [A method of implementing, using a neural network accelerator comprising fixed- function hardware,], which are used to accelerate operations specific to computer graphics and artificial neural networks, respectively. However, the design and fabrication of ASICs is a long, expensive process with high upfront fixed costs.”
[Xi 0095]: “The term computer readable media as used herein may include computer storage [computer readable medium] media. Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or configuration files (“bit files”) specifying the configuration of an FPGA to implement particular functionality. The system memory [memory] 704, the removable storage device 709, and the non-removable storage device 710 are all computer storage media examples (i.e., memory storage.)”
inference using an attention-based neural network, the method comprising
[Xi 0032]: “All layers in a transformer model operate along the hidden dimension (and can ignore this constraint) except for the self-attention [inference using an attention-based neural network, the method comprising] heads, which operate along the sequence dimension. To enforce the autoregressive constraint, a mask is used to mask out tokens at positions greater than or equal to [i+1] (e.g., after the current token at position i). FIG. 1 depicts an example of an attention score matrix supplied as the input data 10, where the rows are labeled in increasing index from top to bottom (e.g., index i from 0 to 7) and from left to right (e.g., index j from 0 to 7). The mask 40 is shown in FIG. 1 as an upper triangular mask to mask out the upper right triangle of the attention score matrix to produce the masked data 30, corresponding to the locations of where positions j are greater than or equal to [i+1]. The masked data 30 is then provided to a normalization circuit 150, which normalizes the masked data 30 (e.g., on a row-by-row basis) and outputs the masked and normalized data, which may be used by a next part of the machine learning process or model training process, such as to update the trainable parameters of the machine learning model (e.g., by applying the backpropagation algorithm to update the weights of connections between layers of a neural network).”
Support for inference is shown by [Xi 0084]: “The stored, trained machine learning model may then be deployed for use in performing inference tasks (e.g., making predictions or estimates) based on live data similar to the training data (e.g., natural language input data, images, etc.) by processing the live data with the trained machine learning model to generate an output (e.g., a classification of the input live data or a predicted next item in a sequence).”
receiving a first input sequence for the attention-based neural network
[Xi 0046]: “Accordingly, in some examples, in operation 391 of FIG. 3B, the accelerator 360 receives [receiving a first input sequence for the attention-based neural network] input data 310 including data values at a plurality of indices. In the example shown in FIG. 3A, the input data 310 is arranged in a two-dimensional array (or matrix) of data values, where each data value is located at a two-dimensional index (e.g., a row and a column coordinate pair) of the matrix. While FIG. 3A shows the input data 310 as being arranged in a square array (e.g., having the same number of rows and columns such as an N×N array), examples of the present technology are not limited thereto, and include circumstances where the input data values are arranged in a rectangular array as an N×M array) of data values.”
padding the first input sequence with padding values to produce a first padded input sequence of a first fixed length
generating a first padding mask identifying the part of the first padded input sequence that contains the padding values
[Xi 0080]: “In order to increase the length of the rows to match the expected shape [adding the first input sequence with padding values to produce a first padded input sequence of a first fixed length] (e.g., N×N) of the output data, at operation 553 the row splitting and padding circuit 569 pads the normalized data vector for output row m to N data values by inserting the 0s at the locations of the masked-out values [generating a first padding mask identifying the part of the first padded input sequence that contains the padding values] (e.g., as defined by the mask) to generate a first padded data vector. Likewise, at operation 555 the row splitting and padding circuit 569 pads the normalized data vector for output row N−m−2 to a length of N data values by inserting the 0s at the locations of the masked-out values to generate a second padded data vector. At operation 557 the row splitting and padding circuit 569 writes the first padded normalized data vector to the BRAM 563 as row m of the output data 570, and at operation 559 the row splitting and padding circuit 569 writes the second normalized padded data vector to the BRAM 563 as row N−m−2 of the output data 570. The computed output data 570 of masked and normalized data values (e.g., based on SoftMax) may then be written back to the system memory 520 for further use, such as for updating the parameters of a machine learning model based on the normalized output scores in the output data 570.”
Further support for padding with extra data to fit a size (in the case of needing to pad in an alternative way) is given by [Xi 0106]: “The method may further include: padding the input data with a dummy row, wherein all of the data values of the dummy row are masked and wherein the dummy row has a negative index.”
Xi is shown by the above quotes to be padding an input to fit a particular fixed size (the “expected shape”) where the padded values are included in a mask. The ordering of the masking may seem different, but the result is the same, as Xi is figuring out how to pad an input. Xi noting that in some cases, such as above, the mask appears to be determined before the padding does not alter the result from the claimed limitations, as figuring out whether the additional slots/indexes of an input should be masked or padded with input first does not alter the input areas from becoming padded and masked. The idea of padding and then masking is supported in [Xi 0106].
generating a first attention mask from the first padding mask, wherein the generating comprises an outer product operation applied to the first padding mask
[Xi 0042]: “In addition, the SoftMax block 270 applies the SoftMax [generating a first attention mask from the first padding mask, wherein the generating comprises an outer product operation applied to the first padding mask] value to each data value in each row of the masked data x_masked 230, without regard to whether those data values are masked. However, in the example shown in FIG. 2, the application of the mask causes the masked data values to be large negative values, and, as discussed above, the large negative value is chosen such that the exponential of the masked values will round to zero. As such, it is already known that the SoftMax output of those masked values will be 0, and therefore computing the SoftMax function on these masked values represent wasted computations (e.g., wasted computational effort). Referring back to FIG. 1 and FIG. 2, in the case of an upper triangular mask 240, this may result in up to approximately 50% wasted computations because approximately half of the input data values are masked out.”
This generation being seen as being related to softmax is supported by the elements of claim 9 indicating that the mask contains large negative numbers.
and processing, by the fixed-function hardware, the first padded input sequence and the first attention mask to perform the inference using the attention-based neural network
[Xi 0091]: “FIG. 7 is a block diagram illustrating physical components (i.e., hardware) of a computing device 700 with which examples of the present disclosure may be practiced [and processing, by the fixed-function hardware, the first padded input sequence and the first attention mask to perform the inference using the attention-based neural network thus indicating that hardware noted by Xi can run the processing]. The computing device components described below may be suitable for running a training process for a machine learning model or for performing inference using a trained machine learning model, as described above. In a basic configuration, the computing device 700 may include at least one processing unit 702, a field programmable gate array (FPGA) 703, and a system memory 704. In some examples, the processing unit 702 includes an FPGA 703 (e.g., the processing unit 702 may include an array of logic blocks that are reconfigurable through setting the interconnections).”
In regards to Claim 4:
The method of claim 1 is taught by Xi.
Xi teaches:
Wherein the attention based neural network comprises an encoder, and wherein the first input sequence is an input for the encoder
[Xi 0085]: “In most transformer models (e.g., Bidirectional Encoder [Wherein the attention based neural network comprises an encoder, and wherein the first input sequence is an input for the encoder] Representations from Transformer or BERT, see Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXivpreprint arXiv:1810.04805 (2018).)−10000.0 is used as the masked output value.”
In regards to Claim 7:
The method of claim 1 is taught by Xi.
Xi teaches:
further comprising modifying the generated first attention mask to ignore one or more elements of the first input sequence
[Xi 0085]: “Aspects of the present technology also relate to a method for masking a portion of data values and performing computations on unmasked data values in a manner that reduces wasted computations on masked values. Generally, the purpose of an attention mask in a transformer model is to assign a large negative values to the masked-out locations so that the following SoftMax layer in the machine learning model attenuates the masked-out locations to zero [further comprising modifying the generated first attention mask to ignore one or more elements of the first input sequence] when supplied as input to a low-precision exponential function (exp(x)) of the SoftMax layer.”
The above limitation is interpreted as referring to the idea of elements not being factored into calculations as a result of the values in softmax being set to 0.
In regards to Claim 9:
The method of claim 1 is taught by Xi.
Xi teaches:
wherein the first attention mask comprises or consists of: a plurality of zeros, in locations corresponding to the elements of the first input sequence;
[Xi 0034]: “In more detail, in the system shown in FIG. 2, a demultiplexer (DEMUX) 265 is used to route training example input data 210 and attention mask 240 along different data paths, where a first floating point subtraction circuit 266, a floating point multiplier 267, and a second floating point subtraction circuit 268 are used to implement Equation (1). While the attention mask storage and logic 264 is shown in FIG. 2 as a set of discrete functional blocks, such comparative accelerators 260 may also be implemented using, for example, software code (e.g., programs or shaders) controlling the operations of a vector processor or a graphics processing unit. As a result, the original value of the training example input data 210 is preserved [wherein the first attention mask comprises or consists of: a plurality of zeros, in locations corresponding to the elements of the first input sequence where the idea of the number being 0 is that the value of the input data is preserved instead of being set to a large negative number to be zeroed out by the softmax] at locations (i,j) where the mask data at (i,j) was 1.0f, and the original data is replaced with the value x-10,000 at locations (i,j) where the mask data at (i,j) was 0.0f.”
and one or more large negative values, in locations corresponding to the padding values of the first padded input sequence
[Xi 0036]: “As seen above, the numerator of the SoftMax function is e.sup.zi. Therefore, the value of the SoftMax approaches 0 as the input value z.sub.i approaches −∞. In practice, supplying a sufficiently large negative number [and one or more large negative values, in locations corresponding to the padding values of the first padded input sequence] (a negative number having a large absolute value) as input to the SoftMax function a will produce a number that is small enough to be rounded to zero, or to be effectively zero for the purposes of the algorithm to be accelerated (e.g., the training of a machine learning model). In the above example shown in FIG. 2 and Equation (1), it is assumed that the values are represented in a low-precision floating point format such as BFloat16, IEEE half-precision 16-bit float FP16, or the like.”
In regards to Claim 10:
The method of claim 1 is taught by Xi.
Xi teaches:
wherein the attention-based neural network comprises a Softmax function, and wherein the processing comprises adding the first attention mask to an input to the Softmax function
[Xi 0035]: “In the comparative system shown in FIG. 2, the masked training example input data (x_masked) 230 is supplied to a softmax circuit [wherein the attention-based neural network comprises a Softmax function, and wherein the processing comprises adding the first attention mask to an input to the Softmax function] 270 implementing a SoftMax function.”
In regards to Claim 11:
The method of claim 1 is taught by Xi.
Xi teaches:
wherein the attention-based neural network comprises a transformer network
[Xi 0032]: “All layers in a transformer [wherein the attention-based neural network comprises a transformer network] model operate along the hidden dimension (and can ignore this constraint) except for the self-attention heads, which operate along the sequence dimension.
In regards to Claim 12:
The method of claim 1 is taught by Xi.
Xi teaches:
Wherein the attention-based neural network comprises a layer normalisation
150 in figure 1 of Xi shows normalization/softmax.
PNG
media_image1.png
919
1486
media_image1.png
Greyscale
In regards to Claim 17:
Xi teaches:
A computer-implemented method for training an attention-based neural network for hardware implementation, the method comprising: obtaining a dataset of first training input sequences for the attention-based neural network, wherein the dataset includes first training input sequences of varying length; padding each first training input sequence with padding values to produce a respective first padded input sequence of a first fixed length; generating, for each first padded input sequence, a respective first padding mask identifying the part of the first padded input sequence that contains the padding values; and generating a first attention mask from each first padding mask, wherein the generating comprises an outer product operation applied to the first padding mask
The above claim limitations are analogous to those taught in claim 1 by Xi. The relations to training are shown by the mapping below.
and training the attention-based neural network using the first padded input sequences and the first attention masks.
[Xi 0032]: “All layers in a transformer model operate along the hidden dimension (and can ignore this constraint) except for the self-attention heads, which operate along the sequence dimension. To enforce the autoregressive constraint, a mask is used to mask out tokens at positions greater than or equal to [i+1] (e.g., after the current token at position i). FIG. 1 depicts an example of an attention score matrix supplied as the input data 10, where the rows are labeled in increasing index from top to bottom (e.g., index i from 0 to 7) and from left to right (e.g., index j from 0 to 7). The mask 40 is shown in FIG. 1 as an upper triangular mask to mask out the upper right triangle of the attention score matrix to produce the masked data 30, corresponding to the locations of where positions j are greater than or equal to [i+1]. The masked data 30 is then provided to a normalization circuit 150, which normalizes the masked data 30 (e.g., on a row-by-row basis) and outputs the masked and normalized data, which may be used by a next part of the machine learning process or model training process [and training the attention-based neural network using the first padded input sequences and the first attention masks.], such as to update the trainable parameters of the machine learning model (e.g., by applying the backpropagation algorithm to update the weights of connections between layers of a neural network).”
In regards to Claim 18:
A graphics processing system configured to perform the method as set forth in claim 1
[Xi 0002]: “Compared to software implementations executed by a general purpose processor, an FPGA brings the benefits of higher performance and lower power consumption of implementing computations at a low level (e.g., at a circuit level). This is similar to the benefits of using an application specific integrated circuit (ASIC) such as specialized co-processors such as a graphics processing unit (GPU) or neural accelerator, which are used to accelerate operations specific to computer graphics [A graphics processing system configured to perform the method as set forth in claim 1] and artificial neural networks, respectively. However, the design and fabrication of ASICs is a long, expensive process with high upfront fixed costs.”
The elements of claim 1 are taught in the rejection of claim 1.
In regards to Claim 19:
This claim is analogous to claim 1.
In regards to Claim 20:
This claim is analogous to claim 1.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 2, 3, 5, 6, 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xi (US 20230376663 A1), referred to as Xi in this document, and further in combination with Vaswani et al (“Attention is all you need”) referred to as Vaswani in this document.
In regards to Claim 2:
The method of claim 1 is taught by Xi.
Xi does not explicitly teach:
wherein the attention-based neural network comprises a decoder, wherein the first input sequence is an input for the decoder
Xi notes the use of a transformer [Xi 0006], but does not directly state that the transformer has a decoder element. For the sake of clarity, Vaswani is used to indicate that transformers can utilize a decoder element.
Vaswani teaches:
wherein the attention-based neural network comprises a decoder, wherein the first input sequence is an input for the decoder
[Vaswani 3.2.3 page 5]: “In "encoder-decoder attention" layers, the queries come from the previous decoder [wherein the attention-based neural network comprises a decoder, wherein the first input sequence is an input for the decoder] layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [31, 2, 8].”
One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Xi and Vaswani. Xi and Vaswani are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Xi and Vaswani to incorporate aspects of a transformer such as one presented in Vaswani, as Xi and Vaswani both note transformers, in as transformers allow for significantly more parallelization ([Vaswani Introduction page 2]: “The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.”)
In regards to Claim 3:
The method of claim 2 is taught by Xi and Vaswani.
Xi teaches:
further comprising executing the decoder for a number of iterations equal to the first fixed length, wherein, at each iteration the decoder produces an output sequence, and, at each iteration other than an initial iteration, the input to the decoder comprises the output sequence from the preceding iteration
[Xi 0032]: “To enforce the autoregressive constraint, a mask is used to mask out tokens at positions greater than or equal to [i+1] (e.g., after the current token [further comprising executing the decoder for a number of iterations equal to the first fixed length, wherein, at each iteration the decoder produces an output sequence, and, at each iteration other than an initial iteration, the input to the decoder comprises the output sequence from the preceding iteration] at position i)”
Further support of the idea is presented in [Xi 0006]: “Aspects of the present technology relates to the hardware acceleration of data masking and the computation of a softmax function on the masked data, which are commonly-performed operations in the field of machine learning. As one example, autoregressive transformer models, which are frequently applied in machine learning models for natural language processing, apply masks to input data in order to ensure that the transformer model learns to make predictions for a given token in a sequence of tokens based only on tokens appearing earlier in the sequence and not based on tokens appearing later in the sequence.”
By going over each element, shown by having the current token at position i, Xi is teaching that each part of the input is gone over in sequence, where the next sequence would then be using the elements of the previous sequence plus the new current token. The first fixed length would be the number the tokens (as Xi is looping through the tokens). This is noted to appear to align with the teachings in the specification around iterations in paragraph 18 of current application. More details are noted after Vaswani.
In case the claim is intended to invoke more recurrent elements or other aspects of transformers, Vaswani teaches…
Vaswani teaches:
further comprising executing the decoder for a number of iterations equal to the first fixed length, wherein, at each iteration the decoder produces an output sequence, and, at each iteration other than an initial iteration, the input to the decoder comprises the output sequence from the preceding iteration
[Vaswani 3.2.3 page 5]: “In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence [further comprising executing the decoder for a number of iterations equal to the first fixed length, wherein, at each iteration the decoder produces an output sequence, and, at each iteration other than an initial iteration, the input to the decoder comprises the output sequence from the preceding iteration]. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [31, 2, 8].”
The motivation to use Vaswani would be the same as in claim 2.
Or in regards to recurrent aspects:
[Vaswani 2 Background page 2]: “End-to-end memory networks are based on a recurrent attention [where the recurrence is teaching the iterating over the input and passing in the previous output] mechanism instead of sequence aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [28].”
One of ordinary skill in the art, prior to the effective filing date, would have been motivation to combine Xi and Vaswani. Xi and Vaswani are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Xi and Vaswani in order to utilize recurrence for the good performance on simple language tasks [Vaswani 2 Background page 2]: “End-to-end memory networks are based on a recurrent attention mechanism instead of sequence aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [28].”.
The nature of the invention being related to transformers and notes [Current Specification 0018]: “When the decoder is executed multiple times, the first attention mask (e.g. self- attention mask) may be updated at each iteration. In particular, the first attention mask may be updated such that less of the input sequence is ignored in successive iterations. For instance, in the first iteration (i.e. the initial iteration), all elements other than the first element may be ignored; in the second iteration, all elements other than the first and second elements may be ignored (and so on). When the output sequence from the decoder is fed back to the input in each iteration, this means that less of the output sequence is ignored in successive iterations.”, the claims are interpreted to mean going over aspects the input as taught by Xi earlier in this claim. The teachings from Vaswani were included as a precaution and clarity of record.
In regards to Claim 5:
The method of claim 4 is taught by Xi.
Xi teaches:
padding the second input sequence with padding values to produce a second padded input sequence of a second fixed length; generating a second padding mask identifying the part of the second padded input sequence that contains the padding values; generating a second attention mask from the second padding mask, wherein the generating comprises an outer product operation applied to the second padding mask;
wherein the method comprises processing, by the fixed-function hardware, the first padded input sequence and the first attention mask using the encoder and processing, by the fixed-function hardware, the second padded input sequence and the second attention mask using the decoder, to perform the inference.
The above elements are analogous to the limitations in claim 1, but with the addition of performing the steps again or a second time as a result of an encoder and decoder structure (meaning seen as at least performing once for the encoder and once for the decoder). The support for that structure is provided below in the combination with Vaswani. The processing by the hardware is taught by Xi, in claim 1, where the indication of the encoder-decoder setup for the processing is, as noted, supported below by Xi in combination with Vaswani.
Xi does not explicitly teach:
wherein the attention-based neural network further comprises a decoder, the method further comprising
receiving a second input sequence, wherein the second input sequence is an input for the decoder;
As noted in claim 2, Xi discloses related elements, but does not appear to directly state a decoder, thus Vaswani is used to help indicate transformers containing decoders.
Vaswani teaches:
wherein the attention-based neural network further comprises a decoder, the method further comprising
[Vaswani 3.2.3 page 5]: “In "encoder-decoder attention" layers, the queries come from the previous decoder [wherein the attention-based neural network comprises a decoder, wherein the first input sequence is an input for the decoder] layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [31, 2, 8].”
receiving a second input sequence, wherein the second input sequence is an input for the decoder;
The basis for receiving an input is taught by Xi in claim 1, but the indication of a second input is taught by Vaswani in Figure 1 (shown below) indicating an input to both the encoder (left) and decoder (right).
PNG
media_image2.png
1441
1057
media_image2.png
Greyscale
[Vaswani Figure 1]
The motivation to combine with Vaswani is the same as in claim 2.
In regards to Claim 6:
The method of claim 5 is taught by Xi and Vaswani.
Xi teaches:
wherein the method further comprises processing, by the fixed-function hardware, the first padded input sequence and the first attention mask using the encoder and processing, by the fixed-function hardware,
The fixed function hardware and processing of the input sequence and attention mask is noted to be taught in claim 1 by Xi. The teachings of an encoder are taught by Xi in claim 4. The idea of an encoder in an encoder-decoder setup is properly taught in the combination with Vaswani below.
Xi does not explicitly teach:
further comprising generating a cross-attention mask from the first padding mask and the second padding mask, comprising an outer product of the first padding mask with the second padding mask,
wherein the method further comprises processing, by the fixed-function hardware, the first padded input sequence and the first attention mask using the encoder and processing, by the fixed-function hardware, the second padded input sequence, the second attention mask, and the cross-attention mask using the decoder, to perform the inference.
Xi does note explicitly teach the encoder-decoder setup, but the hardware for the processing is taught by Xi in claim 1.
Vaswani teaches:
further comprising generating a cross-attention mask from the first padding mask and the second padding mask, comprising an outer product of the first padding mask with the second padding mask,
Cross attention is shown in figure 1 of Vaswani by the outputs of the encoder (left) being passed to the multi-head attention of the decoder (right). The cross attention mask is then created as a result of the multi-head attention process that takes in the encoder elements (which includes the mask) and the decoder elements before more processing.
the second padded input sequence, the second attention mask, and the cross-attention mask using the decoder, to perform the inference.
There being a second sequence and thus a second one of an attention mask, attention mask, etc are also shown in figure 1 of Vaswani by noting the elements performed are in both the encoder and decoder. The cross attention is noted again to come from the data from the encoder being passed to the decoder. All elements being used in processing is shown the parts of the figure after the cross attention. This mapping is supported by paragraph 123 of the current specification and Figure 1B of current application.
PNG
media_image2.png
1441
1057
media_image2.png
Greyscale
The motivation to combine is the same as the motivation to combine in claim 2 with Vaswani.
In regards to Claim 8:
The method of claim 1 is taught by Xi.
Xi does not explicitly teach:
wherein the attention-based neural network comprises a scaled-dot product attention calculation
Xi does not disclose scaled-dot product attention specifically, but the use in a transformer is indicated by the teachings of Vaswani below.
Vaswani teaches:
wherein the attention-based neural network comprises a scaled-dot product attention calculation
[Vaswanit 3.2.1 pages 4-5]: “We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk, and values of dimension dv.” Where figure 2 of Vaswani indicates the multi-head attention utilizes the scaled dot product attention.
One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Xi and Vaswani. Xi and Vaswani are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Xi and Vaswani to utilize scaled dot product attention as the performance of scaled dot production attention with a softmax is better ([Vaswani 3.2.1 page 5]: “While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4. To counteract this effect, we scale the dot product by…”)
Claims 13, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xi (US 20230376663 A1), referred to as Xi in this document, and further in combination with Guo et al (“A Survey of FPGA-Based Neural Network Inference Accelerator”) referred to as Guo in this document.
In regards to Claim 13:
The method of claim 12 is taught by Xi.
Xi teaches:
and evaluating said plurality of elementary neural network operations using the fixed- function hardware, wherein each of the plurality of elementary neural network operations is selected from the list consisting of: a convolution operation; an element-wise subtraction operation; an element-wise multiplication operation; a reciprocal operation; a square root operation; an element-wise division operation; a rectified linear activation function; a local response normalisation; and an element-wise addition.
[Xi 0092]: “As stated above, a number of program modules and data files may be stored in the system memory 704. While executing on the processing unit 702, the program modules 706 may perform processes that offload computational tasks to the FPGA 703. The FPGA 703 may include data paths configured to accelerate the computation of various mathematical functions including, but not limited to, masking data as described above with respect to FIGS. 1, 3A, 3B, 4A, 4B, 5A, 5B, 5C, and 5D. The FPGA 703 may be configured to include other data paths for implementing other mathematical functions in accordance with examples of the present invention, such as computing a softmax function, an exponential function, a reciprocal square root function [and evaluating said plurality of elementary neural network operations using the fixed- function hardware, wherein each of the plurality of elementary neural network operations is selected from the list consisting of:.. a reciprocal operation; a square root operation where Xi may teach other specific operations, such as softmax is likely a form of normalization], and the like.”
Xi does not explicitly teach:
wherein the fixed-function hardware is configured to perform a set of available elementary neural network operations, the method comprising: mapping the layer normalisation to a representation comprising a plurality of elementary neural network operations from the set of available elementary neural network operations;
Xi notes the moving of operations to hardware specifically designed for operations ([Xi 0029]: “Moving computationally expensive operations from programs running on relatively slow, general purpose processors (e.g., CPUs) or shaders running on graphics processing units (GPUs) onto FPGAs specifically configured to perform those expensive mathematical operations can provide significant reductions in total compute time and reductions in power consumption.”). However Xi doesn’t appear to indicate how the mapping of the operations is performed, thus another reference is used to help clear up the teaching.
Guo teaches:
wherein the fixed-function hardware is configured to perform a set of available elementary neural network operations, the method comprising: mapping the layer normalisation to a representation comprising a plurality of elementary neural network operations from the set of available elementary neural network operations;
[Guo 8.1 page 21]: "Venireis, et al. [62] describes the network model as a DFG in their design tool. Then the network computation process is translated to hardware design with DFG mapping method. DnnWeaver [52] use a virtual instruction set [wherein the fixed-function hardware is configured to perform a set of available elementary neural network operations, the method comprising: mapping the layer normalisation to a representation comprising a plurality of elementary neural network operations from the set of available elementary neural network operations;] to describe a network. The network model is first translated into an instruction sequence. Then the sequence is mapped as hardware FSM states but not executed like traditional CPU instructions." Where DFG is noted to mean data flow graph.
Aspects of being related to normalization are taught by Xi in earlier claims. Here the teaching of mapping to more elementary operations is being taught by Guo, as Guo notes the operations are translated into instructions for the hardware.
One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Xi and Guo. Xi and Guo are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Xi and Guo in order to take advantage of the acceleration of algorithms provided by hardware specially designed for target algorithms ([Guo 2.2 FPGA-based Accelerator page 4]:"In recent years, FPGA is becoming a promising solution for algorithm acceleration. Compared with CPU, GPU, and DSP platforms, for which the software and hardware are designed independently, FPGA enables the developers to implement only the necessary logic in hardware according to the target algorithm. By eliminating the redundancy in general hardware platforms, FPGAs can achieve higher efficiency. Application specific integrated circuits (ASICs) based solutions achieve even higher efficiency but requires much longer development cycle and higher cost.”).
In regards to Claim 15:
The method of claim 1 is taught by Xi.
Xi teaches:
wherein the attention-based neural network comprises a matrix multiplication operation defined in two or more dimensions between a first tensor X having dimensions [..., P,..., Q,...] and a second tensor Y having dimensions [..., Q,...,R,...], the method further comprising:
[Xi 0041]: “Regarding arithmetic overhead, in the system shown in FIG. 2, each attention mask operation performed in accordance with Equation (1) requires one floating point multiplication [wherein the attention-based neural network comprises a matrix multiplication operation defined in two or more dimensions between a first tensor X having dimensions [..., P,..., Q,...] and a second tensor Y having dimensions [..., Q,...,R,...], the method further comprising where the indication of being for matrix multiplication is given later in this quote noting “floating point operations for an (N,N)-shape tensor”] and two floating point subtractions, which consume a significant portion of the floating-point computing resources of a vector processor and/or a significant number of logic blocks of an FPGA. Computing the mask also requires a non-neglectable amount of computational energy (on the order of 3N.sup.2 floating point operations for an (N, N)-shape tensor). This is particularly significant when when the input tensors feature a large sequence length (e.g., for large values of N).”
and evaluating the graph of neural network operations to thereby evaluate the matrix multiplication operation, wherein the at least one element-wise operation is evaluated in the fixed-function hardware
Xi teaches processing operations on fixed-function hardware in claim 1. The teachings of aspects being of element wise are indicated in the teachings below.
Xi does not explicitly teach:
mapping the matrix multiplication operation to a graph of neural network operations including at least one element-wise operation
[Guo 2.1 page 4]: “Besides CONV and FC layers, NN layers also have pooling, ReLU [28], concat [58], elementwise [mapping the matrix multiplication operation to a graph of neural network operations including at least one element-wise operation] [22] and other types of layers.”
The motivation to combine with Guo is the same as in claim 13.
Claims 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xi (US 20230376663 A1), referred to as Xi in this document, and further in combination with Chen et al (“A Survey of Accelerator Architectures for Deep Neural Networks”) referred to as Chen in this document, and further in combination with Wikipedia (RowAndColumnMajorOrderWikipediaOnMay2022) referred to as Wikipedia in this document.
In regards to Claim 14:
The method of claim 1 is taught by Xi.
Xi teaches:
wherein the attention-based neural network comprises a matrix multiplication operation defined in two or more dimensions between a first tensor X having dimensions [..., P,..., Q,...] and a second tensor Y having dimensions [..., Q,...,R,...], the method further comprising
[Xi 0041]: “Regarding arithmetic overhead, in the system shown in FIG. 2, each attention mask operation performed in accordance with Equation (1) requires one floating point multiplication [wherein the attention-based neural network comprises a matrix multiplication operation defined in two or more dimensions between a first tensor X having dimensions [..., P,..., Q,...] and a second tensor Y having dimensions [..., Q,...,R,...], the method further comprising where the indication of being for matrix multiplication is given later in this quote noting “floating point operations for an (N,N)-shape tensor”] and two floating point subtractions, which consume a significant portion of the floating-point computing resources of a vector processor and/or a significant number of logic blocks of an FPGA. Computing the mask also requires a non-neglectable amount of computational energy (on the order of 3N.sup.2 floating point operations for an (N, N)-shape tensor). This is particularly significant when when the input tensors feature a large sequence length (e.g., for large values of N).”
and evaluating the graph of neural network operations to thereby evaluate the matrix multiplication operation, wherein the at least one convolution operation is evaluated in the fixed-function hardware
Xi teaches processing operations on fixed-function hardware in claim 1. The teachings of aspects being of convolution and transformation are indicated in the teachings below.
Xi does not explicitly teach:
mapping the matrix multiplication operation to a graph of neural network operations including at least one transformation and at least one convolution operation
Xi notes that hardware can be designed with particular indications, such as being row based, in mind ([Xi 0067]: “By performing the normalization without explicitly calculating the zero values for the masked-out locations (e.g., by computing exponentiations of masked-out values to arrive at zero values) and by combining multiple rows of data into a single row, aspects of the present technology increase the performance (e.g., throughput and efficiency) of hardware accelerators configured to perform masked, row-based normalization (e.g., using masked SoftMax).”). This is considered relevant to the “at least one transformation” as a result of paragraph 192 of specification indicating the rotation is to help with the hardware being row majored.
Chen teaches:
mapping the matrix multiplication operation to a graph of neural network operations including at least one transformation and at least one convolution operation
[Chen 2.2 Computation patterns page 3]: “Although the computation patterns of matrix multiplications and convolutions are very different, they can actually be converted to each other. Thus, an accelerator designed for one type of computation can still support the other type, although doing so might not be very efficient. Convolutions can be transformed into matrix multiplication through the Toeplitz matrix, as illustrated in Fig. 3, at the cost of introducing redundant data. On the other hand, matrix multiplication is just a convolution [mapping the matrix multiplication operation to a graph of neural network operations including at least one transformation and at least one convolution operation] with Oh = Ow = Fw = Fh = 1. The feature map and the filter are reduced to a single element.”
The mapping above is supported by paragraph 195 of the specification indicating that convolutions for multiplication utilize a convolution of size 1, just like Chen indicated above.
One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Xi and Chen. Xi and Chen are of the same field of endeavor of machine learning. One of ordinary skill would have combined Xi and Chen in order to understand the focus on multiplication and convolutions as the operations are the target of accelerator designs ([Chen 2.2 Computation patterns page 2]: “Although a DNN may include many types of layers, matrix multiplications and convolutions dominate over 90% of the operations, and are the main targets of DNN accelerator designs.”).
Wikipedia teaches:
mapping the matrix multiplication operation to a graph of neural network operations including at least one transformation and at least one convolution operation
[Wikipedia Transposition page 4]: “As exchanging the indices of an array is the essence of array transposition, an array stored as row-major [at least one transformation] but read as column-major (or vice versa) will appear transposed (as long as the matrix is square). As actually performing this rearrangement in memory is typically an expensive operation, some systems provide options to specify individual matrices as being stored transposed. The programmer must then decide whether or not to rearrange the elements in memory, based on the actual usage (including the number of times that the array is reused in a computation).”
Wikipedia is teaching that hardware can be considered row or column major, thus how elements are stored and used are stored in row or column major. Wikipedia also notes the idea of transposing a matrix to reorder the rows and columns. As the specification notes the transformation is to permute a matrix to reorganize the rows and columns for the row major aspect [], Wikipedia teaching the reason for reorganizing a matrix (row or column major hardware) and notes reordering the matrix (transposition), thus Wikipedia is seen as teaching the transformation limitation.
One of ordinary skill in the art, prior the effective filing date, would have been motivated to combine Xi and Wikipedia. Xi and Wikipedia are in the same field of endeavor of hardware for computer computation. One of ordinary skill in the art would have been motivated to combine Xi and Wikipedia in order to ensure good performance on hardware ([Wikipedia page 1]: “Data layout is critical for correctly passing arrays between programs written in different programming languages. It is also important for performance when traversing an array because modern CPUs process sequential data more efficiently than nonsequential data. This is primarily due to CPU caching which exploits spatial locality of reference.[1] In addition, contiguous access makes it possible to use SIMD instructions that operate on vectors of data. In some media such as tape or NAND flash memory, accessing sequentially is orders of magnitude faster than nonsequential access.”)
Claims 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xi (US 20230376663 A1), referred to as Xi in this document, and further in combination with Yin et al (US 20180373992 A1) referred to as Yin in this document.
In regards to Claim 16:
The method of claim 1 is taught by Xi.
Xi teaches:
further comprising: receiving a further input sequence; determining a length of the further input sequence; identifying that said length is longer than the first fixed length; and responsive to said identifying,
the method further comprising: padding the further input sequence with padding values to produce a further padded input sequence of the further fixed length; generating a further padding mask identifying the part of the further padded input sequence that contains the padding values; generating a further attention mask from the further padding mask, wherein the generating comprises an outer product operation applied to the further padding mask; and processing, by the fixed-function hardware, the further padded input sequence and the further attention mask to perform the inference using the further attention-based neural network.
The above limitations are noted to be taught in the rejection of claim 1 by Xi. This claim is interpreted as performing the limitations of claim 1 but with the ability to load another network into the accelerator (as taught below) for an input of a different size. Aspects, such as determining the length of the input sequence is seen as taught as part of adding padding to ensure a specific size.
Xi does not explicitly teach:
loading into the neural network accelerator a representation of a further attention-based neural network, wherein the further attention- based neural network is associated with a further fixed length, the further fixed length being longer than the length of the further input sequence
Yin teaches:
loading into the neural network accelerator a representation of a further attention-based neural network, wherein the further attention- based neural network is associated with a further fixed length, the further fixed length being longer than the length of the further input sequence
[Yin 0104]: “In operation 1120, the representation switching module 875, based on the sensor data, selects a second machine learning system for use [loading into the neural network accelerator a representation of a further attention-based neural network, where the teachings for attention based neural network and accelerator are from Xi. Yin is teaching the swapping of networks based on size.] in the method 900 or the method 1000. For example, the autonomous system may include two machine learning systems for controlling the autonomous system. The first machine learning system may have been trained using a first fixed-size input (e.g., a fixed-size vector or fixed-size image). The second machine learning system may have been trained using a second, different, fixed-size input [wherein the further attention- based neural network is associated with a further fixed length, the further fixed length being longer than the length of the further input sequence]. Based on the sensor data (e.g., detection in a change of speed of the autonomous system, a change in the number of objects detected in a region of interest, or any suitable combination thereof), the representation switching module 875 may switch between the two machine learning systems.”
Indication that the fixed length about number of objects actually indicates that the size of the input as the input is the objects [Yin 0089]: “In operation 950, the uniform representation module 865 selects a subset of the plurality of objects based on the priorities of the objects. For example, a fixed-length list of data structures representing the objects in the region of interest may be generated by the uniform representation module 865. If the number of objects within the region of interest exceeds the size of the fixed-length list, a predetermined number of objects may be selected for inclusion in this list based on their priorities. The predetermined number selected for inclusion may correspond to the fixed length of the list of data structures. For example, the k highest-priority objects may be selected, where k is the fixed length of the list of data structures.”
One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Xi and Yin. Xi and Yin are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Xi and Yin in order to be able swap models based on conditions, such as the size of inputs ([Yin 0107]: “After the process 1100 completes, iterations of the method 900 or 1000 will use the selected second machine learning system and the selected uniform representation. Thus, multiple machine learning systems may be trained for specific conditions (e.g., heavy traffic or bad weather) and used only when those conditions apply.”).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Xi et al (US 20230195833 A1) is relevant art that discusses neural network hardware acceleration involving related aspects such as normalization, softmax, self attention, transformers, accelerators, and inference.
Barkan et al (“Grad-SAM: Explaining Transformers via Gradient Self-Attention Maps”) is relevant art that discusses related aspects such as self attention, transformers, self attention, and large negative numbers.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHRISTOPHER D DEVORE whose telephone number is (703)756-1234. The examiner can normally be reached Monday-Friday 7:30 am - 5 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/C.D.D./Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129