DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 21-22 stand rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception without significantly more.
Step 1 analysis:
In the instant case, the claims are directed to a method. Thus, each of the claims falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2A analysis:
Based on the claims being determined to be within of the four categories (Step 1), it must be determined if the claims are directed to a judicial exception (i.e., law of nature, natural phenomenon, and abstract idea), in this case the claims fall within the judicial exception of an abstract idea. Specifically the abstract ideas of Mathematical Concepts (including mathematical relationships, formulas, and/or calculations).
Step 2A: Prong 1 analysis:
The independent claim 21 recites:
“performing a linear projection of the input in each of the plurality of encoder blocks using the attention dictionary, the index matrix associated with the respective encoder block, and the coefficient matrix associated with the respective encoder block”- this limitation corresponds to performing a linear projection of an input, which amounts to performing a mathematical calculation; being mathematical concepts and abstract ideas (see MPEP 2106.04(a)(2): “It is important to note that a mathematical concept need not be expressed in mathematical symbols, because "[w]ords used in a claim operating on data to solve a problem can serve the same purpose as a formula””).
Step 2A: Prong 2 analysis:
This judicial exception is not integrated into a practical application because it only recites these additional elements:
Claim 21:
“receiving an input at a mobile device that stores a trained machine learning model, the trained machine learning model comprising a plurality of encoder blocks, an attention dictionary shared across the plurality of encoder blocks, an index matrix for each of the encoder blocks, and a coefficient matrix for each of the encoder blocks”- this limitation amounts to receiving an input which is analogous to data gathering, considered a pre-solution activity (data gathering) which is an insignificant extra solution activity (see 2106.05(g) (3)). Furthermore, this limitation recites a trained machine learning model stored in a device, however, it is recited at a high level of generality further used to perform a linear projection, therefore it amounts to mere instructions to implement an abstract idea or other exception on a computer (see MPEP 2106.05(f).
Step 2B analysis:
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements recited at Claim 21 above amounts to no more than insignificant extra solution activities, and mere instructions to implement an abstract idea or other exception on a computer.
Moreover, re-evaluation of any additional elements or combination of elements that were considered to be insignificant extra-solution activity at Claim 21 are needed to determine if they are considered well-understood, routine and conventional limitations:
“receiving an input at a mobile device that stores a trained machine learning model, the trained machine learning model comprising a plurality of encoder blocks, an attention dictionary shared across the plurality of encoder blocks, an index matrix for each of the encoder blocks, and a coefficient matrix for each of the encoder blocks”- this limitation amounts to receiving an input which is well understood, routine and conventional under MPEP 2106.05 d II(i): “Receiving or transmitting data over a network, e.g., using the Internet to gather data”. Furthermore, this limitation recites a trained machine learning model stored in a device, however, it is recited at a high level of generality further used to perform a linear projection, therefore it amounts to mere instructions to implement an abstract idea or other exception on a computer (see MPEP 2106.05(f).
Dependent claim 22 when analyzed as a whole is held to be patent ineligible under 35 U.S.C. 101 because the additional recited limitation(s) fail(s) to establish that the claim(s) is not directed to an abstract idea. The claims are reciting further embellishment of the judicial exception.
Claim 22: this claim recites further embellishment about the mathematical calculation performed as a linear projection, which comprises multiplication of parameters (such as “a product of the input and the attention dictionary”, “a product of the intermediate output and coefficients in the coefficient matrix”), being mathematical concepts and abstract ideas (see MPEP 2106.04(a)(2): “It is important to note that a mathematical concept need not be expressed in mathematical symbols, because "[w]ords used in a claim operating on data to solve a problem can serve the same purpose as a formula””).
Viewed as a whole, these additional claim element(s) do not provide meaningful limitation(s) to transform the abstract idea into a patent eligible application of the abstract idea such that the claim(s) amounts to significantly more than the abstract idea itself. Therefore, the claim(s) are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 8-9, 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al (NPL: “Sharing Attention Weights for Fast Transformer”- hereinafter Xiao, as submitted in IDS dated 04/20/2023).
Referring to Claim 1, Xiao teaches a method comprising:
receiving one or more training corpora for training a machine learning model comprising a plurality of encoder blocks, each encoder block including an attention layer and a feedforward network (see Xiao at p. 5297 left column: “In training, we observe that systems tend to learn similar attention weights. Figure 6 plots the JS divergence between layer 4 and layers 5-6 at different training steps” and “For example, for each training epoch (Figure 4), one can train the model for a shorter time, as the JS divergence among layers converges quickly”. Therefore, this training of a machine learning using training epochs is interpreted as the training corpora for training a machine learning model. Furthermore, see p. 5292 section 2 at right column: “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer”); and
using the one or more training corpora to train an attention dictionary shared across the plurality of encoder blocks (see Xiao at p. 5292 right column: “This experience lead us to study the issue in another line of research, in which we reduce redundant computation and re-use some of the hidden states in the attention network. We propose a method to share attention weights in adjacent layers (call it shared attention network, or SAN for short). It leads to a model that shares attention computation in the stacked layers vertically. In addition to the new architecture, we develop a joint method to learn sharing policies and MT models simultaneously. As another “bonus”, SAN reduces the memory footprint because some hidden states are kept in the same piece of memory”. Further at p. 5295-left column- first paragraph: “For example, we can try to share weights on layer blocks consisting of two layers, or three layers, or all layers (π = 2, or 3 , ...), and use the tuned π on test data”. Therefore, this attention network or SAN corresponds to the claims attention dictionary being trained, as it is trained from shared attention weights across the blocks. This would be obvious to a person having ordinary skill in the art in order to reduce redundant computation and re-use hidden states in the attention network, and also reducing memory footprint because some hidden states are kept in the same piece of memory).
Referring to Claim 2, Xiao teaches the method of Claim 1, wherein:
training the attention dictionary comprises training attention parameters of the attention layer in each of the plurality of encoder blocks (see Xiao at p. 5295 left column: “For example, we can try to share weights on layer blocks consisting of two layers, or three layers, or all layers (π = 2, or 3 , ...), and use the tuned π on test data”. The attention weights are interpreted as the training attention parameters, as they are shared for training purposes); and
the attention parameters for a given encoder block among the plurality of encoder blocks are a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks (see Xiao at section 3.1 left column: “Let S[i] be column i of weight matrix S. For position i , we first compute S[i] to weight all positions (as in Eq. (1)), and then compute the weighted sum of values by S[i] (as in Eq. (2)). In column vector S[i], element Si,j indicates the contribution that we fuse the value at position j to position I”).
Referring to Claim 8, Xiao teaches an apparatus comprising:
at least one processing device (see p. 5295 right column “machines with 8 Nvidia 1080TiGPUs” )configured to:
receive one or more training corpora for training a machine learning model comprising a plurality of encoder blocks, each encoder block including an attention layer and a feedforward network (see Xiao at p. 5297 left column: “In training, we observe that systems tend to learn similar attention weights. Figure 6 plots the JS divergence between layer 4 and layers 5-6 at different training steps” and “For example, for each training epoch (Figure 4), one can train the model for a shorter time, as the JS divergence among layers converges quickly”. Therefore, this training of a machine leaning using training epochs is interpreted as the training corpora for training a machine learning model. Furthermore, see p. 5292 section 2 at right column: “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer”); and
use the one or more training corpora to train an attention dictionary shared across the plurality of encoder blocks (see Xiao at p. 5292 right column: “This experience lead us to study the issue in another line of research, in which we reduce redundant computation and re-use some of the hidden states in the attention network. We propose a method to share attention weights in adjacent layers (call it shared attention network, or SAN for short). It leads to a model that shares attention computation in the stacked layers vertically. In addition to the new architecture, we develop a joint method to learn sharing policies and MT models simultaneously. As another “bonus”, SAN reduces the memory footprint because some hidden states are kept in the same piece of memory”. Further at p. 5295-left column- first paragraph: “For example, we can try to share weights on layer blocks consisting of two layers, or three layers, or all layers (π = 2, or 3 , ...), and use the tuned π on test data”. Therefore, this attention network or SAN corresponds to the claims attention dictionary being trained, as it is trained from shared attention weights across the blocks. This would be obvious to a person having ordinary skill in the art in order to reduce redundant computation and re-use hidden states in the attention network, and also reducing memory footprint because some hidden states are kept in the same piece of memory).
Referring to Claim 9, Xiao teaches the apparatus of Claim 8, wherein:
to train the shared attention dictionary, the at least one processing device is configured to train attention parameters of the attention layer in each of the plurality of encoder blocks (see Xiao at p. 5295 left column: “For example, we can try to share weights on layer blocks consisting of two layers, or three layers, or all layers (π = 2, or 3 , ...), and use the tuned π on test data”. The attention weights are interpreted as the attention parameters); and
the attention parameters for a given encoder block among the plurality of encoder blocks are a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks (see Xiao at section 3.1 left column: “Let S[i] be column i of weight matrix S. For position i , we first compute S[i] to weight all positions (as in Eq. (1)), and then compute the weighted sum of values by S[i] (as in Eq. (2)). In column vector S[i], element Si,j indicates the contribution that we fuse the value at position j to position I”).
Referring to Claim 15, Xiao teaches a non-transitory computer readable medium containing instructions that, when executed by at least one processor (see p. 5295 right column “machines with 8 Nvidia 1080TiGPUs” ), cause the at least one processor to:
receive one or more training corpora for training a machine learning model comprising a plurality of encoder blocks, each encoder block including an attention layer and a feedforward network (see Xiao at p. 5297 left column: “In training, we observe that systems tend to learn similar attention weights. Figure 6 plots the JS divergence between layer 4 and layers 5-6 at different training steps” and “For example, for each training epoch (Figure 4), one can train the model for a shorter time, as the JS divergence among layers converges quickly”. Therefore, this training of a machine leaning using training epochs is interpreted as the training corpora for training a machine learning model. Furthermore, see p. 5292 section 2 at right column: “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer”); and
use the one or more training corpora to train an attention dictionary shared across the plurality of encoder blocks (see Xiao at p. 5292 right column: “This experience lead us to study the issue in another line of research, in which we reduce redundant computation and re-use some of the hidden states in the attention network. We propose a method to share attention weights in adjacent layers (call it shared attention network, or SAN for short). It leads to a model that shares attention computation in the stacked layers vertically. In addition to the new architecture, we develop a joint method to learn sharing policies and MT models simultaneously. As another “bonus”, SAN reduces the memory footprint because some hidden states are kept in the same piece of memory”. Further at p. 5295-left column- first paragraph: “For example, we can try to share weights on layer blocks consisting of two layers, or three layers, or all layers (π = 2, or 3 , ...), and use the tuned π on test data”. Therefore, this attention network or SAN corresponds to the claims attention dictionary being trained, as it is trained from shared attention weights across the blocks. This would be obvious to a person having ordinary skill in the art in order to reduce redundant computation and re-use hidden states in the attention network, and also reducing memory footprint because some hidden states are kept in the same piece of memory).
Referring to Claim 16, Xiao teaches the non-transitory computer readable medium of Claim 15, wherein: the instructions that when executed cause the at least one processor to train the shared attention dictionary comprise instructions that when executed cause the at least one processor to train attention parameters of the attention layer in each of the plurality of encoder blocks (see Xiao at p. 5295 left column: “For example, we can try to share weights on layer blocks consisting of two layers, or three layers, or all layers (π = 2, or 3 , ...), and use the tuned π on test data”. The attention weights are interpreted as the attention parameters); and
the attention parameters for a given encoder block among the plurality of encoder blocks are a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks (see Xiao at section 3.1 left column: “Let S[i] be column i of weight matrix S. For position i , we first compute S[i] to weight all positions (as in Eq. (1)), and then compute the weighted sum of values by S[i] (as in Eq. (2)). In column vector S[i], element Si,j indicates the contribution that we fuse the value at position j to position I”).
Claims 3-4, 10-11, 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao in view of Zhang et al (NPL: “Joint dynamic sparse representation for multi-view face recognition”- hereinafter Zhang).
Referring to Claim 3, Xiao teaches the method of Claim 2, however, fails to teach further comprising:
identifying the columns for the weighted combination using an index matrix associated with the given encoder block; and
identifying weights for the weighted combination using a coefficient matrix associated with the given encoder block.
Zhang teaches, in an analogous system,
identifying the columns for the weighted combination using an index matrix associated with the given encoder block (see Zhang at p. 1293 left column: “Each dynamic active set gs contains only one index for each column of X, e.g., gs(m) is the row-index of the selected atom for the m-th column of X in the s-th dynamic active set, as shown in Fig. 2(c). Therefore in our algorithm, we allow the sparse representation for each view to be different, but are forced to share the same class-level (group) structure”. Further, at p. 1294 left column: “
PNG
media_image1.png
20
98
media_image1.png
Greyscale
(12) which gives the index matrix
PNG
media_image2.png
18
60
media_image2.png
Greyscale
containing the top-L dynamic active sets for all the M views, as detailed in Algorithm 2”. Therefore, this index matrix which contains one index for each column corresponds to the claimed columns for the weighted combination using an index matrix);
identifying weights for the weighted combination using a coefficient matrix associated with the given encoder block (see p. 1292 left column:
“
PNG
media_image3.png
348
410
media_image3.png
Greyscale
”. Therefore, the coefficient vector containing the appropriate weights for each atom (training sample as explained at p. 1292) is interpreted as the claimed weights for the weighted combination using a coefficient matrix).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Xiao with the above teachings of Zhang by having a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks, as taught by Xiao, and having a coefficient matrix with the weighted combination, as taught by Zhang. The modification would have been obvious because one of ordinary skill in the art would be motivated to naturally lead to a sparse representation over the whole training dataset (as suggested by Zhang at p. 1292 left column: “where xi= [xi 1,xi 2,...] is the coefficient vector containing the appropriate weights for each atom in class i. This naturally leads to a sparse representation over the whole training dataset of all the C classes”).
Referring to Claim 4, the combination of Xiao and Zhang teaches the method of Claim 3, wherein the index matrix associated with the given encoder block and the coefficient matrix associated with the given encoder block are not shared across the plurality of encoder blocks (see Zhang at p. 1293 left column: “Each dynamic active set gs contains only one index for each column of X, e.g., gs(m) is the row-index of the selected atom for the m-th column of X in the s-th dynamic active set, as shown in Fig. 2(c)”. Therefore, since there is only one index for each column, this is interpreted as not being shared across encoder blocks. Further, a p. 1292 left column: “the coefficient vector containing the appropriate weights for each atom in class i”. Therefore, each atom (training sample) has its own weights, and this is interpreted as not being shared across encoder blocks).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Xiao with the above teachings of Zhang by having a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks, as taught by Xiao, and having a coefficient matrix with the weighted combination, as taught by Zhang. The modification would have been obvious because one of ordinary skill in the art would be motivated to naturally lead to a sparse representation over the whole training dataset (as suggested by Zhang at p. 1292 left column: “where xi= [xi 1,xi 2,...] is the coefficient vector containing the appropriate weights for each atom in class i. This naturally leads to a sparse representation over the whole training dataset of all the C classes”).
Referring to Claim 10, Xiao teaches the apparatus of Claim 9, however, fails to teach wherein the at least one processing device is further configured to:
identify the columns for the weighted combination using an index matrix associated with the given encoder block; and
identify weights for the weighted combination using a coefficient matrix associated with the given encoder block.
Zhang teaches, in an analogous system,
identify the columns for the weighted combination using an index matrix associated with the given encoder block (see Zhang at p. 1293 left column: “Each dynamic active set gs contains only one index for each column of X, e.g., gs(m) is the row-index of the selected atom for the m-th column of X in the s-th dynamic active set, as shown in Fig. 2(c). Therefore in our algorithm, we allow the sparse representation for each view to be different, but are forced to share the same class-level (group) structure”. Further, at p. 1294 left column: “
PNG
media_image1.png
20
98
media_image1.png
Greyscale
(12) which gives the index matrix
PNG
media_image2.png
18
60
media_image2.png
Greyscale
containing the top-L dynamic active sets for all the M views, as detailed in Algorithm 2”. Therefore, this index matrix which contains one index for each column corresponds to the claimed columns for the weighted combination using an index matrix);
identify weights for the weighted combination using a coefficient matrix associated with the given encoder block (see p. 1292 left column:
“
PNG
media_image3.png
348
410
media_image3.png
Greyscale
”. Therefore, the coefficient vector containing the appropriate weights for each atom (training sample as explained at p. 1292) is interpreted as the claimed weights for the weighted combination using a coefficient matrix).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Xiao with the above teachings of Zhang by having a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks, as taught by Xiao, and having a coefficient matrix with the weighted combination, as taught by Zhang. The modification would have been obvious because one of ordinary skill in the art would be motivated to naturally lead to a sparse representation over the whole training dataset (as suggested by Zhang at p. 1292 left column: “where xi= [xi 1,xi 2,...] is the coefficient vector containing the appropriate weights for each atom in class i. This naturally leads to a sparse representation over the whole training dataset of all the C classes”).
Referring to Claim 11, the combination of Xiao and Zhang teaches the apparatus of Claim 10, wherein the index matrix associated with the given encoder block and the coefficient matrix associated with the given encoder block are not shared across the plurality of encoder blocks (see Zhang at p. 1293 left column: “Each dynamic active set gs contains only one index for each column of X, e.g., gs(m) is the row-index of the selected atom for the m-th column of X in the s-th dynamic active set, as shown in Fig. 2(c)”. Therefore, since there is only one index for each column, this is interpreted as not being shared across encoder blocks. Further, a p. 1292 left column: “the coefficient vector containing the appropriate weights for each atom in class i”. Therefore, each atom (training sample) has its own weights, and this is interpreted as not being shared across encoder blocks).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Xiao with the above teachings of Zhang by having a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks, as taught by Xiao, and having a coefficient matrix with the weighted combination, as taught by Zhang. The modification would have been obvious because one of ordinary skill in the art would be motivated to naturally lead to a sparse representation over the whole training dataset (as suggested by Zhang at p. 1292 left column: “where xi= [xi 1,xi 2,...] is the coefficient vector containing the appropriate weights for each atom in class i. This naturally leads to a sparse representation over the whole training dataset of all the C classes”).
Referring to Claim 17, Xiao teaches the non-transitory computer readable medium of Claim 16, however, fails to teach further containing instructions that when executed cause the at least one processor to:
identify the columns for the weighted combination using an index matrix associated with the given encoder block; and
identify weights for the weighted combination using a coefficient matrix associated with the given encoder block.
Zhang teaches, in an analogous system,
identify the columns for the weighted combination using an index matrix associated with the given encoder block (see Zhang at p. 1293 left column: “Each dynamic active set gs contains only one index for each column of X, e.g., gs(m) is the row-index of the selected atom for the m-th column of X in the s-th dynamic active set, as shown in Fig. 2(c). Therefore in our algorithm, we allow the sparse representation for each view to be different, but are forced to share the same class-level (group) structure”. Further, at p. 1294 left column: “
PNG
media_image1.png
20
98
media_image1.png
Greyscale
(12) which gives the index matrix
PNG
media_image2.png
18
60
media_image2.png
Greyscale
containing the top-L dynamic active sets for all the M views, as detailed in Algorithm 2”. Therefore, this index matrix which contains one index for each column corresponds to the claimed columns for the weighted combination using an index matrix);
identify weights for the weighted combination using a coefficient matrix associated with the given encoder block (see p. 1292 left column:
“
PNG
media_image3.png
348
410
media_image3.png
Greyscale
”. Therefore, the coefficient vector containing the appropriate weights for each atom (training sample as explained at p. 1292) is interpreted as the claimed weights for the weighted combination using a coefficient matrix).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Xiao with the above teachings of Zhang by having a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks, as taught by Xiao, and having a coefficient matrix with the weighted combination, as taught by Zhang. The modification would have been obvious because one of ordinary skill in the art would be motivated to naturally lead to a sparse representation over the whole training dataset (as suggested by Zhang at p. 1292 left column: “where xi= [xi 1,xi 2,...] is the coefficient vector containing the appropriate weights for each atom in class i. This naturally leads to a sparse representation over the whole training dataset of all the C classes”).
Referring to Claim 18, the combination of Xiao and Zhang teaches the non-transitory computer readable medium of Claim 17, wherein the index matrix associated with the given encoder block and the coefficient matrix associated with the given encoder block are not shared across the plurality of encoder blocks (see Zhang at p. 1293 left column: “Each dynamic active set gs contains only one index for each column of X, e.g., gs(m) is the row-index of the selected atom for the m-th column of X in the s-th dynamic active set, as shown in Fig. 2(c)”. Therefore, since there is only one index for each column, this is interpreted as not being shared across encoder blocks. Further, a p. 1292 left column: “the coefficient vector containing the appropriate weights for each atom in class i”. Therefore, each atom (training sample) has its own weights, and this is interpreted as not being shared across encoder blocks).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Xiao with the above teachings of Zhang by having a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks, as taught by Xiao, and having a coefficient matrix with the weighted combination, as taught by Zhang. The modification would have been obvious because one of ordinary skill in the art would be motivated to naturally lead to a sparse representation over the whole training dataset (as suggested by Zhang at p. 1292 left column: “where xi= [xi 1,xi 2,...] is the coefficient vector containing the appropriate weights for each atom in class i. This naturally leads to a sparse representation over the whole training dataset of all the C classes”).
Claims 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over in view of Zhang and further in view of Boulis (US Pub. No. 2009/0019002- hereinafter Boulis).
Referring to Claim 21, Xiao teaches a method comprising:
receiving an input at a mobile device that stores a trained machine learning model, the trained machine learning model comprising a plurality of encoder blocks, an attention dictionary shared across the plurality of encoder blocks (see Xiao at section 2 first paragraph: “The input of the attention sub-layer is a tuple of (Q,K, V)”. Further at p. 5297 left column: “In training, we observe that systems tend to learn similar attention weights. Figure 6 plots the JS divergence between layer 4 and layers 5-6 at different training steps” and “For example, for each training epoch (Figure 4), one can train the model for a shorter time, as the JS divergence among layers converges quickly”. Therefore, this machine learning using training epochs is interpreted as the machine learning model stored. Also at p. 5292 right column: “This experience lead us to study the issue in another line of research, in which we reduce redundant computation and re-use some of the hidden states in the attention network. We propose a method to share attention weights in adjacent layers (call it shared attention network, or SAN for short). It leads to a model that shares attention computation in the stacked layers vertically. In addition to the new architecture, we develop a joint method to learn sharing policies and MT models simultaneously. As another “bonus”, SAN reduces the memory footprint because some hidden states are kept in the same piece of memory”. Therefore, this attention network or SAN corresponds to the claimed attention dictionary being shared, as it is trained from shared attention weights).
However, Xiao fails to teach:
receiving an input at a mobile device that stores a trained machine learning model,
the trained machine learning model comprising an index matrix for each of the encoder blocks, and a coefficient matrix for each of the encoder blocks; and
performing a linear projection of the input in each of the plurality of encoder blocks using the attention dictionary, the index matrix associated with the respective encoder block, and the coefficient matrix associated with the respective encoder block.
Zhang teaches, in an analogous system,
the trained machine learning model comprising an index matrix for each of the encoder blocks (see Zhang at p. 1293 left column: “Each dynamic active set gs contains only one index for each column of X, e.g., gs(m) is the row-index of the selected atom for the m-th column of X in the s-th dynamic active set, as shown in Fig. 2(c). Therefore in our algorithm, we allow the sparse representation for each view to be different, but are forced to share the same class-level (group) structure”. Further, at p. 1294 left column: “
PNG
media_image1.png
20
98
media_image1.png
Greyscale
(12) which gives the index matrix
PNG
media_image2.png
18
60
media_image2.png
Greyscale
containing the top-L dynamic active sets for all the M views, as detailed in Algorithm 2”. Therefore, this index matrix which contains one index for each column corresponds to the claimed index matrix), and a coefficient matrix for each of the encoder blocks (see p. 1292 left column:
“
PNG
media_image3.png
348
410
media_image3.png
Greyscale
”. Therefore, the coefficient vector containing the appropriate weights for each atom (training sample as explained at p. 1292) is interpreted as coefficient matrix); and
performing a linear projection of the input in each of the plurality of encoder blocks using the attention dictionary, the index matrix associated with the respective encoder block, and the coefficient matrix associated with the respective encoder block (see p. 1292 left column:
“
PNG
media_image3.png
348
410
media_image3.png
Greyscale
”. Therefore, the linear combination of the samples is interpreted as the claimed linear projection of the input using the dictionary, index and coefficient matrices)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Xiao with the above teachings of Zhang by having a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks, as taught by Xiao, and having a coefficient matrix with the weighted combination, as taught by Zhang. The modification would have been obvious because one of ordinary skill in the art would be motivated to naturally lead to a sparse representation over the whole training dataset (as suggested by Zhang at p. 1292 left column: “where xi= [xi 1,xi 2,...] is the coefficient vector containing the appropriate weights for each atom in class i. This naturally leads to a sparse representation over the whole training dataset of all the C classes”).
Boulis teaches in an analogous system,
receiving an input at a mobile device that stores a trained machine learning model (see Boulis at [0009]: “Embodiments described herein support search query processing that includes receiving a search query input string from a user of a mobile device and comparing the search query input to a personalized dictionary and determining a suggested completion for each match in the comparison, and then providing the suggested completion to the user for selection”. Further at [0091]: “the personalized dictionary can be stored in memory of a mobile device so that it can be readily accessed by a search application without the necessity of communicating with a general-purpose dictionary that is stored at a remote network location”. Therefore, the input query received at the user’s mobile device is interpreted as the received input, and the personalized dictionary stored in the mobile device is interpreted as the machine learning model).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Xiao and Zhang with the above teachings of Boulis by having a trained machine learning model, the trained machine learning model comprising a plurality of encoder blocks, an attention dictionary shared across the plurality of encoder blocks, an index matrix for each of the encoder blocks, and a coefficient matrix for each of the encoder blocks, as taught by Xiao and Zhang, and wherein the model is stored in a mobile device that receives inputs, as taught by Boulis. The modification would have been obvious because one of ordinary skill in the art would be motivated to store the model in a mobile device so that it can be readily accessed by a search application without the necessity of communicating with a general-purpose dictionary that is stored at a remote network location, where speed of input and efficiency of operation are highly prized (as suggested by Boulis at p. [0091]: “Thus, the personalized dictionary can be stored in memory of a mobile device so that it can be readily accessed by a search application without the necessity of communicating with a general-purpose dictionary that is stored at a remote network location” and [0092]: “This is especially useful in the mobile device context, where speed of input and efficiency of operation are highly prized”).
Referring to Claim 22, the combination of Xiao, Zhang and Boulis teaches the method of Claim 21, wherein performing the linear projection comprises:
determining an intermediate output based on a product of the input and the attention dictionary (see Xiao at section 3.1 first paragraph “Self-attention is essentially a procedure that fuses the input values to form a new value at each position. Let S[i] be column i of weight matrix S. For position i , we first compute S[i] to weight all positions (as in Eq. (1)), and then compute the weighted sum of values by S[i] (as in Eq. (2))”. Therefore, these new values at each position before computing the weighted sum corresponds to the claimed intermediate output); and
for each of the encoder blocks, determining a product of the intermediate output and coefficients in the coefficient matrix associated with the respective encoder block for columns identified by the index matrix associated with the respective encoder block (see p. 1292 left column:
“
PNG
media_image3.png
348
410
media_image3.png
Greyscale
”. Therefore, the product between the
PNG
media_image4.png
28
32
media_image4.png
Greyscale
where the
PNG
media_image5.png
28
16
media_image5.png
Greyscale
is the coefficient matrix associated with the respective encoder block for columns identified by the index matrix and the
PNG
media_image6.png
25
16
media_image6.png
Greyscale
is the intermediate output).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Xiao with the above teachings of Zhang by having a weighted combination of columns from the attention dictionary shared across the plurality of encoder blocks, as taught by Xiao, and determining a product of an intermediate output and coefficients in the coefficient matrix, as taught by Zhang. The modification would have been obvious because one of ordinary skill in the art would be motivated to naturally lead to a sparse representation over the whole training dataset (as suggested by Zhang at p. 1292 left column: “where xi= [xi 1,xi 2,...] is the coefficient vector containing the appropriate weights for each atom in class i. This naturally leads to a sparse representation over the whole training dataset of all the C classes”).
Allowable Subject Matter
Claims 5-7, 12-14, 19-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LUIS A SITIRICHE whose telephone number is (571)270-1316. The examiner can normally be reached M-F 9am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126