Last updated: May 29, 2026
Application No. 17/945,978
METHOD AND DEVICE FOR COMPRESSING GENERATIVE PRE-TRAINED LANGUAGE MODELS VIA QUANTIZATION

Non-Final OA §101§102§103
Filed
Sep 15, 2022
Examiner
ALABI, OLUWATOSIN O
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Huawei Technologies Co., Ltd.
OA Round
1 (Non-Final)
This examiner grants 60% of cases after interview

— +22.6% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 209 resolved cases, 2023–2026
Examiner Intelligence

ALABI, OLUWATOSIN O View full profile →
Grants 60% of resolved cases
Career Allowance Rate
125 granted / 209 resolved
+4.8% vs TC avg
Strong +23% interview lift
Without
With
+22.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
27 currently pending
Career history
247
Total Applications
across all art units
Statute-Specific Performance

§101
2.6%
-37.4% vs TC avg
§103
86.8%
+46.8% vs TC avg
§102
7.1%
-32.9% vs TC avg
§112
2.3%
-37.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 209 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings were received on 09/15/2022.  These drawings are acceptable.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/07/2024 has been considered by the examiner. 
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e. an abstract idea) without significantly more. 

Claim 1: Does claim fall within a  statutory category? Yes. 
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
a) determining a scaling factor based on a distribution of weightsnetwork model. (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III)
determining, based on a gradient of the training loss, an updated scaling factor for the neural network model. (Considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
… determining, based on the quantized weights during training of the neural network model, a training loss of the neural network model; and d) determining, based on a gradient of the training loss, an updated scaling factor for the neural network model. (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation merely include instructions to implement an abstract idea on a computer, or merely use a computer as a tool to perform an abstract idea; Thus claim limitations amount to mere instructions to apply the judicial exception using a computer/computing environment as a tool, as discussed in MPEP § 2106.05(f).)
… a distribution of weights associated with the neural network model; …, an updated scaling factor for the neural network model. Deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to generally linking the use of a judicial exception to a particular technological environment or field of use. See 2106.05(h).)

The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above. 
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application.
First, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use and elements invoking computers or other machinery merely as a tool to perform the claimed process/judicial exception. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 2: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
further comprising: determining, based on the distribution of weights, an average weight magnitude; and determining a clipping factor as a product of the scaling factor and the average weight magnitude, wherein determining, based on the weights in the distribution and the scaling factor, quantized weights is based on the clipping factor. (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 3: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
further comprising determining the average weight magnitude by an L1 norm function to the weights associated with the distribution. (Considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 4: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
wherein the weights associated with the distribution are divided into a plurality of value ranges based on the scaling factor, the method further comprising: computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in; and computing the gradient of the training loss by aggregating the gradient contributions from the weights associated with the distribution. (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 5: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
further comprising: setting an initial value of the scaling factor to one.
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
Alternatively: further comprising: setting an initial value of the scaling factor to one. (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to generally linking the use of a judicial exception to a particular technological environment or field of use. See 2106.05(h).)
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 6: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
further comprising: determining, based on initial values of the weights in the neural network model, an initial value for the scaling factor. (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).

The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.

Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use and elements invoking computers or other machinery merely as a tool to perform the claimed process/judicial exception. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 7: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
Abstract idea noted in claim 1.
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
further comprising determining an optimized scaling factor for a task by carrying out multiple iterations of a) through d). (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to insignificant extra-solution activity (e.g. Performing repetitive calculations), see MPEP § 2106.05(g).)
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. 
Secondly, the limitations directed to insufficient to transform the judicial exception to a patentable invention because the recitation is directed to insignificant solution activity for as noted above. The courts have deemed these types of activity as well-known routine and convectional, see evidences noted below:
Performing repetitive calculations, Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values); Bancorp Services v. Sun Life, 687 F.3d 1266, 1278, 103 USPQ2d 1425, 1433 (Fed. Cir. 2012) ("The computer required by some of Bancorp’s claims is employed only for its most basic function, the performance of repetitive calculations, and as such does not impose meaningful limits on the scope of those claims.")
These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 8: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
wherein the updated scaling factor is associated with one weight matrix among a plurality of weight matrices in the neural network model, the method further comprising: determining an updated scaling factor for each of the other weight matrices in the plurality of weight matrices in the neural network model by carrying out a) through d) for the respective weight matrix. (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above. 
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application.
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 9  Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
applying the updated scaling factors  (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
applying the updated scaling factors to the neural network model; …and updating the neural network model by updating the learnable weights. (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation merely include instructions to implement an abstract idea on a computer, or merely use a computer as a tool to perform an abstract idea; Thus claim limitations amount to mere instructions to apply the judicial exception using a computer/computing environment as a tool, as discussed in MPEP § 2106.05(f).)
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 10: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
….determining a first loss based on pair-wise comparison between first tokens in the set of first token representations and second tokens in the set of second token representations; determining, based on the first loss, a third loss ... (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
wherein the neural network model with the quantized weights associated with the updated scaling factors is included in a student network and the student network is trained with a teacher network, … a set of second token representations by the teacher network; (Claimed limitations are generally linking the use of a judicial exception to a particular technological environment or field of use, as discussed in MPEP § 2106.05(h))
… and the student network is trained with a teacher network, the method further comprising: … determining, based on the first loss, a third loss during training of the student network; updating, based on the third loss, the student network. (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation merely include instructions to implement an abstract idea on a computer, or merely use a computer as a tool to perform an abstract idea; Thus claim limitations amount to mere instructions to apply the judicial exception using a computer/computing environment as a tool, as discussed in MPEP § 2106.05(f).)
the method further comprising: obtaining, based on an input sequence, a set of first token representations by the student network and a set of second token representations by the teacher network; (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to insignificant extra-solution activity (e.g. Receiving or transmitting data over a network,), see MPEP § 2106.05(g).)
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.

Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
First, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. 
Second, the limitations directed to insufficient to transform the judicial exception to a patentable invention because the recitation is directed to insignificant solution activity for as noted above. The courts have deemed these types of activity as well-known routine and convectional, see evidences noted below:
Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) (using a telephone for image transmission); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network); but see DDR Holdings, LLC v. Hotels.com, L.P., 773 F.3d 1245, 1258, 113 USPQ2d 1097, 1106 (Fed. Cir. 2014) ("Unlike the claims in Ultramercial, the claims at issue here specify how interactions with the Internet are manipulated to yield a desired result‐‐a result that overrides the routine and conventional sequence of events ordinarily triggered by the click of a hyperlink." (emphasis added));
These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 11: Does claim fall within a  statutory category? Yes
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
Abstract idea in claim 10.
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
wherein the first loss comprises a student- to-teacher loss and a teacher-to-student loss. (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to generally linking the use of a judicial exception to a particular technological environment or field of use. See 2106.05(h).)

The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above. 
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application.
First, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 12: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
further comprising: … determining a second loss based on pair-wise comparison between each first logit in the set of first logits and respective second logit in the set of second logits; and determining the third loss based on the first loss and the second loss. (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
obtaining, based on the input sequence, a set of first logits from the student network and a set of second logits from the teacher network;. (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to insignificant extra-solution activity (e.g. Receiving or transmitting data over a network,), see MPEP § 2106.05(g).)

The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above. 
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application.
First, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use.
Second, the limitations directed to insufficient to transform the judicial exception to a patentable invention because the recitation is directed to insignificant solution activity for as noted above. The courts have deemed these types of activity as well-known routine and convectional, see evidences noted below:
Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) (using a telephone for image transmission); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network); but see DDR Holdings, LLC v. Hotels.com, L.P., 773 F.3d 1245, 1258, 113 USPQ2d 1097, 1106 (Fed. Cir. 2014) ("Unlike the claims in Ultramercial, the claims at issue here specify how interactions with the Internet are manipulated to yield a desired result‐‐a result that overrides the routine and conventional sequence of events ordinarily triggered by the click of a hyperlink." (emphasis added));
These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 13: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
wherein the determining of the third loss based on the first loss and the second loss further comprises determining the third loss by aggregating the first loss and the second loss with a tunable factor. (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above. 
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application.
First, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.

Claim 14: Does claim fall within a  statutory category? Yes.
Step 2A Prong 1: Evaluate whether the claim recites a judicial exception.
a) determining a scaling factor based on a distribution of weights associated with the neural network model; b) determining quantized weights based on the scaling factor and the weights associated with the distribution; c) determining, …, a training loss of the neural network model; and d) determining, based on a gradient of the training loss, an updated scaling factor for the neural network model. (Considered directed to a Mental Process: Making observations for formulating observations, evaluations and judgements as claimed; see MPEP § 2106.04(a)(2), subsection III; And considered directed to a Mathematical concepts – mathematical relationships, mathematical calculations (see MPEP § 2106.04(a)(2), subsection I))
Step 2A Prong 2: Evaluate whether the claim as a whole integrates the recited judicial exception into a practical application of the exception
The preamble is deemed insufficient to transform the judicial exception to a patentable invention because the preamble generally links the use of a judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h).
one or more processors; and a non-transitory computer-readable medium, having computer-executable instructions stored thereon, the computer-executable instructions, when executed by one or more processors, causing the one or more processors to facilitate: … determining, based on the quantized weights during training of the neural network model … (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation merely include instructions to implement an abstract idea on a computer, or merely use a computer as a tool to perform an abstract idea; Thus claim limitations amount to mere instructions to apply the judicial exception using a computer/computing environment as a tool, as discussed in MPEP § 2106.05(f).)
…. weights associated with the neural network model. (Deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to generally linking the use of a judicial exception to a particular technological environment or field of use. See 2106.05(h).)
The additional elements do not appear to be sufficient to transform the judicial exception into a practical application at Step 2A as analyzed above.
Step 2B: Evaluates whether the claim as a whole/in combination integrates the recited judicial exception into a practical application of the exception
The claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception and fail to integrate the abstract into practical application. 
Specifically, the additional limitations are directed to elements that generally link the use of a judicial exception to a particular technological environment or field of use and elements invoking computers or other machinery merely as a tool to perform the claimed process/judicial exception. These types of claimed elements cannot transform the judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. 
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.
Regarding claim 15, the limitations are similar to claim 2, and rejected under the same rationale.
Regarding claim 16, the limitations are similar to claim 4, and rejected under the same rationale.
Regarding claim 17, the limitations are similar to claim 8, and rejected under the same rationale.
Regarding claim 18, the limitations are similar to claim 9, and rejected under the same rationale.
Regarding claim 19, the limitations are similar to claims 10 and 12, and rejected under the same rationale.
Regarding claim 20, the limitations are similar to claims 1 and 14, and rejected under the same rationale.

Therefore, claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed a judicial exception and does not recite, when claim elements are examined or as an ordered combination, that are directed to what have the courts have identified as "significantly more”, than the identified abstract idea, see MPEP 2106.05.




Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1-3, 7-9, 14-15, 17-18 and 20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Charlaix et al. (US 20230306255, herein after ‘Char’).

Regarding independent claim 1,Char teaches a computer-implemented method for quantizing a neural network model, performed by a processing system, comprising: (in [0093] The NN may be software-implemented by machine readable instructions that are executed using a processing unit, such as a tensor processing unit or a neural processing unit. Alternatively, the NN may be implemented using software that includes machine readable instructions executed by a dedicated hardware device, such as a compact, energy efficient AI chip (e.g. a microprocessor which is specifically designed to execute NN operations tasks faster, using less power than a conventional microprocessor) that includes a small number of logical gates. In example embodiments the NN is trained using a processing unit that is more powerful than the processing systems on which the trained NN is ultimately deployed for inference operations. )
a) determining a scaling factor based on a distribution of weights associated with the neural network model; b) determining quantized weights based on the scaling factor and the weights associated with the distribution; (in [0010] According to a first aspect of the disclosure, a method of training a neural network that comprises a plurality of computational blocks is disclosed. The method includes performing a plurality of training iterations. Each training iteration includes: (i) for each computational block: (a) applying a respective quantization function to a set of respective real-valued weights [based on a distribution of weights associated with the neural network model] of the computational block to generate a respective set of quantized weights [b) determining quantized weights based on the scaling factor and the weights associated with the distribution] that are scaled based on a respective scaling factor [determining a scaling factor based on a distribution of weights associated with the neural network model] to fall within a respective quantization range that is symmetrically centered at zero and comprises a defined number of uniform quantization levels corresponding to integer multiples of the respective scaling factor;…)
c) determining, based on the quantized weights during training of the neural network model, a training loss of the neural network model; and d) determining, based on a gradient of the training loss, an updated scaling factor for the neural network model. (in [0010] According to a first aspect of the disclosure, a method of training a neural network that comprises a plurality of computational blocks is disclosed. The method includes performing a plurality of training iterations. Each training iteration [determining, based on the quantized weights during training of the neural network model, a training loss of the neural network model] includes: (i) for each computational block: (a) applying a respective quantization function to a set of respective real-valued weights of the computational block to generate a respective set of quantized weights that are scaled based on a respective scaling factor to fall within a respective quantization range that is symmetrically centered at zero and comprises a defined number of uniform quantization levels corresponding to integer multiples of the respective scaling factor; and (b) computing a set of respective output activations for the computational block based on a respective set of input activations and the respective set of quantized weights; and (ii) computing a cost [a training loss of the neural network model] for the training iteration based on the respective output activations of the computational blocks and relative alignments of the respective quantized weights of the computational blocks with the uniform quantization levels of the respective quantization ranges; and (iii) for each computational block, adjusting the set of respective real-valued weights and the respective scaling factor [d) determining, based on a gradient of the training loss, an updated scaling factor for the neural network model] with an objective of reducing the computed cost in one or more following training iterations. When performing the plurality of training iterations, a smoothness of the respective quantization functions applied by the computational blocks is incrementally reduced for multiple training iterations of the plurality of training iterations.; And in [0052] Training NN 90 includes performing a series of training iterations that each include a forward pass and a backward pass... The cost can be calculated using a defined cost function (also referred to as a loss function). During each backward pass a backpropagation algorithm [[d) determining, based on a gradient of the training loss,…] is applied to update trainable parameters (e.g., NN weights and biases) of the NN 90 with an objective of minimizing the cost in future iterations. For example, a backpropagation algorithm can be applied to calculate the gradient of the cost function [d) determining, based on a gradient of the training loss, an updated scaling factor for the neural network model] at the NN output block 94 and then distribute this gradient back through the computational blocks 100i to adjust the learnable parameters, including weights W′, of each computational blocks 100i...)

Regarding claim 2, the rejection of claim 1 is incorporated and Char further teaches the method according to claim 1, further comprising: determining, based on the distribution of weights, an average weight magnitude; (in [0062] By way of example, the cost function applied by evaluation block 96 can be represented as: Cost=Error (Y,Ŷ)+(L1 or L2 Regularization Function)+R(W|α) (EQ. 4) [determining, based on the distribution of weights] …[0063] Where: Error (Y,Ŷ) can be any suitable function for computing an error value that represents a difference between predictions Ŷ (output by NN 90 and target values Y (e.g., cross entropy loss or Mean Square Error [further comprising: determining, based on the distribution of weights, an average weight magnitude]); L1 or L2 refer to the L1 and L2 regularization functions respectively; and R(W|α) refers to a scaling factor regulation function.)
and determining a clipping factor as a product of the scaling factor and the average weight magnitude, wherein determining, based on the weights in the distribution and the scaling factor, quantized weights is based on the clipping factor. (in [0064] Different functions may be used in different embodiments to implement scaling factor regulation function R(W|α) to push the respective weights to quantized values that are a closest multiple of scaling factor α. In example embodiments, the scaling factor regulation function R(W|α) provides a set of constraining cavities [determining a clipping factor as a product of the scaling factor and the average weight magnitude] equal in number to the number of quantization levels M, with the regulation function R(W|α) having a value of zero for every quantized weight such as R(jα)=0 for {j∈custom-character, |−N≤j≤N, and a positive value for all other values (e.g., the cavities have a zero value at each of the M quantization levels). For example, a sinusoidal function such as… [0068] As shown in FIGS. 4A and 4B, the quantized weight range [−Nα, Nα] is constrained [determining a clipping factor as a product of the scaling factor and the average weight magnitude] to be symmetrically centered at 0, with N=2k−1−1 (e.g., [−7α,7α] for 4-bit quantization). The quantized values are defined by ja with {j∈custom-character, |−N≤j≤N}. Every value outside the range [−Nα, Nα] is be clamped [and determining a clipping factor as a product of the scaling factor and the average weight magnitude, wherein determining, based on the weights in the distribution and the scaling factor, quantized weights is based on the clipping factor] to either −Nα (if the value is less than −Nα) or Nα (if the value is greater than Nα).)
	
Regarding claim 3, the rejection of claim 2 is incorporated and Char further teaches the method according to claim 2, further comprising determining the average weight magnitude by an L1 norm function to the weights associated with the distribution. (in [0062] By way of example, the cost function applied by evaluation block 96 can be represented as: Cost=Error (Y,Ŷ)+(L1 or L2 Regularization Function)+R(W|α) (EQ. 4) [determining the average weight magnitude by an L1 norm function to the weights associated with the distribution] …[0063] Where: Error (Y,Ŷ) can be any suitable function for computing an error value that represents a difference between predictions Ŷ (output by NN 90 and target values Y (e.g., cross entropy loss or Mean Square Error [determining the average weight magnitude by an L1 norm function to the weights associated with the distribution]); L1 or L2 refer to the L1 and L2 regularization functions respectively; and R(W|α) refers to a scaling factor regulation function.)

Regarding claim 7, the rejection of claim 1 is incorporated and Char further teaches the method according to claim 1, further comprising determining an optimized scaling factor for a task by carrying out multiple iterations of a) through d). (in [0010] According to a first aspect of the disclosure, a method of training a neural network that comprises a plurality of computational blocks is disclosed. The method includes performing a plurality of training iterations [further comprising determining an optimized scaling factor for a task by carrying out multiple iterations of a) through d)]. Each training iteration  includes: (i) for each computational block: (a) applying a respective quantization function to a set of respective real-valued weights of the computational block to generate a respective set of quantized weights that are scaled based on a respective scaling factor to fall within a respective quantization range that is symmetrically centered at zero and comprises a defined number of uniform quantization levels corresponding to integer multiples of the respective scaling factor; and (b) computing a set of respective output activations for the computational block based on a respective set of input activations and the respective set of quantized weights;…. (iii) for each computational block, adjusting the set of respective real-valued weights and the respective scaling factor [d) determining, based on a gradient of the training loss, an updated scaling factor for the neural network model] with an objective of reducing the computed cost in one or more following training iterations. When performing the plurality of training iterations, a smoothness of the respective quantization functions applied by the computational blocks is incrementally reduced for multiple training iterations of the plurality of training iterations [further comprising determining an optimized scaling factor for a task by carrying out multiple iterations of a) through d)].)

Regarding claim 8, the rejection of claim 1 is incorporated and Char further teaches the method according to claim 1, wherein the updated scaling factor is associated with one weight matrix among a plurality of weight matrices in the neural network model, the method further comprising: determining an updated scaling factor for each of the other weight matrices in the plurality of weight matrices in the neural network model by carrying out a) through d) for the respective weight matrix. (in [0010] According to a first aspect of the disclosure, a method of training a neural network that comprises a plurality of computational blocks is disclosed. The method includes performing a plurality of training iterations [wherein the updated scaling factor is associated with one weight matrix among a plurality of weight matrices in the neural network model, the method further comprising: determining an updated scaling factor for each of the other weight matrices in the plurality of weight matrices in the neural network model by carrying out a) through d) for the respective weight matrix]. Each training iteration  includes: (i) for each computational block: (a) applying a respective quantization function to a set of respective real-valued weights of the computational block to generate a respective set of quantized weights that are scaled based on a respective scaling factor to fall within a respective quantization range that is symmetrically centered at zero and comprises a defined number of uniform quantization levels corresponding to integer multiples of the respective scaling factor; and (b) computing a set of respective output activations for the computational block based on a respective set of input activations and the respective set of quantized weights;…. (iii) for each computational block, adjusting the set of respective real-valued weights and the respective scaling factor [wherein the updated scaling factor is associated with one weight matrix among a plurality of weight matrices in the neural network model, the method further comprising: determining an updated scaling factor for each of the other weight matrices in the plurality of weight matrices in the neural network model by carrying out a) through d) for the respective weight matrix] with an objective of reducing the computed cost in one or more following training iterations. When performing the plurality of training iterations, a smoothness of the respective quantization functions applied by the computational blocks [wherein the updated scaling factor is associated with one weight matrix among a plurality of weight matrices in the neural network model, the method further comprising: determining an updated scaling factor for each of the other weight matrices in the plurality of weight matrices in the neural network model by carrying out a) through d) for the respective weight matrix] is incrementally reduced for multiple training iterations of the plurality of training iterations.; And in [0054] Each computational block 100i is configured to perform a set of operations to process its input activations X′ and generate corresponding output activations X.sup.i+1. The operations performed by computational block 100i can include operations commonly found in an NN layer, namely, a matrix multiplication operation (MatMul)  [wherein the updated scaling factor is associated with one weight matrix among a plurality of weight matrices in the neural network model, the method further comprising: determining an updated scaling factor for each of the other weight matrices in the plurality of weight matrices in the neural network model by carrying out a) through d) for the respective weight matrix] 206 (which can, for example, include multiply and accumulate operations), a batch normalization (BN) operation 208, and an activation function 210. Further, in the illustrated example, the training stage computational block 100i includes a quantize activations operation 202 and a quantize weights operation 204 for respectively quantizing activations X.sup.i and weights W.sup.i [wherein the updated scaling factor is associated with one weight matrix among a plurality of weight matrices in the neural network model, the method further comprising: determining an updated scaling factor for each of the other weight matrices in the plurality of weight matrices in the neural network model by carrying out a) through d) for the respective weight matrix].)

Regarding claim 9, the rejection of claim 8 is incorporated and Char further teaches the method according to claim 9, wherein the neural network model with the quantized weights associated with the updated scaling factors is included in a student network, and the student network is trained with a teacher network, the method further comprising: obtaining, based on an input sequence, a set of first token representations by the student network and a set of second token representations by the teacher network; determining a first loss based on pair-wise comparison between first tokens in the set of first token representations and second tokens in the set of second token representations; determining, based on the first loss, a third loss during training of the student network; updating, based on the third loss, the student network. (in [0010] According to a first aspect of the disclosure, a method of training a neural network that comprises a plurality of computational blocks is disclosed. The method includes performing a plurality of training iterations [further comprising: applying the updated scaling factors to the neural network model; determining quantized weights associated with the updated scaling factors as learnable weights in the neural network model; and updating the neural network model by updating the learnable weights]. Each training iteration  includes: (i) for each computational block: (a) applying a respective quantization function to a set of respective real-valued weights of the computational block to generate a respective set of quantized weights that are scaled based on a respective scaling factor to fall within a respective quantization range that is symmetrically centered at zero and comprises a defined number of uniform quantization levels corresponding to integer multiples of the respective scaling factor; and (b) computing a set of respective output activations for the computational block based on a respective set of input activations and the respective set of quantized weights;…. (iii) for each computational block, adjusting the set of respective real-valued weights [learnable weights in the neural network model; and updating the neural network model by updating the learnable weights] and the respective scaling factor [applying the updated scaling factors to the neural network model] with an objective of reducing the computed cost in one or more following training iterations. When performing the plurality of training iterations, a smoothness of the respective quantization functions [determining quantized weights associated with the updated scaling factors as learnable weights in the neural network model; and updating the neural network model by updating the learnable weights] applied by the computational blocks  is incrementally reduced for multiple training iterations of the plurality of training iterations.)
	Regarding claims 14 and 20, the limitations are similar with claim 1 limitations and are rejected under the same rationale. Additionally Char teaches one or more processors; and a non-transitory computer-readable medium, having computer-executable instructions stored thereon, the computer-executable instructions, when executed by one or more processors, causing the one or more processors to facilitate:…  And the computer-executable instructions, when executed by one or more processors, causing the one or more processors to facilitate…, in [0020] According to a further example aspect, a processing unit is disclosed. The processing unit includes one or more processing devices and one or more storages operatively connected to the one or more processing devices and storing executable instructions that when executed by the one or more processing devices configure the processing unit to perform on or more of the methods of the preceding aspects. [0021] According to a further example aspect, a computer readable medium is disclosed that stores executable instructions that when executed by one or more processing devices configures the processing device(s) to perform on or more of the methods of the preceding aspects.

	Regarding claim 15, the limitations are similar to those in claim 2, and are thus rejected under the same rationale. 
Regarding claims 17 and 18, the limitations are similar to those in claims 8 and 9 respectively, and are thus rejected under the same rationale.
	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 4-6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Charlaix et al. (US 20230306255, herein after ‘Char’) in view of Da Costa et al. (US 20230186095, hereinafter ‘Costa’).

Regarding claim 4, the rejection of claim 1 is incorporated and Char further teaches the method according to claim 1, wherein the weights associated with the distribution are divided into a plurality of value ranges based on the scaling factor, (in [0085] As indicated at Block 708, a cost for the training iteration is computed by evaluation block 96 based on the respective output activations of the computational blocks and relative alignments of the respective quantized weights of the computational blocks with the uniform quantization levels of the respective quantization ranges [wherein the weights associated with the distribution are divided into a plurality of value ranges based on the scaling factor]. And in [0064] Different functions may be used in different embodiments to implement scaling factor regulation function R(W|α) to push the respective weights to quantized values that are a closest multiple of scaling factor α. In example embodiments, the scaling factor regulation function R(W|α) provides a set of constraining cavities [wherein the weights associated with the distribution are divided into a plurality of value ranges based on the scaling factor] equal in number to the number of quantization levels M, with the regulation function R(W|α) having a value of zero for every quantized weight such as R(jα)=0 for {j∈custom-character, |−N≤j≤N, and a positive value for all other values (e.g., the cavities have a zero value at each of the M quantization levels). For example, a sinusoidal function such as…)
the method further comprising: computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in; (in [0068] As shown in FIGS. 4A and 4B, the quantized weight range [−Nα, Nα] [the method further comprising: computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in] is constrained to be symmetrically centered at 0, with N=2k−1−1 (e.g., [−7α,7α] for 4-bit quantization)… [0069] In at least some examples, this constrained quantization can result in a trained NN that can exploit hardware efficiency due to the uniform symmetric quantization, while maintaining accuracy. Accuracy is maintained by guiding the real-valued weights towards their quantized counterparts by using a regularization function. Furthermore, for each computational block 100i the quantization range is adaptive [the method further comprising: computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in] (defined by the trainable scaling factor α for the computational block 100i), which can reduce errors induced by quantization [the method further comprising: computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in].)
and computing the gradient of the training loss by aggregating the gradient contributions from the weights associated with the distribution. (in [0052] …. During each backward pass a backpropagation algorithm is applied to update trainable parameters (e.g., NN weights and biases) of the NN 90 with an objective of minimizing the cost in future iterations. For example, a backpropagation algorithm [and computing the gradient of the training loss by aggregating the gradient contributions from the weights associated with the distribution] can be applied to calculate the gradient of the cost function at the NN output block 94 and then distribute this gradient back through the computational blocks 100i to adjust the learnable parameters, including weights W′, of each computational blocks 100i. Multiple batch-based training iterations can be required to process an entire training dataset as part of a single training epoch. Multiple training epochs can be required to ultimately train the NN 90.)
Examiner notes that processing the gradient of the cost is considered computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in, as noted above for each range level.
Additionally, Costa discloses that the gradient is generated for each weight, in [0063] FIG. 4 shows how a neural network may be trained while applying a loss scaling factor L to the loss function 100 based on gradient statistics. As described for FIG. 2, the network processes training data in a forward pass through a series of layers 402, at the end of which a loss function 100 is defined. In this case the loss function is multiplied by a loss scaling factor 420 to obtain a scaled loss function 416. The loss scaling factor may be initialised as any value. Gradients are then computed for scaled loss with respect to the weights and activations. This means that the gradients 406 computed with respect to weights [computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in] and activations at each layer are scaled up or down by the same loss scaling factor 420. The gradients are propagated back through the network in a backward pass... [0094] In the forward pass, histograms are collected for activations, gradients with respect to weights [computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in] and gradients with respect to outputs, where the goal is to determine an appropriate format for representing these values. As described above for gradients, histograms have at least two bins with the histogram providing an aggregation of all values falling within the ranges [computing the gradient of the training loss by aggregating the gradient contributions from the weights associated with the distribution] indicated by each bin…)
Costa and Char are analogous art because both involve developing information processing techniques using machine learning systems and algorithms.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for developing information processing techniques of neural networks using mixed-precision numerical formats as disclosed by Costa with the method for processing of neural network models using quantization techniques, as disclosed by Char.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Costa and Char as noted above; Doing so allowing for processing information formats with the smallest exponent field size, as this maximises precision by allowing more mantissa bits to represent the given number, (Costa, 0099).

Regarding claim 5, the rejection of claim 1 is incorporated and Char further teaches the method according to claim 1, further comprising: setting an initial value of the scaling factor to one. (in [0060] In example embodiments, when training NN 90, a respective scaling factor α [further comprising: setting an initial value of the scaling factor to one] (also referred to as step size) and a zero-centered quantization range [−Nα, Nα] are also learned for each computational block 100i. As illustrated in FIG. 3, quantize weights operation 204 is trained to constrain the quantized weights W.sub.q.sup.i in a range symmetrically centered at 0... The scaling factor α defines the step size between two adjacent quantized values…)
Char teaches setting the scaling factor to any value which includes the claimed value.
Additionally, Costa does expressly teach the setting an initial value of the scaling factor to one, in [0073] FIG. 6 shows a flow chart of how a loss scaling factor L may be updated automatically based on gradient statistics computed periodically during training of a deep learning model... The scaling factor itself is also initialised. For example, the scaling factor may initially be set to 1 [setting an initial value of the scaling factor to one], such that the gradients are not scaled up or down for the first training iterations, and once gradient statistics are known, the scaling factor is adjusted, as will be described below.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Costa and Char for the same reasons disclosed above.

Regarding claim 6, the rejection of claim 1 is incorporated and Char further teaches the method according to claim 1, further comprising: determining, based on initial values of the weights in the neural network model, an initial value for the scaling factor. (in [0060] In example embodiments, when training NN 90, a respective scaling factor α [further comprising: determining, based on initial values of the weights in the neural network model, an initial value for the scaling factor] (also referred to as step size) and a zero-centered quantization range [−Nα, Nα] are also learned for each computational block 100i. As illustrated in FIG. 3, quantize weights operation 204 is trained to constrain the quantized weights W.sub.q.sup.i  [further comprising: determining, based on initial values of the weights in the neural network model, an initial value for the scaling factor] in a range symmetrically centered at 0... The scaling factor α defines the step size between two adjacent quantized values…; And in [0017] In some examples of one or more of the preceding aspects, the set of respective real-valued weights and the respective scaling factor [further comprising: determining, based on initial values of the weights in the neural network model, an initial value for the scaling factor] for each computational block is performed using a derivative of a corresponding one of the plurality of repeated, shifted functions for at least some of the plurality of training iterations.)
Additionally, Costa does expressly teach further comprising: determining, based on initial values of the weights in the neural network model, an initial value for the scaling factor in [0073] FIG. 6 shows a flow chart of how a loss scaling factor L may be updated automatically based on gradient statistics computed periodically during training of a deep learning model... The scaling factor itself is also initialised. For example, the scaling factor may initially be set to 1 [further comprising: determining, based on initial values of the weights in the neural network model, an initial value for the scaling factor], such that the gradients are not scaled up or down for the first training iterations, and once gradient statistics are known, the scaling factor is adjusted, as will be described below. And in [0025] A first aspect disclosed herein provides a computer-implemented method of training, based on a set of training data, a multi-layer neural network comprising a set of network weights, the method comprising: processing the training data in respective forward and backward passes through a sequence of layers of the network, the forward pass comprising computing a set of activations by applying an activation function in dependence on the network weights and training data, and the backward pass comprising: computing gradients of a pre-determined loss function with respect to the network weights and/or computing gradients of the pre-determined loss function with respect to the computed activations of the network, …the gradients with respect to activations computed in the backward pass, and the gradients with respect to weights computed in the backward pass; updating the network weights in dependence on the computed gradients with respect to the weights [further comprising: determining, based on initial values of the weights in the neural network model, an initial value for the scaling factor]; computing a proportion of the subset of values falling above a predefined threshold; and updating the adjustment parameter applied to the subset of machine learning parameters in dependence on the computed proportion.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Costa and Char for the same reasons disclosed above.

Regarding claim 16, the limitations are similar to those in claim 4, and are thus rejected under the same rationale.


Claims 10 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Charlaix et al. (US 20230306255, herein after ‘Char’) in view of Liu et al. (NPL: “BiT: Robustly Binarized Multi-distilled Transformer”, hereinafter ‘Liu’) in further view of Yang et al. (US 11487944, hereinafter ‘Yang’).

Regarding claim 10, the rejection of claim 9 is incorporated and Char further teaches the method according to claim 9, wherein the neural network model with the quantized weights associated with the updated scaling factors is included in a student network, and the student network is trained with a teacher network, the method further comprising: obtaining, based on an input sequence, a set of first token representations by the student network and a set of second token representations by the teacher network; determining a first loss based on pair-wise comparison between first tokens in the set of first token representations and second tokens in the set of second token representations; determining, based on the first loss, a third loss during training of the student network; updating, based on the third loss, the student network. (in [0010] According to a first aspect of the disclosure, a method of training a neural network that comprises a plurality of computational blocks is disclosed. The method includes performing a plurality of training iterations [further comprising: applying the updated scaling factors to the neural network model; determining quantized weights associated with the updated scaling factors as learnable weights in the neural network model; and updating the neural network model by updating the learnable weights]. Each training iteration  includes: (i) for each computational block: (a) applying a respective quantization function to a set of respective real-valued weights of the computational block to generate a respective set of quantized weights that are scaled based on a respective scaling factor to fall within a respective quantization range that is symmetrically centered at zero and comprises a defined number of uniform quantization levels corresponding to integer multiples of the respective scaling factor; and (b) computing a set of respective output activations for the computational block based on a respective set of input activations and the respective set of quantized weights;…. (iii) for each computational block, adjusting the set of respective real-valued weights [learnable weights in the neural network model; and updating the neural network model by updating the learnable weights] and the respective scaling factor [applying the updated scaling factors to the neural network model] with an objective of reducing the computed cost in one or more following training iterations. When performing the plurality of training iterations, a smoothness of the respective quantization functions [determining quantized weights associated with the updated scaling factors as learnable weights in the neural network model; and updating the neural network model by updating the learnable weights] applied by the computational blocks  is incrementally reduced for multiple training iterations of the plurality of training iterations.)
Char does not expressly teach the use of student-teacher networks as claimed in the limitations …the updated scaling factors is included in a student network, and the student network is trained with a teacher network, the method further comprising: obtaining, based on an input sequence, a set of first token representations by the student network and a set of second token representations by the teacher network; determining a first loss based on pair-wise comparison between first tokens in the set of first token representations and second tokens in the set of second token representations; determining, based on the first loss, a third loss during training of the student network; updating, based on the third loss, the student network.
Liu does expressly teach the student-teacher network as claimed in the limitations …the updated scaling factors is included in a student network, and the student network is trained with a teacher network, in Sec. 4: …The multi-step distillation follows a quantization schedule, Q = {(b 1 w , b 1 a ),(b 2 w , b 2 a ), . . . ,(b k w , b k a )} with (b 1 w , b 1 a ) > (b 2 w , b 2 a ) > . . . > (b k w , b k a ) 3 . (b k w , b k a ) is the target quantization level, which is in our case binary for both weights and activations. In practice, we find that down to a quantization level of W1A2, we can distill models of reasonable accuracy in single shot, following the best practices outlined in Section 3.2 (See our 1-1-2 baseline results in Table 1). As a result, we follow a fixed quantization schedule, W32A32 → W1A2 → W1A1. This is not necessarily optimal, and how to efficiently find the best quantization schedule is an interesting open problem. We present our initial explorations towards this direction in Section 5.5. Combining the elastic binary activations with multi-distillation we obtain BiT, the robustly binarized multi-distilled transformer…; And in Sec. 3.3: …Learning the scaling [the updated scaling factors is included in a student network,] and threshold parameters, and how to approximate the gradients precisely in the process becomes crucial for the final accuracy. To handle this, we propose the elastic binarization function to rescale and shift the real-valued activations [the updated scaling factors is included in a student network,], where α ∈ R+, β ∈ R: Xi B = αXˆ i B = αbClip(Xi R − β α , 0, 1)e (9) In the function, we initialize α with α ∗ in Sec. 3.1 and β to be 0, and train it with gradients from the final loss. To back-propagate the gradients to α through the discretized binarization function, we follow the practice in Choi et al. (2018); Zhou et al. (2016) to use straight-through estimator (STE) (Bengio et al., 2013) to bypass the incoming gradients to the round function to be the outgoing gradients:…

    PNG
    media_image1.png
    356
    1094
    media_image1.png
    Greyscale

And Sec. 4: …This suggests a multi-step approach, where instead of directly distilling from a full-precision teacher to the desired quantization level, we first distill into a model with sufficient precision in order to preserve quality. This model can then be used as a teacher to distill into a further quantized student [and the student network is trained with a teacher network]. This process can be repeated multiple times, while at each step ensuring that the teacher and student models are sufficiently similar, and the performance loss is limited. This multi-distillation approach is sketched in Algorithm 1. 
Liu and Char are analogous art because both involve developing information processing techniques using machine learning systems and algorithms.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for developing information processing techniques for binarized Multi-distilled Transformer as disclosed by Liu with the method for processing of neural network models using quantization techniques, as disclosed by Char.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Liu and Char as noted above; Doing so allow for developing and implementing fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9% (Liu, Abstract).

While Liu teaches the use of knowledge distillation algorithms for processing a quantization techniques of model information.  
Liu does not expressly teach the claimed software architecture as claimed in the limitations the method further comprising: obtaining, based on an input sequence, a set of first token representations by the student network and a set of second token representations by the teacher network;
determining a first loss based on pair-wise comparison between first tokens in the set of first token representations and second tokens in the set of second token representations; determining, based on the first loss, a third loss during training of the student network; updating, based on the third loss, the student network. 
Yang teaches the claimed software architecture as claimed in the limitations the method further comprising: obtaining, based on an input sequence, a set of first token representations by the student network and a set of second token representations by the teacher network; determining a first loss based on pair-wise comparison between first tokens in the set of first token representations and second tokens in the set of second token representations; determining, based on the first loss, a third loss during training of the student network; updating, based on the third loss, the student network.
(As depicted in Fig.5  
    PNG
    media_image2.png
    600
    586
    media_image2.png
    Greyscale
 
And 7:59-8:2: In one embodiment, the contrastive representation distillation loss is computed as follows: (37) Let the vector representations of an input data sequence x produced by the k-th teacher be f.sup.T.sup.k(x) and by student be f.sup.S(x) [the method further comprising: obtaining, based on an input sequence, a set of first token representations by the student network and a set of second token representations by the teacher network]. A data sequence from the set of input data sequences is treated as a positive example x, and M other randomly sampled data sequences {x′.sub.m}.sub.m=1.sup.M are treated as negative examples. Let the vector representations of the data sequence x′.sub.m be f.sup.S(x′.sub.m). A contrastive loss [determining a first loss based on pair-wise comparison between first tokens in the set of first token representations and second tokens in the set of second token representations] is then utilized to distinguish between the positive and negative examples:… And in 8:41-56:   FIG. 5 illustrates an example system for performing the methods described herein. The methods described herein may be implemented in other systems and are not limited to system 500. The system 500 includes a Tag Predictions Module 520, a Loss Calculation Module 535, and a Student Model Optimizer 560. The Tag Prediction Modules 520 applies the teacher models 525 and the student model 530 to input data sequences 510 to obtain the tag predictions. The Loss Calculation Module 535 [determining, based on the first loss, a third loss during training of the student network; updating, based on the third loss, the student network] includes a Distillation Loss Submodule 540 which calculates the distillation losses. In certain embodiments, the Loss Calculation Module 535 also includes a Student Loss Submodule 545 and CRD Loss Submodule 550 for calculating a student loss and a CRD loss, respectively, as described above. The Student Model Optimizer 560 adjusts the parameters of the student model with each iteration to reduce the overall loss [a third loss during training of the student network; updating, based on the third loss, the student network]… And in 4:36-45: …The system obtains a set of input data sequences for use in transferring knowledge from the teacher models to the student model (step 220). An example of input data sequences are text strings. Each input data sequence includes one or more tokens [the method further comprising: obtaining, based on an input sequence, a set of first token representations by the student network and a set of second token representations by the teacher network]. For text strings, the individual words in the string each may be treated as a token. Knowledge can be distilled from various teacher models using only the one set of input data sequences;...)
Yang, Liu and Char are analogous art because both involve developing information processing techniques using machine learning systems and algorithms.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for developing distillation approach to implementing information processing techniques in machine learning software that perform natural language processing as disclosed by Yang with the method for processing of neural network models using quantization techniques, as collectively disclosed by Liu and Char.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Yang, Liu and Char as noted above; Doing so allow for including a contrastive representation distillation (CRD) loss in the overall loss function enables the student to distill domain-invariant knowledge from the teacher models and enables the student model to produce vector representations of input data sequences that are domain insensitive or less domain sensitive than they would otherwise be. (Yang, 3:8-21).

Regarding claim 11, the rejection of claim 10 is incorporated and Char in combinations with Yang and Liu further teaches the method according to claim 10, wherein the first loss comprises a student- to-teacher loss and a teacher-to-student loss. (As depicted in Fig. 5 and in 8:41-56:   FIG. 5 illustrates an example system for performing the methods described herein. The methods described herein may be implemented in other systems and are not limited to system 500. The system 500 includes a Tag Predictions Module 520, a Loss Calculation Module 535, and a Student Model Optimizer 560. The Tag Prediction Modules 520 applies the teacher models 525 and the student model 530 to input data sequences 510 to obtain the tag predictions. The Loss Calculation Module 535 [wherein the first loss comprises a student- to-teacher loss and a teacher-to-student loss] includes a Distillation Loss Submodule 540 which calculates the distillation losses [wherein the first loss comprises a student- to-teacher loss and a teacher-to-student loss]. In certain embodiments, the Loss Calculation Module 535 also includes a Student Loss Submodule 545 and CRD Loss Submodule 550 for calculating a student loss and a CRD loss, respectively, as described above. The Student Model Optimizer 560 adjusts the parameters of the student model with each iteration to reduce the overall loss …; And in 3:8-21: …In certain embodiments, the overall loss is a function of the aggregate distillation loss, the student loss, and a contrastive representation distillation (CRD) loss. The CRD loss [wherein the first loss comprises a student- to-teacher loss and a teacher-to-student loss] is based on a comparison of the vector representations generated by the teacher models for each of the input data sequences, the vector representations generated by the student model for each of the input data sequences, and the vector representations generated by the student model for negative example data sequences. Including the CRD loss in the overall loss function enables the student to distill domain-invariant knowledge from the teacher models and enables the student model to produce vector representations of input data sequences that are domain insensitive or less domain sensitive than they would otherwise be...)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Yang, Liu and Char for the same reasons disclosed above.

Claims 12-13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Charlaix et al. (US 20230306255, herein after ‘Char’) in view of Liu et al. (NPL:”BiT: Robustly Binarized Multi-distilled Transformer”, hereinafter ‘Liu’) in further view of Yang et al. (US 11487944, hereinafter ‘Yang’) in further view of Haidar et al. (US 20220335303, hereinafter ‘Hai’).

Regarding claim 12, the rejection of claim 10 is incorporated and Yang further teaches the method according to claim 10, further comprising: obtaining, based on the input sequence, a set of first (in As depicted in Fig.5; And 7:59-8:2: In one embodiment, the contrastive representation distillation loss is computed as follows: (37) Let the vector representations of an input data sequence x produced by the k-th teacher be f.sup.T.sup.k(x) and by student be f.sup.S(x). A data sequence from the set of input data sequences is treated as a positive example x, and M other randomly sampled data sequences {x′.sub.m}.sub.m=1.sup.M are treated as negative examples. Let the vector representations of the data sequence x′.sub.m be f.sup.S(x′.sub.m). A contrastive loss [determining a second loss based on pair-wise comparison between each first ] is then utilized to distinguish between the positive and negative examples:… And in 8:41-56:   FIG. 5 illustrates an example system for performing the methods described herein. The methods described herein may be implemented in other systems and are not limited to system 500. The system 500 includes a Tag Predictions Module 520, a Loss Calculation Module 535, and a Student Model Optimizer 560. The Tag Prediction Modules 520 applies the teacher models 525 and the student model 530 to input data sequences 510 to obtain the tag predictions [further comprising: obtaining, based on the input sequence, a set of first ]. The Loss Calculation Module 535 includes a Distillation Loss Submodule 540 which calculates the distillation losses [determining a second loss based on pair-wise comparison between each first ]. In certain embodiments, the Loss Calculation Module 535 [determining the third loss based on the first loss and the second loss] also includes a Student Loss Submodule 545 and CRD Loss Submodule 550 for calculating a student loss and a CRD loss, respectively, as described above. The Student Model Optimizer 560 adjusts the parameters of the student model with each iteration to reduce the overall loss [determining the third loss based on the first loss and the second loss]…)
Yang does not expressly teach the data process associated with the distillation models as claimed a set of first logits from the student network and a set of second logits from the teacher network; …each first logit in the set of first logits and respective second logit in the set of second logits;
Hai does expressly teach the data process associated with the distillation models as claimed a set of first logits from the student network and a set of second logits from the teacher network; …each first logit in the set of first logits and respective second logit in the set of second logits;, in [0009] As described above, the [t]eacher and the student each typically generate (i.e. predict) an output in the form of logits [set of first logits from the student network and a set of second logits from the teacher network; …each first logit in the set of first logits and respective second logit in the set of second logits], i.e. a non-normalized probability distribution. The predicted logits of the teacher model and the student model are then typically normalized by a softmax function to generate a normalized probability distribution, which may be used as the final prediction of the model (e.g., a model trained to perform a classification task using images as input may generate a normalized probability distribution of (“dog”=0.9, “cat”=0.1))…

Hai, Yang, Liu and Char are analogous art because both involve developing information processing techniques using machine learning systems and algorithms.
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for developing distillation approach to implementing information processing techniques in machine learning as disclosed by Hai with the method for processing of neural network models using quantization techniques, as collectively disclosed by Yang, Liu and Char.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Hai, Yang, Liu and Char as noted above; Doing so allows for improving knowledge distillation using intermediate representations, (Hai, 0001).

Regarding claim 13, the rejection of claim 12 is incorporated and Yang further teaches the method according to claim 12, wherein the determining of the third loss based on the first loss and the second loss further comprises determining the third loss by aggregating the first loss and the second loss with a tunable factor. (in 5:42-60: … The system aggregates the distillation losses of each of the student-teacher model pairs to compute an aggregate distillation loss (step 250)[ wherein the determining of the third loss based on the first loss and the second loss further comprises determining the third loss by aggregating the first loss and the second loss with a tunable factor]. The system computes an overall loss as function of the aggregate distillation loss (step 260). In certain embodiments, the overall loss may be equal to the aggregate distillation loss. In other embodiments, it may also include other losses [wherein the determining of the third loss based on the first loss and the second loss further comprises determining the third loss by aggregating the first loss and the second loss with a tunable factor], such as a student loss or a contrastive representation distillation (CRD) loss, as described below with respect to FIGS. 3A-3B and 4A-4B. The system repeats steps 230-260 for a number of iterations, adjusting the parameters of the student model with each iteration to reduce the overall loss (step 270). The steps may be repeated for a fixed number of iterations or until convergence is achieved. The result is a unified named-entity recognition model (i.e., the student) with the collective predictive capabilities of the teacher models without the need for the annotated training data used to train the teacher models…; And in 7:61-8:30: Let the vector representations of an input data sequence x produced by the k-th teacher be f.sup.T.sup.k(x) and by student be f.sup.S(x). A data sequence from the set of input data sequences is treated as a positive example x, and M other randomly sampled data sequences {x′.sub.m}.sub.m=1.sup.M are treated as negative examples. Let the vector representations of the data sequence x′.sub.m be f.sup.S(x′.sub.m). A contrastive loss is then utilized to distinguish between the positive and negative examples: … Where h(v, v′)=sigmoid(v.sup.Tv′/τ) and τ is a temperature [the second loss with a tunable factor] that adjusts the concentration level. To learn domain-invariant representations on data drawn from D.sub.k, the system maximizes the mutual information [the second loss with a tunable factor] between the student representation and each of the teacher representations by calculating the final CRD loss that as follows:… In contrast to Equation 3 above, which distills knowledge from the k-th teacher with only in-domain data, the CRD loss encourages the model to distill domain invariant knowledge of a teacher using both in-domain and out-domain data. The system calculates the overall loss as a function of the distillation loss, the student loss, and the CRD loss as set forth below: …)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Yang, Liu and Char for the same reasons disclosed above.
Regarding claim 19, the limitations are similar to those in claims 10 and 12, and are thus rejected under the same rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Moshovos et al. (US 20220092382): teaches a model quantization technique that compresses the vast majority (e.g., 99.9%) of the 32-bit floating-point parameters of state-of-the-art BERT (Bidirectional Encoder Representations from Transformers) models.
Shi et al. (US 20170270408) teaches processing the gradient of the cost is considered computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in, in [0058] The hardware costs from hardware complexity cost generator 52 are input to hardware complexity cost gradient generator 62. These hardware costs include the bit-depth costs. The hardware cost gradient for each weight [computing a gradient contribution by each weight of the weights associated with the distribution based on the value range that the respective weight falls in] is calculated by hardware complexity cost gradient generator 62, and these gradients collected by weights selector 70. Some gradients, such as a gradient of error, can be back propagated [computing the gradient of the training loss by aggregating the gradient contributions from the weights associated with the distribution] to find the gradient over each parameter using a chain rule. Regularization costs from other regularization generator 56 are input to other regularization gradient generator 66, which generates gradients for regularization costs.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516. The examiner can normally be reached Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/OLUWATOSIN ALABI/              Primary Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Sep 15, 2022
Application Filed
Oct 16, 2025
Non-Final Rejection mailed — §101, §102, §103
Jan 15, 2026
Response Filed
Jan 15, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/656,024
Patent 12639573
Method, System, and Computer Program Product for Embedding Compression and Regularization
2y 0m to grant Granted May 26, 2026
17/798,578
Patent 12632740
METHOD AND SYSTEM FOR MULTIMODAL CLASSIFICATION BASED ON BRAIN-INSPIRED UNSUPERVISED LEARNING
3y 9m to grant Granted May 19, 2026
18/093,594
Patent 12579409
IDENTIFYING SENSOR DRIFTS AND DIVERSE VARYING OPERATIONAL CONDITIONS USING VARIATIONAL AUTOENCODERS FOR CONTINUAL TRAINING
3y 2m to grant Granted Mar 17, 2026
18/802,747
Patent 12572814
ARTIFICIAL NEURAL NETWORK BASED SEARCH ENGINE CIRCUITRY
1y 6m to grant Granted Mar 10, 2026
18/196,986
Patent 12561570
METHODS AND ARRANGEMENTS TO IDENTIFY FEATURE CONTRIBUTIONS TO ERRONEOUS PREDICTIONS
2y 9m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
60%
Grant Probability
82%
With Interview (+22.6%)
3y 11m (~2m remaining)
Median Time to Grant
Low
PTA Risk
Based on 209 resolved cases by this examiner. Grant probability derived from career allowance rate.