DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responsive to the claims filed 6/28/2023.
Claims 1-17 are presented for examination.
Specification
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
The following title is suggested: METHOD AND APPARATUS FOR TRAINING NEURAL NETWORKS USING DIFFERENTIAL DATA.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-17 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The analysis of the claims will follow the 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50 (“2019 PEG”).
Claim 1
Step 1: The claim recites “A processor-implemented method, the method comprising:”; therefore, it is directed to the statutory category of a process.
Step 2A Prong 1: The claim recites, inter alia:
generating respective first neural network differential data by differentiating a respective output of each layer of a first neural network with respect to input data provided to the first neural network that estimates output data from the input data: These limitations recite a mathematical relationship of organizing information (generating respective first neural network differential data) and manipulating information through mathematical correlations (by differentiating a respective of each layer of a first neural network with respect to input data provided to the first neural network that estimates output data from the input data). See MPEP 2106.04(a)(2)(I).
generating, using a second neural network, an output differential value of the output data with respect to the input data using the respective first neural network differential data: These limitations recite a mathematical relationship of organizing information (generating an output differential value of the output data with respect to the input data) and manipulating information through mathematical correlations (using a second neural network and the respective first neural network differential data). See MPEP 2106.04(a)(2)(I).
Thus, the claim recites a judicial exception.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of the claim are as follows:
by a forward propagation process of the first neural network: These additional elements are recited at a high level of generality and merely indicates a field of use or technological environment in which to apply the judicial exception. See MPEP 2106.05(h).
training the first neural network and the second neural network based on ground truth data of the output data and ground truth data of the output differential value: These additional elements recite only the idea of a solution or outcome but fail to recite details of how the solution is accomplished. The specification identifies the technical problem as the complex calculations (e.g., a Hessian matrix) required to solve for a second-order differential when training a neural network (SPEC [0003], [0038]). The particular technological solution described in the specification is sharing parameters between the first and second neural networks and updating those shared parameters to avoid directly calculating the second-order differential (SPEC [0042], [0061], [0065], [0067]). Claim 1 merely recites the outcome of "training" the networks without reciting the mechanism for accomplishing this result (e.g., sharing parameters). Because the claim attempts to cover any implementation of training the networks without the specific details of how the technological solution is achieved, it does not reflect the improvement described in the specification. Thus, this recitation is equivalent to the words “apply it” and does not integrate the judicial exception into a practical application. See MPEP 2106.05(f).
Step 2B: The additional elements from Step 2A Prong 2 include generally linking the use of the judicial exception to indicate a field of use or technological environment and adding the words equivalent to “apply it” with the judicial exception. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 2
Step 1: a process, as in claim 1.
Step 2A Prong 1: The claim recites, inter alia:
further comprising generating, by a layer of the second neural network, second differential data obtained by differentiating, with respect to the input data, an output of a layer of the first neural network from first differential data, of the respective first neural network differential data, obtained by differentiating an output of another layer previous to the layer of the first neural network, with respect to the input data: These limitations recite mathematical calculations via the act of differentiating being a mathematical operation and an act of calculating using mathematical methods to determine a variable or number. e.g. the differential data. See MPEP 2106.04(a)(2)(I).
based on parameters of the first neural network and a first differential value of an activation function of the layer of the first neural network: These limitations recite a mathematical relationship defining that the generation of the second differential data is a mathematical correlation between variables or numbers, specifically detailing how the second differential data relates to and is derived from the first differential data, the network parameters, and the first differential value of the activation function. See MPEP 2106.04(a)(2)(I).
Thus, the claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 3
Step 1: a process, as in claim 2.
Step 2A Prong 1: The claim recites, inter alia:
wherein a second differential value of the second differential data with respect to a parameter of the layer of the first neural network is calculated by multiplying the first differential data by the first differential value: These limitations recite mathematical calculations via the act of calculating using mathematical methods to determine a variable or number, specifically by multiplying the first differential data by the first differential value to calculate the second differential value. See MPEP 2106.04(a)(2)(I).
Thus, the claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 4
Step 1: a process, as in claim 1.
Step 2A Prong 1: The claim recites, inter alia:
wherein an activation function of a layer of the second neural network comprises a function that multiplies a differential value of an activation function of a layer of the first neural network that corresponds to the layer of the second neural network: These limitations recite a mathematical relationship wherein an activation function of a layer of the second neural network relates a function that multiplies a differential value of an activation function of a layer of the first neural network that corresponds to the layer of the second neural network. See MPEP 2106.04(a)(2)(I).
Thus, the claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 5
Step 1: a process, as in claim 1.
Step 2A Prong 1: The claim recites the same abstract ideas of the judicial exception as in claim 1.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of the claim are as follows:
further comprising storing a respective differential value of a corresponding activation function for each layer of the first neural network in a forward propagation process of the first neural network: These additional elements are recited at a high level of generality and merely indicates a field of use or technological environment in which to apply the judicial exception. See MPEP 2106.05(h).
Step 2B: The additional elements from Step 2A Prong 2 include generally linking the use of the judicial exception to indicate a field of use or technological environment. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 6
Step 1: a process, as in claim 1.
Step 2A Prong 1: The claim recites the same abstract ideas of the judicial exception as in claim 1.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of the claim are as follows:
wherein parameters of the first neural network for the estimation of the output data are the same as parameters of the second neural network: While these additional elements reflect a component described in the SPEC (shared parameters), when considered as a whole with the abstract idea, it merely represents an attempt to generally link the use of the underlying judicial exception to a field of use or a technological environment (neural network parameters). See MPEP 2106.05(h). Furthermore, per the ANC Desjardins Memo, examiners should be careful to avoid oversimplifying the claims by looking at them generally and failing to account for the specific requirements of the claims. Here, the claim merely recites that the parameters are the same, but misses critical steps described in the SPEC (e.g., calculating a gradient with respect to a second loss function through backpropagation to update the shared parameters, SPEC [0067]) that actually provide the technological improvement of avoiding the Hessian matrix. Thus, the claim does not integrate the abstract idea into a practical application.
Step 2B: The additional elements from Step 2A Prong 2 include generally linking the use of the judicial exception to indicate a field of use or technological environment. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 7
Step 1: a process, as in claim 1.
Step 2A Prong 1: The claim recites, inter alia:
wherein the generating of the respective first neural network differential data further comprises: determining select input data among plural input data, for which a calculation of differential value is determined to be needed: These limitations recite further mentally performable processes of using judgement to determine select input data among observed plural input data for which a calculation of differential value is determined to be needed as part of generating of the respective first neural network differential data calculation.
and for each of the select input data, storing corresponding respective first neural network differential data obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data: These limitations recite mathematical relationships of organizing and manipulating information (and for each of the select input data, storing corresponding respective first neural network differential data) through mathematical correlations (obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data). See MPEP 2106.04(a)(2)(I).
Thus, the claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 8
Step 1: The claim recites “A non-transitory computer-readable storage medium”; therefore, it is directed to the statutory category of a manufacture.
Step 2A Prong 1: The claim recites the same abstract ideas of the judicial exception as in claim 1.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of the claim are as follows:
A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of claim 1: These additional elements are recited at a high level of generality and merely amount to invoking computers or other machinery merely as a tool to apply the underlying judicial exception of determining training samples. See MPEP 2106.05(f).
Step 2B: The additional elements from Step 2A Prong 2 include mere instructions to implement an abstract idea on a computer. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 9
Step 1: The claim recites “An apparatus”; therefore, it is directed to the statutory category of machines.
Step 2A Prong 1: The claim recites, inter alia:
estimate output data with respect to input data by differentiating a respective output of each layer of a first neural network with respect to input data provided to the first neural network: These limitations recite mathematical calculations of differentiating a respective output of each layer of a first neural network with respect to input data provide to the first neural network to estimate output data with respect to input data. See MPEP 2106.04(a)(2)(I).
and generate an output differential value of the output data with respect to the input data using respective first neural network differential data: These limitations recite a mathematical relationship of organizing information (and generate an output differential value of the output data with respect to the input data) and manipulating information through mathematical correlations (using respective first neural network differential data). See MPEP 2106.04(a)(2)(I).
Thus, the claim recites a judicial exception.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of the claim are as follows:
by a forward propagation process of the first neural network… through forward propagation of a second neural network: These additional elements are recited at a high level of generality and recite only the idea of a solution or outcome but fail to recite details of how the solution is accomplished, e.g. no particulars defining what constitutes a forward propagation process of the first neural network nor forward propagation of a second neural network. Thus, this recitation is equivalent to the words “apply it” and does not integrate the judicial exception into a practical application. See MPEP 2106.05(f).
and a memory configured to store the differential data: These additional elements are recited at a high level of generality and merely amount to invoking computers or other machinery merely as a tool to apply the underlying judicial exception of determining training samples. See MPEP 2106.05(f).
Step 2B: The additional elements from Step 2A Prong 2 include with “apply it” or equivalent instructions and mere instructions to implement an abstract idea on a computer. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 10
Step 1: a machine, as in claim 9.
Step 2A Prong 1: The claim recites the same abstract ideas of the judicial exception as in claim 9
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of the claim are as follows:
wherein the memory is further configured to store a respective differential value of an activation function for each layer of the first neural network, wherein the respective differential value is obtained through the forward propagation of the first neural network: These additional elements are recited at a high level of generality and merely indicates a field of use or technological environment in which to apply the judicial exception. See MPEP 2106.05(h).
Step 2B: The additional elements from Step 2A Prong 2 include generally linking the use of the judicial exception to indicate a field of use or technological environment. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 11
Step 1: a machine, as in claim 9.
Step 2A Prong 1: The claim recites, inter alia:
wherein an activation function of a layer of the second neural network comprises a function that multiplies a differential value of an activation function of a layer of the first neural network that corresponds to the layer of the second neural network: These limitations recite a mathematical relationship wherein an activation function of a layer of the second neural network relates a function that multiplies a differential value of an activation function of a layer of the first neural network that corresponds to the layer of the second neural network. See MPEP 2106.04(a)(2)(I).
Thus, the claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 12
Step 1: a machine, as in claim 9.
Step 2A Prong 1: The claim depends from claim 1 and thus recites the same judicial exception.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of the claim are as follows:
wherein parameters of the first neural network for the estimation of the output data are the same as parameters of the second neural network: While these additional elements reflect a component described in the SPEC (shared parameters), when considered as a whole with the abstract idea, it merely represents an attempt to generally link the use of the underlying judicial exception to a field of use or a technological environment (neural network parameters). See MPEP 2106.05(h). Furthermore, per the ANC Desjardins Memo, examiners should be careful to avoid oversimplifying the claims by looking at them generally and failing to account for the specific requirements of the claims. Here, the claim merely recites that the parameters are the same, but misses critical steps described in the SPEC (e.g., calculating a gradient with respect to a second loss function through backpropagation to update the shared parameters, SPEC [0067]) that actually provide the technological improvement of avoiding the Hessian matrix. Thus, the claim does not integrate the abstract idea into a practical application.
Step 2B: The additional elements from Step 2A Prong 2 include generally linking the use of the judicial exception to indicate a field of use or technological environment. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 13
Step 1: a machine, as in claim 9.
Step 2A Prong 1: The claim recites, inter alia:
wherein the first neural network and the second neural network are trained based on a first loss function and a second loss function, wherein the first loss function is based on ground truth of the output data and an estimated value of the output data that is output from the forward propagation of the first neural network, and the second loss function is based on the output differential value and ground truth data of the output differential value with respect to the input data: These limitations recite mathematical relationships of organizing information and manipulating information (wherein the first loss function is based on ground truth of the output data and an estimated value of the output data that is output from the forward propagation of the first neural network, and the second loss function is based on the output differential value and ground truth data of the output differential value with respect to the input data) through mathematical correlations (wherein the first neural network and the second neural network are trained based on a first loss function and a second loss function). See MPEP 2106.04(a)(2)(I).
Thus, the claim recites a judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 14
Step 1: a machine, as in claim 9.
Step 2A Prong 1: The claim recites, inter alia:
wherein the second neural network comprises a layer defined to output second differential data obtained by differentiating, with respect to the input data, an output of a layer of the first neural network from first differential data, of the respective first neural network differential data, obtained by differentiating an output of another layer, previous to the layer of the first neural network, with respect to the input data, based on parameters of the first neural network and a first differential value of an activation function of the layer of the first neural network: These limitations recite mathematical calculations via the act of differentiating being a mathematical operation and an act of calculating using mathematical methods to determine a variable or number, e.g., the second differential data and first differential data. See MPEP 2106.04(a)(2)(I).
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 15
Step 1: a machine, as in claim 14.
Step 2A Prong 1: The claim recites, inter alia:
wherein a second differential value of the second differential data with respect to a parameter of the layer of the first neural network is calculated by multiplying the first differential data by the first differential value: These limitations recite mathematical calculations via the act of calculating using mathematical methods to determine a variable or number, specifically by multiplying the first differential data by the first differential value to calculate the second differential value. See MPEP 2106.04(a)(2)(I).
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 16
Step 1: a machine, as in claim 9.
Step 2A Prong 1: The claim recites, inter alia:
wherein the processor is configured to calculate differential data obtained by differentiating the output of each layer of the first neural network with respect to the input data: These limitations recite mathematical calculations via the act of differentiating being a mathematical operation and an act of calculating using mathematical methods to determine a variable or number, e.g., the differential data. See MPEP 2106.04(a)(2)(I).
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 17
Step 1: a machine, as in claim 16.
Step 2A Prong 1: The claim recites, inter alia:
wherein, in the calculating of the differential data, the processor is configured to: and for each of the select input data, calculate the differential data corresponding respectively to first neural network differential data obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data: These limitations recite mathematical calculations via the act of differentiating being a mathematical operation and an act of calculating using mathematical methods to determine a variable or number, specifically by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data to calculate the differential data. See MPEP 2106.04(a)(2)(I).
determine select input data among plural input data, for which a calculation of a differential value is determined to be needed: These limitations recite mentally performable processes of using judgement to determine select input data among observed plural input data for which a calculation of a differential value is determined to be needed. See MPEP 2106.04(a)(2)(III).
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-6 and 8-16 are rejected under 35 U.S.C. 103 as being unpatentable over Czarnecki et al. (hereinafter Czarnecki) “Sobolev Training for Neural Networks” (2017) in view of Bishop (hereinafter Bishop) “Exact Calculation of the Hessian Matrix for the Multi-layer Perceptron” (1992).
Bishop was disclosed in an IDS dated 2/28/2024.
Regarding independent claim 1, Czarnecki teaches a processor-implemented method, the method comprising: (Page 5 Section 4.1 footnote 2 "All experiments were performed using TensorFlow [2] and the Sonnet neural network library [1]", which necessarily requires a processor); a first neural network that estimates output data from the input data (Page 3 Section 2 Paragraph 2 "Considering a neural network model m parameterised with ϴ, one typically seeks to minimise the empirical error in relation to f", where m estimates the output f((xi) from input training points xi); generating, using a second neural network, an output differential value of the output data with respect to the input data (Pages 2-3 Figure 1 showing the derivative network Dx m which acts as a second neural network to generate the derivative of the output m with respect to input x); and training the first neural network and the second neural network based on ground truth data of the output data and ground truth data of the output differential value. (Page 3 Section 2 Equation 1 showing the loss function during training with Sobolev spaces (and training) that minimizes the error between the network output m(xi | ϴ) (the first neural network) and ground truth f(xi) as well as the error between the network derivative Dx^j m(xi | ϴ) (and the second neural network) and ground truth derivative Dx^j f(xi) (based on ground truth data of the output differential value)).
Czarnecki does not explicitly teach generating respective first neural network differential data by differentiating a respective output of each layer of a first neural network with respect to input data provided to the first neural network that estimates output data from the input data, by a forward propagation process of the first neural network; and using the respective first neural network differential data.
However, Bishop teaches generating respective first neural network differential data by differentiating a respective output of each layer of a first neural network with respect to an internal pre-activation, by a forward propagation process of the first neural network (Page 2 Equation 7 defining gli≡∂al/∂ai and Equation 9 utilizes f'(al) gli which is the derivative of the output of a layer zl of a first neural network with respect to internal pre-activations ai provided to the first neural network, Page 3 Paragraph 1 demonstrates the generation of this differential data for each layer via forward propagation using Equation 11, stating "The remaining elements of gli can then be found by forward propagation using equation 11", teaching that the generation of differential data for each layer by a forward propagation process of the first neural network) and using the respective first neural network differential data (Bishop Page 2 Equation 9 showing the use of the intermediate differential data gli to compute the final derivatives, and Page 2 explaining that the forward propagation is used to sequentially compute the derivatives for subsequent layers up to the output, thereby using the respective first neural network differential data to generate the output differential value).
Because Czarnecki and Bishop address the issue of computing and utilizing derivatives of neural network outputs with respect to their inputs, accordingly, it would have been obvious to one or ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of exactly calculating derivatives layer-by-layer via forward propagation as suggested by Bishop into Czarnecki's processor-implemented method, with a reasonable expectation of success, such that Czarnecki incorporates Bishop's forward propagation derivative calculation framework applied to the network input data x, instead of internal pre-activations ai, to compute the required input derivative for Sobolev training to teach generating respective first neural network differential data by differentiating a respective output of each layer of a first neural network with respect to input data provided to the first neural network that estimates output data from the input data, by a forward propagation process of the first neural network; generating, using a second neural network, an output differential value of the output data with respect to the input data using the respective first neural network differential data. This modification would have been motivated by the desire to accurately and efficiently compute the exact derivatives required for the loss function allowing all elements of the Hessian matrix to be evaluated exactly for a feed-forward network of arbitrary topology and can readily be implemented in software (Bishop Page 1 Paragraph 3).
Regarding dependent claim 2, Czarnecki, in view of Bishop, teach the method of claim 1, further comprising generating, by a layer of the second neural network, second differential data obtained by differentiating, with respect to the input data, an output of a layer of the first neural network (see Czarnecki Pages 2-3 Figure 1 teaches a secondary computational graph used to compute derivatives (second neural network) and Bishop teaches the exact layer-by-layer mathematical formulation for computing these derivatives via forward propagation of the first neural network. When combined, a layer of Czarnecki's secondary computational graph (i.e., a layer of the second neural network) executes the calculation for a corresponding layer of the first neural network. Specifically, Bishop Page 2 Equation 9 utilizes the exact value f'(al) gli, which is the derivative of the activation output of a layer l (zl = f(al)) with respect to internal pre-activation ai, which as modified when combined with Czarnecki is applied to the input data x. Therefore, a layer of the second neural network generates this second differential data f'(al) gli obtained by differentiating the output of a layer of the first neural network zl with respect to the input data x); from first differential data, of the respective first neural network differential data, obtained by differentiating an output of another layer previous to the layer of the first neural network, with respect to the input data, (See Bishop Page 2 Equation 11 where the term f'(ar) gri is the derivative of the activation output zr of the previous layer r with respect to the internal pre-activation ai, which as modified when combined with Czarnecki is applied to the input data x. This value represents the claimed "first differential data" obtained by differentiating the output of a previous layer of the first neural network with respect to the input data, which is part of the respective first neural network differential data); based on parameters of the first neural network and a first differential value of an activation function of the layer of the first neural network. (see Bishop Page 2 Equation 9 and Equation 11 teach that the second differential data f'(al) gli is calculated based on the weights connecting the layers wlr, representing the parameters of the first neural network, and the derivative of the activation function of the current layer f'(al), representing the first differential value of an activation function of the layer of the first neural network).
Regarding dependent claim 3, Czarnecki, in view of Bishop, teach the method of claim 2, wherein a second differential value of the second differential data with respect to a parameter of the layer of the first neural network (see Bishop Page 2 Equations 9 and 11 establish the mathematical relationship for the second differential data f'(al) gli. The partial derivative of this second differential data with respect to the weight parameter wlr represents the second differential value of the second differential data with respect to a parameter of the layer of the first neural network); is calculated by multiplying the first differential data by the first differential value (see Bishop based on Bishop's equations, the partial derivative of the second differential data with respect to the parameter wlr is exactly f'(al) f'(ar) gri. This is calculated by multiplying the first differential data f'(ar) gri by the first differential value of the activation function f'(al)).
Regarding dependent claim 4, Czarnecki, in view of Bishop, teach the method of claim 1, wherein an activation function of a layer of the second neural network comprises a function that multiplies a differential value of an activation function of a layer of the first neural network that corresponds to the layer of the second neural network (see Bishop Page 2 Equations 9 and 11 teach that in the forward propagation of the derivatives, which corresponds to the second neural network's operation, the operation applied at layer l to produce the output derivative f'(al) gli includes multiplying the weighted sum by the derivative of the original activation function f'(al) of the corresponding layer l of the first neural network. This multiplication acts as the activation function of a layer of the second neural network).
Regarding dependent claim 5, Czarnecki, in view of Bishop, teach the method of claim 1, further comprising storing a respective differential value of a corresponding activation function for each layer of the first neural network (see Bishop Page 2 Equation 11 and Page 4 Equation 21 which require evaluating and utilizing the derivative of the activation function, such as f'(ar) or f'(al), for each layer) in a forward propagation process of the first neural network. (see Bishop Page 4 last paragraph to Page 5 "For each pattern p, the {zn} are calculated by forward propagation using equations 1 and 2, and the {gli} are obtained by forward propagation using equation 11." This teaches that the activations are calculated during the initial forward propagation, and their corresponding activation function derivatives must be evaluated and stored during this forward propagation process of the first neural network to subsequently compute the forward propagation of the differential data).
Regarding dependent claim 6, Czarnecki, in view of Bishop, teach the method of claim 1, wherein parameters of the first neural network for the estimation of the output data are the same as parameters of the second neural network (see Czarnecki Page 3 Figure 1 showing both the m node, which estimates the output data, and the Dx m node, which generates the output differential value, parameterized by the exact same shared parameters ϴ; see Bishop Page 2 Equation 11 also teaches that the same weights wlr used in the original network are used in the derivative calculations, meaning the parameters are the same).
Regarding dependent claim 8, it is a non-transitory computer-readable storage medium of claim 1. Thus, claim 8 is rejected for the same reasons as claim 1. In addition, Czarnecki teaches a non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform (Page 2 first paragraph "…incorporated into any training pipeline using modern machine learning libraries" suggests necessarily requiring a non-transitory computer-readable storage medium storing instructions executed by a processor to perform the training method).
Regarding independent claim 9, Czarnecki teaches an apparatus, comprising: a processor configured to: (Page 1 first paragraph "…incorporated into any training pipeline using modern machine learning libraries" and Page 5 Section 4.1 footnote 2 "All experiments were performed using TensorFlow [2] and the Sonnet neural network library [1]"; suggests an apparatus with a processor configured to perform experiments using machine learning libraries); estimate output data with respect to input data (Pages 2-3 Figure 1 showing the derivative network Dx m to generate the derivative of the output m (estimate output data) with respect to input x (with respect to input data)); and generate an output differential value of the output data with respect to the input data through forward propagation of a second neural network (Page 2 Figure 1 showing the derivative network Dx m computational graph or network (a second neural network) that shares parameters with the first neural network and is used to calculate the derivative of the output with respect to the input. Derivative network Dx m acts as this second neural network to generate the derivative of the output m with respect to input x, which is the output differential value, through forward propagation of this computational graph); and a memory configured to store the differential data (Page 2 suggests using machine learning libraries on a computer to compute and store gradients/derivatives necessarily in memory during the forward propagation of the computational graph).
Czarnecki does not explicitly teach by differentiating a respective output of each layer of a first neural network with respect to input data provided to the first neural network by a forward propagation process of the first neural network; and using respective first neural network differential data.
However, Bishop teaches by differentiating a respective output of each layer of a first neural network with respect to an internal pre-activation by a forward propagation process of the first neural network (Page 2 Equation 7 defining gli≡∂al/∂ai, which is the derivative of the activation of a layer with respect to a previous layer or internal pre-activations ai, and Page 3 Paragraph 1 "The remaining elements of gli can then be found by forward propagation using equation 11", demonstrating the generation of differential data for each layer via forward propagation) and using respective first neural network differential data (Page 2 Equation 9 showing the use of the intermediate differential data gli (using the respective first neural network differential data) to compute the final second-order derivatives/Hessian matrix elements and Page 2 explaining that the forward propagation is used to sequentially compute the derivatives for subsequent layers up to the output, thereby using the respective first neural network differential data to generate the output differential value).
Because Czarnecki and Bishop address the issue of computing and utilizing derivatives of neural network outputs with respect to their inputs, accordingly, it would have been obvious to one or ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of exactly calculating derivatives layer-by-layer via forward propagation as suggested by Bishop into Czarnecki's apparatus, with a reasonable expectation of success, such that Czarnecki incorporates Bishop's forward propagation derivative calculation framework applied to the network inputs x, instead of internal pre-activations ai, to compute the required input derivative for Sobolev training to teach by differentiating a respective output of each layer of a first neural network with respect to input data provided to the first neural network by a forward propagation process of the first neural network; and generate an output differential value of the output data with respect to the input data using respective first neural network differential data through forward propagation of a second neural network. This modification would have been motivated by the desire to accurately and efficiently compute the exact derivatives required for the loss function allowing all elements of the Hessian matrix to be evaluated exactly for a feed-forward network of arbitrary topology and can readily be implemented in software (Bishop Page 1 Paragraph 3).
Regarding dependent claim 10, Czarnecki, in view of Bishop, teach the apparatus of claim 9, wherein the memory is further configured to store a respective differential value of an activation function for each layer of the first neural network (see Bishop Page 2 Equation 11 and Page 4 Equation 21 which require evaluating and utilizing the derivative of the activation function, such as f'(ar) or f'(al), for each layer of the first neural network, which necessarily requires storing them in memory), wherein the respective differential value is obtained through the forward propagation of the first neural network (see Bishop Page 4 last paragraph to Page 5 "For each pattern p, the {zn} are calculated by forward propagation using equations 1 and 2, and the {gli} are obtained by forward propagation using equation 11." This teaches that the activations are calculated during the initial forward propagation, and their corresponding activation function derivatives must be evaluated and obtained during this forward propagation process of the first neural network to subsequently compute the forward propagation of the differential data).
Regarding dependent claim 11, Czarnecki, in view of Bishop, teach the apparatus of claim 9, wherein an activation function of a layer of the second neural network comprises a function that multiplies a differential value of an activation function of a layer of the first neural network that corresponds to the layer of the second neural network (see Bishop Page 2 Equations 9 and 11 teach that in the forward propagation of the derivatives, which corresponds to the second neural network's operation, the operation applied at layer l to produce the output derivative f'(al) gli includes multiplying the weighted sum by the derivative of the original activation function f'(al) of the corresponding layer l of the first neural network. This multiplication acts as the activation function of a layer of the second neural network).
Regarding dependent claim 12, Czarnecki, in view of Bishop, teach the apparatus of claim 9, wherein parameters of the first neural network for the estimation of the output data are the same as parameters of the second neural network (see Czarnecki Page 3 Figure 1 showing both the m node, which estimates the output data, and the Dx m node, which generates the output differential value, parameterized by the exact same shared parameters ϴ; see Bishop Page 2 Equation 11 also teaches that the same weights wlr used in the original network are used in the derivative calculations, meaning the parameters are the same).
Regarding dependent claim 13, Czarnecki, in view of Bishop, teach the apparatus of claim 9, wherein the first neural network and the second neural network are trained based on a first loss function and a second loss function (see Czarnecki Page 3 Section 2 Equation 1 showing the loss functions with network output m(xi | ϴ) and network derivative Dx^j m(xi | ϴ) during training with Sobolev spaces), wherein the first loss function is based on ground truth of the output data and an estimated value of the output data that is output from the forward propagation of the first neural network (see Czarnecki Page 3 Section 2 Equation 1 showing the loss function that minimizes the error between the network output m(xi | ϴ), which is an estimated value of the output data, and ground truth f(xi), which is the first loss function based on ground truth of the output data), and the second loss function is based on the output differential value and ground truth data of the output differential value with respect to the input data (see Czarnecki Page 3 Section 2 Equation 1 showing the error between the network derivative Dx^j m(xi | ϴ), which is the output differential value, and ground truth derivative Dx^j f(xi), which is the ground truth data of the output differential value with respect to the input data, representing the second loss function).
Regarding dependent claim 14, Czarnecki, in view of Bishop, teach the apparatus of claim 9, wherein the second neural network comprises a layer defined to output second differential data obtained by differentiating, with respect to the input data, an output of a layer of the first neural network (see Czarnecki Pages 2-3 Figure 1 teaches a secondary computational graph used to compute derivatives (wherein the second neural network) and Bishop teaches the exact layer-by-layer mathematical formulation for computing these derivatives via forward propagation of the first neural network. When combined, a layer of Czarnecki's secondary computational graph, e.g. a layer of the second neural network, executes the calculation for a corresponding layer of the first neural network. Specifically, Bishop Page 2 Equation 9 utilizes the exact value f'(al) gli, which is the derivative of the activation output of a layer l (zl = f(al)) with respect to internal pre-activation ai, which as modified when combined with Czarnecki is applied to the input data x. Therefore, a layer of the second neural network generates this second differential data f'(al) gli obtained by differentiating the output of a layer of the first neural network zl with respect to the input data x) from first differential data, of the respective first neural network differential data, obtained by differentiating an output of another layer, previous to the layer of the first neural network, with respect to the input data, (See Bishop Page 2 Equation 11 where the term f'(ar) gri is the derivative of the activation output zr of the previous layer r with respect to the internal pre-activation ai, which as modified when combined with Czarnecki is applied to the input data x. This value represents the claimed "first differential data" obtained by differentiating the output of a previous layer of the first neural network with respect to the input data, which is part of the respective first neural network differential data) based on parameters of the first neural network and a first differential value of an activation function of the layer of the first neural network (see Bishop Page 2 Equation 9 and Equation 11 teach that the second differential data f'(al) gli is calculated based on the weights connecting the layers wlr, representing the parameters of the first neural network, and the derivative of the activation function of the current layer f'(al), representing the first differential value of an activation function of the layer of the first neural network).
Regarding dependent claim 15, Czarnecki, in view of Bishop, teach the apparatus of claim 14, wherein a second differential value of the second differential data with respect to a parameter of the layer of the first neural network (see Bishop Page 2 Equations 9 and 11 establish the mathematical relationship for the second differential data f'(al) gli. The partial derivative of this second differential data with respect to the weight parameter wlr represents the second differential value of the second differential data with respect to a parameter of the layer of the first neural network) is calculated by multiplying the first differential data by the first differential value (see Bishop based on Bishop's equations, the partial derivative of the second differential data with respect to the parameter wlr is exactly f'(al) f'(ar) gri. This is calculated by multiplying the first differential data f'(ar) gri by the first differential value of the activation function f'(al)).
Regarding dependent claim 16, Czarnecki, in view of Bishop, teach the apparatus of claim 9, wherein the processor is configured to calculate differential data obtained by differentiating the output of each layer of the first neural network with respect to the input data (see Bishop Page 2 Equation 7 defining gli≡∂al/∂ai, which is the derivative of the activation of a layer with respect to a previous layer or internal pre-activations ai, and Page 3 Paragraph 1 "The remaining elements of gli can then be found by forward propagation using equation 11", demonstrating the generation of differential data for each layer via forward propagation, which as modified when combined with Czarnecki is applied to the input data x, thereby calculating differential data obtained by differentiating the output of each layer of the first neural network with respect to the input data).
Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Czarnecki in view of Bishop, as applied in the rejections of claims 1 and 16 above, and further in view of Baydin et al. (hereinafter Baydin) “Automatic differentiation in machine learning: a survey” (2018).
Regarding dependent claim 7, Czarnecki, in view of Bishop, teach all the elements of claim 1.
Czarnecki and Bishop do not expressly teach wherein the generating of the respective first neural network differential data further comprises: determining select input data among plural input data, for which a calculation of differential value is determined to be needed; and for each of the select input data, storing corresponding respective first neural network differential data obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data.
However, Baydin teaches wherein the generating of the respective first neural network differential data further comprises: determining select input data among plural input data, for which a calculation of differential value is determined to be needed (Page 25 Section 5.2 "For procedures coded in ANSI C, the ADIC tool (Bischof et al., 1997) implements AD as a source code transformation after the specification of dependent and independent variables", teaching the determination or specification of select independent variables or input data among all possible inputs for which a calculation of differential value is needed); and for each of the select input data, storing corresponding respective first neural network differential data obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data (Page 9 Section 3.1 "each forward pass of AD is initialized by setting only one of the variables x_dot_i = 1 and setting the rest to zero... A run of the code with specific input values x = a then computes y_dot_j = partial yj / partial xi | x=a" and Page 10 "giving us one column of the Jacobian matrix... Thus, the full Jacobian can be computed in n evaluations"; teaches that for each specified independent variable or select input data, the forward pass respectively differentiates the outputs with respect to that specific input and stores the resulting differential data as a column of the Jacobian matrix).
Because Czarnecki, in view of Bishop, and Baydin address the issue of efficiently computing derivatives for neural networks and complex functions, accordingly, it would have been obvious to one or ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of specifying independent variables to selectively compute and store corresponding columns of the Jacobian matrix as suggested by Baydin into Czarnecki and Bishop's processor-implemented method, with a reasonable expectation of success, to incorporate the technique of specifying select input variables to compute and store only the needed differential data to teach wherein the generating of the respective first neural network differential data further comprises: determining select input data among plural input data, for which a calculation of differential value is determined to be needed; and for each of the select input data, storing corresponding respective first neural network differential data obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data. This modification would have been motivated by the desire to reduce computational complexity and memory overhead by only calculating and storing the differential data for the specific input variables that are required for the task at hand, avoiding the unnecessary evaluations of the full Jacobian when only a subset of derivatives is needed (Baydin Page 10).
Regarding dependent claim 17, Czarnecki, in view of Bishop, teach all the elements of claim 16.
Czarnecki and Bishop do not expressly teach wherein, in the calculating of the differential data, the processor is configured to: determine select input data among plural input data, for which a calculation of a differential value is determined to be needed; and for each of the select input data, calculate the differential data corresponding respectively to first neural network differential data obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data.
However, Baydin teaches wherein, in the calculating of the differential data, the processor is configured to: determine select input data among plural input data, for which a calculation of a differential value is determined to be needed (Page 25 Section 5.2 "For procedures coded in ANSI C, the ADIC tool (Bischof et al., 1997) implements AD as a source code transformation after the specification of dependent and independent variables", teaching the determination or specification of select independent variables or input data among all possible inputs for which a calculation of differential value is needed); and for each of the select input data, calculate the differential data corresponding respectively to first neural network differential data obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data (Page 9 Section 3.1 "each forward pass of AD is initialized by setting only one of the variables x_dot_i = 1 and setting the rest to zero... A run of the code with specific input values x = a then computes y_dot_j = partial yj / partial xi | x=a" and Page 10 "giving us one column of the Jacobian matrix... Thus, the full Jacobian can be computed in n evaluations"; teaches that for each specified independent variable or select input data, the forward pass respectively differentiates the outputs with respect to that specific input and calculates the resulting differential data as a column of the Jacobian matrix).
Because Czarnecki, in view of Bishop, and Baydin address the issue of efficiently computing derivatives for neural networks and complex functions, accordingly, it would have been obvious to one or ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of specifying independent variables to selectively compute corresponding columns of the Jacobian matrix as suggested by Baydin into Czarnecki and Bishop's apparatus, with a reasonable expectation of success, such that Czarnecki incorporates Baydin's technique of specifying select input variables to compute only the needed differential data to teach wherein, in the calculating of the differential data, the processor is configured to: determine select input data among plural input data, for which a calculation of a differential value is determined to be needed; and for each of the select input data, calculate the differential data corresponding respectively to first neural network differential data obtained by respectively differentiating the outputs of each layer of the first neural network with a corresponding select input data. This modification would have been motivated by the desire to reduce computational complexity and memory overhead by only calculating the differential data for the specific input variables that are required for the task at hand, avoiding the unnecessary evaluations of the full Jacobian when only a subset of derivatives is needed (Baydin Page 10).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
KIM et al. (US 2025/0155389 A1) (May 15, 2025) (ABSTRACT A data processing method according to an embodiment of the present invention comprises the steps of: training a neural network; receiving input data from the outside; and converting the received input data by means of the trained neural network, wherein the training step comprises the steps of: generating one or more pieces of generative data from raw data; converting the generative data into output data by means of the neural network; evaluating the output data on the basis of the raw data; and optimizing the neural network on the basis of the evaluation result, wherein the raw data and the generative data conform to a statistical distribution, and the raw data and the output data have higher signal-to-noise ratios than the generative data).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KUANG FU CHEN whose telephone number is (571)272-1393. The examiner can normally be reached M-F 9:00-5:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Welch can be reached on (571) 272-7212. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KC CHEN/Primary Patent Examiner, Art Unit 2143