Prosecution Insights
Last updated: April 19, 2026
Application No. 17/855,955

SYSTEM AND METHOD FOR EVALUATING WEIGHT INITIALIZATION FOR NEURAL NETWORK MODELS

Non-Final OA §101§103§112
Filed
Jul 01, 2022
Examiner
BOSTWICK, SIDNEY VINCENT
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
Cognizant Technology Solutions US Corp.
OA Round
3 (Non-Final)
52%
Grant Probability
Moderate
3-4
OA Rounds
4y 7m
To Grant
90%
With Interview

Examiner Intelligence

Grants 52% of resolved cases
52%
Career Allow Rate
71 granted / 136 resolved
-2.8% vs TC avg
Strong +38% interview lift
Without
With
+38.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
68 currently pending
Career history
204
Total Applications
across all art units

Statute-Specific Performance

§101
24.4%
-15.6% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
12.0%
-28.0% vs TC avg
§112
21.9%
-18.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 136 resolved cases

Office Action

§101 §103 §112
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/30/2025 has been entered. Remarks This Office Action is responsive to Applicants' Amendment filed on December 30, 2025, in which claims 1, 13, and 25 are currently amended. Claims 1-25 are currently pending. Response to Arguments Applicant’s arguments with respect to rejection of claims 1-25 under 35 U.S.C. 101 based on amendment have been considered. With respect to Applicant's arguments on p. 14 of the Remarks submitted 12/30/2025 that "the specific claimed features [...] are rooted in computer technology which cannot be practically performed in the human mind", Examiner respectfully disagrees. Examiner asserts that the claims are rooted in observation, evaluation, and judgement of mathematical calculations and relationships (mean and variance). Claim 1 explicitly recites building, evaluation, and determination steps which can be readily performed in the mind with the exception of the recitation of "by the processor" which is seen as mere instructions to apply the judicial exception using generic computer components. With respect to Applicant's arguments on pp. 11-12 of the Remarks submitted 12/30/2025 that the claims are not directed towards an abstract idea but rather "a special purpose computer limited to the use of the particularly claimed combination of elements performing the particularly claimed combination of functions", Examiner respectfully disagrees. The instant specification explicitly states ([¶0028] "the client-computing device may be a general purpose computer, such as a desktop, a laptop, a smartphone and a tablet; a super computer; a microcomputer or any device capable of executing instructions, connecting to a network and sending/receiving data") such that the processor in the claims is explicitly recited in the instant specification as being a generic processor belonging to a general purpose computer. Examiner asserts that there is nothing in the claims that would limit the processor and memory to be interpreted as anything other than a generic computer for performing the claims abstract idea. With respect to Applicant's arguments on p. 12 of the Remarks submitted 12/30/2025 that "the claimed processor is integrated with physical tools for receiving a neural network model [...]" Examiner respectfully disagrees. The claims do not recite "physical tools" nor would one of ordinary skill in the art interpret an API or GUI as physical tools, and similarly input/output devices could reasonably be interpreted as software, hardware, or something else altogether such that the claim is not limited to "physical tools". Examiner also notes that the act of receiving data has not been interpreted as an abstract idea but rather as insignificant extra-solution activity of gathering and outputting data which is well-understood, routine, and conventional in the art which is not seen as integrating the judicial exception into a practical application (See MPEP 2106.05(g), MPEP 2106.05(d)(II)(i), and MPEP 2106.05(d)(II)(iv)). With respect to Applicant's arguments on pp. 12-13 of the Remarks submitted 12/30/2025 that the processor has been "programmed to execute a technical process" which "aids in adapting to a plurality of different and unique neural network architectures" and "provides a technical solution to a technical problem", Examiner notes again that the generic processor is seen as mere instructions to apply the judicial exception of evaluating weight initialization techniques to aid in adapting to a plurality of different and unique neural network architectures such that the judicial exception alone is seen as providing the implied technical improvement (See MPEP 2106.05(a) "It is important to note, the judicial exception alone cannot provide the improvement." MPEP 2106.05(a) also recites "An important consideration in determining whether a claim improves technology is the extent to which the claim covers a particular solution to a problem or a particular way to achieve a desired outcome, as opposed to merely claiming the idea of a solution or outcome." and finally, MPEP 2106.07(a)(II) "employing well-known computer functions to execute an abstract idea, even when limiting the use of the idea to one particular environment, does not integrate the exception into a practical application"). In other words Examiner wants to distinguish that the processor in the claims is not seen as being improved by the judicial exception but rather as merely a means to implement the judicial exception. For at least these reasons and those further detailed below Examiner asserts that it is reasonable and appropriate to maintain the rejection under 35 USC 101. Applicant’s arguments with respect to rejection of claims 1-25 under 35 U.S.C. 103 based on amendment have been considered. With respect to Applicant's arguments on pp. 18-19 of the Remarks submitted 12/30/2025 that Schilling fails to disclose ""building" a mean variance mapping table in the manner recited in amended claim 1", Examiner respectfully disagrees. Examiner notes that Schilling explicitly maps mean and variance to layers in the forward pass (See Eqn. 3.2) and tracks a running average (table) of said mean and variances. This is explicitly reinforced by newly introduced secondary reference Zhang who performs the same method and explicitly tracks the running mean and variance in a data structure table using processor executed instructions. Similarly, with respect to Applicant's arguments on pp. 18-19 of the Remarks submitted 12/30/2025 that Schilling does not disclose "in case of non-predefined layers that the mean and variance remain unchanged after propagation through said layer" and "in case a layer is not predefined then assuming that the mean and variance remains unchanged after propagation through said layer", Examiner notes that these are conditional, and nested conditional limitations, respectively, that do not apply to Schilling whose mean-variance mapping function and layer are both predefined. With respect to Applicant's arguments on p. 20 of the Remarks submitted 12/30/2025 that Schilling "fail to disclose receiving neural network models and input datasets via an integration interface configured with APIs or a GUI accessible via a user module", Examiner respectfully disagrees. Schilling is very explicit about receiving the neural network model (for example Algorithm 1 receives the model at every epoch and outputs the trained model) and training dataset ([p. 49] "The datasets used for the following experiments are arguably the most popular ones used for image classification tasks in the literature. Ordered by increasing difficulty to generalize, those are MNIST [39], SVHN [43], CIFAR10, and CIFAR100 [32]") and very explicit about the hardware and API used which clearly read on the claim limitation ([p. 49] "The datasets used for the following experiments are arguably the most popular ones used for image classification tasks in the literature. Ordered by increasing difficulty to generalize, those are MNIST [39], SVHN [43], CIFAR10, and CIFAR100 [32]" [p. 89] "The software framework used for conducting experiments is Torch [6], a scientific computing platform with wide support for machine learning algorithms that is used and maintained by various technology companies such as Google DeepMind and Facebook AI Research. At its core, Torch supplies a flexible n-dimensional array with many useful routines such as indexing, slicing and transposing. Moreover, Torch features an extremely fast scripting language and is based on LuaJIT, a powerful just-in-time compiler. LuaJIT is implemented in the C programming language and thus provides a clean interface to the GPU using NVIDIA’s CUDA libraries. CUDA is a parallel computing platform and application programming interface that enables general purpose GPU processing. In this research, we make heavy use of both the CUDA Toolkit 7.5 and cuDNN v4, a deep neural network library that extends the toolkit with useful operations such as highly optimized convolutions. An exciting piece of trivia is that NVIDIA advertises cuDNN v4 with the following slogan: “Train neural networks up to 14x faster with batch normalization” [...] Torch’s API bears significant resemblance to the pseudocode used in the back propagation algorithm (algorithm 1) since each module provides a forward and backward function that implements the mathematical notation introduced in sections 2.9.1, 2.9.2, 2.9.3, and 3" [p. 90] "The machine used for the experiments features a 3.4GHz Intel Core i7-2600K with 16GB of DDR3 RAM, and a NVIDIA Tesla K40c GPU with 2880 CUDA cores clocked at 745MHz and 12GB of RAM. The system is running Ubuntu 14.04.1 LTS with GNU/Linux kernel 3.19.0-51"). With respect to Applicant's arguments on pp. 21-22 of the Remarks submitted 12/30/2025 directed towards Glorot, these arguments are moot in view of a new ground of rejection set forth below. With respect to Applicant's arguments on pp. 21-22 of the Remarks submitted 12/30/2025 that "the process is not described as evaluating different weight initialization techniques for each layer based on a derived mapping function", Examiner notes that this argument is moot as the instant claims as amended are not limited to this interpretation. Rather, the instant claims submitted 12/30/2025 recite "evaluating, by the processor, a weight initialization technique for selecting an initial value of respective weight parameter (0) associated with each layer" such that the claims only require a singular initialization technique to be evaluated which is explicitly done in Schilling as acknowledged on pp. 21-22 of the Remarks submitted 12/30/2025. For at least these reasons and those further detailed below Examiner asserts that it is reasonable and appropriate to maintain the rejection under 35 USC 103. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claims 1-25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Regarding claims 1, 13, and 25, "the mean-variance mapping function (g-layer) corresponding to a layer" lacks antecedent basis. Claim 1 introduces "a plurality of mean-variance mapping functions (g-layer) corresponding to a plurality of layers" but does not limit how the plurality of mean-variance mapping functions corresponds (one-to-one, one-to-many, many-to-one, etc) nor would it be clear to one of ordinary skill in the art which of the plurality of mean-variance mapping functions was "the mean-variance mapping function corresponding to a layer" where the relationship between "a layer" and "the plurality of layers" is also unclear. In the interest of further examination the claim limitation is interpreted as "a mean-variance mapping function (g-layer) corresponding to a layer". The remaining claims are rejected with respect to their dependence on the rejected claims. Claim Rejections - 35 USC § 101 101 Rejection 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-25 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter. Regarding Claim 1: Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1 Analysis: Claim 1 is directed to a method which is directed to a process, one of the statutory categories. Step 2A Prong One Analysis: Claim 1 under its broadest reasonable interpretation is a series of mental processes. For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following: Building, […], a mean-variance mapping table, wherein a plurality of mean- variance mapping functions (g-layer) corresponding to a plurality of layers used in one or more neural network models are pre-defined, and wherein if the mean-variance mapping function (g-layer) corresponding to a layer is not pre-defined, then deriving the mean variance mapping function (g-laver) using data analytics based on user inputs and storing it in the mean-variance mapping table, and wherein if the layer is not predefined, then the mean and variance are assumed to remain unchanged after propagation through said layer (observation, evaluation, and judgement based on mathematical calculations and relationships) Deriving, […], a mean-variance mapping function (g-layer) corresponding to respective layers of the neural network from the mean-variance mapping table, (observation, evaluation, and judgement), determining, […], association of a weight parameter (θ) with the respective layers of the neural network (observation, evaluation, and judgement) evaluating, […], a weight initialization technique for selecting an initial value of respective weight parameter (θ) associated with each layer determined to have associated weight parameter (θ) out of the plurality of layers based on the derived mean-variance mapping function (g-layer) corresponding to said each layer, such that a mean (μOut) and a variance (vout) of respective output signals of said each layer is zero and one, respectively, wherein the weight initialization technique is employed to improve deep learning of neural network models and adapt to a plurality of different and unique neural network architectures (observation, evaluation, and judgement based on mathematical calculations and relationships) Therefore, claim 1 recites an abstract idea which is a judicial exception. Step 2A Prong Two Analysis: Claim 1 recites additional elements “by a processor executing program instructions stored in a memory”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component. An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application. Claim 1 also recites additional elements “receiving, by the processor, a neural network comprising a plurality of layers and an input dataset from one or more client devices, one or more input/output devices and one or more external resources via an integration interface configured with one or more Application Programming Interfaces (APIs) or a Graphical User Interface (GUI) accessible via a user module” which amounts to gathering and outputting data which is considered insignificant extra-solution activity (see MPEP 2106.05(g)). Therefore, claim 1 is directed to a judicial exception. Step 2B Analysis: Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component and insignificant extra-solution activity. The gathering and outputting of data is considered well-understood, routine, and conventional in the art (See MPEP 2106.05(d)(II)(i)). For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 13 and 25, which recite a system and a computer program product, respectively, as well as to dependent claims 2-12 and 14-24. The additional limitations of the dependent claims are addressed briefly below: Dependent claims 2 and 15 recite additional observation, evaluation, and judgement “the derived mean-variance mapping functions (g-layer) mapped to corresponding layers of the neural network are stored in a mean-variance mapping table.” Dependent claims 3 and 16 recite additional observation, evaluation, and judgement “wherein the mean-variance mapping functions (g-layer) corresponding to the respective layers of the neural network are derived using data analytics based on any one of the following: a weight parameter associated with the respective layer, a type of said respective layer, an activation function associated with said respective layer or any combination thereof.” Dependent claims 4 and 17 recite additional observation, evaluation, and judgement “wherein the mean-variance mapping function (g-layer) corresponding to any layer of the neural network having no associated weight parameter (θ) is representative of a function derived to map a mean and a variance of input signal of the any layer with a mean and a variance of output signal after propagation through said any layer”. Dependent claims 5 and 18 recite additional observation, evaluation, and judgement “wherein the mean-variance mapping function (g-layer) corresponding to any layer of the neural network having an associated weight parameter (θ6) is representative of a function derived to map a mean and a variance of an input signal of the any layer and the weight parameter (θ) associated with said any layer with a mean and a variance of an output signal after propagation through said any layer” Dependent claims 6 and 19 recites additional instructions to apply the judicial exception using generic computer components “the neural network is an untrained neural network” as well as additional observation, evaluation, and judgement “wherein each layer of the neural network is connected with its next layer or any subsequent layer of said neural network, such that an output of any layer (L) of said neural network is an input of its next layer (L+1) or a subsequent layer connected directly to said any layer (L), and a mean (μOut) and a variance (vout)of an output signal of said any layer (L) is a mean (μin) and a variance (vin) of an input signal of the next layer (L+1) or the subsequent layer” Dependent claims 7 and 20 recite additional observation, evaluation, and judgement “the step of determining association of the weight parameter (θ) with the respective layers of the neural network comprises: identifying a type of the layer based on analysis of the layer”, “determining association of weight parameter (θ) with the layer based on the identified type of the layer by accessing a predefined database, said predefined database comprising information associated with types of layers having weights and not having weights” Dependent claims 8 and 21 recite additional observation, evaluation, and judgement “wherein the evaluating of the weight initialization technique for selecting the initial value of the respective weight parameter (θ) associated with the each layer determined to have associated weight parameter (θ) comprises: a. computing and incorporating a mean (μin) and a variance (vin) of an input signal of a layer (L) out of the each layer determined to have associated weight parameter (θ) in the derived mean-variance mapping function (g-layer) corresponding to the layer (L), wherein the derived mean-variance mapping function maps the mean (μin) and the variance (vin) and the weight parameter (θ) associated with said layer (L) with a mean (μOut) and a variance (vout) of an output signal after propagation through said layer (L) b. evaluating a weight distribution for the weight parameter (θ) associated with the layer (L) and ascertaining a sampling range for said weight parameter (θ); c. selecting the initial value of the weight parameter (θ) from the ascertained sampling range such that the mean (μOut) and variance (vout) of the output signal of said layer (L) is zero and one, respectively on incorporating the selected initial value in said derived mean-variance mapping function (g-layer); and d. repeating a-c for the each layer determined to have associated weight parameter (θ)” Dependent claim 9 recites additional observation, evaluation, and judgement “wherein the mean (μin) and the variance (vin) of the input signal of the layer (L) is same as a mean (μOut) and a variance (vout) of an output signal of any preceding layer of the neural network directly providing input to said layer (L)” Dependent claims 10 and 22 recite additional observation, evaluation, and judgement “wherein the mean (μOut) and the variance (vout) of the output signal of the any preceding layer is computed using a mean-variance mapping function (g-layer) corresponding to said any preceding layer if no weight parameter (θ) is associated with said any preceding layer; or the mean (μOut) and the variance (vout) of the output signal of the any preceding layer is zero and one respectively, if said any preceding layer has an associated weight parameter (θ)” Dependent claims 11 and 23 recites additional observation, evaluation, and judgement ”wherein a mean and a variance of an output signal of any layer of the neural network having no associated weight parameter (θ) is computed using the derived mean-variance mapping function (g-layer) corresponding to said any layer by: computing a mean and a variance of the input signal of said any layer, wherein the mean and the variance of the input signal of said any layer is same as a mean and a variance of an output signal of any preceding layer of the neural network directly providing input to said any layer and incorporating the computed mean and the variance of the input signal of said any layer in the derived mean-variance mapping function (g-layer) to compute the mean and variance of the output signal after propagating through said any layer” Dependent claims 12 and 24 recite additional observation, evaluation, and judgement “wherein the mean and the variance of the input signal of the layer (L) is computed by aggregation of input data if the layer (L) is an input layer” Dependent claim 14 recites additional instructions to apply the judicial exception using generic computer components “interface unit configured to facilitate user interaction” as well as additional insignificant extra-solution activity of gathering and outputting data (see MPEP 2106.05(g)) “receive the neural network model” which is well-understood, routine, and conventional in the art (See MPEP 2106.05(d)(II)(i)) Therefore, when considering the elements separately and in combination, they do not add significantly more to the inventive concept. Accordingly, claims 1-25 are rejected under 35 U.S.C. § 101. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claims 1, 3-6, 8-13, 16-19, and 21-25 are rejected under U.S.C. §103 as being unpatentable over the combination of Schilling (“The Effect of Batch Normalization on Deep Convolutional Neural Networks”, 2016) and Glorot (“Understanding the difficulty of training deep feedforward neural networks”, 2010). Regarding claim 1, Schilling teaches A method for evaluating weight initialization technique for individual layers of neural network models to analytically preserve mean and variance across layers, ([p. 22 §2.4.3] "This method is the de-facto standard of weight initialization when using rectified nonlinearities and can be formulated as [See Eqn. 2.12] where n_in denotes the number of inputs of the current layer" [p. 74 §7] "Comparison of weight initializations [...] It is immediate from this visualization that high variance weight initialization performs strictly worse than the baseline or low variance initialization. Again, this result is in accordance to the properties of batch normalization, namely that smaller weights lead to larger gradients and vice versa. Interestingly, the low variance initialization actually seems to improve batch normalized models by a very slight margin. This result calls for further investigation whether increasingly smaller weights lead to favorable convergence properties. The magnitude of the gradients in case of high variance weight initialization are simply not sufficient to reach favorable parameter states fast enough" See also FIG. 7.11) wherein the method is implemented by a processor executing program instructions stored in a memory, the method comprising:([p. 66 §7.1.5] "We select the batch sizes in powers of two as in B = {2^5..., 2^10}. These batch sizes are chosen because we see an exponential rise in training time for batch sizes B < 2^5 and a batch size of B > 2^10 reaches the memory limit of our GPU for the CNN model") building, by the processor, a mean-variance mapping table, wherein a plurality of mean- variance mapping functions (g-layer) corresponding to a plurality of layers used in one or more neural network models are pre-defined, and wherein if the mean-variance mapping function (g-layer) corresponding to a layer is not pre-defined, then deriving the mean variance mapping function (g-laver) using data analytics based on user inputs and storing it in the mean-variance mapping table, and wherein if the layer is not predefined, then the mean and variance are assumed to remain unchanged after propagation through said layer;([p. 37] "At test time, the batch mean and variance are replaced by the respective population statistics since the input does not depend on other samples from a mini-batch. Another popular method is to keep running averages of the batch statistics during training and to use these to compute the network output at test time. At test time, the batch normalization transform can be expressed as [Eqn. 3.2]" [p. 47 §5.5] "the batch size is usually chosen with regards to how much data can fit in memory" the mean-variance mapping function corresponding to a layer is interpreted as explicitly pre-defined in Schilling. Schilling describes that during the forward pass the system computes the batch mean/variance and applies the transform. Separately, Schilling explicitly discusses the practical constraint that batch size is chosen by how much data can fit in memory and be processed in parallel, the respective batch sizes being stored in memory such that the memory could reasonably be interpreted as a mapping table. Alternatively, a running average is routinely represented as a table such that the running average computed by Eqn. 3.2 comprising each respective mean-variance mapping function is interpreted as a mean variance mapping table corresponding to a plurality of layers (l), each function being pre-defined) receiving, by the processor, a neural network comprising a plurality of layers and an input dataset from one or more client devices, one or more input/output devices and one or more external resources via an integration interface configured with one or more Application Programming Interfaces (APIs) or a Graphical User Interface (GUI) accessible via a user module([p. 49] "The datasets used for the following experiments are arguably the most popular ones used for image classification tasks in the literature. Ordered by increasing difficulty to generalize, those are MNIST [39], SVHN [43], CIFAR10, and CIFAR100 [32]" [p. 89] "The software framework used for conducting experiments is Torch [6], a scientific computing platform with wide support for machine learning algorithms that is used and maintained by various technology companies such as Google DeepMind and Facebook AI Research. At its core, Torch supplies a flexible n-dimensional array with many useful routines such as indexing, slicing and transposing. Moreover, Torch features an extremely fast scripting language and is based on LuaJIT, a powerful just-in-time compiler. LuaJIT is implemented in the C programming language and thus provides a clean interface to the GPU using NVIDIA’s CUDA libraries. CUDA is a parallel computing platform and application programming interface that enables general purpose GPU processing. In this research, we make heavy use of both the CUDA Toolkit 7.5 and cuDNN v4, a deep neural network library that extends the toolkit with useful operations such as highly optimized convolutions. An exciting piece of trivia is that NVIDIA advertises cuDNN v4 with the following slogan: “Train neural networks up to 14x faster with batch normalization” [...] Torch’s API bears significant resemblance to the pseudocode used in the back propagation algorithm (algorithm 1) since each module provides a forward and backward function that implements the mathematical notation introduced in sections 2.9.1, 2.9.2, 2.9.3, and 3" [p. 90] "The machine used for the experiments features a 3.4GHz Intel Core i7-2600K with 16GB of DDR3 RAM, and a NVIDIA Tesla K40c GPU with 2880 CUDA cores clocked at 745MHz and 12GB of RAM. The system is running Ubuntu 14.04.1 LTS with GNU/Linux kernel 3.19.0-51" The machine in the experiment interpreted as a client device comprising multiple input/output devices (CPU/GPU/RAM) and obtaining neural network and training dataset using external resources (MNIST, SVHN, CIFAR10, CIFAR100) via integration interface (CUDA, Torch, cDNN, etc.) with one or more API and/or a GUI) deriving, by the processor, a mean-variance mapping function (g-layer) corresponding to respective layers of a neural network([p. 35 §3] "Batch normalization can be seen as yet another layer that can be inserted into the model architecture […] we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one" [p. 78 §7.2] "in the convolutional case, we normalize each feature map over the current mini-batch and learn the scale and shift parameters per feature map, rather than per activation" Batch normalization interpreted as synonymous with a g-layer function corresponding to respective layers (feature maps are outputs of respective convolutional layers). Intermediate layer input signal is an output signal of a previous layer normalized by batch normalization. Variance is standard deviation squared (1^2=1).) from a mean-variance mapping table, ([p. 37] "At test time, the batch mean and variance are replaced by the respective population statistics since the input does not depend on other samples from a mini-batch. Another popular method is to keep running averages of the batch statistics during training and to use these to compute the network output at test time. At test time, the batch normalization transform can be expressed as [Eqn. 3.2]" [p. 47 §5.5] "the batch size is usually chosen with regards to how much data can fit in memory" the mean-variance mapping function corresponding to a layer is interpreted as explicitly pre-defined in Schilling. Schilling describes that during the forward pass the system computes the batch mean/variance and applies the transform. Separately, Schilling explicitly discusses the practical constraint that batch size is chosen by how much data can fit in memory and be processed in parallel, the respective batch sizes being stored in memory such that the memory could reasonably be interpreted as a mapping table. Alternatively, a running average is routinely represented as a table such that the running average computed by Eqn. 3.2 comprising each respective mean-variance mapping function is interpreted as a mean variance mapping table corresponding to a plurality of layers (l), each function being pre-defined) determining, by the processor, association of a weight parameter (θ) with the respective layers of the neural network; and evaluating, by the processor, a weight initialization technique for selecting an initial value of respective weight parameter (θ) associated with each layer determined to have associated weight parameter (θ) out of the plurality of layers ([p. 22 §2.4.3] "This method is the de-facto standard of weight initialization when using rectified nonlinearities and can be formulated as [See Eqn. 2.12] where n_in denotes the number of inputs of the current layer") based on the derived mean-variance mapping function (g-layer) corresponding to said each layer, such that a mean (μOut) and a variance (vout) of respective output signals of said each layer is zero and one, respectively.([p. 35 §3] "Batch normalization can be seen as yet another layer that can be inserted into the model architecture […] we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one" [p. 49 §3.5] "After applying batch normalization, all the intermediate layers will have restored their zero mean and unit variance property" See also Equation 3 where mean is a function of (l-1) (the previous layer output). See also FIG. 2.8 on p. 21). However, Schilling does not explicitly teach wherein the weight initialization technique is employed to improve deep learning of neural network models and adapt to a plurality of different and unique neural network architectures. Glorot, in the same field of endeavor, teaches wherein the weight initialization technique is employed to improve deep learning of neural network models and adapt to a plurality of different and unique neural network architectures([p. 249 Abstract] "We provide a new initialization scheme that brings substantially faster convergence"). Schilling as well as Glorot are directed towards neural network optimization. Therefore, Schilling as well as Glorot are reasonably pertinent analogous art. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Schilling with the teachings of Glorot by using the weight initialization in Glorot. Schilling explicitly cites Glorot for the weight initialization and Glorot provides as additional motivation for combination ([p. 249 Abstract] "We provide a new initialization scheme that brings substantially faster convergence"). This motivation for combination also applies to the remaining claims which depend on this combination. Regarding claim 3, the combination of Schilling, and Glorot teaches The method as claimed in claim 1, wherein the mean-variance mapping functions (g-layer) corresponding to the respective layers of the neural network are derived using data analytics based on any one of the following: a weight parameter associated with the respective layer, a type of said respective layer, an activation function associated with said respective layer or any combination thereof.(Schilling [p. 22 §2.4.3] "This method is the de-facto standard of weight initialization when using rectified nonlinearities and can be formulated as [See Eqn. 2.12] where n_in denotes the number of inputs of the current layer" [p. 74 §7] "Comparison of weight initializations [...] It is immediate from this visualization that high variance weight initialization performs strictly worse than the baseline or low variance initialization. Again, this result is in accordance to the properties of batch normalization, namely that smaller weights lead to larger gradients and vice versa. Interestingly, the low variance initialization actually seems to improve batch normalized models by a very slight margin. This result calls for further investigation whether increasingly smaller weights lead to favorable convergence properties. The magnitude of the gradients in case of high variance weight initialization are simply not sufficient to reach favorable parameter states fast enough"). Regarding claim 4, the combination of Schilling, and Glorot teaches The method as claimed in claim 1, wherein the mean-variance mapping function (g-layer) corresponding to any layer of the neural network having no associated weight parameter (θ) is representative of a function derived to map a mean and a variance of input signal of the any layer with a mean and a variance of output signal after propagation through said any layer.(Schilling [pp. 24-25 §2.9.3] "It is common practice to periodically insert a pooling layer in between successive convolutional layers [...] Since the pooling layer does not have any learnable parameters, the backward pass is merely an upsampling operation of the upstream derivatives. In case of the max-pooling operation, it is common practice to keep track of the index of the maximum activation so that the gradient can be routed towards its origin during backpropagation" Schilling explicitly teaches that the pooling layers have no associated weight parameters.). Regarding claim 5, the combination of Schilling, and Glorot teaches The method as claimed in claim 1, wherein the mean-variance mapping function (g-layer) corresponding to any layer of the neural network having an associated weight parameter (θ6) is representative of a function derived to map a mean and a variance of an input signal of the any layer and the weight parameter (θ) associated with said any layer with a mean and a variance of an output signal after propagation through said any layer.(Schilling [p. 35 §3] "Batch normalization can be seen as yet another layer that can be inserted into the model architecture […] we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one" [p. 49 §3.5] "After applying batch normalization, all the intermediate layers will have restored their zero mean and unit variance property" See also Equation 3 where mean is a function of (l-1) (the previous layer output). See also FIG. 2.8 on p. 21). Regarding claim 6, the combination of Schilling, and Glorot teaches The method as claimed in claim 1, wherein the neural network is an untrained neural network, further wherein each layer of the neural network is connected with its next layer or any subsequent layer of said neural network, such that an output of any layer (L) of said neural network is an input of its next layer (L+1) or a subsequent layer connected directly to said any layer (L), and a mean (μOut) and a variance (vout)of an output signal of said any layer (L) is a mean (μin) and a variance (vin) of an input signal of the next layer (L+1) or the subsequent layer. (Schilling [p. 35 §3] "Batch normalization can be seen as yet another layer that can be inserted into the model architecture […] we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one" [p. 49 §3.5] "After applying batch normalization, all the intermediate layers will have restored their zero mean and unit variance property" See also Equation 3 where mean is a function of (l-1) (the previous layer output). See also FIG. 2.8 on p. 21). Regarding claim 8, the combination of Schilling, and Glorot teaches The method as claimed in claim 1, wherein the evaluating of the weight initialization technique for selecting the initial value of the respective weight parameter (θ) associated with the each layer determined to have associated weight parameter (θ) comprises: a. computing and incorporating a mean (μin) and a variance (vin) of an input signal of a layer (L) out of the each layer determined to have associated weight parameter (θ) in the derived mean-variance mapping function (g-layer) corresponding to the layer (L), wherein the derived mean-variance mapping function maps the mean (μin) and the variance (vin) and the weight parameter (θ) associated with said layer (L) with a mean (μOut) and a variance (vout) of an output signal after propagation through said layer (L);(Schilling [p. 35 §3] "Batch normalization can be seen as yet another layer that can be inserted into the model architecture […] we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one" [p. 49 §3.5] "After applying batch normalization, all the intermediate layers will have restored their zero mean and unit variance property" See also Equation 3 where mean is a function of (l-1) (the previous layer output). See also FIG. 2.8 on p. 21) b. evaluating a weight distribution for the weight parameter (θ) associated with the layer (L) and ascertaining a sampling range for said weight parameter (θ); c. selecting the initial value of the weight parameter (θ) from the ascertained sampling range such that the mean (μOut) and variance (vout) of the output signal of said layer (L) is zero and one, respectively on incorporating the selected initial value in said derived mean-variance mapping function (g-layer); and d. repeating a-c for the each layer determined to have associated weight parameter (θ). (Schilling [p. 49 §3.5] "Batch normalization alleviates some of the hopelessness by assuming that if x is drawn from a unit normal distribution, all the downstream layers will be normally distributed because the intermediate transformations are linear" [p. 52] "The goal of batch normalization is to achieve a stable distribution of activation values throughout training" [p. 56 §5.4] "Weight Initialization [...] To assess how the scale of the weight initialization affects the training behavior, we train both the vanilla and batch normalized network with initial values drawn from distributions with different variances"). Regarding claim 9, the combination of Schilling, and Glorot teaches The method as claimed in claim 8, wherein the mean (μin) and the variance (vin) of the input signal of the layer (L) is same as a mean (μOut) and a variance (vout) of an output signal of any preceding layer of the neural network directly providing input to said layer (L).(Schilling [p. 78 §7.2] "in the convolutional case, we normalize each feature map over the current mini-batch and learn the scale and shift parameters per feature map, rather than per activation" [p. 38 §3] "we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one" Convolution layers have weight parameters each feature map is both an output signal of one layer and an input signal to a preceding layer.). Regarding claim 10, the combination of Schilling, and Glorot teaches The method as claimed in claim 9, wherein the mean (μOut) and the variance (vout) of the output signal of the any preceding layer is computed using a mean-variance mapping function (g-layer) corresponding to said any preceding layer if no weight parameter (θ) is associated with said any preceding layer; or the mean (μOut) and the variance (vout) of the output signal of the any preceding layer is zero and one respectively, if said any preceding layer has an associated weight parameter (θ). (Schilling [p. 78 §7.2] "in the convolutional case, we normalize each feature map over the current mini-batch and learn the scale and shift parameters per feature map, rather than per activation" [p. 38 §3] "we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one"). Regarding claim 11, the combination of Schilling, and Glorot teaches The method as claimed in claim 1, wherein a mean and a variance of an output signal of any layer of the neural network having no associated weight parameter (θ) is computed using the derived mean-variance mapping function (g-layer) corresponding to said any layer by: computing a mean and a variance of the input signal of said any layer, wherein the mean and the variance of the input signal of said any layer is same as a mean and a variance of an output signal of any preceding layer of the neural network directly providing input to said any layer; and(Schilling [p. 78 §7.2] "in the convolutional case, we normalize each feature map over the current mini-batch and learn the scale and shift parameters per feature map, rather than per activation" [p. 38 §3] "we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one" See also Table 6.3 which shows that each convolutional layer is followed by a pooling layer which has no associated weight parameter ([pp. 24-25 §2.9.3] "the pooling layer does not have any learnable parameters").) incorporating the computed mean and the variance of the input signal of said any layer in the derived mean-variance mapping function (g-layer) to compute the mean and variance of the output signal after propagating through said any layer.(Schilling See also Table 6.3 which shows that each convolutional layer is followed by a pooling layer which has no associated weight parameter. Convolutional layer having output feature map with zero mean and unit variance subsequent to max pooling interpreted as synonymous with computing the mean and variance of the output signal after propagating through said any layer.). Regarding claim 12, the combination of Schilling, and Glorot teaches The method as claimed in claim 8, wherein the mean and the variance of the input signal of the layer (L) is computed by aggregation of input data if the layer (L) is an input layer.(Schilling [p. 17 §2.2] "the network’s input X consists of a variety of grayscale or color images and their corresponding ground truth labels y. An image Xi can be viewed as a c × h × w tensor of numbers with c color channels and corresponding height h and width w [...] it is common to split the entire dataset into three subsets called the training set, the test set, and the validation set. Figure 6.1 provides a schematic overview of these dataset splits" [p. 23 §2.6.1] "Batch gradient descent computes the gradient of the loss function for the entire training dataset x" See FIG. 2 which shows how input data X is aggregated for input layer.). Regarding claims 13, 16-19, and 21-24, claims 13, 16-19, and 21-24 are directed towards a system for performing the method of claims 1, 3-6, 8, and 10-12, respectively. Therefore, the rejections applied to claims 1, 3-6, 8, and 10-12 also apply to claims 13, 16-19, and 21-24. Regarding claim 25, claim 25 is directed towards a computer program product for performing the method of claim 1. Therefore, the rejection applied to claim 1 also applies to claim 25. Claims 2 and 15 are rejected under U.S.C. §103 as being unpatentable over the combination of Schilling and Glorot and Wang (“Look-up Table Unit Activation Function for Deep Convolutional Neural Networks”, 2018). Regarding claim 2, the combination of Schilling and Glorot teaches The method as claimed in claim 1. However, the combination of Schilling and Glorot doesn't explicitly teach, wherein the derived mean-variance mapping functions (g-layer) mapped to corresponding layers of the neural network are stored in a mean-variance mapping table. Wang, in the same field of endeavor, teaches the derived mean-variance mapping functions (g-layer) mapped to corresponding layers of the neural network are stored in a mean-variance mapping table. ([p. 1226 §3.1] "We introduce a novel activation function that we name Look-up Table Unit (LuTU). The function is controlled by a look-up table" [p. 1227 §3.1.1] "following batch normalization layer, the activation function has a fixed input distribution (zero mean and unit variance)"). The combination of Schilling and Glorot as well as Wang are directed towards convolutional neural networks with batch normalization. Therefore, the combination of Schilling and Glorot as well as Wang are reasonably pertinent analogous art. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Schilling and Glorot with the teachings of Wang by using a batch normalized lookup table activation function as at least one of the activation functions in Schilling. Wang provides as additional motivation for combination ([p. 1232 §5] "we introduced a novel activation function that is highly flexible and learned with network training. We visualized how this method can learn complex distributions more effectively than the ReLU function, and experimentally verified that it is able to improve the performance of deep CNN models."). This motivation for combination also applies to the remaining claims which depend on this combination Regarding claim 15, claim 15 is directed towards a system for performing the method of claim 2. Therefore, the rejection applied to claim 2 also applies to claim 15. Claims 7 and 20 are rejected under U.S.C. §103 as being unpatentable over the combination of Schilling and Glorot and in further view of Amiri (US 20230022401 A1). Regarding claim 7, the combination of Schilling and Glorot teaches The method as claimed in claim 1, wherein the step of determining association of the weight parameter (θ) with the respective layers of the neural network comprises: identifying a type of the layer based on analysis of the layer; (Schilling [pp. 24-25 §2.9.3] "It is common practice to periodically insert a pooling layer in between successive convolutional layers [...] Since the pooling layer does not have any learnable parameters, the backward pass is merely an upsampling operation of the upstream derivatives. In case of the max-pooling operation, it is common practice to keep track of the index of the maximum activation so that the gradient can be routed towards its origin during backpropagation" FIG. 27 shows a neural network architecture with determined types having determined and known weight parameter associations. §2.9.1 and 2.9.2 describe layers having weight parameters. 2.9.3 describes layers not having weight parameters.). However, the combination of Schilling and Glorot doesn't explicitly teach and determining association of weight parameter (θ) with the layer based on the identified type of the layer by accessing a predefined database, said predefined database comprising information associated with types of layers having weights and not having weights. Amiri, in the same field of endeavor, teaches and determining association of weight parameter (θ) with the layer based on the identified type of the layer by accessing a predefined database, said predefined database comprising information associated with types of layers having weights and not having weights.([¶0040] "corresponding parameters and weights are stored in the model database 214 to be used in the prediction phase. When the best forecasters (denoted by ft in FIG. 2) are determined, time series data and the forecaster are provided to TS classifier 210 for training the TS classifier 210. The best forecaster need not be a different forecaster and may be the same forecaster with different parameters or hyper-parameters e.g., parameters may be the trainable weights of a DNN, or trainable parameters of a linear regression model, while the hyper-parameters may be describing the number of hidden layers, type of activation functions"). The combination of Schilling and Glorot as well as Amiri are directed towards neural network processing. Therefore, the combination of Schilling and Glorot as well as Amiri are reasonably pertinent analogous art. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Schilling and Glorot with the teachings of Amiri by using a database comprising information associated with types of layers having weights and not having weights. Amiri teaches that the database is used to determine a best forecaster for the particular application and further provides at motivation for combination ([¶0020] “The forecaster of various embodiments of the disclosure offer, in some cases, over 40% improvement over conventional forecasters because the best forecaster for the time series is selected based on the output of a time series classifier.”). Regarding claim 20, claim 20 is directed towards a system for performing the method of claim 7. Therefore, the rejection applied to claim 7 also applies to claim 20. Claim 14 is rejected under U.S.C. §103 as being unpatentable over the combination of Schilling and Glorot and Lidman (US20200379923A1). Regarding claim 14, the combination of Schilling and Glorot teaches The system as claimed in claim 13. However, the combination of Schilling and Glorot doesn't explicitly teach wherein the weight initialization engine comprises an interface unit executed by the processor, said interface unit configured to facilitate user interaction, and receive the neural network model. Lidman, in the same field of endeavor, teaches The system as claimed in claim 13, wherein the weight initialization engine comprises an interface unit executed by the processor, said interface unit configured to facilitate user interaction, and receive the neural network model. ([¶0024] "the user device 110 may receive the neural network models 102 (including updated neural network models) from the deep learning environment 101 at any time."). The combination of Schilling and Glorot as well as Lidman are directed towards neural network processing. Therefore, the combination of Schilling and Glorot as well as Lidman are reasonably pertinent analogous art. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Schilling and Glorot with the teachings of Lidman by receiving the neural network model through a user interface. Lidman provides as additional motivation for combination ([¶0021] “the user device 110 may use the neural network models 102 to generate inferences about a user and/or content the user is viewing or listening to”). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Klambauer (“Self Normalizing Neural Networks”, 2017) is directed towards a method of mapping mean and variance from one layer to the next and building a mean variance mapping table. Krahenbuhl (“DATA-DEPENDENT INITIALIZATIONS OF CONVOLUTIONAL NEURAL NETWORKS”, 2016) is directed towards a layer-wise initialization method involving mapping sample mean and variance per layer. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124
Read full office action

Prosecution Timeline

Jul 01, 2022
Application Filed
Jul 05, 2025
Non-Final Rejection — §101, §103, §112
Oct 06, 2025
Response Filed
Oct 23, 2025
Final Rejection — §101, §103, §112
Dec 30, 2025
Request for Continued Examination
Jan 16, 2026
Response after Non-Final Action
Feb 27, 2026
Non-Final Rejection — §101, §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12561604
SYSTEM AND METHOD FOR ITERATIVE DATA CLUSTERING USING MACHINE LEARNING
2y 5m to grant Granted Feb 24, 2026
Patent 12547878
Highly Efficient Convolutional Neural Networks
2y 5m to grant Granted Feb 10, 2026
Patent 12536426
Smooth Continuous Piecewise Constructed Activation Functions
2y 5m to grant Granted Jan 27, 2026
Patent 12518143
FEEDFORWARD GENERATIVE NEURAL NETWORKS
2y 5m to grant Granted Jan 06, 2026
Patent 12505340
STASH BALANCING IN MODEL PARALLELISM
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
52%
Grant Probability
90%
With Interview (+38.2%)
4y 7m
Median Time to Grant
High
PTA Risk
Based on 136 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month