DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responsive to the Preliminary Amendment filed on 5/6/2024. Claims 1-20 are pending in the case. Claims 1, 10, and 19-20 are independent claims.
Claim Rejections - 35 U.S.C. § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. §§ 102 and 103 (or as subject to pre-AIA 35 U.S.C. §§ 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. § 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-2, 8-11, and 17-20 are rejected under 35 U.S.C. § 102(a)(1) as being anticipated by Agostinelli et al. (“Learning Activation Functions to Improve Deep Neural Networks,” 21 April 2015, https://arxiv.org/abs/1412.6830).
As to independent claim 1, Agostinelli discloses a neural network model training method, comprising:
obtaining training data (“The CIFAR-10 and CIFAR-100 datasets,” page 4 section “3.1 CIFAR” line 1);
training a neural network model based on the training data (“The networks were trained 5 times using different random initializations,” page 5 paragraph 2 line 3), wherein an activation function of the neural network model comprises at least one piecewise function (“Here we define the adaptive piecewise linear (APL) activation unit. Our method formulates the activation function
h
i
(
x
)
of an APL unit
i
as a sum of hinge-shaped functions,
h
i
(
x
)
=
m
a
x
(
0
,
x
)
+
∑
s
=
1
S
a
i
s
m
a
x
(
0
;
-
x
+
b
i
s
)
,” page 2 section “2 Adaptive Piecewise Linear Units” lines 1-3), and the piecewise function comprises a plurality of trainable parameters (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7); and
updating the plurality of trainable parameters of the at least one piecewise function in a process of training the neural network model, to obtain a target neural network model (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7).
As to dependent claim 2, Agostinelli further discloses a method wherein the at least one piecewise function is a piecewise linear function, and parameters of the at least one piecewise function comprise one or more of the following: a quantity of boundary points, a right boundary, a left boundary, a slope of a range with a maximum domain, a slope of a range with a minimum domain, or a function value corresponding to a boundary point (“The number of hinges,
S
, is a hyperparameter set in advance, while the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 4-7).
As to dependent claim 8, Agostinelli further discloses a method wherein the plurality of trainable parameters of the at least one piecewise function comprise: the right boundary, the left boundary, the slope of the range with the maximum domain, the slope of the range with the minimum domain, or the function value corresponding to the boundary point (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7).
As to dependent claim 9, Agostinelli further discloses a method wherein a quantity of segments of the piecewise function is any value from 6 to 18 (“S = 10,” page 6 section “3.3 Effects of APL Unit Hyperparameters” Table 3 line 7; “Any continuous piecewise-linear function g(x) can be expressed by Equation 1 for some S,” page 2 section “2 Adaptive Piecewise Linear Units” theorem 1 lines 1-3).
As to independent claim 10, Agostinelli discloses a data processing method, comprising:
obtaining to-be-processed data, wherein the data comprises image data, voice data, or text data (“The CIFAR-10 and CIFAR-100 datasets (Krizhevsky & Hinton, 2009) are 32x32 color images that have 10 and 100 classes, respectively,” page 4 section “3.1 CIFAR” lines 1-2); and
processing the to-be-processed data by using a target neural network model, to obtain a processing result of the to-be-processed data (“For testing we just take the center 32 x 32 image. To the best of our knowledge, the results we report for data augmentation using the network-in-network architecture are the best results reported for CIFAR-10 and CIFAR-100 for any method,” page 4 section “3.1 CIFAR” paragraph 3 lines 3-5), wherein the target neural network model is obtained by training a neural network model based on training data (“The networks were trained 5 times using different random initializations,” page 5 paragraph 2 line 3), an activation function of the neural network model comprises at least one piecewise function (“Here we define the adaptive piecewise linear (APL) activation unit. Our method formulates the activation function
h
i
(
x
)
of an APL unit
i
as a sum of hinge-shaped functions,
h
i
(
x
)
=
m
a
x
(
0
,
x
)
+
∑
s
=
1
S
a
i
s
m
a
x
(
0
;
-
x
+
b
i
s
)
,” page 2 section “2 Adaptive Piecewise Linear Units” lines 1-3), an activation function of the target neural network model comprises at least one target piecewise function, and the target piecewise function is obtained by updating a plurality of trainable parameters of the piecewise function in a process of training the neural network model (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7).
As to dependent claim 11, Agostinelli further discloses a method wherein the at least one piecewise function is a piecewise linear function, and parameters of the at least one piecewise function comprise one or more of the following: a quantity of boundary points, a right boundary, a left boundary, a slope of a range with a maximum domain, a slope of a range with a minimum domain, or a function value corresponding to a boundary point (“The number of hinges,
S
, is a hyperparameter set in advance, while the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 4-7).
As to dependent claim 17, Agostinelli further discloses a method wherein the plurality of trainable parameters of the at least one piecewise function comprise: the right boundary, the left boundary, the slope of the range with the maximum domain, the slope of the range with the minimum domain, or the function value corresponding to the boundary point (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7).
As to dependent claim 18, Agostinelli further discloses a method wherein a quantity of segments of the piecewise function is any value from 6 to 18 (“S = 10,” page 6 section “3.3 Effects of APL Unit Hyperparameters” Table 3 line 7; “Any continuous piecewise-linear function g(x) can be expressed by Equation 1 for some S,” page 2 section “2 Adaptive Piecewise Linear Units” theorem 1 lines 1-3).
As to independent claim 19, Agostinelli discloses a neural network model training apparatus, comprising a processor and a memory, wherein the memory is configured to store program instructions, and the processor is configured to invoke the program instructions to perform the operations (“Experiments were performed using the software package CAFFE (Jia et al., 2014),” page 4 section “3 Experiments” line 1) of:
obtaining training data (“The CIFAR-10 and CIFAR-100 datasets,” page 4 section “3.1 CIFAR” line 1);
training a neural network model based on the training data (“The networks were trained 5 times using different random initializations,” page 5 paragraph 2 line 3), wherein an activation function of the neural network model comprises at least one piecewise function (“Here we define the adaptive piecewise linear (APL) activation unit. Our method formulates the activation function
h
i
(
x
)
of an APL unit
i
as a sum of hinge-shaped functions,
h
i
(
x
)
=
m
a
x
(
0
,
x
)
+
∑
s
=
1
S
a
i
s
m
a
x
(
0
;
-
x
+
b
i
s
)
,” page 2 section “2 Adaptive Piecewise Linear Units” lines 1-3), and the piecewise function comprises a plurality of trainable parameters (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7); and
updating the plurality of trainable parameters of the at least one piecewise function in a process of training the neural network model, to obtain a target neural network model (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7).
As to independent claim 20, Agostinelli discloses a data processing apparatus, comprising a processor and a memory, wherein the memory is configured to store program instructions, and the processor is configured to invoke the program instructions to perform the operations (“Experiments were performed using the software package CAFFE (Jia et al., 2014),” page 4 section “3 Experiments” line 1) of:
obtaining to-be-processed data, wherein the data comprises image data, voice data, or text data (“The CIFAR-10 and CIFAR-100 datasets (Krizhevsky & Hinton, 2009) are 32x32 color images that have 10 and 100 classes, respectively,” page 4 section “3.1 CIFAR” lines 1-2); and
processing the to-be-processed data by using a target neural network model, to obtain a processing result of the to-be-processed data (“For testing we just take the center 32 x 32 image. To the best of our knowledge, the results we report for data augmentation using the network-in-network architecture are the best results reported for CIFAR-10 and CIFAR-100 for any method,” page 4 section “3.1 CIFAR” paragraph 3 lines 3-5), wherein the target neural network model is obtained by training a neural network model based on training data (“The networks were trained 5 times using different random initializations,” page 5 paragraph 2 line 3), an activation function of the neural network model comprises at least one piecewise function (“Here we define the adaptive piecewise linear (APL) activation unit. Our method formulates the activation function
h
i
(
x
)
of an APL unit
i
as a sum of hinge-shaped functions,
h
i
(
x
)
=
m
a
x
(
0
,
x
)
+
∑
s
=
1
S
a
i
s
m
a
x
(
0
;
-
x
+
b
i
s
)
,” page 2 section “2 Adaptive Piecewise Linear Units” lines 1-3), an activation function of the target neural network model comprises at least one target piecewise function, and the target piecewise function is obtained by updating a plurality of trainable parameters of the piecewise function in a process of training the neural network model (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7).
Claim Rejections - 35 U.S.C. § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. §§ 102 and 103 (or as subject to pre-AIA 35 U.S.C. §§ 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 C.F.R. § 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. § 102(b)(2)(C) for any potential 35 U.S.C. § 102(a)(2) prior art against the later invention.
Claims 3-7 and 12-16 are rejected under 35 U.S.C. § 103 as being unpatentable over Agostinelli in view of Kingma et al. (“Adam: A Method for Stochastic Optimization,” 30 January 2017, https://arxiv.org/abs/1412.6980).
As to dependent claim 3, the rejection of claim 2 is incorporated.
Agostinelli does not appear to expressly teach a method wherein the process of training the neural network model comprises a first phase and a second phase, and the first phase is performed before the second phase; and
the updating the plurality of trainable parameters of the at least one piecewise function in a process of training the neural network model comprises:
updating, in the second phase, the plurality of trainable parameters of the at least one piecewise function based on gradients of the plurality of trainable parameters of the at least one piecewise function, wherein
initial values of the right boundary and the left boundary of the at least one piecewise function in the second phase are determined based on distribution of a feature input to the at least one piecewise function in the first phase.
Kingma teaches a method wherein the process of training the neural network model comprises a first phase (steps “Update biased first moment estimate” through “Compute bias-corrected second raw moment estimate,” page 2 Algorithm 1) and a second phase (step “Update parameters,” page 2 Algorithm 1), and the first phase is performed before the second phase (page 2 Algorithm 1); and
the updating the plurality of trainable parameters of the at least one [] function in a process of training the neural network model comprises:
updating, in the second phase, the plurality of trainable parameters of the at least one [] function based on gradients of the plurality of trainable parameters of the at least one [] function (step “Update parameters,” page 2 Algorithm 1), wherein
initial values of the right boundary and the left boundary of the at least one [] function in the second phase are determined based on distribution of a feature input to the at least one [] function in the first phase (steps “Update biased first moment estimate” through “Compute bias-corrected second raw moment estimate,” page 2 Algorithm 1).
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the training of Agostinelli to comprise the algorithm of Kingma. (1) The Examiner finds that the prior art included each claim element listed above, although not necessarily in a single prior art reference, with the only difference between the claimed invention and the prior art being the lack of actual combination of the elements in a single prior art reference. (2) The Examiner finds that one of ordinary skill in the art could have combined the elements as claimed by known software development methods, and that in combination, each element merely performs the same function as it does separately. (3) The Examiner finds that one of ordinary skill in the art would have recognized that the results of the combination were predictable, namely efficiently optimizing each piece of the piecewise function (Kingma abstract). Therefore, the rationale to support a conclusion that the claim would have been obvious is that the combining prior art elements according to known methods to yield predictable results to one of ordinary skill in the art. See MPEP § 2143(I)(A).
As to dependent claim 4, the rejection of claim 3 is incorporated. Agostinelli/Kingma further teaches a method wherein the plurality of trainable parameters of the at least one piecewise function remain unchanged in the first phase (steps “Update biased first moment estimate” through “Compute bias-corrected second raw moment estimate” are separate from step “Update parameters,” Kingma page 2 Algorithm 1).
As to dependent claim 5, the rejection of claim 3 is incorporated. Agostinelli/Kingma further teaches a method wherein the distribution of the feature input to the at least one piecewise function in the first phase is represented by a predicted average value of the feature (steps “Update biased first moment estimate” and “Compute bias-corrected first moment estimate,” Kingma page 2 Algorithm 1) and a predicted standard deviation of the feature (steps “Update biased second raw moment estimate” and “Compute bias-corrected second raw moment estimate,” Kingma page 2 Algorithm 1) that are obtained through a last iteration in the first phase, and the predicted average value of the feature and the predicted standard deviation of the feature are determined by using a moving average method (“The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 3-4).
As to dependent claim 6, the rejection of claim 5 is incorporated. Agostinelli/Kingma further teaches a method wherein the predicted average value of the feature and the predicted standard deviation of the feature respectively satisfy the following formulas:
Rmean_j+1 = Rmean_j * a + mean(x) * (1 − a) (steps “Update biased first moment estimate” and “Compute bias-corrected first moment estimate,” Kingma page 2 Algorithm 1)
Rstd_j+1 = Rstd_j * b + std(x) * (1 − b) (steps “Update biased second raw moment estimate” and “Compute bias-corrected second raw moment estimate,” Kingma page 2 Algorithm 1), wherein
Rmean_j represents a predicted average value of the feature obtained through a jth iteration, Rmean_j+1 represents a predicted average value of the feature obtained through a (j+1)th iteration, Rstd_j represents a predicted standard deviation of the feature obtained through the jth iteration, Rstd_j+1 represents a predicted standard deviation of the feature obtained through the (j+1)th iteration, and j is an integer greater than or equal to 0 (“The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 3-4);
when j=0, Rmean_0 represents an initial value of a predicted average value of the feature, Rstd_0 represents an initial value of a predicted standard deviation of the feature, Rmean_0=0, and Rstd_0=0 (“these moving averages are
initialized as (vectors of) 0’s,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 4-5); and
mean(x) represents an average value of the feature, std(x) represents a standard deviation of the feature (“The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 3-4), a represents a weight parameter of Rmean_j, and b represents a weight parameter of Rstd_j (“the hyper-parameters
β
1
,
β
2
∊
[
0
,
1
)
control the exponential decay rates of these moving averages,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 2-3).
As to dependent claim 7, the rejection of claim 5 is incorporated. Agostinelli/Kingma further teaches a method wherein the initial value of the right boundary RB of the at least one piecewise function in the second phase satisfies the following formula:
RB = Rmean + c * Rstd (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” Agostinelli page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7), and
the initial value of the left boundary LB of the at least one piecewise function in the second phase satisfies the following formula:
LB = Rmean - c * Rstd (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” Agostinelli page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7), wherein
Rmean represents the predicted average value of the feature obtained through the last iteration in the first phase, Rstd represents the predicted standard deviation of the feature obtained through the last iteration in the first phase (“The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 3-4), and c represents a parameter (“the hyper-parameters
β
1
,
β
2
∊
[
0
,
1
)
control the exponential decay rates of these moving averages,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 2-3).
As to dependent claim 12, the rejection of claim 11 is incorporated.
Agostinelli does not appear to expressly teach a method wherein the process of training the neural network model comprises a first phase and a second phase, and the first phase is performed before the second phase; and
that the target piecewise function is obtained by updating a plurality of trainable parameters of the piecewise function in a process of training the neural network model comprises:
the target piecewise function is obtained by updating, in the second phase, the plurality of trainable parameters of the piecewise function based on gradients of the plurality of trainable parameters of the piecewise function, wherein initial values of the right boundary and the left boundary of the piecewise function in the second phase are determined based on distribution of a feature input to the piecewise function in the first phase.
Kingma teaches a method wherein the process of training the neural network model comprises a first phase (steps “Update biased first moment estimate” through “Compute bias-corrected second raw moment estimate,” page 2 Algorithm 1) and a second phase (step “Update parameters,” page 2 Algorithm 1), and the first phase is performed before the second phase (page 2 Algorithm 1); and
that the target [] function is obtained by updating a plurality of trainable parameters of the [] function in a process of training the neural network model comprises:
the target [] function is obtained by updating, in the second phase, the plurality of trainable parameters of the [] function based on gradients of the plurality of trainable parameters of the [] function (step “Update parameters,” page 2 Algorithm 1), wherein initial values of the right boundary and the left boundary of the [] function in the second phase are determined based on distribution of a feature input to the [] function in the first phase (steps “Update biased first moment estimate” through “Compute bias-corrected second raw moment estimate,” page 2 Algorithm 1).
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the training of Agostinelli to comprise the algorithm of Kingma. (1) The Examiner finds that the prior art included each claim element listed above, although not necessarily in a single prior art reference, with the only difference between the claimed invention and the prior art being the lack of actual combination of the elements in a single prior art reference. (2) The Examiner finds that one of ordinary skill in the art could have combined the elements as claimed by known software development methods, and that in combination, each element merely performs the same function as it does separately. (3) The Examiner finds that one of ordinary skill in the art would have recognized that the results of the combination were predictable, namely efficiently optimizing each piece of the piecewise function (Kingma abstract). Therefore, the rationale to support a conclusion that the claim would have been obvious is that the combining prior art elements according to known methods to yield predictable results to one of ordinary skill in the art. See MPEP § 2143(I)(A).
As to dependent claim 13, the rejection of claim 12 is incorporated. Agostinelli/Kingma further teaches a method wherein the plurality of trainable parameters of the at least one piecewise function remain unchanged in the first phase (steps “Update biased first moment estimate” through “Compute bias-corrected second raw moment estimate” are separate from step “Update parameters,” Kingma page 2 Algorithm 1).
As to dependent claim 14, the rejection of claim 12 is incorporated. Agostinelli/Kingma further teaches a method wherein the distribution of the feature input to the at least one piecewise function in the first phase is represented by a predicted average value of the feature (steps “Update biased first moment estimate” and “Compute bias-corrected first moment estimate,” Kingma page 2 Algorithm 1) and a predicted standard deviation of the feature (steps “Update biased second raw moment estimate” and “Compute bias-corrected second raw moment estimate,” Kingma page 2 Algorithm 1) that are obtained through a last iteration in the first phase, and the predicted average value of the feature and the predicted standard deviation of the feature are determined by using a moving average method (“The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 3-4).
As to dependent claim 15, the rejection of claim 14 is incorporated. Agostinelli/Kingma further teaches a method wherein the predicted average value of the feature and the predicted standard deviation of the feature respectively satisfy the following formulas:
Rmean_j+1 = Rmean_j * a + mean(x) * (1 − a) (steps “Update biased first moment estimate” and “Compute bias-corrected first moment estimate,” Kingma page 2 Algorithm 1)
Rstd_j+1 = Rstd_j * b + std(x) * (1 − b) (steps “Update biased second raw moment estimate” and “Compute bias-corrected second raw moment estimate,” Kingma page 2 Algorithm 1), wherein
Rmean_j represents a predicted average value of the feature obtained through a jth iteration, Rmean_j+1 represents a predicted average value of the feature obtained through a (j+1)th iteration, Rstd_j represents a predicted standard deviation of the feature obtained through the jth iteration, Rstd_j+1 represents a predicted standard deviation of the feature obtained through the (j+1)th iteration, and j is an integer greater than or equal to 0 (“The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 3-4);
when j=0, Rmean_0 represents an initial value of a predicted average value of the feature, Rstd_0 represents an initial value of a predicted standard deviation of the feature, Rmean_0=0, and Rstd_0=0 (“these moving averages are
initialized as (vectors of) 0’s,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 4-5); and
mean(x) represents an average value of the feature, std(x) represents a standard deviation of the feature (“The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 3-4), a represents a weight parameter of Rmean_j, and b represents a weight parameter of Rstd_j (“the hyper-parameters
β
1
,
β
2
∊
[
0
,
1
)
control the exponential decay rates of these moving averages,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 2-3).
As to dependent claim 16, the rejection of claim 14 is incorporated. Agostinelli/Kingma further teaches a method wherein the initial value of the right boundary RB of the at least one piecewise function in the second phase satisfies the following formula:
RB = Rmean + c * Rstd (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” Agostinelli page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7), and
the initial value of the left boundary LB of the at least one piecewise function in the second phase satisfies the following formula:
LB = Rmean - c * Rstd (“the variables
a
i
s
,
b
i
s
for
i
∊
1
,
…
,
S
are learned using standard gradient descent during training. The
a
i
s
variables control the slopes of the linear segments, while the
b
i
s
variables determine the locations of the hinges,” Agostinelli page 2 section “2 Adaptive Piecewise Linear Units” lines 5-7), wherein
Rmean represents the predicted average value of the feature obtained through the last iteration in the first phase, Rstd represents the predicted standard deviation of the feature obtained through the last iteration in the first phase (“The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 3-4), and c represents a parameter (“the hyper-parameters
β
1
,
β
2
∊
[
0
,
1
)
control the exponential decay rates of these moving averages,” Kingma page 2 section “2 Algorithm” paragraph 2 lines 2-3).
Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicant’s disclosure:
Bohra et al, “Learning Activation Functions in Deep (Spline) Neural Networks”, 19 November 2020, https://ieeexplore.ieee.org/document/9264754 disclosing piecewise polynomial activation functions
Applicant is required under 37 C.F.R. § 1.111(c) to consider these references fully when responding to this action.
It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way. A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33, 216 U.S.P.Q. 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 U.S.P.Q. 275, 277 (C.C.P.A. 1968)).
In the interests of compact prosecution, Applicant is invited to contact the examiner via electronic media pursuant to USPTO policy outlined MPEP § 502.03. All electronic communication must be authorized in writing. Applicant may wish to file an Internet Communications Authorization Form PTO/SB/439. Applicant may wish to request an interview using the Interview Practice website: http://www.uspto.gov/patent/laws-and-regulations/interview-practice.
Applicant is reminded Internet e-mail may not be used for communication for matters under 35 U.S.C. § 132 or which otherwise require a signature. A reply to an Office action may NOT be communicated by Applicant to the USPTO via Internet e-mail. If such a reply is submitted by Applicant via Internet e-mail, a paper copy will be placed in the appropriate patent application file with an indication that the reply is NOT ENTERED. See MPEP § 502.03(II).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Ryan Barrett whose telephone number is 571 270 3311. The examiner can normally be reached 9:00am to 5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor Michelle Bechtold can be reached at 571 431 0762. The fax phone number for the organization where this application or proceeding is assigned is 571 273 8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Ryan Barrett/
Primary Examiner, Art Unit 2148