Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
1. Claims 1-2, 4-5, 9-10, 12-13, 17-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Vaswani et al. (US 2021/0390410 A1) (“Vaswani”) in view of Huo et al., “Large Batch Optimization for Deep Learning Using New Complete Layer-Wise Adaptive Rate Scaling,” The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) (“Huo”) and Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497v3 [cs.CV] 6 Jan 2016 (“Ren”).
As to claim 1, Vaswani teaches a method for training an object detection system comprising multiple neural network layers, [[0018]: “As a particular example, the neural network 150 can be configured to process an image to generate a classification output that includes a respective score corresponding to each of multiple categories… In some cases, the categories may be classes of objects…, and the image may belong to a category if it depicts an object included in the object class corresponding to the category” Furthermore, “training” is disclosed in [0071], and multiple layers of a neural network are generally shown in FIG. 1, as further discussed below.] wherein the method comprises:
providing a training image as input to the object detection system, [[0017]: “the system 100 can process the input image 102 using a computer vision neural network 150 that is configured to process an input that includes an image to generate a corresponding output.” [0071]: “The system can repeatedly perform the process 400 on inputs selected from a set of training data as part of a conventional machine learning training technique to train the attention layers and the output layer(s) of the neural network.”] wherein the object detection system comprises a backbone network and a head network, the backbone network comprising multiple convolutional layers and multiple self-attention layers; [[0022]: “In particular, the computer vision neural network 150 includes a backbone neural network 110 that processes the input image 102 to generate a feature representation 130 of the input image 102 and an output neural network 140 that processes the feature representation 130 to generate the output 152 for the computer vision task.” That is, the output neural network corresponds to a head network. [0024]: “As a particular example, the backbone neural network 110 can include multiple residual blocks (also referred to as “layer stacks”) that are each configured to receive a stack input and to generate a stack output. Each block can include a first convolutional neural network layer that reduces a dimensionality of the stack input, a local self-attention layer that operates on the reduced-dimensionality stack input, and a second convolutional neural network layer that increases the dimensionality of the output of the local self-attention layer.”]
processing the training image by the object detection system, wherein the processing comprises performing convolution processing on the training image by using the multiple convolutional layers to obtain a convolution representation, performing self-attention processing on the convolution representation by using the multiple self-attention layers to obtain a feature map, [[0024]: “Each block can include a first convolutional neural network layer that reduces a dimensionality of the stack input, a local self-attention layer that operates on the reduced-dimensionality stack input.” [0049]: “In some implementations, the local self-attention layer determines the context blocks 240 corresponding to each query block 220 by processing the layer input using two-dimensional or three-dimensional convolutions.” [0022]: “The feature representation can be, e.g., one or more tensor of numeric values that represent learned properties of the input image 102. For example, the feature representation 130 can be a single feature map having smaller spatial dimensions than the input image but with a larger number of channels than the input image.”] and processing the feature map by using the head network to obtain a detection result of a target object in the training image; [[0022]: “an output neural network 140 that processes the feature representation 130 to generate the output 152 for the computer vision task.” [0018]: “As a particular example, the neural network 150 can be configured to process an image to generate a classification output that includes a respective score corresponding to each of multiple categories… In some cases, the categories may be classes of objects…, and the image may belong to a category if it depicts an object included in the object class corresponding to the category”].
Vaswani does not explicitly teach the remaining limitations of “determining a gradient norm of each neural network layer based on object annotation data and the detection result corresponding to the training image; and updating, for each neural network layer, parameter values of the neural network layer based on an average of gradient norms and the gradient norm of the neural network layer.”
Huo teaches “determining a gradient norm of each neural network layer based on object annotation data and the detection result corresponding to the training image” [Page 7884 (2rd page), bottom paragraph of left column: “Let
∇
k
f
i
w
t
denote the stochastic gradient with respect to the parameters at layer k and γ denote its corresponding learning rate at layer k…. At each iteration, the learning rate k at layer k is updated using Complete Layer-wise Adaptive Rate Scaling (CLARS) as follows:” Here, in equation (5),
∇
k
f
i
w
t
2
is the l2 norm of the gradient. Note that this is performed for each layer. See page 7884, left column, paragraph above equation (3); see also page 7886, right column, bottom paragraph: we know that the upper bound of learning rate γk at each layer.” Furthermore, the gradient is based on a loss function, as described on page 7884, top paragraph. Thus, although the specific aspect of being based on object annotation data and the detection result corresponding to the training image is not taught, a corresponding generalized concept is taught.] and “updating, for each neural network layer, parameter values of the neural network layer based on an average of gradient norms and the gradient norm of the neural network layer” [In equation 5,
1
B
∑
i
∈
I
t
∇
k
f
i
w
t
2
is an average across the number B of mini-batches. This learning rate is used to update the weights w (i.e., parameter values) as shown in equation 4, above. See also Algorithm 1, lines 5-6.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Vaswani with the teachings of Chen by implementing the steps of “determining a gradient norm of each neural network layer” and “updating, for each neural network layer, parameter values of the neural network layer based on an average of gradient norms and the gradient norm of the neural network layer.” The motivation for doing so would have been to alleviate training difficulties caused by layer-wise gradient variance (see § 1, paragraph 3, item 2: “…CLARS alleviate the training difficulties caused by layer-wise gradient variance.”).
The combination of references thus far does not teach the limitation of “based on object annotation data and the detection result corresponding to the training image” for the determination of the gradients.
Ren teaches “based on object annotation data and the detection result corresponding to the training image” [§ 3.1, paragraph 2: “This feature is fed into two sibling fully connected layers—a box-regression layer (reg) and a box-classification layer (cls).” § 3.1.2 teaches training based on annotated and protected boxes and labels: “For training RPNs, we assign a binary class label (of being an object or not) to each anchor… We assign a positive label to two kinds of anchors…with any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors.” That is, the label and ground-truth boxes are the object annotation data, and includes both a bounding box and a classification annotation. Furthermore, this annotation is compared to the output as shown in the loss function shown in equation (1), which is described below: “Here, i is the index of an anchor in a mini-batch and pi is the predicted probability of anchor i being an object. The ground-truth label p*i is 1 if the anchor is positive, and is 0 if the anchor is negative. ti is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t*i is that of the ground-truth box associated with a positive anchor. The classification loss Lcls is log loss over two classes (object vs. not object). For the regression loss, we use Lreg… The outputs of the cls and reg layers consist of {pi} and {ti} respectively.” That is, the outputs pi and ti are detection results comprising a classification and a bounding box, whereas p* and t* are the ground truth annotations.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Ren by implementing the determination of the gradients to be “based on object annotation data and the detection result corresponding to the training image.” Doing so would have enabled the model to be trained for both classification and region proposal bounding box (see Ren, § 3.1: “A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.” FIG. 3: “Our method detects objects in a wide range of scales and aspect ratios.”).
As to claim 2, the combination of Vaswani, Huo, and Ren teaches the method of claim 1, as set forth above.
Ren further teaches “wherein the detection result comprises a classification result and a detection bounding box of the target object, and wherein the object annotation data comprises a classification annotation result and an annotation bounding box.” [§ 3.1, paragraph 2: “This feature is fed into two sibling fully connected layers—a box-regression layer (reg) and a box-classification layer (cls).” § 3.1.2 teaches training based on annotated and protected boxes and labels: “For training RPNs, we assign a binary class label (of being an object or not) to each anchor… We assign a positive label to two kinds of anchors…with any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors.” That is, the label and ground-truth boxes are the object annotation data, and includes both a bounding box and a classification annotation. Furthermore, this annotation is compared to the output as shown in the loss function shown in equation (1), which is described below: “Here, i is the index of an anchor in a mini-batch and pi is the predicted probability of anchor i being an object. The ground-truth label p*i is 1 if the anchor is positive, and is 0 if the anchor is negative. ti is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t*i is that of the ground-truth box associated with a positive anchor. The classification loss Lcls is log loss over two classes (object vs. not object). For the regression loss, we use Lreg… The outputs of the cls and reg layers consist of {pi} and {ti} respectively.” That is, the outputs pi and ti are detection results comprising a classification and a bounding box, whereas p* and t* are the ground truth annotations.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Vaswani, Huo, and Ren, including the teachings of Ren discussed above for the instant dependent claim, so as to have also arrived at the limitations of the instant dependent claim. The motivation for doing so is the same as the one given for the teachings of Ren in the rejection of the parent claim, since the teachings discussed for the instant dependent claim are part of the techniques already discussed in the rejection of the parent claim.
As to claim 4, the combination of Vaswani, Huo, and Ren teaches the method of claim 1, as set forth above.
Ren further teaches:
wherein the head network comprises a region proposal network (RPN) and a classification and regression layer, [FIG. 3, which teaches an RPN comprising a classification layer (cls layer) and a regression layer (reg layer).] and wherein processing the feature map by using the head network comprises:
determining, by using the RPN based on the feature map, a plurality of proposed regions that are predicted to comprise the target object; [§ 3.1.1: “At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal. The k proposals are parameterized relative to k reference boxes, which we call anchors.” That is, as shown in FIG. 3, there are k anchor boxes, which are a plurality of proposed rejections.]
determining, by using the classification and regression layer and based on a region feature of each proposed region, a target object category and a bounding box that correspond to the proposed region; [§ 3.1.1: “At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal.”] and
using the target object category and the bounding box for each proposed region as the detection result. [§ 3.1.1: “At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Vaswani, Huo, and Ren so as to have also arrived at the limitations of the instant dependent claim. The motivation for doing so is the same as the one given for Ren in the rejection of the parent claim, since the teachings of Ren discussed for this claim are part of the general techniques discussed in the rejection of the parent independent claim.
As to claim 5, the combination of Vaswani, Huo, and Ren teaches the method of claim 1, as set forth above.
Chen further teaches wherein determining a gradient norm of each neural network layer based on object annotation data and the detection result corresponding to the training image, comprises:
calculating, by using a back propagation technique, a gradient of each neural network layer based on the object annotation data and the detection result; [Page 7884 (2rd page), bottom paragraph of left column: “Let
∇
k
f
i
w
t
denote the stochastic gradient with respect to the parameters at layer k and γ denote its corresponding learning rate at layer k…. At each iteration, the learning rate k at layer k is updated using Complete Layer-wise Adaptive Rate Scaling (CLARS) as follows:” Here, in equation (5),
∇
k
f
i
w
t
2
is the L2 norm of the gradient. Note that this is performed for each layer. See page 7884, left column, paragraph above equation (3); see also page 7886, right column, bottom paragraph: we know that the upper bound of learning rate γk at each layer.” That is, the gradient is computed for each layer. Since the gradient is computed for the layers and the weights of each layer are updated, this use of gradient descent is considered to be a backpropagation technique.] and
calculating a norm of the gradient of each neural network layer as the gradient norm of each neural network layer. [As noted in the rejection of the parent independent, in equation (5),
∇
k
f
i
w
t
2
is the L2 norm of the gradient, and the same operations are used for each layer.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Vaswani, Huo, and Ren, including the teachings of Huo discussed above, so as to have also arrived at the limitations of the instant dependent claim. The motivation for doing so is the same as the one given for the teachings of Huo in the rejection of the parent claim, since the teachings discussed for the instant dependent claim are part of the techniques already discussed in the rejection of the parent claim.
As to claims 9-10 and 12-13, these claims are directed to a system for performing operations that are the same or substantially the same as those of claims 1-2 and 4-5, respectively. Therefore, the rejections made to claims 1-2 and 4-5 are applied to claims 9-10 and 12-13, respectively.
Furthermore, Vaswani teaches “A system, comprising: one or more computers; and
one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing one or more instructions that, when executed by the one or more computers, perform operations for training an object detection system comprising multiple neural network layers” [[0073]: “Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus.” [0074]: “The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.”]
As to claims 17-18 and 20, these claims are directed to a system for performing operations that are the same or substantially the same as those of claims 1-2 and 4-5, respectively. Therefore, the rejections made to claims 1-2 and 4-5 are applied to claims 17-18 and 20, respectively.
Furthermore, Vaswani teaches “A non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform operations for training an object detection system comprising multiple neural network layers” [[0073]: “Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus.” [0074]: “The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.”]
2. Claims 3, 11, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Vaswani in view of Huo and Ren, and further in view of Tran et al. (US 2019/0066326 A1) (“Tran”).
As to claim 3, the combination of Vaswani, Huo, and Ren teaches the method of claim 1, wherein the convolution representation comprises C two-dimensional matrices, [Vaswani, [0024]: “Each block can include a first convolutional neural network layer that reduces a dimensionality of the stack input, a local self-attention layer that operates on the reduced-dimensionality stack input.” That is, the “reduced-dimensionality stack input” corresponds to a two-dimensional matrix. See also Vaswani, [0061]: “As shown in Table 1, the neural network maps an input image of size s×s to an output that includes a respective score for each of 1000 categories. The architecture includes a backbone neural network that includes (i) an initial layer block that includes a 7×7 convolution with stride 2 and a 3×3 max pooling layer with stride 2.” That is, the convolution produces outputs of resolution s/4 x s/4 (see the original application of this publication for a clearer print of the table).] and wherein performing self-attention processing comprises:
performing, by using the multiple self-attention layers, self-attention processing on C vectors obtained […]; [Vaswani, [0024]: “Each block can include a first convolutional neural network layer that reduces a dimensionality of the stack input, a local self-attention layer that operates on the reduced-dimensionality stack input.” Vaswani, [0061]: “(ii) four local self-attention layer blocks that each reduce the spatial resolution of the each include multiple sets of local self-attention layers that are each preceded and followed by a 1×1 convolution. The second, third, and fourth self-attention layer blocks each reduce the spatial resolution of the input to the block, e.g., by having an attention downsampling layer as the first local self-attention layer in the block.” In regards to flattening, see [0048]: “As a particular example, the system can flatten each (b, b) block into a sequence of b2 elements and process the image through the layers of the neural network as a five-dimensional tensor”]
Ren further teach “by performing flattening processing based on the C two-dimensional matrices, to obtain Z vectors” [§ 3.1: “Each sliding window is mapped to a lower-dimensional feature (256-d for ZF and 512-d for VGG, with ReLU [33] following).” See also FIG. 3. That is, the 2D feature map is mapped (flattened) to a vector of length 256.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Vaswani, Huo, and Ren so as to have also arrived at the above limitation of “by performing flattening processing based on the C two-dimensional matrices, to obtain Z vectors.” The motivation for doing so is the same as the one given for Ren in the rejection of the parent claim, since the teachings of Ren discussed for this claim are part of the general techniques discussed in the rejection of the parent independent claim.
Tran teaches “respectively performing truncation and stack processing on the Z vectors to obtain Z two-dimensional matrices as the feature map.” [[0031]: “The feature weighting system 130 can consist of a reshaping layer 131, a set of fully connected layers 132, a softmax layer 133, and a reshaping layer 134. In one embodiment, the reshaping layer 131 can resize the combined feature map 120 of size W×H×C into a one-dimensional (1D) vector of size 1×(W·H·C), which can then be passed through a set of fully connected layers 132 of various output sizes, e.g., 1024, 512, 256, and 128 dimensional vectors for example. The output from fully connected layers 132 can then be passed to a softmax layer 133 to compute a score vector (where each entry value is between zero and one). The score vector can then be resized by the reshaping layer 134 to have the size of W×H (or the same spatial dimension as the combined feature map 120).” That is the reshape operation from 1D vectors to WxH size matrices corresponds to a truncation and stack processing. Note that the score map is a feature map itself, but it is also used to update (obtain) the combined feature map. See [0030].]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references with the teachings of Tran by implementing the reshaping technique of Tran so as to arrive at the claimed invention. The motivation for doing so would have been obtain a weighted feature map, which enables prediction of characteristics (see Tran, [0030]: “hence the score map 140 can be used to weight or multiply along each channel of the combined feature map 120 to obtain the (spatially) weighted feature map 150. The weighted feature map 150 can be fed to a pose estimation CNN 160 to predict a pose 170.”).
As to claims 11 and 19, these claims recite further limitations that are the same or substantially the same as those of claim 3. Therefore, the rejection made to claim 3 is applied to claims 11 and 19.
Allowable Subject Matter
Claims 6-8 and 14-16 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
The prior art of record does not teach or fairly suggest the following limitations of dependent claims 6 and 14, in combination with the remaining limitations of the claims, including those incorporated from their independent claims:
calculating an average of multiple gradient norms corresponding, respectively, to the multiple neural network layers; and
updating, for each neural network layer, the parameter values of the neural network layer based on a ratio of the gradient norm of the neural network layer to the average of multiple gradient norms.
The closest prior art of record is discussed below.
Vaswani, as discussed above, teaches the general context of an object detection model, but does not teach the use of a gradient norm.
Huo, as discussed above, teaches the use of a gradient norm, but does not teach “an average of multiple gradient norms corresponding, respectively, to the multiple neural network layers” and “a ratio of the gradient norm of the neural network layer to the average of multiple gradient norms.” Instead, Huo teaches that the gradient norms are averaged with respect to the batch count.
The differences between the claimed invention and the prior art are such that the claimed invention as a whole would have no been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. For example, the following references do not teach the limitations described above.
Chen et al., “GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks,” Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018 teaches a gradient norm technique involving the average of gradient norms. However, the average is across different tasks, and not across different layers of the neural network. Furthermore, this reference does not explicitly use the ratio of one norm to the average of multiple norms. Therefore, Chen does not teach “an average of multiple gradient norms corresponding, respectively, to the multiple neural network layers” and “a ratio of the gradient norm of the neural network layer to the average of multiple gradient norms.”
Nakamura et al., “Adaptive Weight Decay for Deep Neural Networks,” in IEEE Access, vol. 7, pp. 118857-118865, 2019, doi: 10.1109/ACCESS.2019.2937139 teaches an average of gradient norms in a single layer (see Algorithm 1), but does not teach “an average of multiple gradient norms corresponding, respectively, to the multiple neural network layers” and “a ratio of the gradient norm of the neural network layer to the average of multiple gradient norms.”
Dependent claims 7-8 and 15-16 are patentable over the art at least on the basis of their claim dependency on claim 6 or 14.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The following document depicts the state of the art.
Ginsburg et al., “Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments,” arXiv:1905.11286v3 [cs.LG] 6 Feb 2020 (“Ginsburg”) teaches the use of gradient norms as a normalization factor.
Chen et al. (US 2019/0130275 A1) teaches the GradNorm technique in a patent publication corresponding to the GradNorm paper described above.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764. The examiner can normally be reached Monday - Friday 9:00 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Y.D.H./Examiner, Art Unit 2124
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124