Prosecution Insights
Last updated: April 19, 2026
Application No. 17/085,264

METHOD, ACCELERATOR, AND ELECTRONIC DEVICE WITH TENSOR PROCESSING

Final Rejection §103§DP
Filed
Oct 30, 2020
Examiner
ALSHAHARI, SADIK AHMED
Art Unit
2121
Tech Center
2100 — Computer Architecture & Software
Assignee
Samsung Electronics Co., Ltd.
OA Round
5 (Final)
35%
Grant Probability
At Risk
6-7
OA Rounds
4y 5m
To Grant
82%
With Interview

Examiner Intelligence

Grants only 35% of cases
35%
Career Allow Rate
12 granted / 34 resolved
-19.7% vs TC avg
Strong +47% interview lift
Without
With
+47.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
24 currently pending
Career history
58
Total Applications
across all art units

Statute-Specific Performance

§101
31.8%
-8.2% vs TC avg
§103
41.7%
+1.7% vs TC avg
§102
4.1%
-35.9% vs TC avg
§112
16.7%
-23.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 34 resolved cases

Office Action

§103 §DP
DETAILED ACTION Status of Claims Claim(s) 1, 5-17, and 20-22 are pending and are examined herein. Claim(s) 1, 8, 17 and 22 have been Amended. Claim(s) 2-4, and 18-19 are Canceled. Claim(s) 1, 5-17, and 20-22 remain rejected under are rejected under Nonstatutory Double Patenting (NSDP) and 35 U.S.C. § 103. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Priority Acknowledgment is made of applicant’s claim for domestic priority US Provisional Application No. 62/117,728 filed on November 24, 2020. Response to Amendment The amendment filed on June 10, 2025 has been entered. Claims 1, 5-17, and 20-22 are pending in the application. Applicant’s amendments to the claims have been fully considered and are addressed in the rejections below. Response to Arguments Applicant's arguments, with respect to the rejection under 35 U.S.C. § 103 filed on 06/10/2025 (see remarks Pp. 8-11) have been fully considered but are not persuasive. Applicant argues that the cited references do not teach or suggest the limitation: "wherein a number of elements of a normalization unit applied to the target tensor is equal to a number of channels of the target tensor. and the number of input channels of the kernel is equal to the number of channels of the target tensor.", as currently amended in Claim 1. The examiner respectfully disagrees. First, Applicant has amended the claim to recite the limitation "wherein a number of elements of a normalization unit applied to the target tensor is equal to a number of channels of the target tensor. and the number of input channels of the kernel is equal to the number of channels of the target tensor." Claims 3 and 4 have been canceled, and not every element of claims 3 and 4 have been incorporated into amended claim 1 as previously presented. Claim 4 was previously dependent on claim 3. Accordingly, the scope of the claim is not the same as previously presented. Second, Applicant asserts that the cited references do not teach the amended claim limitations. The applicant merely restates the claim language and merely provides a conclusory statement that the relied upon art does not teach the claim language. Applicant’s remarks amount to a general allegation under 37 CFR 1.111 (b) and fails to with 37 CFR 1.111 (c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections. For example, Applicant’s argument does not provide a factual explanation of how Luo’s disclosure differs from the claimed feature. Applicant merely states that Luo “merely describes” normalizing a target feature map set based on variance and mean, without addressing Luo’s explicit description of the channel dimension, the correspondence between channels and convolution kernels, and the normalization unit operating across each channel, each of which maps the claimed elements. Accordingly, Applicant’s arguments fail to describe what distinguishes the claim features from the cited prior art and instead amount to mere conclusory statements. Lastly, Huang and Luo teach the claim limitations as currently amended and under the broadest reasonable interpretation (BRI) of the claim language. Specifically, Huang describes ([0073]-[0076]) that number of input channels of the filter (i.e., kernel’s size) corresponds to the number of channels of the input tensor (e.g., target tensor). Luo in combination with Huang, further teaches ([0143]-[0146], [0260]-[0263]) a dimension normalization unit to normalize a feature map set output from a network layer along a channel dimension, based on the number of channels corresponding to the feature map set. Luo further teaches that each channel corresponds to at least one feature map, and the number of channels of the feature map set is identical to the number of convolution kernels. Thus, the per-channel normalization, where the mean and variance are computed over the channel dimension, demonstrate that the normalization unit elements performed along the channel dimension corresponds to the number of channels of the feature map set, and that the number of channels of the convolution kernel corresponds to the number of channels of the feature map set, as required by the claim. Furthermore, the Examiner notes that under the BRI, the term “target tensor” is broadly interpreted as corresponding to the “feature maps” or “input tensor.” The “target tensor” broadly encompasses the feature map output by a network layer, as taught by the cited references. Applicant’s arguments for independent claims 17 and 22, and for all remaining dependent claims, mainly refer back to the same arguments presented for independent claim 1 and make the same assertion that the cited references do not teach the claims for the same reasons. (See Remarks Pp. 11-17). The Examiner notes that these remarks merely incorporate the same general allegation arguments made for claim 1 and are unpersuasive for the same reasons discussed above. Accordingly, Applicant’s arguments are not persuasive, and the rejection under 35 U.S.C. § 103 is maintained. The examiner refers to the updated prior art rejection under 35 U.S.C. § 103 for more details. Double Patenting The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969). A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13. The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer. Claims 1, 5, 7, 13-17, and 21 provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-5, 7, 11-12, 14-17 and 20 of copending Application No. 17/091,338 in view of view of Banner et al. (NPL: “Scalable methods for 8-bit training of neural networks." (2018)), NA et al., (Pub. No.: US 20180174044 A1), Yang et al., (Pub. No.: US 20220129740 A1), and Ross et al., (Pub. No.: US 20200160226 A1). Although the claims at issue are not identical, they are not patentably distinct from each other. Corresponding claims/features are shown in the table below and rejection based on obviousness analysis is further described below. Claims of the Present Application Field on 06/10/2025 Claims of Copending Application # 17/091,338 Field on 04/25/2025 1. (Currently amended) A processor-implemented tensor processing method, comprising: receiving a request to process a neural network including a normalization layer by an accelerator; and generating an instruction executable by the accelerator in response to the request, wherein, by executing the instruction, the accelerator is configured to determine an intermediate tensor corresponding to a result of performing a portion of operations included in the normalization layer, by performing, in a channel axis direction, a convolution based on: a target tensor on which the portion of operations is to be performed; and a kernel having a number of input channels and a number of output channels determined based on the target tensor and including elements of scaling values determined based on the target tensor, wherein the accelerator is configured to, for determining the intermediate tensor, perform normalization in the normalization layer of the neural network by replacing a calculation of a square sum or a square mean with the convolution performed in the channel axis direction, including extracting diagonal elements from a result tensor determined by the convolution based on the target tensor and the kernel, wherein a number of elements of a normalization unit applied to the target tensor is equal to a number of channels of the target tensor. and the number of input channels of the kernel is equal to the number of channels of the target tensor. [Examiner’s note: Differences between the instant application and the copending application is that the intermediate tenors of the copending application is determined by subtracting average values from the result of the convolution operation. In the other hand, the intermediate tensor of the instant application is determined by extracting diagonal elements from the result of the convolution operation. Furthermore, the instant application differs from the copending application in the limitation of replacing of a square root calculation in the normalization process. The same applies to the independent claims 17 & 22.] 1. (Currently amended) A processor-implemented tensor processing method, comprising: receiving a request to process a neural network including a normalization layer by an accelerator; and generating an instruction executable by the accelerator in response to the request, wherein, by executing the instruction, the accelerator is configured to determine an intermediate tensor corresponding to a result of a portion of operations of the normalization layer, by performing, in a channel axis direction, a convolution based on an input tensor and a kernel such that an element in the intermediate tensor is determined based on elements of a plurality of channels of the input tensor, wherein the input tensor is of the normalization layer, a number of input channels of the kernel is determined based on the input tensor, and at least a portion of scaling values of elements of the kernel have absolute values equal to an inverse of either one of a number of input channels of the input tensor and a number of elements included in a same channel of the input tensor. 2. (Original) The method of claim 1, wherein the intermediate tensor is determined by subtracting an average value of one or more elements of the input tensor from a value of each of the one or more elements through the convolution, and an output tensor corresponding to an output of the normalization layer is determined based on the intermediate tensor. 3. (Previously presented) The method of claim 1, wherein the number of input channels and a number of output channels of the kernel are equal to the number of channels of the input tensor, and diagonal elements of the kernel have different scaling values from scaling values of remaining elements of the kernel. 5. The method of claim 1, wherein the number of output channels of the kernel is determined based on a width length of the target tensor. 3. (Previously presented) The method of claim 1, wherein the number of input channels and a number of output channels of the kernel are equal to the number of channels of the input tensor, and diagonal elements of the kernel have different scaling values from scaling values of remaining elements of the kernel. 7. The method of claim 1, wherein a scaling value of each of the elements included in the kernel is equal to a value of a corresponding element in the target tensor. 7. (Previously presented) The method of claim 1, wherein the number of input channels of the kernel is equal to the number of channels of the input tensor, and the scaling values of the elements of the kernel correspond to the inverse of the number of channels of the input tensor. 13. The method of claim 1, wherein the convolution is performed between the kernel and an input tensor transformed such that elements included in the same channel are arranged in a line, and the intermediate tensor is determined by transforming elements determined as a result of the convolution to the same form as the input tensor. 5. (Original) The method of claim 1, wherein the convolution is performed between the kernel and a transformed input tensor transformed such that elements included in a same channel of the input tensor are arranged in the channel axis direction, and the intermediate tensor is determined by transforming elements determined as a result of the convolution to a same form as the input tensor. 14. The method of claim 1, wherein the convolution is performed in the accelerator such that the target tensor is not transmitted outside the accelerator to perform an operation according to the normalization layer. 11. (Original) The method of claim 1, wherein the convolution is performed in the accelerator such that the input tensor is not transmitted externally from the accelerator for performing an operation according to the normalization layer. 15. The method of claim 1, wherein the accelerator is included in either one of: a user terminal into which data to be inferred using the neural network is input; and a server that receives the data to be inferred from the user terminal. 12. (Original) The method of claim 1, wherein the accelerator is included in either one or both of a user terminal configured to receive data to be inferred using the neural network, and a server configured to receive the data to be inferred from the user terminal. 16. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, configure the processor to perform the method of claim 1. 14. (Original) A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, configure the one or more processors to perform the method of claim 1. 17. (Currently amended) An accelerator, comprising: one or more processors configured to: determine a target tensor on which a portion of operations included in a normalization layer in a neural network is to be performed; determine a kernel having a number of input channels and a number of output channels determined based on the target tensor and including elements of scaling values determined based on the target tensor; and determine an intermediate tensor corresponding to a result of performing the portion of operations by performing, in a channel axis direction, a convolution based on the target tensor and the kernel, wherein the accelerator is configured to, for determining the intermediate tensor, perform normalization in the normalization layer of the neural network by replacing a calculation of a square sum or a square mean with the convolution performed in the channel axis direction, including extracting diagonal elements from a result tensor determined by the convolution based on the target tensor and the kernel, and wherein a number of elements of a normalization unit applied to the target tensor is equal to a number of channels of the target tensor, and the number of input channels of the kernel is equal to the number of channels of the target tensor. 15. (Currently amended) An accelerator, comprising: one or more processors configured to: obtain an input tensor of a normalization layer included in a neural network, obtain a kernel having a number of input channels determined based on the input tensor and including at least a portion of elements of scaling values having absolute values equal to an inverse of either one of a number of input channels of the input tensor and a number of elements included in a same channel of the input tensor, and determine an intermediate tensor corresponding to a result of a portion of operations of the normalization layer, by performing, in a channel axis direction, a convolution which is based on the input tensor and the kernel such that an element in the intermediate tensor is determined based on elements of a plurality of channels of the input tensor. 16. (Original) The accelerator of claim 15, wherein the one or more processors are configured to determine the intermediate tensor by subtracting an average value of one or more elements of the input tensor from a value of each of the one or more elements through the convolution, and an output tensor corresponding to an output of the normalization layer is determined based on the intermediate tensor. 17. (Previously presented) The accelerator of claim 15, wherein the number of input channels and a number of output channels of the kernel are equal to the number of channels of the input tensor, and diagonal elements of the kernel have different scaling values from scaling values of remaining elements of the kernel. 21. The accelerator of claim 17, wherein a scaling value of each of the elements included in the kernel is equal to a value of a corresponding element in the target tensor. 20. (Currently amended) The accelerator of claim 15, wherein the number of input channels of the kernel is equal to the number of channels of the input tensor, and the scaling values of the elements of the kernel correspond to an inverse of the number of channels of the input tensor. As shown in the table above, the claims of the instant application are not identical to the claims of the copending application, but they are not patentably distinct from each other. Regarding independent claim(s) 1 and 17: as shown above, it can be seen that the copending application teaches the claim limitation “wherein the accelerator is configured to, for determining the intermediate tensor, perform normalization in the normalization layer of the neural network using the convolution in the channel axis direction.” Specifically, it describes determining an intermediate tensor by performing a convolution in the channel axis direction based on an input tensor from the normalization layer and a kernel. Thus, the copending application teaches performing normalization within the normalization layer and performing convolution in the channel axis direction. With respect to the amended feature of the present application which recites “wherein a number of elements of a normalization unit applied to the target tensor is equal to a number of channels of the target tensor. and the number of input channels of the kernel is equal to the number of channels of the target tensor.” the corresponding elements recited in claim 3 (which incorporates features of parent claim 1) of the copending application. Since the normalization operation of the present application is replaced with convolution operation along the channel axis, the normalization unit conceptually represents the set of elements that operates across the channels of the target tensor (i.e., input tensor). Thus, the number of input channels and the number of output channels of the kernel are equal to the number of channels of the input tensor (i.e., target tensor). Regarding the differences between the instant application and the copending application, the copending application does not explicitly teach following limitations: the replacement of the square sum or square mean calculation of the normalization operation extracting diagonal element from a result tensor However, these differences are obvious in view of Banner et al. (NPL: “Scalable methods for 8-bit training of neural networks." (2018)),and Na et al., (US 20180174044 A1). Regarding Amended claim 1: Banner teaches “perform normalization in the normalization layer of the neural network by replacing a calculation of a square sum or square mean” (Banner, teaches a scalable method to train a neural network by avoiding and replacing standard Batch Normalization (BN) with Range BN. This includes replacing normalization operations including the sum of squares and square root operations with 8-bit matrix multiplication (i.e., a few maximum and minimum operation). See Banner [Pp. 3 and 6, Sections: 3 & 6.2].) Therefore, it would have been prima facie obvious to one ordinarily skilled in the art of machine learning to have modified the copending claims to incorporate diagonal extraction method/step as taught by NA. One would have been motivated to make such combination in order to enable low-precision operations, multiplications become 16 times faster and at least 15 times more energy efficient. Doing so would enable a major performance benefits in terms of speed, memory, and energy (Banner [Sec: 7]). Na teaches “extracting diagonal elements from a result tensor” (Na, Fig. 7, [0085] “A transpose of the feature vector hi,t−1 is multiplied by the weight matrix Whi. A “2n×2n” matrix is determined as a result of multiplication, and a principal diagonal element 710 of the “2n×2n” matrix is extracted based on diag( ), and accordingly the second term of FIG. 6 is determined. For example, a portion of the result of multiplication is extracted to determine the second term.”) Therefore, it would have been prima facie obvious to one ordinarily skilled in the art of machine learning to have modified the copending claims to incorporate diagonal extraction method/step as taught by NA. One would have been motivated to make such combination in order to efficiently capture the relationship between data or model, simplify data representation, and improve computational efficiency (NA [0077]-[0096]). Regarding claim 17, the claim recites similar limitation as claim 1, therefore, the same rationale applies to this claim 17. Regarding dependent claim(s) 5 and 7: the co-pending application does not explicitly recite the following limitations: the number of output channels of the kernel is determined based on a width length of the target tensor. a scaling value of each of the elements included in the kernel includes a runtime value corresponding to the target tensor. scaling value of each of the elements included in the kernel is equal to a value of a corresponding element in the target tensor. However, these differences are obvious in view Banner et al. (NPL: “Scalable methods for 8-bit training of neural networks." (2018)),and Na et al., (US 20180174044 A1), Yang et al., (US 20220129740 A1), and Ross et al., (Pub. No.: US 20200160226 A1). Regarding claim 5: Ross teaches “the number of output channels of the kernel is determined based on a width length of the target tensor” (Ross, [0164] The aisle-major order accesses elements in a three-dimensional (3D) kernel-sized tile of the input tensor first along an axis corresponding to the depth of the 3D tile and subsequently along axes corresponding to the width and height of the 3D tile.) Therefore, it would have been prima facie obvious to one ordinarily skilled in the art of machine learning to have modified the copending claims to incorporate the kernel method/step as taught by Ross. One would have been motivated to make such combination in order to obtain an optimize approach to make the convolution operations more efficient, reduce the burden on I/O operations, and improve the computational efficiently (Ross [0096]). Regarding claim 7: Yang, teaches “scaling value of each of the elements included in the kernel is equal to a value of a corresponding element in the target tensor.” (Yang, [0033] When a given input tensor to the convolutional layer 120 is received, the system 100 generates a respective input-dependent weight for each of the multiple kernels from the input tensor. The input-dependent weight is referred to “input-dependent” because the system 100 generates the weight based on the input tensor to the convolutional layer 120, with different input tensors resulting in different weights for the various kernels of the conditional convolutional layer 120. Generating the input-dependent weights will be described in more detail below with reference to FIG. 3.”) Regarding claim 21, The claim recites similar limitation as claim 7, therefore, the same rationale applies to this claim 21. Therefore, it would have been prima facie obvious to one ordinarily skilled in the art of machine learning to have modified the copending claims to incorporate the kernel technique/method as taught by Yang in order to improves the accuracy of all baseline neural networks with small relative increase in inference cost (<10%), and because doing so would increase the size and performance of the convolutional neural network with only minimal increases in computational overhead (Yang [0007]). This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented. As noted above, a timely filed terminal disclaimer in compliance with 37 CFR 1.32l(c) or 1.32l(d) may be used to overcome a provisional rejection based on nonstatutory double patenting provided the reference application either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. Claim(s) 1, 5, 13, 14-17, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Huang et al., (Pub. No.: US 20200410337 A1) in view of AlBahar et al., (NPL: “Guided Image-to-Image Translation with Bi-Directional Feature Transformation.” (2019)), further in view of Banner et al. (NPL: “Scalable methods for 8-bit training of neural networks." (2018)), and further in view of NA et al., (Pub. No.: US 20180174044 A1). Regarding Amended Claim 1, Huang discloses the following: A processor-implemented tensor processing method, comprising: (Huang, [0025] “FIG. 21 is a flow chart illustrating an example of a method for accelerating a tensor operation by performing sub-operations of the tensor operation in parallel on multiple computing engines according to certain embodiment.”) receiving a request to process a neural network (Huang, [0145] “At block 2110, a host system may receive a neural network model that includes a first tensor operation, such as a convolution operation.”) including a normalization layer (Huang, [0145] “Each layer 2212 may include two sub-layers that perform matrix multiplications and element-wise transformations. …, A residual connection may be used around each of the two sub-layers, followed by layer normalization. A residual connection adds the input to the output of the sub-layer, and is a way of making training deep networks easier. Layer normalization is a normalization method in deep learning that is similar to batch normalization. The output of each sub-layer may be written as LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer.”) by an accelerator; (Huang, [0081] “FIG. 7 is a block diagram illustrating an example of an integrated circuit device for performing neural network operations, such as tensor operations, according to certain embodiments. The example shown in FIG. 7 includes an accelerator 702.”) and generating an instruction executable by the accelerator in response to the request, (Huang, [0147] “At block 2140, the compiler may assign a second sub-operation in the two or more sub-operations to a second computing engine in the two or more computing engines. At block 2150, the compiler may generate instructions (e.g., machine code) for performing the first sub-operation by the first computing engine and for performing the second sub-operation by the second computing engine in parallel. Optionally, at block 2160, the compiler may generate instructions for making an inference based on a result of the first sub-operation and/or a result of the second sub-operation.” [0164]-[0165] “In the example of FIG. 24, the acceleration engine 2412 is a neural network accelerator and the compiler 2430 is for compiling a neural network description into instructions to be executed by the acceleration engine 2412.”) by executing the instruction, the accelerator is configured to (Huang, [0164] “In various examples, the acceleration engine 2412 can execute program code to perform certain operations.”) the accelerator is configured to determine an intermediate tensor (Huang, [0098] “Processing element array 710 can output intermediate results, which represent the outputs of individual layers of the neural network. ..., Processing element array 710 can output intermediate results, which represent the outputs of individual layers of the neural network. …, Accelerator 702 can store the intermediate results in memory subsystem 704 for inputting into processing element array 710 to compute results for the next layer of the neural network.”) corresponding to a result of performing a portion of operations (Huang, [0130] “A tensor operation may be divided such that each sub-operation may generate a portion of the output tensor (e.g., output feature maps) …” [0145] “The tensor operation may be used to generate an output tensor that includes a set of output feature maps using a set of input feature maps and a set of filters.” [0146] “Each of the two or more sub-operations may generate a portion of the set of output feature maps. In some embodiments, the portion of the set of output feature maps may include a fraction of a total number of output feature maps in the set of output feature maps, where a sub-operation may generate the portion of the set of output feature maps using the set of input feature maps and a fraction of a total number of filters in the set of filters.”) included in the normalization layer, (Huang, [0151] “Each layer 2212 may include two sub-layers that perform matrix multiplications and element-wise transformations. The first sub-layer may include a multi-head self-attention network, and the second sub-layer may include a position-wise fully connected feed-forward network. A residual connection may be used around each of the two sub-layers, followed by layer normalization. A residual connection adds the input to the output of the sub-layer, and is a way of making training deep networks easier. Layer normalization is a normalization method in deep learning that is similar to batch normalization. The output of each sub-layer may be written as LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer.”) by performing, in a channel axis direction, a convolution based on: a target tensor on which the portion of operations is to be performed; and a kernel (Huang, [0033] “In one example, each computing engine may perform a convolution operation on a portion of the input feature maps in a shorter time period to generate a portion of each of the output feature maps in the tensor output. In another example, each computing engine may perform a convolution operation on the input feature maps using a portion of the filters for a portion of the output channels in a shorter time period to generate a fraction of the number of output feature maps. [0138] “According to certain embodiments, the C channels of input feature maps 1930 used for the convolution operation may be divided into L groups, where each group may include N×H×W/L pixels. In addition, the M 3-D filters 1940 (corresponding to output channels) used for the convolution operation may be divided into K groups, where each group may include M/K 3-D filters or output channels. As such, the convolution operation may be divided into L×K sub-operations, where each sub-operation may use M/K 3-D filters and a portion of input feature maps 1930 that include C channels each including N×H×W/L pixel values to generate a portion (e.g., a few rows) of each output feature map on M/K output channels, where each output channel may include NIL output feature maps each including E×F pixels.” Further described [0067] and [0135]-[0146]. [Examiner’s Note: Target tensor (i.e., input data or input feature map). The input feature maps are divided into C channels, which are portion of the input data that will be used to perform convolution operation. Kernel (i.e., filter or M-D filter) which are used to perform portion of convolution operation. Channel axis direction (i.e., the L groups of sub-operation performing convolution based on the C channels).]) a kernel having a number of input channels and a number of output channels determined based on the target tensor and including elements of scaling values determined based on the target tensor, (Huang, [0073] “Each input may have an associated weight (w), which may be assigned based on the importance of the input relative to other inputs. “More specifically, as shown in FIG. 5, for a 3-D input 520-1, . . . , or 520-N and a 3-D filter 510-1, . . . , or 510-M, the C 2-D filters (each with dimensions R×S) in 3-D filter 510-m may correspond to the C channels of 2-D input feature maps (each with dimensions H×W) in the 3-D input, and the convolution operation between each 2-D filter of the C 2-D filters and the corresponding channel of the C channels of 2-D input feature maps may be performed. …, Wc,m r,s is a weight corresponding to a pixel at a location (r, s) of a 2-D filter of index C in the 3-D filter of index m.” [0130] “As described above with respect to FIGS. 5, 6, and 9, a tensor operation, such as a convolution operation, may use an input tensor that includes N (e.g., one or more) 3-D inputs each including C channels of input feature maps (each with dimensions H×W), and filters that include M 3-D filters each including C channels of 2-D filters (each with dimensions R×S). Thus, the input tensor may include N×C×H×W pixel values, and the filters may include a total of M×C×R×S weight values. As also described above, the C input channels (each including N×H×W pixel values) may be mapped to the rows of the PE array and the M output channels or 3-D filters (each including C×R×S weight values) may be mapped to the columns of the PE array. …, where each sub-operation may use M/K 3-D filters and input feature maps 1730 that include C channels each including N×H×W pixel values to generate output feature maps on M/K output channels, where each output channel may include N output feature maps each including E×F pixels. [0143] “For example, a first sub-operation may be performed by a first accelerator 2010-1 using a PE array 2020-1. First accelerator 2010-1 may use the C/K 2-D filters in each of the M 3-D filters 2040 and a portion of input feature maps 2030 that includes C/K channels of input feature maps to generate partial sum feature maps 2050-1 for the output feature maps on the M output channels, where each output channel may include N partial sum feature maps each including E×F pixels.” Furthermore, see paragraph, [0072], [0132]-[0136] and [0154]. [Examiner’s Note: the input channels and output channels of the filter (kernel) corresponding to the dimensions of the input tensor (e.g., or input feature map) and the wight values of the filter (i.e. scaling values).]) wherein the accelerator is configured to, for determining the intermediate tensor, (Huang, [0098] “Processing element array 710 can output intermediate results, which represent the outputs of individual layers of the neural network. …, Accelerator 702 can store the intermediate results in memory subsystem 704 for inputting into processing element array 710 to compute results for the next layer of the neural network. Processing element array 710 can further output final results from a last layer of the neural network.” [0133] “The output feature maps generated by the K accelerators are the final output feature maps of the convolution operation, ..., The output feature maps generated by each of the K accelerators can be saved to a part of the memory space for the output feature maps of the convolution operation and can be used to make a prediction or decision.”) ……, including extracting diagonal elements from a result tensor determined by the convolution based on the target tensor and the kernel. (Huang, Fig. 4A-4E, [0069] – [0071] “FIG. 4A illustrates an example input matrix 410 that includes the example input pixel data. Input matrix 410 may include a 6×6 pixel array, where each element of the pixel array may include a real number, such as an integer number or a floating point number. FIG. 4B illustrates an example filter 420. Filter 420 may include a 3×3 matrix, where each element of the matrix represents a weight of the filter. Filter 420 may be used to extract certain features from input matrix 410. Input matrix 410 and filter 420 may be convoluted to generate an output matrix 430 as shown in FIG. 4C., …., A non-linear activation function (e.g., ReLU, sigmoid, tanh, etc.) may then be applied to output matrix 430 to generate a matrix 440 as shown in FIG. 4D, …., a max pooling operation may be applied to matrix 440, …,Thus, a feature map 450 with four elements 9, 2, 6, and 5 may be generated from the 6×6 input matrix 410 after the convolution, non-linear activation, and pooling operations.”) PNG media_image1.png 534 701 media_image1.png Greyscale [Examiner Note: Under the broadest reasonable interpretation (BRI), performing convolution operation based on the input tensor (410) and the filter matrix (420), then down-sample to extract elements from a result matrix 440 (i.e., the feature map 450 represents the extracted e.g., diagonal elements {9, 2, 6 and 5} from matrix 440) determined by the convolution operation based on the target tensor and the kernel (i.e., the convolution operation performed on the input matrix 410 and the filter matrix 420).] ….. wherein … the number of input channels of the kernel is equal to the number of channels of the target tensor. (Huang, [0073]-[0076] “… the C 2-D filters (each with dimensions R×S) in 3-D filter 510-m may correspond to the C channels of 2-D input feature maps (each with dimensions H×W) in the 3-D input, and the convolution operation between each 2-D filter of the C 2-D filters and the corresponding channel of the C channels of 2-D input feature maps may be performed. …. As illustrated, input data 620 includes 3 input feature maps 622, 622, and 624 (e.g., input channels), each corresponding to an input channel. The filters include a first set of filters 610-1 and second set of filters 610-2, where first set of filters 610-1 may include three 2-D filters 612-1, 614-1, and 616-1 and second set of filters 610-2 may include three 2-D filters 612-2, 614-2, and 616-2. Each 2-D filter 612-1, 614-1, or 616-1 in first set of filters 610-1 may convolve with the corresponding input feature map 622, 622, or 624,”) [Examiner’s Note: Huang describes a 3-D convolution where the kernel depth equals the input tensor channel. The C channels of 2-D input feature maps (e.g., 3 input feature maps 622, 622, and 624) correspond to the 3-D filter (set of 2-D filters). Thereby, the number of input channels of the filter (i.e., kernel) is equal to the number of channels of the input tensor (e.g., target tensor).] Huang does not appear to explicitly teach: perform normalization in the normalization layer of the neural network by replacing a calculation of a square root with the convolution performed in the channel axis direction, including extracting diagonal elements from a result tensor determined by the convolution based on the target tensor and the kernel, and wherein a number of elements of a normalization unit applied to the target tensor is equal to a number of channels of the target tensor. Hereinafter, Huang in view AlBahar teaches: perform normalization in the normalization layer of the neural network by replacing a calculation of a square sum or square mean with the convolution performed in the channel axis direction, (AlBahar, [Pp. 9018-9019, Figs. 2 & 3] “In (c) we replace every normalization layer with our novel feature transformation (FT) layer that manipulates the input using scaling and shifting parameters generated from the guide using a parameter generator (PG). We denote this uni-directional scheme as uFT. In this work, we propose (d) a bi-directional feature transformation scheme denoted as bFT. In bFT, the input is manipulated using scaling and shifting parameters generated from the guide and the guide is also manipulated using scaling and shifting parameters generated from the input. …., In place of every normalization layer in the encoder, we add our novel FT layer. This layer scales and shifts the normalized feature of that layer as shown in Figure 4. The scaling and shifting parameters are generated using a parameter generation model of two convolution layers with a bottleneck of 100 dimension. ….., In our proposed bFT scheme, we replace every normalization layer with our proposed FT layer. At l-th layer, the guidance feature representation manipulates the input feature representation as shown in Eqn. 1, and at the same time is manipulated by that input feature representation.”) [Examiner’s Note: Under the broadest reasonable interpretation of the claim in light of the specification, the claim limitation suggests that typical normalization layers are replaced with the convolution (e.g., replaced with one or two convolutional layers). See Applicant’s specification paragraphs [0095] and [0097]. Albahar replaces normalization layer with FT layer using a parameter generator (PG) consist of convolution layers. The replaced normalization layer involves normalization operations such as the mean and standard deviation. The standard deviation is obtained by taking the square root of the variance, and the variance involves determining the sum of square differences.] Therefore, at the effective filing date of the claimed invention, it would have been prima facie obvious to one of ordinary skill in the art to modify the method/system of Huang to incorporate the bFT method/scheme as taught by AlBahar. One would have been motivated to make such a combination in order to better capture local contents and help improve the performance of the end task. Doing so would provide competitive or better performance than the state-of-the-art (AlBahar [Intro]). As noted above, Huang in view AlBahar teaches replacing a normalization layer with convolution operation. While AlBahar does appear to explicitly indicate that the replaced normalization layer involves “a calculation of a sum or a square mean.” However, it would have been obvious in view of Banner that normalization involves calculating a square sum or square mean (e.g., variance). Hereinafter, Banner, in combination with Huang in view of AlBahar, teaches: Perform normalization in the normalization layer of the neural network by replacing a square sum or a square mean (Banner, teaches avoiding and replacing standard Batch Normalization (BN) with Range BN. [Pp. 1-2, Intro] “The traditional batch normalization [11] implementation requires the computation of the sum of squares, square-root and reciprocal operations; these require high precision (to avoid zero variance) and a large dynamic range.” [P. 3, Sec: 3] “The term V a r [ x d ] involves sums of squares that can lead to numerical instability as well as to arithmetic overflows when dealing with large values. The Range BN method replaces the above term by normalizing according to the range of the input distribution (i.e., max(·) − min(·)), making it more tolerant to quantization.” [P. 6, Sec: 6.2] “Replacing the sum of squares and square root operations in standard batch-norm by a few maximum and minimum operations has a major benefit in low-precision implementations.”) Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the combination of Huang and AlBahar before them, to incorporate the Range BN method which replaces normalization operations as taught by Banner. One would have been motivated to make such a combination in order to enable low-precision operations, multiplications become 16 times faster and at least 15 times more energy efficient. Doing so would enable major performance benefits in terms of speed, memory, and energy (Banner [Sec: 7]). As described above, while Huang describes extracting elements from a result tensor determined by the convolution operation which includes may include diagonal elements. The combination of Huang, AlBahar, and Banner does not appear to explicitly teach: extracting diagonal elements from a result tensor determined by the convolution based on the target tensor and the kernel. However, NA, in combination with Huang, AlBahar, and Banner, teaches the limitation: extracting diagonal elements from a result tensor determined by the convolution based on the target tensor and the kernel. (NA, [0084] – [0086] “The feature vector hi,t−1 output from the current layer at the previous time includes a partial feature vector hi,t−1 A corresponding to the first model and a partial feature vector hi,t−1 B responding to the second model. The partial feature vector hi,t−1 A is a feature vector output from an i-th hidden layer in the first model at the previous time, and the partial feature vector hi,t−1 B is a feature vector output from an i-th hidden layer in the second model at the previous time. A transpose of the feature vector hi,t−1 is multiplied by the weight matrix Whi. A “2n×2n” matrix is determined as a result of multiplication, and a principal diagonal element 710 of the “2n×2n” matrix is extracted based on diag( ), and accordingly the second term of FIG. 6 is determined. For example, a portion of the result of multiplication is extracted to determine the second term.” Further described in [0098].) Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the combination of Huang, AlBahar, and Banner, before them, to incorporate the diagonal extraction method as taught by NA. One would have been motivated to make such combination in order to efficiently capture the relationship between data or model, simplify data representation, and improve computational efficiency (NA [0077]-[0096]). While Huang describes that the convolution filter size (i.e., number of channels) corresponds to the number of channels of the input tensor (i.e., target tensor), and AlBahar describes the FiLM that performs channel-wise normalization, the combination of Huang, AlBahar, Banner, and NA does not appear to explicitly suggest: wherein a number of elements of a normalization unit applied to the target tensor is equal to a number of channels of the target tensor. However, Luo, in combination with Huang, AlBahar, Banner, and NA, teaches the limitation: wherein a number of elements of a normalization unit applied to the target tensor is equal to a number of channels of the target tensor. and the number of input channels of the kernel is equal to the number of channels of the target tensor. (Luo, [0143]-[0153] “At step 120, a feature map set output by means of a network layer in the deep neural network is normalized from at least one dimension to obtain at least one dimension variance and at least one dimension mean. The feature map set includes at least one feature map, the feature map set corresponds to at least one channel, and each channel corresponds to at least one feature map. For example, if the network layer is a convolutional layer, the number of channels corresponding to the generated feature map set is identical to the number of convolution kernels, and if the convolutional layer has two convolution kernels, the feature map set corresponding to two channels is generated. …. C represents the number of channels corresponding to a feature map set (i.e., the number of channels corresponding to the network layer in step 120).” [0093] “when obtaining a channel dimension variance and a channel dimension mean corresponding to the channel dimension based on the spatial dimension variance and the spatial dimension mean, the dimension normalization unit is configured to obtain the channel dimension mean based on the spatial dimension mean by using the number of channels corresponding to the feature map set as a variable, and obtain the channel dimension variance based on the spatial dimension mean, the spatial dimension variance, and the channel dimension mean by using the number of channels corresponding to the feature map set as the variable.” Further described [0176]-[0180] and [0260]-[0270].) [Examiner’s Note: Target tensor channels (i.e., the number of channels in the feature map set) and normalization unit elements (i.e., normalization along each channel of the feature map).] Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the combination of Huang, AlBahar, Banner, and NA before them, to incorporate the adaptive normalization method as taught by Luo. One would have been motivated to make such a combination in order to reduce the amount of calculation and improve processing speed (Luo [0185]). Regarding Original Claim 5, the combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above, and further teaches. wherein the number of output channels of the kernel is determined based on a width length of the target tensor. (Huang, [0072] “Multiple (e.g., M) 3-D filters 510-1, . . . and 510-M each having C 2-D filters of dimensions R×S may be convolved with the N 3-D inputs 520-1, . . . , and 520-N (e.g., N batches of C input feature maps of dimensions H×W) to generate multiple (e.g., N) 3-D outputs 530-1, . . . , and 530-N, where each of the 3-D outputs 530-1, . . . , and 530-N may include M output feature maps (also referred to as output channels). Each 3-D filter 510-1, . . . , or 510-M (with dimensions C×R×S) may be applied to a 3-D input 520-1, . . . , or 520-N (with dimensions C×H×W) to generate an output feature map (with dimensions E×F as described above with respect to FIGS. 3A and 3B) in a 3-D output 530-1, . . . , or 530-N that includes M output feature maps, and thus M 3-D filters may be used to generate the M output feature maps in a 3-D output for a 3-D input.” [0134] “As illustrated, a convolution operation to be performed by a PE array 1820 may use N 3-D inputs each including C channels of 2-D input feature maps (each with dimensions H×W) and 3-D filters 1840 that include M 3-D filters each including C channels of 2-D filters (each with dimensions R×S) to generate output feature maps 1850 that include M output channels of output feature maps. Each output channel may include N output feature maps that each include E×F pixels.”) [Examiner’s note: the output channels of the filter (i.e., output feature map) determined based on the spatial dimensions H and W (i.e., width and height) of the input feature map (i.e., target tensor). Where the filter (i.e., kernel) is the matrices used in convolution operation, and HxW refers to the width and height of the input tensor (i.e., input feature maps).] Regarding Original Claim 13, the combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above, and further teaches: wherein the convolution is performed between the kernel and an input tensor transformed such that elements included in the same channel are arranged in a line, (Huang, [0086] “Processing element array 710 includes multiple processing elements 711, arranged in rows and columns, such that results output by one processing element 711 can be input directly into another processing element 711. [0101] “Each row of PE array 820 may process one input data set comprising multiple input data elements, such as a one-dimensional vector representing a flattened multi-dimensional matrix.” [0104] “For example, the H×W pixels in each 2-D input feature map may be flattened to form a one-dimensional vector and mapped to a row of the PE array.”) and the intermediate tensor is determined by transforming elements determined as a result of the convolution to the same form as the input tensor. (Huang, [0107]-[0108] “The products in each column may be accumulated to generate a second partial sum vector PSUM0,1 (932) that includes four partial sum sub-vectors for the four output feature maps. Each element in the 16 2-D filters may be loaded into PE array 910 and multiplied with the elements in the one-dimensional vector to generate a partial sum vector that includes four partial sum sub-values for the four output feature maps until a partial sum vector PSUMR-1,S-1 (934) that corresponds to the element (R−1, S−1) in each 2-D filter and includes four partial sum sub-vectors for the four output feature maps is generated. The partial sum sub-vectors in partial sum vectors PSUM0,0 (930), PSUM0,1 (932), . . . and PSUMR-1,S-1 (934) and corresponding to each respective output feature map may be accumulated to generate a respective vector 940, 942, 944, or 946 that may correspond to a flattened output feature map.”) [Examiner’s Note: the flattened filter and input feature maps into a one-dimensional vector arranged to a row of the PE array reads on the claim limitation “the kernel and an input tensor transformed such that elements included in the same channel are arranged in a line.” The accumulated partial sum that correspond to a flattened output feature map (i.e., transformed intermediate tensor).] Regarding Original Claim 15, the combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above, and further teaches: wherein the accelerator is included in either one of: a user terminal into which data to be inferred using the neural network is input; and a server that receives the data to be inferred from the user terminal. (Huang, FIG. 10A illustrates an example of a series of operations for making an inference using a neural network model [0187] “The network 2600 can be used to process data. For example, input data can be received at one of the nodes 2602 a-2602 h or from other networks 2608 with which the network 2600 can communicate. In this example, the input data can be directed to a node in the network 2600 that includes an acceleration engine, for the acceleration engine to operate on and produce a result. The result can then be transferred to the node or other network from which the input data was received.” [0193] The application 2632 may allow the user(s) to interact with the service provider computer(s) to, for example, access web content (e.g., web pages, music, video, etc.).” [0199] “The data stores 2630 may include permanent or transitory data used and/or operated on by the operating system 2628, applications 2632, or drivers 2634., …., The information in the data stores 2630 may, in some implementations, be provided over the network(s) 2608 to user devices.” [0202] “The node(s) 2602 a-2602 h may also contain network device(s) 2624 that allow the node(s) 2602 a-2602 h to communicate with a stored database, another computing device or server, user terminals and/or other devices on the network(s) 2600.”) Regarding Original Claim 16, the combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above, and further teaches: A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, configure the processor to perform the method of claim 1. (Huang, [0200]-[0206] “Alternatively or additionally, computer-readable communication media may include computer-readable instructions, program modules or other data transmitted within a data signal, such as a carrier wave or other transmission. …, The modules described herein may be software modules, hardware modules or a suitable combination thereof. If the modules are software modules, the modules can be embodied on a non-transitory computer readable medium and processed by a processor in any of the computer systems described herein.”) Regarding Amended Claim 17, The claim recites substantially similar limitation as corresponding claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Regarding Amended Claim 22, The claim recites substantially similar limitation as corresponding claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Claim 22 recites: An electronic device, comprising: a host processor configured to generate an instruction executable by an accelerator, (Huang, [0159] “FIG. 24 includes a block diagram illustrating an example of a host system 2400 on which a compiler 2430, such as is described herein, can run. The illustrated host system 2400 is an example of a computing device, and includes a processor 2402, a processor memory 2404, at least one storage device 2406, various Input/Output (I/O) devices 2408, and at least one network interface 2410. In the example of FIG. 24, the host system 2400 also includes an acceleration engine 2412, which is an integrated circuit device that can accelerate certain operations or computations performed by the host system 2400.”) Claim(s) 6, 7, 20, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Huang, AlBahar, Banner, and NA as described above, and further in view of Yang et al., (US 20220129740 A1). Regarding Original Claim 6, the combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above. The combination of Huang, AlBahar, Banner, and NA does not appear to explicitly teach: wherein a scaling value of each of the elements included in the kernel includes a runtime value corresponding to the target tensor. However, Yang, in combination with Huang, AlBahar, Banner, and NA, teaches the limitation: wherein a scaling value of each of the elements included in the kernel includes a runtime value corresponding to the target tensor. (Yang, [0033] “When a given input tensor to the convolutional layer 120 is received, the system 100 generates a respective input-dependent weight for each of the multiple kernels from the input tensor. The input-dependent weight is referred to “input-dependent” because the system 100 generates the weight based on the input tensor to the convolutional layer 120, with different input tensors resulting in different weights for the various kernels of the conditional convolutional layer 120. Generating the input-dependent weights will be described in more detail below with reference to FIG. 3. The system determines, from the input tensor, a respective input-dependent weight for each of the plurality of kernels (step 304). See [0049]”) [Examiner’s Note: “runtime” indicates that kernel element’s scaling values (i.e., weights values) are determined based on the input tensor.] Therefore, at the effective filing date, it would have been prima facie obvious to one ordinarily skilled in the art having the teaching of Huang, AlBahar, Banner, and NA before them, to incorporate the kernel technique/method as taught by Yang. One would have been motivated in order to improves the accuracy of all baseline neural networks with small relative increase in inference cost (<10%). Doing so would increase the size and performance of the convolutional neural network with only minimal increases in computational overhead (Yang [0007]). Regarding Original Claim 7, the combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above. The combination of Huang, AlBahar, Banner, and NA does not appear to explicitly teach: wherein a scaling value of each of the elements included in the kernel is equal to a value of a corresponding element in the target tensor. However, Yang, in combination with Huang, AlBahar, Banner, and NA, teaches the limitation: wherein a scaling value of each of the elements included in the kernel is equal to a value of a corresponding element in the target tensor. (Yang, [0033] “When a given input tensor to the convolutional layer 120 is received, the system 100 generates a respective input-dependent weight for each of the multiple kernels from the input tensor. The input-dependent weight is referred to “input-dependent” because the system 100 generates the weight based on the input tensor to the convolutional layer 120, with different input tensors resulting in different weights for the various kernels of the conditional convolutional layer 120. Generating the input-dependent weights will be described in more detail below with reference to FIG. 3.” See [0049].) [Examiner’s note: this limitation is broadly interpreted as each scaling values included in the kernel (e.g., a respective input-dependent weights for each kernel) corresponding to elements or based on the input tensor (i.e., output tensor).] The same motivation that was utilized for combining Huang, AlBahar, Banner, NA, and Yang as set forth in claim 6 is equally applicable to claim 7. Regarding Original Claim 20, The claim recites substantially similar limitations as corresponding claim 6 and is rejected for similar reasons as claim 6 using similar teachings and rationale. Regarding Original Claim 21, The claim recites substantially similar limitations as corresponding claim 7 and is rejected for similar reasons as claim 7 using similar teachings and rationale. Claim(s) 8 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Huang, AlBahar, Banner, and NA as described above, and further in view of Petrizio et al., (US 20200394519 A1). Regarding Currently amended Claim 8, the combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above. The combination of Huang, AlBahar, Banner, and NA does not appear to explicitly teach: wherein the target tensor is determined based on: an average subtraction tensor comprising values determined by subtracting a value of each of elements included in an input tensor of the normalization layer from an average value of the elements; and a constant value determined based on a number of elements of a normalization unit applied to the target tensor. However, Petrizio, in combination with Huang, AlBahar, Banner, and NA, teaches the limitations: wherein the target tensor is determined based on: an average subtraction tensor comprising values determined by subtracting a value of each of elements included in an input tensor of the normalization layer from an average value of the elements; (Petrizio, [0013] & [0014] “All training data elements, all input data elements, and/or all input matrices of individual convolution layers may then be normalized using the global values of the standard deviation and of the average value.” [0015] “The normalization operation may generally involve an addition, subtraction, division, and/or multiplication of the input matrix with the first normalization value and/or the second normalization value and/or with further normalization values. In particular, the normalization operation may involve a subtraction of an average value from inputs of the input matrix and/or a division of the inputs of the input matrix by the standard deviation. the modified shift matrix is determined based on the original shift matrix, based on the ascertained standard deviation and based on the ascertained average value” [0017] “Alternatively or additionally, the modified shift matrix is determined based on the original shift matrix, based on the ascertained standard deviation and based on the ascertained average value. The standard deviation and the average value used for normalizing the input matrix may thus be integrated into the modified filter matrix and/or the modified shift matrix, so that a normalization of the input matrix to be carried out separately.”) [Note: the input matrix may refer to a one-dimensional or multidimensional input tensor. See [0010].] and a constant value determined based on the number of elements of a normalization unit applied to the target tensor. (Petrizio, [0053]–[0061] “A modified shift matrix {tilde over (b)} is determined in a further step S3, based on an original shift matrix b of convolution layer 12 a through 12 c, based on first normalization value v, and based on second normalization value w. In conventional neural networks 10 and/or in a conventional method for operating neural network 10, for normalizing the input matrix, it is customary to subtract average value μ from each entry of input matrix I and to divide the result of this difference by standard deviation σ. The result of this normalization is then convoluted with original filter matrix f and linearly shifted with original shift matrix b, as indicated in the following equation: PNG media_image2.png 48 134 media_image2.png Greyscale where {tilde over (μ)} is a normalization matrix and/or average value matrix whose inputs all have average value μ and have the same dimension as input matrix I. Original shift matrix b has the same dimension as the result of the convolution of the normalized input matrix with original filter matrix f, and all inputs of shift matrix b have a constant value b.”) Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the combination of Huang, AlBahar, Banner, and NA before them, to incorporate the normalization operations as taught by Petrizio. One would have been motivated to make such a combination in order to further increase efficiency of the method for operating the neural network and/or to further reduce the computing effort and/or the computing time (Petrizio [0026]). Regarding Original Claim 12, the combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above: The combination of Huang, AlBahar, Banner, and NA does not appear to explicitly teach: wherein the normalization layer is configured to perform normalization using either one or both of an average and a variance determined based on values of one or more elements included in the target tensor. However, Petrizio, in combination with Huang, AlBahar, Banner, and NA, teaches the limitations: wherein the normalization layer is configured to perform normalization using either one or both of an average and a variance determined based on values of one or more elements included in the target tensor. (Petrizio, [0013] & [0014] “The first normalization value and/or the second normalization value may be a scalar, a vector, a matrix, and/or a tensor. The first normalization value and the second normalization value may in each case generally be a value, such as a statistical value, that may be derived from inputs of the input matrix and/or from the training data set. For example, the first normalization value may correlate with a standard deviation, and/or the second normalization value may correlate with an average value. The average value and the standard deviation may be global values and/or may be ascertained based on an overall training data set that includes multiple training data elements.”) The same motivation that was utilized for combining Huang, AlBahar, Banner, NA, and Petrizio as set forth in claim 8 is equally applicable to claim 12. Claim(s) 9 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Huang, AlBahar, Banner, NA, and Petrizio as described above, and further in view of Pan et al. (NPL: "Switchable whitening for deep representation learning." 2019). Regarding Original Claim 9, the combination of Huang, AlBahar, Banner, NA, and Petrizio teaches the elements of claim 8 as outlined above. and further teaches: a second kernel having a number of input channels and a number of output channels (Huang, [0054] “Each matrix 230 may be processed by a second convolution layer 235 using a second set of filters. Second convolution layer 235 may perform convolutions on matrix 230 using the second set of filters to generate multiple output matrices 240.” Multiple (e.g., M) 3-D filters 510-1, . . . and 510-M each having C 2-D filters of dimensions R×S may be convolved with the N 3-D inputs 520-1, . . . , and 520-N (e.g., N batches of C input feature maps of dimensions H×W) to generate multiple (e.g., N) 3-D outputs 530-1, . . . , and 530-N, where each of the 3-D outputs 530-1, . . . , and 530-N may include M output feature maps (also referred to as output channels).” [Examiner’s Note: Multiple convolution operation applied to each input tensor with a second set of filters (i.e., kernels). Each filter is applied to the input matrix which in consist of input feature maps (input channels) to obtain an output feature maps (output channels).] wherein the target tensor is determined by performing, in a channel axis direction, a convolution based on: the average subtraction tensor; (Petrizio, [0060] “In conventional neural networks 10 and/or in a conventional method for operating neural network 10, for normalizing the input matrix, it is customary to subtract average value μ from each entry of input matrix I and to divide the result of this difference by standard deviation σ. [0082] “step S1 by subtracting average value μ from the inputs of each training data element, and dividing the result of this subtraction by standard deviation σ. The training data elements may in a manner of speaking represent an input matrix I, as described in the preceding figures, for input layer 12 a.” and a second kernel having a number of input channels and a number of output channels determined based on the average subtraction tensor … (Petrizio, [0020]–[0021] the step of converting the input matrix into the output matrix includes a convolution of the input matrix with the modified filter matrix, and an addition of the modified shift matrix to the convoluted input matrix. As explained above, the modified filter matrix and the modified shift matrix may contain the normalization operation for normalizing the input matrix. the step of ascertaining the modified filter matrix includes forming a ratio of inputs of the original filter matrix to the ascertained standard deviation. [0060] The result of this normalization is then convoluted with original filter matrix f and linearly shifted with original shift matrix b, as indicated in the following equation: PNG media_image2.png 48 134 media_image2.png Greyscale where {tilde over (μ)} is a normalization matrix and/or average value matrix whose inputs all have average value μ and have the same dimension as input matrix I. Original shift matrix b has the same dimension as the result of the convolution of the normalized input matrix with original filter matrix f, and all inputs of shift matrix b have a constant value b.”) [Examiner’s Note: “multidimensional tensor” includes spatial dimensions of the data and (channel dimensions)” and the modified filter matrix reads on the “second kernel”.] The combination of Huang, AlBahar, Banner, NA, and Petrizio does not appear to explicitly teach: the average subtraction tensor; and a second kernel having a number of input channels and a number of output channels determined based on the average subtraction tensor and including diagonal elements of scaling values determined based on the constant value. However, Pan, in combination with Huang, AlBahar, Banner, NA, and Petrizio, teaches the limitations: the average subtraction tensor; and a second kernel having a number of input channels and a number of output channels determined based on the average subtraction tensor and including diagonal elements of scaling values determined based on the constant value. (Pan, [Pp. 1864-1865, Section: 3] Normalization. Existing normalization techniques generally performs standardization. For example, Batch Normalization (BN) [10] centers and scales activations using the mean and variance estimated over a mini-batch, accelerating training and enhancing generalization. Our discussion is mainly based on CNNs, where the data have four dimensions. Let X ∈ R C   × N H W be the data matrix of a mini-batch, where N, C, H, W indicate the number of samples, number of channels, height, and width respectively. Here N, H and W are viewed as a single dimension for convenience. Let matrix X n ∈   R C   × N H W be the nth ample in the mini-batch, where n ∈ {1, 2, ..., N}. Then the whitening transformation φ : R C   × N H W → R C   × N H W for a sample X n could be formulated as φ X n =   Σ - 1 / 2     ( X n   -   µ   ·   1 T ) (1) where µ and Σ are the mean vector and the covariance matrix calculated from the data, and 1 is a column vector of all ones. …., In the covariance matrix Σ, the diagonal elements are the variance for each channel, while the off-diagonal elements are the correlation between channels. Therefore, by simply setting the off-diagonal elements to zeros, the left multiplication of Σ - 1 / 2 equals to dividing the standard variance, so that Eq.(1) becomes standardization.”) Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the combination of Huang, AlBahar, Banner, NA, and Petrizio, before them, to incorporate the Switchable Whitening method as taught by Pan. One would have been motivated to make such a combination in order to achieve consistent improvements over previous normalization methods in number of computer vision tasks, including classification, segmentation, domain adaptation, and image style transfer (Pan [Conclusion]). Claim(s) 10 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Huang, AlBahar, Banner, NA, Petrizio, and Pan as described above, and further in view of Son et al., (Pub. No.: US 20180285715 A1). Regarding Original Claim 10, the combination of Huang, AlBahar, Banner, NA, Petrizio, and Pan teaches the elements of claim 9 as outlined above. Petrizio further teaches: wherein the number of input channels and the number of output channels of the second kernel are equal to the number of elements of the normalization unit, (Petrizio, [0014] “All training data elements, all input data elements, and/or all input matrices of individual convolution layers may then be normalized using the global values of the standard deviation and of the average value.”) Huang teaches: the diagonal elements in the second kernel have different scaling values from those of the remaining elements. (Huang, Fig. 4B, Filter 420, [0054] “Each matrix 230 may be processed by a second convolution layer 235 using a second set of filters. Different filters may be used to detect or extract different features from the input pixel array. a processing node of a neural network layer, and may apply a different set of weights {wi} to generate a different weighted sum y=Σi=0 n xiwi for each input dataset {xi}. …, where different weight matrices may be used for the multiple attention heads, …etc.”) [Note: Fig. 4B, Filter 420, include diagonal elements weight values i.e., scaling values (0, -4, 0) which include different weight value “-4” from the remaining elements of the filter 420. ] The combination of Huang, AlBahar, Banner, NA, Petrizio, and Pan does not appear to explicitly teaches: the number of input channels and the number of output channels of the second kernel are equal to the number of elements of the normalization unit, However, Son, in combination with Huang, AlBahar, Banner, NA, Petrizio, teaches the limitation: the number of input channels and the number of output channels of the second kernel are equal to the number of elements of the normalization unit, (Son, [0099]–[0100] The kernel set 301 includes D kernels respectively corresponding to the D output channels, each including C kernel feature maps corresponding to C input channels. …., The CNN processing apparatus performs an operation between the input 208 and a kernel corresponding to a first output channel of the kernel set 301 to generate an output feature map corresponding to the first output channel. Likewise, the CNN processing apparatus performs operations between each of the D kernels of the kernel set 301 and the input 208 to generate each of the respective output feature maps corresponding to D output channels.”) Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the combination of Huang, AlBahar, Banner, NA, Petrizio, and Pan, before them, to incorporate the kernel method/step as taught by Son. One would have been motivated to make such combination in order to reduce input elements overlapping in terms of the number of times that the same data is loaded during the multiple convolution operations of the convolutional layer, thereby improving a performance associated with a speed of processing the convolution operations (Son [0103]). Claim(s) 11 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Huang, AlBahar, Banner, NA, Petrizio, and Pan as described above, and further in view of Shao et al. (NPL: "Channel Equilibrium Networks for Learning Deep Representation." 2020). Regarding Previously presented Claim 11, the combination of Huang, AlBahar, Banner, NA, Petrizio, and Pan teaches the elements of claim 9 as outlined above. The combination of Huang, AlBahar, Banner, NA, Petrizio, and Pan does not appear to explicitly teach: wherein the constant value is equal to a square root of the number of elements of a normalization unit applied to the target tensor, and the scaling values of the second kernel are equal to an inverse of the square root. However, Shao, in combination with Huang, AlBahar, Banner, NA, Petrizio, and Pan, teaches the limitations: wherein the constant value is equal to a square root of the number of elements of a normalization unit applied to the target tensor, and the scaling values of the second kernel are equal to an inverse of the square root. (Shao, [Pp. 3-4, Section: 3] “In the following descriptions, we treat Σ - 1 / 2 in Eqn.(4) as batch decorrelation (BD) and treat D i a g v n - 1 / 2 as instance reweighting (IR). The former one performs decorrelation by using a covariance matrix estimated in an entire minibatch, while the latter one adjusts correlations among feature channels by reweighting each channel with the inverse square root of an adaptive variance for each instance. Integrating both of them yields a dynamic decorrelation operator conditioned on each instance in the CE bock whose forward representation is illustrated in Fig.2(b).” [P. 4, Section 3.2.] “The IR branch returns an inverse square root of an adaptive instance inverse, denoted as D i a g v n - 1 / 2 , which is used to adjusts correlations among feature channels. It needs to satisfy two requirements. ….., where s - 1 / 2 represents the magnitude of the inverse square root of variance. … etc.” [Pp. 15-16, Section: E] “Moving average in inference. Unlike previous methods in manual architecture design that do not depend on batch estimated statistics, the proposed CE block requires computing the inverse square root of a batch covariance matrix Σ and a glob variance scale s in Eqn.(8) in each training step. To make the output depend only on the input, deterministically in inference, we use the moving average to calculate the population estimate of Σ - 1 / 2 and s - 1 / 2 by following the below updating rules: {equation (32)} where s and Σ are the variance scale and covariance calculated within each mini-batch during training, and m denotes the momentum of moving average. It is worth noting that since Σ - 1 / 2   is fixed during inference, the BD branch does not introduce extra costs in memory or computation except for a simple linear transformation ( Σ ˆ - 1 / 2   x ~ n i j ).) [Examiner’s Note: the diagonal elements of the matrix represent the inverse square root of adaptive instance variance for each channel.] Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the combination of Huang, AlBahar, Banner, NA, Petrizio, and Pan, before them, to incorporate the Channel Equilibrium method as taught by Shao. One would have been motivated to make such a combination in order reduce computational complexity and accelerate Inference (Shao [Section 3]). Claim(s) 14 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Huang, AlBahar, Banner, and NA as described above, and further in view of Dikici et al., (US 20200301994 A1). Regarding Original Claim 14, XZZzzZtteachthe combination of Huang, AlBahar, Banner, and NA teaches the elements of claim 1 as outlined above. Huang, AlBahar, Banner, and NA do not appear to explicitly teach: wherein the convolution is performed in the accelerator such that the target tensor is not transmitted outside the accelerator to perform an operation according to the normalization layer. However, Dikici, in combination with Huang, AlBahar, Banner, and NA, teaches the limitation: wherein the convolution is performed in the accelerator such that the target tensor is not transmitted outside the accelerator to perform an operation according to the normalization layer. (Dikici, Fig. 25, shows the normalization unit and convolution operations (e.g., tensor operations) are internally included/processed in the DNN accelerator. Furthermore, Dikici [0113] “the system 2200 for performing a convolution transpose between an input tensor and a filter may form part of a DNN accelerator. A DNN accelerator comprises hardware logic configured to process input data to a DNN in accordance with the layers of the DNN.” [0016] “DNN accelerator 2500 of FIG. 25 also comprises an element-wise operations module 2506, an activation module 2508, a normalisation module 2510, a pooling module 2512, and an output module 2515., …., Specifically, together the convolution engine 2202, the accumulators 2204 and the accumulation buffer 2206 can implement or process all or a portion of a convolution layer, a fully connected layer or a convolution transpose layer. The activation module 2508 can process or implement an activation layer. The normalisation module 2510 can process or implement a normalisation layer.”) Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the combination of Huang, AlBahar, Banner, and NA before them, to incorporate the implementation of the DNN accelerator as taught by Dikici. One would have been motivated to make such a combination in order to obtain an integrated circuit that would increase computational performance, reduce latency, and/or reduce power consumption (Dikici [0148]). Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SADIK ALSHAHARI whose telephone number is (703)756-4749. The examiner can normally be reached Monday Friday, 9 A.M - 6 P.M. ET. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached on (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /S.A.A./Examiner, Art Unit 2121 /Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action

Prosecution Timeline

Oct 30, 2020
Application Filed
Oct 24, 2023
Non-Final Rejection — §103, §DP
Jan 30, 2024
Response Filed
Feb 23, 2024
Interview Requested
Mar 05, 2024
Non-Final Rejection — §103, §DP
Mar 06, 2024
Applicant Interview (Telephonic)
Mar 20, 2024
Examiner Interview Summary
Jun 13, 2024
Response Filed
Jul 16, 2024
Interview Requested
Jul 22, 2024
Applicant Interview (Telephonic)
Jul 23, 2024
Final Rejection — §103, §DP
Oct 02, 2024
Examiner Interview Summary
Oct 02, 2024
Applicant Interview (Telephonic)
Oct 29, 2024
Request for Continued Examination
Nov 04, 2024
Response after Non-Final Action
Jan 30, 2025
Non-Final Rejection — §103, §DP
May 07, 2025
Examiner Interview Summary
May 07, 2025
Examiner Interview (Telephonic)
Jun 10, 2025
Response Filed
Aug 14, 2025
Final Rejection — §103, §DP (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12596930
SENSOR COMPENSATION USING BACKPROPAGATION
2y 5m to grant Granted Apr 07, 2026
Patent 12493786
Visual Analytics System to Assess, Understand, and Improve Deep Neural Networks
2y 5m to grant Granted Dec 09, 2025
Patent 12462199
ADAPTIVE FILTER BASED LEARNING MODEL FOR TIME SERIES SENSOR SIGNAL CLASSIFICATION ON EDGE DEVICES
2y 5m to grant Granted Nov 04, 2025
Patent 12437199
Activation Compression Method for Deep Learning Acceleration
2y 5m to grant Granted Oct 07, 2025
Patent 12430552
Processing Data Batches in a Multi-Layer Network
2y 5m to grant Granted Sep 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

6-7
Expected OA Rounds
35%
Grant Probability
82%
With Interview (+47.1%)
4y 5m
Median Time to Grant
High
PTA Risk
Based on 34 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month