Prosecution Insights
Last updated: April 19, 2026
Application No. 16/170,360

METHOD AND APPARATUS FOR PERFORMING OPERATIONS IN CONVOLUTIONAL NEURAL NETWORK

Non-Final OA §103§112
Filed
Oct 25, 2018
Examiner
BOSTWICK, SIDNEY VINCENT
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
Nanjing Horizon Robotics Technology Co. Ltd.
OA Round
9 (Non-Final)
52%
Grant Probability
Moderate
9-10
OA Rounds
4y 7m
To Grant
90%
With Interview

Examiner Intelligence

Grants 52% of resolved cases
52%
Career Allow Rate
71 granted / 136 resolved
-2.8% vs TC avg
Strong +38% interview lift
Without
With
+38.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
68 currently pending
Career history
204
Total Applications
across all art units

Statute-Specific Performance

§101
24.4%
-15.6% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
12.0%
-28.0% vs TC avg
§112
21.9%
-18.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 136 resolved cases

Office Action

§103 §112
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 9/16/2025 has been entered. Remarks This Office Action is responsive to Applicants' Amendment filed on September 16, 2025, in which claims 1, 5-6, 8, 13, 15, and 19 are currently amended. Claims 20-21 are newly added. Claims 1, 5, 6, 8-13, and 15-21 are currently pending. Response to Arguments Applicant’s arguments with respect to rejection of claims 1, 5, 6, 8-13, and 15-21 under 35 U.S.C. 101 based on amendment have been considered and are persuasive. The rejection under 35 U.S.C. 101 has been withdrawn as necessitated by Applicant’s arguments and amendments made to the claims. Applicant’s arguments with respect to rejection of claims 1, 5, 6, 8-13, and 15-21 under 35 U.S.C. 103 based on amendment have been considered, however, are not persuasive. With respect to Applicant's arguments on pp. 20-21 of the Remarks submitted 9/16/2025 that "the "processor cores" that Yang splits based on the current performance of the processor are not equivalent to a "convolutional layer"", Examiner respectfully disagrees. Examiner notes that Applicant has misconstrued Yang. Yang is not merely directed towards splitting processor cores for parallelization, Yang accomplishes this by blocking/unrolling filter weight matrices of individual layers which is explicitly stated in Yang ([p. 7 §3.5] "To optimize a CNN layer for a fixed memory hierarchy, for each string we continue to pack the lower level buffers into the lowest available level of memory hierarchy, always adding the unpacked buffer with the highest number of accesses. When the current memory level does not have enough remaining space to fit the added buffer, we place that and all subsequent buffers into the next level of the memory hierarchy until it becomes full"). Examiner further notes that according to the amended claim 1 for example, the layer selection only needs to be based on one of the elements selected from a large list of elements including broadly recited metrics "a requirement on operation parallelism" and "a design of the convolution neural network" which the layer selection in Yang categorically falls under. With respect to Applicant's arguments on pp. 21-22 of the Remarks submitted 9/16/2025 that Yang does not disclose "selecting a layer in the convolutional neural network" Examiner respectfully disagrees. Applicant appears to have interpreted "selecting a layer" very narrowly as Yang explicitly discloses that the partitioning is performed for individual (selected) layers ([p. 7 §3.5] "To optimize a CNN layer for a fixed memory hierarchy" Examiner notes the language "a CNN layer" singular). Examiner also reiterates as previously stated according to the amended claim 1 for example, the layer selection only needs to be based on one of the elements selected from a large list of elements including broadly recited metrics "a requirement on operation parallelism" and "a design of the convolution neural network" which the layer selection in Yang categorically falls under ([p. 7 §3.5] "To optimize a CNN layer for a fixed memory hierarchy, for each string we continue to pack the lower level buffers into the lowest available level of memory hierarchy, always adding the unpacked buffer with the highest number of accesses. When the current memory level does not have enough remaining space to fit the added buffer, we place that and all subsequent buffers into the next level of the memory hierarchy until it becomes full"). With respect to Applicant's arguments on p. 23 of the Remarks submitted 9/16/2025 that Yang "only performs multiple levels of nested splitting on the channel C dimension and the convolution kernel K dimension, respectively, without constructing a new parameter consisting of both the channel dimension C and the convolution kernel number dimension K" Examiner notes that the instant claims do not recite "constructing a new parameter consisting of both the channel dimension C and the convolution kernel number dimension K" but rather the instant claims split an existing weight parameter matrix which is synonymous with Yang who by Applicant's admission on p. 23 of the Remarks submitted 9/16/2025 acknowledges that Yang splits/unrolls the filter matrix into smaller partitions which is also illustrated in FIG. 1 of Yang. Examiner further notes that Applicant's arguments on pp. 23-26 are directed towards the instant specification and Applicant's narrow interpretation of the instant claims in view of the instant specification and not directed towards the scope of the claim language itself. With respect to Applicant's arguments on p. 27 of the Remarks submitted 9/16/2025 that "in Yang, no array representation is involved", Examiner respectfully disagrees. Fig. 1 of Yang explicitly shows multidimensional arrays and Yang explicitly teaches that the partitioning is operable to fit fragments of the weight filter in memory ([p. 6 §3.4] "How do we find the right size for this memory block? We leverage the fact that large SRAMs are built from smaller memory arrays, so the energy increase as the memory gets larger is mostly from the energy to communicate the data to the output port from the array where it was stored"). With respect to Applicant’s arguments that Yang splits the input image and not the weight matrix, Examiner respectfully disagrees. One of ordinary skill in the art would recognize that in convolutional neural networks the kernels are in the weight matrix and not the input image (also known as the input feature map) such that it would be impossible to split the input image by kernels. However, Yang explicitly splits the weight matrix by kernels K ([p. 2 §2] "These K FwxFhxC stencil coefficients are the “weights” of the convolutional layer."). This is explicitly shown in Algorithm 1 of Yang which splits the weight matrix into respective dimensions K, C, Y, X, Fh, and Fw using a nested for loop in which for each iteration of an outer loop at least one iteration of an inner loop is performed by definition. With respect to Applicant’s arguments that the claimed invention “simultaneously splits the weight parameters from two dimensions C and K and directly obtain the operation parameter array with crossing rows and columns through the synergetic splitting of two dimensions C and K” Examiner notes that this language is not present in the instant claims and appears to be a narrow interpretation of the claims. With respect to Applicant's arguments that Yang does not teach splitting a weight parameter matrix in at least one of depth and number of kernels, Examiner respectfully disagrees. Yang explicitly teaches that K represents the weight parameter kernel being split in a depth direction would could reasonably be interpreted as C, Y, or X. Yang, however, explicitly teaches that the depth direction is C ([p. 2 §2] "Here, (X;Y) and (Fw;Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension, which we define as C; or the number of channels."). Examiner further notes that the rejection is not in view of Yang alone, but in view of the combination of Yang and Henry, where Henry has been introduced to reinforce the obviousness. Examiner asserts that the interpretation is very reasonable in view of the prior art and that the rejection should be maintained. With respect to Applicant's argument that Yang does not teach the performing and generating steps, Examiner respectfully disagrees. The cited portion of Yang explicitly teaches ([p. 4] "Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C"). Examiner also asserts that the rejection is not in view of Yang alone but in view of the combination of Yang and Henry. Henry also explicitly teaches performing partial operations to generate an output at least at paragraph 535. For at least these reasons and those further detailed below Examiner asserts that the interpretation of the prior art is reasonable and it is appropriate to maintain the rejection under 35 U.S.C. 103. With respect to Applicant's arguments that Yang just splits a nested loop into an inner loop and an outer loop of a conventional convolutional neural network and that this differentiates it from the instant. Examiner respectfully notes that while this may summarize the algorithm producing the end result of Yang, FIG. 1 of Yang explicitly shows that the intention of Yang is to split weight matrices in Yang and Yang even explicitly teaches that these weight matrices are split in kernel dimensions, channel dimensions, x and y dimensions, etc. to be able to perform large scale operations on limited hardware ([Abstract] "Convolutional Neural Networks (CNNs) are the state of the art solution for many computer vision problems, and many researchers have explored optimized implementations. Most implementations heuristically block the computation to deal with the large data sizes and high data reuse of CNNs. This paper explores how to block CNN computations for memory locality by creating an analytical model for CNN-like loop nests."). Yang explicitly teaches that the method presented is not a traditional CNN method ([Abstract] "Compared to traditional CNN CPU implementations based on highly-tuned, hand-optimized BLAS libraries, our x86 programs implementing the optimal blocking reduce the number of memory accesses by up to 90%."). One of ordinary skill in the art would recognize that the values of K, C, Y, X, etc. in Algorithm 1 of Yang could be set arbitrarily such that blocks without interdependence could be readily achieved. Examiner notes that this is a moot point as the claims do not recite "splitting the weight parameter matrix into a plurality of submatrices without any interdependence" as argued on p. 21 of the Remarks submitted 6/10/2024. Similarly, the instant specification does not make any mention of splitting the weight matrices into submatrices without interdependence. For at least these reasons Applicant's conclusion that Yang fails to disclose splitting of loops based on said arguments is flawed. With respect to Applicant's arguments that Henry does not teach "the weight parameter matrix is split according to each partial input data such that the operational parameter array obtained by the splitting has a number of columns equal to the number of the received plurality of partial input data, and all the operational parameters in each column correspond to the same one or more channels as one of the plurality of partial input data", Examiner respectfully disagrees. Yang explicitly teaches that each input image is comprised of partial data ([p. 2 §2] "image dimensions x and y, and c for color channels") such that all of the pixels in the x,y, and c dimension form an input image of the set of input images ([p. 1 §1] "the input data to a layer often consists of tens to hundreds of images"). With respect to Applicant’s arguments that the claimed invention is directed towards kernels and not layers, Examiner notes that claim 1 of the instant explicitly recites “splitting a weight parameter matrix of a selected layer in the convolutional neural network”. Examiner further notes that the combination of Yang and Henry explicitly teaches splitting along a kernel dimension, and one of ordinary skill in the art would readily recognize that CNN layers are comprised of kernels. With respect to Applicant’s arguments on p. 19 of the Remarks submitted 5/28/2025 that Henry does not disclose “that the weight parameter matrix is split collaboratively in both the channel C dimension and the kernel-number K dimension during splitting”, Examiner notes that the limitation mapped to Henry in the Non-Final Office Action mailed 2/28/2025 does not mention anything about splitting collaboratively in both the channel C dimension and the kernel-number K dimension. Applicant appears to be taking Henry out of the context of the combination of Yang and Henry as intended in the Non-Final Office Action mailed 2/28/2025 and applying a narrow interpretation of the instant specification to Henry alone (outside of the combination). Applicant's arguments directed towards Henry do not comply with 37 CFR 1.111(c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections. Applicant's arguments directed towards Stanford do not comply with 37 CFR 1.111(c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections. The remaining arguments are moot in view of a new ground of rejection set forth below. Claim Objections Claims 1, 8, 13, 15, 20, and 21 objected to because of the following informalities: Regarding claims 1, 13, and 15, "in response to that the number of the kernels of the weight parameter exceeds a second threshold" should read "in response to the number of the kernels of the weight parameter exceeding a second threshold". Regarding claim 8, “in response to that the operational parameter array” should read “in response to the operational parameter array” Regarding claim 20, "In response to that the number of the kernels of the weight parameter is greater than or equal to a first predetermined number" should read "In response to the number of kernels of the weight parameter being greater than or equal to a first predetermined number". Regarding claim 21, "in response to that the selected layer receives a plurality of partial input data" should read "in response to the selected layer receiving a plurality of partial input data". Appropriate correction is required. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claims 1,5-6,8-13 and 15-21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Regarding claims 1, 13, and 15, "the plurality of operational parameters obtained by the splitting have different depths and/or different numbers of kernels" is indefinite. It's unclear to one of ordinary skill in the art what the depths and/or number of kernels of the plurality of operational parameters are different relative to. It's unclear if the claim limitation should be interpreted as "Each operational parameter of the plurality of operational parameters obtained by the splitting has a different depth and/or different number of kernels" or if the claim limitation should be interpreted as "the plurality of operational parameters obtained by the splitting have different depth than the depth of the convolutional neural network and/or a number of kernels of the convolutional neural network", or something else altogether. As these interpretations are contradictory the scope of the claim cannot be reasonably determined and the claim is seen as being indefinite. In the interest of further Examination the claim is interpreted as "the plurality of operational parameters obtained by the splitting have different depth than the depth of the convolutional neural network and/or a number of kernels of the convolutional neural network" The remaining claims are rejected with respect to their dependence on the rejected claims. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. Claims 1, 5, 6, 8-11, 13, 15, 19, 20, and 21 are rejected under U.S.C. §103 as being unpatentable over the combination of Yang ("A Systematic Approach to Blocking Convolutional Neural Networks", 2016) and Liang (“FP-BNN: Binarized neural network on FPGA”, 2017). PNG media_image1.png 706 1552 media_image1.png Greyscale FIG. 1 of Yang Regarding claim 1, Yang teaches A computer implemented method for performing operations in a convolutional neural network, comprising: ([Abstract] "This paper explores how to block CNN computations for memory locality by creating an analytical model for CNN-like loop nests. Using this model we automatically derive optimized blockings for common networks that improve the energy efficiency of custom hardware implementations by up to an order of magnitude") selecting a layer in the convolutional neural network based on at least one of a capacity of a high-speed memory, a capacity of the high-speed memory reserved for a weight parameter of the layer, a capacity of the high-speed memory currently available for the weight parameter of the layer, an arrangement of multipliers and adders, a requirement on operation parallelism, a design of the convolution neural network, or current performance of a processor and/or an operating system; ([p. 7 §3.5] "To optimize a CNN layer for a fixed memory hierarchy, for each string we continue to pack the lower level buffers into the lowest available level of memory hierarchy, always adding the unpacked buffer with the highest number of accesses. When the current memory level does not have enough remaining space to fit the added buffer, we place that and all subsequent buffers into the next level of the memory hierarchy until it becomes full" See also layer in FIG. 1 interpreted as a selected CNN layer) splitting the weight parameter of the selected layer in the convolutional neural network in a dimension of a depth and a dimension of a number of kernels, to obtain an operational parameter array including a plurality of operational parameters, wherein the weight parameter of the selected layer is an array of one or more parameters represented by both the dimension of the depth and the dimension of the number of the kernels; the operational parameters in the operational parameter array consist of the one or more parameters in the array of the weight parameter; ([p. 2 §2] "A convolutional layer (Conv) corresponds to a filter bank. In the standard case of 3D input and output, a convolutional layer maps a C×X×Y input to a K×X×Y output using K shift-invariant 3D stencils, where each stencil is of the size Fw×Fh×C (i.e., a set of K 3-dimensional convolutions). These K Fw×Fh×C stencil coefficients are the “weights” of the convolutional layer. Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension, which we define as C, or the number of channels. Typically the dimensions of the kernels are much smaller than the image dimensions." [p. 4 §3.1] "The computation being performed by a convolutional layer can be easily expressed as a 6 layer loop nest as shown in Algorithm 1...blocking can be thought of as simply splitting a number of loops, and then exchanging the order in which these split loops are executed" Yang explicitly states that 3x3 convolution is blocked (split) into 1x1 fragments of the 3x3 convolution (See also FIG. 1) where each fragment is three dimensional having a depth dimension and Kernel (K) dimension.) respective operational parameters in each row of the operational parameter array are from a same subset of a set of the kernels of the weight parameter and have different channels respectively, ("Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension," Row of filter fragment in Yang is interpreted as elements along the channel dimension (C0, C1, C2, etc) each having different channels respective to a particular kernel (K).) respective operational parameters in each column of the operational parameter array are from different subsets of the set of the kernels of the weight parameter respectively and have the same channel(s); ([p. 3 §2.2] "1 for k0 = 0 : K do 2 for c0 = 0 : C do" The first element of every kernel corresponds to the same channel C=C0 but of different kernel subsets (for each kernel from 0 to K) which is evident in FIG. 1 and can be verified with line 2 of Algorithm 1.) and the plurality of operational parameters obtained by the splitting have different depths and/or different numbers of kernels; ("Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension," See also FIG. 1 where each blocking level has different depths and/or different numbers of kernels) storing the operational parameter array in the high-speed memory; ([p. 4 §3.2] "While in the final design the input, kernel, and output data at each level of the memory hierarchy may be stored together, for this analysis it is convenient to think of them as separate memory structures. Thus we will consider a memory for kernel coefficients KB (kernel buffer), input image data IB, and output data OB. Since these memories exist at multiple levels in the memory hierarchy, we use KB0, IB0, OB0, to indicate the kernel, input, and output memory that is closest to the compute unit, and each buffer at level i, (e.g. IBi) fetches its data from the buffer at level i+1 (IBi+1)" [p. 5 §3.2] "partial outputs are being reduced Ci/Ci−1 times, and should be stored in a new output buffer to prevent these fetches from going to a larger memory at a higher level in the memory hierarchy. For maximum reuse of kernel partial outputs, the output buffer contains all elements that are computed by the inner loops") obtaining each operational parameter in the operational parameter array stored in the high- speed memory; ([p. 4 §3.2] "fetches its data from the buffer at level i+1 (IBi+1)" [p. 5 §3.2] "partial outputs are being reduced Ci/Ci−1 times, and should be stored in a new output buffer to prevent these fetches from going to a larger memory at a higher level in the memory hierarchy. For maximum reuse of kernel partial outputs, the output buffer contains all elements that are computed by the inner loops" [p. 6] "We compute the number of accesses to each memory by introducing refetch rate RRi , the number of times a piece of data is fetched from a certain buffer after initially being loaded into that buffer (Table 2)") performing, by hardware in parallel using the each operational parameter in the operational parameter array, operations of the selected layer on data of input data of the selected layer that are in a channel corresponding to a channel of the each operational parameter, to obtain a partial operation result array including a plurality of partial operation results;([p. 4] "Figure 1: Hierarchical blocking of a single convolutional layer. The six-dimensional overall problem domain (X,Y,C,Fw,Fh,K) depicted in Figure 1 is blocked to three levels in the input domain ({X,Y,C}{0,1,2}), and two levels in the set of kernels (K) which correspond to the third dimension of the output domain ({X,Y}{0,1,2},{K}{0,1}). Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C" [p. 11 §5.3] "As discussed in Section 3.3, there are two different schemes for parallelizing a problem on a multiple-core chip. Figure 9 demonstrates how the two parallelization schemes, with different schedules and memory hierarchies, affect the energy efficiency of a system with up to eight cores. We chose the top four blocking schedules from the single core problem discussed in the previous section and we evaluated the two parallelization methods for each of the four schedules as applied to layer Conv1.") and generating one or more output data of the selected layer based on the partial operational result array, ([p. 4 §3] "Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C"). While Yang explicitly anticipates splitting the weight filter based on memory capacity thresholds ([p. 6] "Memory access energy per 16 bits (pJ/16b) for various memory sizes and word lengths. For memory size in the range of 0.25KB to 16MB, we use SRAM. When the memory size exceeds 16MB, we use DRAM") Yang does not explicitly teach wherein splitting the weight parameter comprises: in response to that the number of the kernels of the weight parameter exceeds a second threshold, splitting the weight parameter, such that a number of kernels of the each operational parameter in the operational parameter array obtained by the splitting is less than or equal to the second threshold, wherein the second threshold is set according to the capacity of the high-speed memory, a size of each of the kernels, and parameters relating to the hardware supporting the operations of the convolutional neural network. Liang, in the same field of endeavor, teaches wherein splitting the weight parameter comprises: in response to that the number of the kernels of the weight parameter exceeds a second threshold, splitting the weight parameter, such that a number of kernels of the each operational parameter in the operational parameter array obtained by the splitting is less than or equal to the second threshold, ([p. 4] "For operations in BN layers, the number of operations (NOP) has a linear relationship with the number of output channels Nout" [p. 9 §5] "The number of tiled output channels equals to the number of PE channels NPE […] It will take [ Nin / TNin ] iterations to get the intermediate result accumulated for one output. This will be repeated by [ Nout / NPE ] times for all outputs get done" In CNN each kernel is tied to a respective output channel such that number of output channels necessarily and by definition is equal to number of kernels. Liang explicitly teaches that only NPE output channels can be processed per iteration such that if Nout (the total number of kernels of the each operational parameter in the operational parameter array obtained by the splitting) is greater than NPE (the second threshold) then Nout needs to be split by NPE.) wherein the second threshold is set according to the capacity of the high-speed memory, a size of each of the kernels, and parameters relating to the hardware supporting the operations of the convolutional neural network. ([p. 10 §6.2] "we keep the parallelism of memories identical to the number of PE channels NPE, and the width of each memory equals to PEsize"). Yang as well as Liang are directed towards partitioning convolutional neural network filter weight matrices. Therefore, Yang as well as Liang are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Yang with the teachings of Liang by partitioning the filter weight matrix by a kernel threshold. Liang provides as additional motivation for combination ([p. 2] “To improve resource usage […] we introduce FP-BNN, a BNN acceleration system design on FPGA, with related optimizations”). Regarding claim 5, the combination of Yang and Liang teaches The method of claim 1 wherein splitting the weight parameter comprises: splitting the weight parameter in a case where the weight parameter has a number of channels exceeding a third threshold, such that each operational parameter in the operational parameter array obtained by the splitting has a number of channels less than or equal to the third threshold.(Yang [p. 5 §3.2] "When a new C loop Ci is added, a series of images and kernels are streamed and Ci channels reductions are being performed on the same set of outputs. Therefore those partial outputs are being reduced Ci/Ci-1 times, and should be stored in a new output buffer to prevent these fetches from going to a larger memory at a higher level in the memory hierarchy" [p. 5 §3.2] "Suppose we apply parallelism for S cores at a given level p by unrolling that loop p across the processors. The first constraint is that we need to block the application such that the dimension being unrolled, e.g. Cp, is S times that of the previous level, Cp􀀀1. The parallelism can be performed by partitioning the problem across the input XY, the kernels K, or the channels C" Third threshold interpreted as Cp-1 such that if Cp is equal to Cp-1 no splitting occurs. Yang explicitly teaches that the partitioning may occur as a function of channels (Cp = Cp-1).). Regarding claim 6, the combination of Yang and Liang teaches The method of claim 1 wherein splitting the weight parameter comprises: splitting the weight parameter in a case where the weight parameter has a number of channels greater than or equal to a second predetermined number, such that the operational parameter array obtained by the splitting has a number of columns equal to a multiple of the second predetermined number.(Yang [p. 5 §3.3] "The first constraint is that we need to block the application such that the dimension being unrolled, e.g. Cp, is S times that of the previous level, Cp−1. The parallelism can be performed by partitioning the problem across the input XY, the kernels K, or the channels C" S interpreted as multiple of predetermined number (Cp-1).). Regarding claim 8, the combination of Yang and Liang teaches The method of claim 1, wherein splitting the weight parameter further comprises: in response to that the operational parameter array obtained by the splitting includes an operational parameter having a size exceeding a first threshold, subdividing at least a row and/or column of including the operational parameter having the size exceeding the first threshold in at least one of a dimension of depth and number of kernels, such that each operational parameter in an operational parameter array obtained by the subdividing has a size less than or equal to the first threshold.(Yang [p. 5 §3.3] "The first constraint is that we need to block the application such that the dimension being unrolled, e.g. Cp, is S times that of the previous level, Cp−1. The parallelism can be performed by partitioning the problem across the input XY, the kernels K, or the channels C" S interpreted as multiple of predetermined number (Cp-1). See also figure 1. Cp interpreted as operational parameter exceeding the threshold. Cp-1 interpreted as synonymous with operational parameter having a size less than Cp.). Regarding claim 9, the combination of Yang and Liang teaches The method of claim 1 wherein each partial operation result in the partial operation result array corresponds to one output data of the selected layer.(Yang [p. 4] "Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C" Output pixel interpreted as synonymous with one output data of the selected layer.). Regarding claim 10, the combination of Yang and Liang teaches The method of claim 1 where generating the output data comprises: compressing the partial operation result array into one column by adding up all the partial operation results in each row of the partial operation result array in a point-to-point manner when the partial operation result array includes a plurality of columns, each partial operation result in the compressed partial operation result array corresponding to an output data of the selected layer.(Yang [p. 2] "Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C" [p. 5 §3.2] "The inner loop takes a small amount of input data with block size X0Y0C0 and convolves it with K0 kernels to create some partial outputs with block size X0Y0K0. A complete output cannot be generated until all the channels of the input are processed for that kernel and the output pixel is generated, which will happen only when all of the channels (C2 loop) finish" For the iteration of X0=1 partial output is compressed into a single column. Yang explicitly teaches that the partial results are accumulated (summed) from the channels (rows).). Regarding claim 11, the combination of Yang and Liang teaches The method of claim 1 wherein generating the output data comprises: compressing the partial operation result array into one row by combining all the partial operation results in each column of the partial operation result array in the depth direction when the partial operation result array includes a plurality of rows, each partial operation result in the compressed partial operation result array corresponding to an output data of the selected layer.(Yang [p. 2] "Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C" [p. 5 §3.2] "The inner loop takes a small amount of input data with block size X0Y0C0 and convolves it with K0 kernels to create some partial outputs with block size X0Y0K0. A complete output cannot be generated until all the channels of the input are processed for that kernel and the output pixel is generated, which will happen only when all of the channels (C2 loop) finish" For the iteration of X0=1 partial output is compressed into a single row. See Figure 1 of how the partial operation results correspond to the output.). Regarding claims 13 and 15, claims 13 and 15 are directed towards an apparatus for performing the method of claim 1. Therefore, the rejection applied to claim 1 also applies to claims 13 and 15. Claims 13 and 15 also mention additional elements including a processor to perform the method ([p. 1 §1] "Early attempts [20, 1, 24, 2] to optimize CPU and GPU CNN implementations treated the convolutional layers as matrix multiplication and used an optimized BLAS matrix matrix-multiplication (GEMM) routine") as well as memory to store the instructions performed by the processor ([p. 1 §1] "the design of the memory hierarchy and how the data is choreographed has a dramatic effect on the energy required for the computation."). Regarding claim 19, the combination of Yang and Liang teaches The method of claim 1 wherein splitting the weight parameter matrix comprises: splitting the weight parameter matrix in a case where a size of the weight parameter matrix exceeds a first threshold, such that the each operational parameter in the operational parameter array obtained by the splitting has a size less than or equal to the first threshold.(Yang [p. 5 §3.2] "When a new C loop Ci is added, a series of images and kernels are streamed and Ci channels reductions are being performed on the same set of outputs. Therefore those partial outputs are being reduced Ci/Ci-1 times, and should be stored in a new output buffer to prevent these fetches from going to a larger memory at a higher level in the memory hierarchy" [p. 5 §3.2] "Suppose we apply parallelism for S cores at a given level p by unrolling that loop p across the processors. The first constraint is that we need to block the application such that the dimension being unrolled, e.g. Cp, is S times that of the previous level, Cp􀀀1. The parallelism can be performed by partitioning the problem across the input XY, the kernels K, or the channels C" Threshold interpreted as Cp-1 such that if Cp is equal to Cp-1 no splitting occurs. Yang explicitly teaches that the partitioning may occur as a function of operational parameters (Cp = Cp-1).). Regarding claim 20, the combination of Yang and Liang teaches The method of claim 1, wherein splitting the weight parameter comprises: In response to that the number of the kernels of the weight parameter is greater than or equal to a first predetermined number, splitting the weight parameter, such that the operational parameter array obtained by the splitting has a number of rows equal to a multiple of the first predetermined number.(Liang [p. 9 §5] "Considering the datapath shared among different layers, NPE should be a common divisor of Nout of different layers""). Regarding claim 21, the combination of Yang and Liang teaches The method of claim 1, wherein splitting the weight parameter comprises: in response to that the selected layer receives a plurality of partial input data, any two of which do not have the same channel, and the plurality of partial input data collectively correspond to the complete input data of the selected layer, splitting the weight parameter according to each of the partial input data, such that the operational parameter array obtained by the splitting has a number of columns equal to a number of the received plurality of partial input data, and the respective operational parameters in the each column correspond to the same one or more channels as one of the plurality of partial input data.(Liang [p. 9 §5] "For FC layers, one tiled input will be fed into all PE channels. For better resource utilization we set TNin as close as possible to PEsize. It will take [ Nin/ TNin ] iterations to get the intermediate result accumulated for one output. This will be repeated by [ Nout/ NPE ] times for all outputs get done [...] Considering the datapath shared among different layers, NPE should be a common divisor of Nout of different layers, TNin for each layer should preferably be best a sub-multiple of Nin, and PEsize should be a big value and also close to Lin to best explore the resource utilization" Tnin interpreted as partial input data, any two of which do not have the same channel.). Claim 12 is rejected under U.S.C. §103 as being unpatentable over the combination of Yang and Liang and in further view of Stanford (“CS231n Convolutional Neural Networks for Visual Recognition”, 2015). Regarding claim 12, the combination of Yang and Liang teaches The method of claim 1. However, the combination of Yang and Liang doesn't explicitly teach wherein generating the output data comprises: generating an output data of the selected layer by adding up all the partial operation results in each row of the partial operation result array in a point-to-point manner and then combining, in the depth direction, all the partial operation results in each column of the partial operation result array compressed by the adding up, or by combining all the partial operation results in each column of the partial operation result array in the depth direction and then adding up all the partial operation results in each row of the partial operation result array compressed by the combining in a point-to-point manner, when the partial operation result array includes a plurality of rows and a plurality of columns.. Stanford, in the same field of endeavor, teaches generating the output data comprises: generating an output data of the selected layer by adding up all the partial operation results in each row of the partial operation result array in a point-to-point manner and then combining, in the depth direction, all the partial operation results in each column of the partial operation result array compressed by the adding up, or by combining all the partial operation results in each column of the partial operation result array in the depth direction and then adding up all the partial operation results in each row of the partial operation result array compressed by the combining in a point-to-point manner, when the partial operation result array includes a plurality of rows and a plurality of columns. ([p. 11] "The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias." See FIG. on p. 12.). The combination of Yang and Liang as well as Stanford are directed towards accelerating convolutional neural networks. Therefore, the combination of Yang and Liang as well as Stanford are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Yang and Liang with the teachings of Stanford. While the content on the Stanford convnet course website would be considered to be well-understood to one of ordinary skill in the art, a motivation for combination with regards to matrix factorization of convolutional neural networks has been provided ([p. 13 "Implementation as Matrix Multiplication"] "the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used BLAS API). Moreover, the same im2col idea can be reused to perform the pooling operation"). Claims 16, 17, and 18 are rejected under U.S.C. §103 as being unpatentable over the combination of Yang and Liang and Stanford and Henry (US20180189633A1). Regarding claim 16, the combination of Yang, Liang, and Stanford teaches The method of claim 12. However, the combination of Yang, Liang, and Stanford doesn't explicitly teach, further comprising: providing the partial operation results to a next layer; and in the next layer, using a weight parameter of the next layer to perform operations on each partial input data, and then adding results obtained by the operations in a point-to-point manner.. Henry, in the same field of endeavor, teaches providing the partial operation results to a next layer; and in the next layer, using a weight parameter of the next layer to perform operations on each partial input data, and then adding results obtained by the operations in a point-to-point manner. ([¶0110] "Each data word functions as the output value (also sometimes referred to as an activation) of a neuron of the previous layer in the network, and each weight word functions as a weight associated with a connection coming into a neuron of the instant layer of the network" [¶0599] " each iteration of loop 2 implicates a horizontal 2D input slice (i.e., all C channels of a given row of the H rows of the input 5802) and a horizontal 2D filter slice (i.e., all C channels of a given row of the R rows of the filter 5804). The column-channel-sum is the result of, for each channel of all the C channels, convolving the channel's portion of the implicated horizontal 2D input slice and the channel's portion of the implicated horizontal 2D filter slice to generate a column-sum, and continually accumulating all of the C channel's column-sums to produce the column-channel-sum" Accumulating row by row in a column-channel-sum fashion is interpreted as synonymous with adding results obtained by the operations in a point-to-point manner.). The combination of Yang, Liang, and Stanford as well as Henry are directed towards neural network accelerators with an emphasis on CNNs. Therefore, Yang, Liang, and Stanford as well as Henry are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Yang, Liang, and Stanford with the teachings of Henry by performing the loop partitioning in an explicitly column-first fashion. Algorithm 600 in Henry is substantially similar to algorithm 1 in Yang, Liang, and Stanford and it would be obvious to one of ordinary skill in the art that it would be trivial to modify either to match the other. Henry provides as additional motivation for combination ([¶0524] "Partitioning the NPU 126 array, a data RAM 122 row, weight RAM 124 row, and data RAM 122 row into their respective G NPU blocks 5906, input blocks 5902, filter blocks 5904, and output blocks 5908 each of size B facilitates the NNU 121 convolving the input 5802 with the filters 5804 to generate the output 5808 in an efficient manner. In particular, the partitioning, in conjunction with the layout of the input data and filter weights within the data RAM 122 and weight RAM 124, facilitates a unique nested loop structure that advantageously uses the rotater mux-reg 208 structure of the NNU 121 to rotate the input blocks 5902 associated with all the C channels of the input 5802 so that each of F of the G NPU blocks 5906 associated with the F filters 5804 “see” (i.e., to receive) all C channels of the input 5802 for convolving with its corresponding filter 5804. More specifically, the NNU 121 reads the input blocks 5902 of a row of the data RAM 122 into the mux-regs 208 and then, using the rotater formed by the mux-regs 208, rotates the input blocks 5902 through at least C adjacent NPU blocks 5906. This enables each NPU 126 to perform multiply-accumulate operations of all the channels of a row of its corresponding filter 5804 with all the channels of a row of the input 5802 (e.g., to perform a column-channel-sum, as described below with respect to FIG. 60) before another row of the input 5802 is read into the mux-regs 208"). Regarding claim 17, the combination of Yang, Liang, and Stanford teaches The method of claim 12. However, the combination of Yang, Liang, and Stanford doesn't explicitly teach, further comprising: providing the partial operation results to a next layer; and in the next layer, using a weight parameter of the next layer to perform operations on each partial input data, and then directly providing the partial output data to a yet next layer. Henry, in the same field of endeavor, teaches The method of claim 12, further comprising: providing the partial operation results to a next layer; and in the next layer, using a weight parameter of the next layer to perform operations on each partial input data, and then directly providing the partial output data to a yet next layer.([¶0110] "Each data word functions as the output value (also sometimes referred to as an activation) of a neuron of the previous layer in the network, and each weight word functions as a weight associated with a connection coming into a neuron of the instant layer of the network" [¶0149] "the architectural program writes the weight words for the next layer to the weight RAM 124 while the NNU 121 is performing the hidden layer computations for the current layer so that the NNU 121 can immediately start performing the hidden layer computations for the next layer once the computations for the current layer are complete" Current and instant layer in Henry are interpreted as synonymous such that a previous, instant, and next layer are three layers among which intermediate results are explicitly relayed using the weight parameters.). The combination of Yang, Liang, and Stanford as well as Henry are directed towards neural network accelerators with an emphasis on CNNs. Therefore, Yang, Liang, and Stanford as well as Henry are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Yang, Liang, and Stanford with the teachings of Henry by performing the loop partitioning in an explicitly column-first fashion. Algorithm 600 in Henry is substantially similar to algorithm 1 in Yang, Liang, and Stanford and it would be obvious to one of ordinary skill in the art that it would be trivial to modify either to match the other. Henry provides as additional motivation for combination ([¶0524] "Partitioning the NPU 126 array, a data RAM 122 row, weight RAM 124 row, and data RAM 122 row into their respective G NPU blocks 5906, input blocks 5902, filter blocks 5904, and output blocks 5908 each of size B facilitates the NNU 121 convolving the input 5802 with the filters 5804 to generate the output 5808 in an efficient manner. In particular, the partitioning, in conjunction with the layout of the input data and filter weights within the data RAM 122 and weight RAM 124, facilitates a unique nested loop structure that advantageously uses the rotater mux-reg 208 structure of the NNU 121 to rotate the input blocks 5902 associated with all the C channels of the input 5802 so that each of F of the G NPU blocks 5906 associated with the F filters 5804 “see” (i.e., to receive) all C channels of the input 5802 for convolving with its corresponding filter 5804. More specifically, the NNU 121 reads the input blocks 5902 of a row of the data RAM 122 into the mux-regs 208 and then, using the rotater formed by the mux-regs 208, rotates the input blocks 5902 through at least C adjacent NPU blocks 5906. This enables each NPU 126 to perform multiply-accumulate operations of all the channels of a row of its corresponding filter 5804 with all the channels of a row of the input 5802 (e.g., to perform a column-channel-sum, as described below with respect to FIG. 60) before another row of the input 5802 is read into the mux-regs 208"). Regarding claim 18, the combination of Yang, Liang, and Stanford teaches The method of claim 12. However, the combination of Yang, Liang, and Stanford doesn't explicitly teach further comprising: providing the partial operation results to a next layer; and adding the partial input data received in the next layer first in a point-to-point manner to obtain a complete input data, and then performing conventional operations on the complete input data. Henry, in the same field of endeavor, teaches The method of claim 12, further comprising: providing the partial operation results to a next layer; and adding the partial input data received in the next layer first in a point-to-point manner to obtain a complete input data, and then performing conventional operations on the complete input data ([¶0149] "the architectural program writes the weight words for the next layer to the weight RAM 124 while the NNU 121 is performing the hidden layer computations for the current layer so that the NNU 121 can immediately start performing the hidden layer computations for the next layer once the computations for the current layer are complete"). Yang, Liang, and Stanford as well as Henry are directed towards neural network accelerators with an emphasis on CNNs. Therefore, Yang, Liang, and Stanford as well as Henry are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Yang, Liang, and Stanford with the teachings of Henry by performing the loop partitioning in an explicitly column-first fashion. Algorithm 600 in Henry is substantially similar to algorithm 1 in Yang, Liang, and Stanford and it would be obvious to one of ordinary skill in the art that it would be trivial to modify either to match the other. Henry provides as additional motivation for combination ([¶0524] "Partitioning the NPU 126 array, a data RAM 122 row, weight RAM 124 row, and data RAM 122 row into their respective G NPU blocks 5906, input blocks 5902, filter blocks 5904, and output blocks 5908 each of size B facilitates the NNU 121 convolving the input 5802 with the filters 5804 to generate the output 5808 in an efficient manner. In particular, the partitioning, in conjunction with the layout of the input data and filter weights within the data RAM 122 and weight RAM 124, facilitates a unique nested loop structure that advantageously uses the rotater mux-reg 208 structure of the NNU 121 to rotate the input blocks 5902 associated with all the C channels of the input 5802 so that each of F of the G NPU blocks 5906 associated with the F filters 5804 “see” (i.e., to receive) all C channels of the input 5802 for convolving with its corresponding filter 5804. More specifically, the NNU 121 reads the input blocks 5902 of a row of the data RAM 122 into the mux-regs 208 and then, using the rotater formed by the mux-regs 208, rotates the input blocks 5902 through at least C adjacent NPU blocks 5906. This enables each NPU 126 to perform multiply-accumulate operations of all the channels of a row of its corresponding filter 5804 with all the channels of a row of the input 5802 (e.g., to perform a column-channel-sum, as described below with respect to FIG. 60) before another row of the input 5802 is read into the mux-regs 208"). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Zhang (“Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks”, 2015) partitions along output-feature/kernel dimensions using a tile T_M with 0<T_m< M and M/T_m passes. Umuroglu (“FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”, 2017) formalizes folding factors across output channels (kernel) dimensions. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124 /MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124
Read full office action

Prosecution Timeline

Oct 25, 2018
Application Filed
Aug 15, 2022
Non-Final Rejection — §103, §112
Nov 29, 2022
Response Filed
Dec 08, 2022
Final Rejection — §103, §112
Feb 14, 2023
Response after Non-Final Action
Mar 15, 2023
Request for Continued Examination
Mar 20, 2023
Response after Non-Final Action
Apr 29, 2023
Non-Final Rejection — §103, §112
Jul 03, 2023
Response Filed
Jul 31, 2023
Final Rejection — §103, §112
Oct 05, 2023
Response after Non-Final Action
Oct 16, 2023
Response after Non-Final Action
Nov 09, 2023
Response after Non-Final Action
Dec 11, 2023
Request for Continued Examination
Dec 18, 2023
Response after Non-Final Action
Feb 26, 2024
Non-Final Rejection — §103, §112
Jun 10, 2024
Response Filed
Jul 20, 2024
Final Rejection — §103, §112
Sep 30, 2024
Response after Non-Final Action
Oct 30, 2024
Request for Continued Examination
Nov 04, 2024
Response after Non-Final Action
Feb 25, 2025
Non-Final Rejection — §103, §112
May 28, 2025
Response Filed
Jul 14, 2025
Final Rejection — §103, §112
Sep 16, 2025
Response after Non-Final Action
Oct 13, 2025
Request for Continued Examination
Oct 16, 2025
Response after Non-Final Action
Dec 30, 2025
Non-Final Rejection — §103, §112
Apr 03, 2026
Response Filed

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12561604
SYSTEM AND METHOD FOR ITERATIVE DATA CLUSTERING USING MACHINE LEARNING
2y 5m to grant Granted Feb 24, 2026
Patent 12547878
Highly Efficient Convolutional Neural Networks
2y 5m to grant Granted Feb 10, 2026
Patent 12536426
Smooth Continuous Piecewise Constructed Activation Functions
2y 5m to grant Granted Jan 27, 2026
Patent 12518143
FEEDFORWARD GENERATIVE NEURAL NETWORKS
2y 5m to grant Granted Jan 06, 2026
Patent 12505340
STASH BALANCING IN MODEL PARALLELISM
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

9-10
Expected OA Rounds
52%
Grant Probability
90%
With Interview (+38.2%)
4y 7m
Median Time to Grant
High
PTA Risk
Based on 136 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month