Last updated: July 17, 2026
Application No. 16/170,360
METHOD AND APPARATUS FOR PERFORMING OPERATIONS IN CONVOLUTIONAL NEURAL NETWORK

Final Rejection §103
Filed
Oct 25, 2018
Priority
Oct 31, 2017 — CN 201711041806.5
Examiner
BOSTWICK, SIDNEY VINCENT
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
Nanjing Horizon Robotics Technology Co. Ltd.
OA Round
10 (Final)
This examiner grants 52% of cases after interview

— +37.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 143 resolved cases, 2023–2026
Examiner Intelligence

BOSTWICK, SIDNEY VINCENT View full profile →
Grants 52% of resolved cases
Career Allowance Rate
74 granted / 143 resolved
-3.3% vs TC avg
Strong +37% interview lift
Without
With
+37.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
44 currently pending
Career history
212
Total Applications
across all art units
Statute-Specific Performance

§101
2.3%
-37.7% vs TC avg
§103
93.7%
+53.7% vs TC avg
§102
1.5%
-38.5% vs TC avg
§112
2.4%
-37.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 143 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on April 3, 2026, in which claims 1, 8, 13, 15, and 20-21 are currently amended. Claims 2-5, 7-12, and 18-24 are canceled. Claims 25-34 are newly added.

Response to Arguments
Applicant’s arguments with respect to rejection of claims 1, 5-6, 8-13, and 15-21 under 35 U.S.C. 112(b) based on amendment have been considered, however, are not persuasive.  Regarding claims 1, 13, and 15, "the plurality of operational parameters obtained by the splitting corresponding to different depths and/or different numbers of kernels" is indefinite.  First, the limitation is grammatically confusing and incomplete.  There is no clear verb such that it's unclear if the claim should read "the plurality of operational parameters [...] correspond to different depths and/or different kernel numbers", however, this interpretation would be equally indefinite as it does not clarify what the depths or numbers are different relative to.  Second, "different numbers of kernels" could mean different kernel indices, different quantities of kernels, or something else altogether.  Because of these issues the scope of the claim cannot be reasonably determined.  In the interest of further examination the claim is interpreted as "the plurality of operational parameters obtained by the splitting correspond to different depths and/or are based on kernels".  

Applicant’s arguments with respect to rejection of claims 1, 5-6, 8-13, and 15-21 under 35 U.S.C. 103(a) based on amendment have been considered, however, are not persuasive.  
With respect to Applicant's arguments on p. 16 of the Remarks submitted 4/3/2026 that Yang does not disclose "the weight parameter of the selected layer is an array in the dimension of the depth and the dimension of the number of the kernels; a row of the array of the weight parameter corresponds to different depths of one kernel, and a column of the array of the weight parameter corresponds to different kernels in one depth", Examiner respectfully disagrees.  Yang explicitly teaches that the depth direction is C ([p. 2 §2] "Here, (X;Y) and (Fw;Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension, which we define as C; or the number of channels.").  Yang also discloses [p. 2] "Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension" [p. 3] "Each CNN layer processes two grids—the 3-D input image (X,Y,C) and the 4-D kernel weights (Fw,Fh,C, K) —to produce a single 3-D output grid (X,Y,K). This computation is depicted in Figure 1" [p. 1] "Given the large possible parameter space—the computation is a 4 level loop nest (x, y, input channel, kernel) around a 2-D convolution" such that Yang discloses "the weight parameter of the selected layer is an array in the dimension of the depth and the dimension of the number of the kernels; a row of the array of the weight parameter corresponds to different depths of one kernel, and a column of the array of the weight parameter corresponds to different kernels in one depth".
With respect to Applicant’s arguments on p. 17 of the Remarks submitted 4/2/2026, Examiner notes that Applicant does not connect their arguments with the instant claim limitations to show how Yang does or does not address the claim limitations.  Similarly, Examiner notes that Applicant first admits “Yang, Figure 1 only shows hierarchical blocking results after blocking the weight parameter loops, for example, the results of dividing the kernels into two loop levels” and then conversely argues “the figure does not show an array of the weight parameter in the dimension of the depth and the dimension of the number of kernels” which appears to directly contradict the previous admission. 
With respect to Applicant’s arguments on p. 18 of the Remarks submitted 4/3/2026 that Yang does not disclose “the splitting comprises splitting performed in a row direction and a column direction of the array of the weight parameter”, Examiner respectfully disagrees.  Yang teaches ([p. 4] "Given this initial loop nest, blocking can be thought of as simply splitting a number of loops, and then exchanging the order in which these split loops are executed. In our notation, when the X loop is split, X0 represents the inner part of the X loop and the value of X0 represents the range of the data computed in this loop. For X0, the loop variable x0 increments by one, and the value of X0 remains equal to the number of iterations in the inner loop. X1 represents the outer loop and its value again represents the range of data computed in this loop. In this case, the loop variable x1 increments by X0, so the number of iterations is X1/X0. Multi-level blocking occurs when a single loop is split multiple times, and is easy represented in our notation extending X1 to Xn; the loop variable for Xn, xn, increments by Xn−1 on each iteration") and Algorithm 1 shows clear iterative dimensional splitting.  
With respect to Applicant’s arguments on p. 18 of the Remarks submitted 4/3/2026 that the claim limitation necessarily requires “the weight splitting process of claim 1 as amended is always completed within the same layer, performing same-level division in both the dimension of depths (channels) and the dimension of the number of kernels, without involving hierarchical nested blocking”, Examiner notes that nothing in the instant claims or the instant specification requires this narrow construction.  Neither the instant specification or instant claims disavow hierarchical nested blocking which shares the same scope.  Applicant’s arguments as a whole import significant perceived meaning from outside of the instant specification or claims which would not be warranted by what is known to one of ordinary skill in the art (See MPEP §2111.01).  For example, Applicant argues on p. 19 of the Remarks submitted 4/3/2026 that claim 1 requires “collaborative splitting in the dimension of the depth and the dimension of the number of kernels in form of arrays, wherein the splitting is a synchronous division of the rows and columns of the entire weight parameter matrix”, however, claim 1 recites no such thing and in fact Applicant has explicitly omitted “in parallel” from “performing, by hardware in parallel” which was the closest thing to Applicant’s imported narrow scope.  
With respect to Applicant’s arguments on p. 19 of the Remarks submitted 4/3/2026 that “Yang fails to disclose or inspire “respective operational parameters in each row of the operational parameter array are from a same subset of a set of the kernels of the weight parameter and have different channels respectively; respective operational parameters in each column of the operational parameter array are from different subsets of the set of the kernels of the weight parameter respectively and have the same channel(s)”, Examiner respectfully disagrees. Yang discloses ([p. 2] "Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension," See also FIG. 1 where each blocking level has different depths and/or different numbers of kernels).  With respect to Applicant’s supporting arguments on p. 20 of the Remarks submitted 4/3/2026 that “the operational parameters are certainly not within nested loops”, Applicant has appeared to misconstrue Yang who uses iterative loops to split operational parameters in the weight parameter array, which appears to be synonymous with what is required by the instant claims.  Applicant appears to try to distinguish between iterating along channel dimensions with “loop levels”, however, Examiner respectfully asserts that they are not mutually exclusive, as clearly demonstrated by Yang who explicitly equivocates the blocking to “splitting” and performs the blocking iteratively over explicit dimensions of the weight matrix.
With respect to Applicant’s argument on p. 22 of the Remarks submitted 4/3/2026 that Yang fails to disclose or inspire “the plurality of operational parameters obtained by the splitting corresponding to different depths and/or different numbers of kernels of the array of the weight parameter”, Examiner respectfully disagrees.  Yang discloses ([p. 2] "Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension").  Examiner further notes that Applicant’s corresponding arguments on p. 22 import a very narrow interpretation of the ambiguous claim language.  Regarding claims 1, 13, and 15, "the plurality of operational parameters obtained by the splitting corresponding to different depths and/or different numbers of kernels" is indefinite.  First, the limitation is grammatically confusing and incomplete.  There is no clear verb such that it's unclear if the claim should read "the plurality of operational parameters [...] correspond to different depths and/or different kernel numbers", however, this interpretation would be equally indefinite as it does not clarify what the depths or numbers are different relative to.  Second, "different numbers of kernels" could mean different kernel indices, different quantities of kernels, or something else altogether.  Because of these issues the scope of the claim cannot be reasonably determined.  In the interest of further examination the claim is interpreted as "the plurality of operational parameters obtained by the splitting correspond to different depths and/or are based on kernels".  Applicant asserts that this necessarily requires asymmetric matrix splitting, however, the actual claim language appears to be of significantly different scope.  Similarly, with respect to Applicant’s arguments on p. 24 of the Remarks submitted 4/3/2026 that the claim scope is limited to “when the operational parameter array of claim 1 as amended is stored in the high-speed memory, each operational parameter is stored as one unit, and there is no concept of parameter levels among the operational parameters, which are not stored according to parameter levels”, Examiner respectfully disagrees.  This construction is neither in the instant claims or instant specification and the comparison is against Applicant’s improper fabrication of Yang, wherein Yang never isolates the nested iteration from the blocking, Yang never isolates the blocking from the weight matrix, and Yang never isolates any of the above from the splitting.  In fact, Yang explicitly ties them all together ([p. 4] "Given this initial loop nest, blocking can be thought of as simply splitting a number of loops, and then exchanging the order in which these split loops are executed. In our notation, when the X loop is split, X0 represents the inner part of the X loop and the value of X0 represents the range of the data computed in this loop. For X0, the loop variable x0 increments by one, and the value of X0 remains equal to the number of iterations in the inner loop. X1 represents the outer loop and its value again represents the range of data computed in this loop. In this case, the loop variable x1 increments by X0, so the number of iterations is X1/X0. Multi-level blocking occurs when a single loop is split multiple times, and is easy represented in our notation extending X1 to Xn; the loop variable for Xn, xn, increments by Xn−1 on each iteration").
With respect to Applicant’s arguments on p. 25 of the Remarks submitted 4/3/2026 that Yang does not disclose “in response to the number of the kernels of the weight parameter exceeding a second threshold, splitting the weight parameter”, Yang is not relied upon to teach this limitation.  
 With respect to Applicant’s arguments on p. 25 of the Remarks submitted 4/3/2026 that Liang does not disclose “in response to the number of the kernels of the weight parameter exceeding a second threshold, splitting the weight parameter”, Examiner respectfully disagrees.  First, Examiner notes that Liang is not relied upon in isolation, but rather in combination with Yang.  Secondly, Examiner notes that splitting a binarized weight matrix is not mutually exclusive from splitting along the row and column directions of the array of the weight parameter, which is explicitly shown in Liang who splits binarized CNN kernels in said dimensions.  For these reasons Examiner asserts that Applicant’s first and second argument on pp. 27-28 of the Remarks submitted 4/3/2026 mischaracterize Liang.  
With respect to Applicant’s third arguments on p. 27 of the Remarks submitted 4/3/2026 that Liang does not disclose “wherein the second threshold is set as a value less than or equal to a ratio of the capacity of the high-speed memory available for storing the weight parameter to a size of each of the kernels”, Examiner respectfully disagrees.  Applicant’s narrow interpretation of this claim could not be reasonably construed as entire scope of the ambiguous claim limitation.  "Storing the weight parameter to a size of each of the kernels" is indefinite because the phrase is grammatically ambiguous.  "storing X to a size" does not describe a coherent operation.  It is unclear whether "to a size of each of the kernels" modifies "storing the weight parameter", identifies the denominator of the recited ratio, or something else altogether.  As written, "storing the weight parameter to a size" is not functional English and does not clearly define how the second threshold is determined.  
	The remainder of Applicant’s arguments are directed entirely towards Applicant’s imported interpretation of the claims which are significantly outside of the scope of the instant claims as filed.
For at least these reasons and those further detailed below Examiner asserts that it is reasonable and appropriate to maintain the rejections under 1, 5-6, 8-13, and 15-21 under 35 U.S.C. 103.
  


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

	Claims 1, 5, 6, 8-11, 13, 15, 19, 20, and 21 are rejected under U.S.C. §103 as being unpatentable over the combination of Yang ("A Systematic Approach to Blocking Convolutional Neural Networks", 2016) and Liang (“FP-BNN: Binarized neural network on FPGA”, 2017).

    PNG
    media_image1.png
    678
    1546
    media_image1.png
    Greyscale

FIG. 1 of Yang


	 Regarding claim 1, Yang teaches A computer implemented method for performing operations in a convolutional neural network, comprising: ([Abstract] "This paper explores how to block CNN computations for memory locality by creating an analytical model for CNN-like loop nests. Using this model we automatically derive optimized blockings for common networks that improve the energy efficiency of custom hardware implementations by up to an order of magnitude")
	selecting a layer in the convolutional neural network based on at least one of a capacity of a high-speed memory, a capacity of the high-speed memory reserved for a weight parameter of the layer, a capacity of the high-speed memory currently available for the weight parameter of the layer, an arrangement of multipliers and adders, a requirement on operation parallelism, a design of the convolution neural network, or current performance of a processor and/or an operating system;([p. 7 §3.5] "To optimize a CNN layer for a fixed memory hierarchy, for each string we continue to pack the lower level buffers into the lowest available level of memory hierarchy, always adding the unpacked buffer with the highest number of accesses. When the current memory level does not have enough remaining space to fit the added buffer, we place that and all subsequent buffers into the next level of the memory hierarchy until it becomes full" See also FIG. 1 for a selected CNN layer)
	splitting the weight parameter of the selected layer in the convolutional neural network in a dimension of a depth and a dimension of a number of kernels, to obtain an operational parameter array including a plurality of operational parameters, wherein the weight parameter of the selected layer is an array in the dimension of the depth and the dimension of the number of the kernels; ([p. 2 §2] "A convolutional layer (Conv) corresponds to a filter bank. In the standard case of 3D input and output, a convolutional layer maps a C×X×Y input to a K×X×Y output using K shift-invariant 3D stencils, where each stencil is of the size Fw×Fh×C (i.e., a set of K 3-dimensional convolutions). These K Fw×Fh×C stencil coefficients are the “weights” of the convolutional layer. Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension, which we define as C, or the number of channels. Typically the dimensions of the kernels are much smaller than the image dimensions." [p. 4 §3.1] "The computation being performed by a convolutional layer can be easily expressed as a 6 layer loop nest as shown in Algorithm 1...blocking can be thought of as simply splitting a number of loops, and then exchanging the order in which these split loops are executed" Yang explicitly states that 3x3 convolution is blocked into 1x1 fragments of the 3x3 convolution (See also FIG. 1) where each fragment is three dimensional having a depth dimension and Kernel (K) dimension.)
	a row of the array of the weight parameter corresponds to different depths of one kernel, ([p. 2] "Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension" [p. 3] "Each CNN layer processes two grids—the 3-D input image (X,Y,C) and the 4-D kernel weights (Fw,Fh,C, K) —to produce a single 3-D output grid (X,Y,K). This computation is depicted in Figure 1")
	and a column of the array of the weight parameter corresponds to different kernels in one depth; ([p. 1] "Given the large possible parameter space—the computation is a 4 level loop nest (x, y, input channel, kernel) around a 2-D convolution")
	the splitting comprises splitting performed in a row direction and a column direction of the array of the weight parameter([p. 4] "Given this initial loop nest, blocking can be thought of as simply splitting a number of loops, and then exchanging the order in which these split loops are executed. In our notation, when the X loop is split, X0 represents the inner part of the X loop and the value of X0 represents the range of the data computed in this loop. For X0, the loop variable x0 increments by one, and the value of X0 remains equal to the number of iterations in the inner loop. X1 represents the outer loop and its value again represents the range of data computed in this loop. In this case, the loop variable x1 increments by X0, so the number of iterations is X1/X0. Multi-level blocking occurs when a single loop is split multiple times, and is easy represented in our notation extending X1 to Xn; the loop variable for Xn, xn, increments by Xn−1 on each iteration" Yang explicitly describes blocking as loop splitting along every applicable dimension of weight parameters.)
	respective operational parameters in each row of the operational parameter array are from a same subset of a set of the kernels of the weight parameter and have different channels respectively, ("Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension," Row of filter fragment in Yang is interpreted as elements along the channel dimension (C0, C1, C2, etc) each having different channels respective to a particular kernel (K).)
	respective operational parameters in each column of the operational parameter array are from different subsets of the set of the kernels of the weight parameter respectively and have the same channel(s); ([p. 3 §2.2] "1 for k0 = 0 : K do 2 for c0 = 0 : C do" The first element of every kernel corresponds to the same channel C=C0 but of different kernel subsets (for each kernel from 0 to K) which is evident in FIG. 1 and can be verified with line 2 of Algorithm 1.)
	and the plurality of operational parameters obtained by the splitting corresponding to different depths and/or different numbers of kernels of the array of the weight parameter;("Here, (X,Y) and (Fw,Fh) are the image and kernel width and height dimensions and both image and kernels have the same depth dimension," See also FIG. 1 where each blocking level has different depths and/or different numbers of kernels)
	storing the operational parameter array in the high-speed memory; ([p. 4 §3.2] "While in the final design the input, kernel, and output data at each level of the memory hierarchy may be stored together, for this analysis it is convenient to think of them as separate memory structures. Thus we will consider a memory for kernel coefficients KB (kernel buffer), input image data IB, and output data OB. Since these memories exist at multiple levels in the memory hierarchy, we use KB0, IB0, OB0, to indicate the kernel, input, and output memory that is closest to the compute unit, and each buffer at level i, (e.g. IBi) fetches its data from the buffer at level i+1 (IBi+1)" [p. 5 §3.2] "partial outputs are being reduced Ci/Ci−1 times, and should be stored in a new output buffer to prevent these fetches from going to a larger memory at a higher level in the memory hierarchy. For maximum reuse of kernel partial outputs, the output buffer contains all elements that are computed by the inner loops")
	obtaining each operational parameter in the operational parameter array stored in the high- speed memory; ([p. 4 §3.2] "fetches its data from the buffer at level i+1 (IBi+1)" [p. 5 §3.2] "partial outputs are being reduced Ci/Ci−1 times, and should be stored in a new output buffer to prevent these fetches from going to a larger memory at a higher level in the memory hierarchy. For maximum reuse of kernel partial outputs, the output buffer contains all elements that are computed by the inner loops" [p. 6] "We compute the number of accesses to each memory by introducing refetch rate RRi , the number of times a piece of data is fetched from a certain buffer after initially being loaded into that buffer (Table 2)")
	performing, by hardware using the each operational parameter in the operational parameter array, operations of the selected layer on data of input data of the selected layer that are in a channel corresponding to a channel of the each operational parameter, to obtain a partial operation result array including a plurality of partial operation results;([p. 4] "Figure 1: Hierarchical blocking of a single convolutional layer. The six-dimensional overall problem domain (X,Y,C,Fw,Fh,K) depicted in Figure 1 is blocked to three levels in the input domain ({X,Y,C}{0,1,2}), and two levels in the set of kernels (K) which correspond to the third dimension of the output domain ({X,Y}{0,1,2},{K}{0,1}). Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C" [p. 11 §5.3] "As discussed in Section 3.3, there are two different schemes for parallelizing a problem on a multiple-core chip. Figure 9 demonstrates how the two parallelization schemes, with different schedules and memory hierarchies, affect the energy efficiency of a system with up to eight cores. We chose the top four blocking schedules from the single core problem discussed in the previous section and we evaluated the two parallelization methods for each of the four schedules as applied to layer Conv1.")
	 and generating one or more output data of the selected layer based on the partial operational result array, ([p. 4 §3] "Partial results for each output pixel are
accumulated hierarchically across the three levels of blocking in C").
	However, Yang does not explicitly teach wherein splitting the weight parameter comprises: in response to the number of the kernels of the weight parameter exceeding a second threshold, splitting the weight parameter, such that a number of kernels of the each operational parameter in the operational parameter array obtained by the splitting is less than or equal to the second threshold, 
	wherein the second threshold is set as a value less than or equal to a ratio of the capacity of the high-speed memory available for storing the weight parameter to a size of each of the kernels.

	Liang, in the same field of endeavor, teaches wherein splitting the weight parameter comprises: in response to the number of the kernels of the weight parameter exceeding a second threshold, splitting the weight parameter, such that a number of kernels of the each operational parameter in the operational parameter array obtained by the splitting is less than or equal to the second threshold, ([p. 4] "For operations in BN layers, the number of operations (NOP) has a linear relationship with the number of output channels Nout" [p. 9 §5] "The number of tiled output channels equals to the number of PE channels NPE […] It will take [ Nin / TNin ] iterations to get the intermediate result accumulated for one output. This will be repeated by [ Nout / NPE ] times for all outputs get done" In CNN each kernel is tied to a respective output channel such that number of output channels necessarily and by definition is equal to number of kernels.  Liang explicitly teaches that only NPE output channels can be processed per iteration such that if Nout (the total number of kernels of the each operational parameter in the operational parameter array obtained by the splitting) is greater than NPE (the second threshold) then Nout needs to be split by NPE.)
	wherein the second threshold is set as a value less than or equal to a ratio of the capacity of the high-speed memory available for storing the weight parameter to a size of each of the kernels([p. 10 §6.2] "we keep the parallelism of memories identical to the number of PE channels NPE, and the width of each memory equals to PEsize").

	Yang as well as Liang are directed towards partitioning convolutional neural network filter weight matrices.  Therefore, Yang as well as Liang are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Yang with the teachings of Liang by partitioning the filter weight matrix by a kernel threshold.  Liang provides as additional motivation for combination ([p. 2] “To improve resource usage […] we introduce FP-BNN, a BNN acceleration system design on FPGA, with related optimizations”).
	 Regarding claim 5, the combination of Yang, and Liang teaches The method of claim 1 wherein splitting the weight parameter comprises: splitting the weight parameter in a case where the weight parameter has a number of channels exceeding a third threshold, such that each operational parameter in the operational parameter array obtained by the splitting has a number of channels less than or equal to the third threshold.(Yang [p. 5 §3.2] "When a new C loop Ci is added, a series of images and kernels are streamed and Ci channels reductions are being performed on the same set of outputs. Therefore those partial outputs are being reduced Ci/Ci-1 times, and should be stored in a new output buffer to prevent these fetches from going to a larger memory at a higher level in the memory hierarchy" [p. 5 §3.2] "Suppose we apply parallelism for S cores at a given level p by unrolling that loop p across the processors. The first constraint is that we need to block the application such that the dimension being unrolled, e.g. Cp, is S times that of the previous level, Cp􀀀1. The parallelism can be performed by partitioning the problem across the input XY, the kernels K, or the channels C" Third threshold interpreted as Cp-1 such that if Cp is equal to Cp-1 no splitting occurs.  Yang explicitly teaches that the partitioning may occur as a function of channels (Cp = Cp-1).).
	
	 Regarding claim 6, the combination of Yang, and Liang teaches The method of claim 1 wherein splitting the weight parameter comprises: splitting the weight parameter in a case where the weight parameter has a number of channels greater than or equal to a second predetermined number, such that the operational parameter array obtained by the splitting has a number of columns equal to a multiple of the second predetermined number.(Yang [p. 5 §3.3] "The first constraint is that we need to block the application such that the dimension being unrolled, e.g. Cp, is S times that of the previous level, Cp−1. The parallelism can be performed by partitioning the problem across the input XY, the kernels K, or the channels C" S interpreted as multiple of predetermined number (Cp-1).).
	
	 Regarding claim 8, the combination of Yang, and Liang teaches The method of claim 1, wherein splitting the weight parameter further comprises: in response to that the operational parameter array obtained by the splitting includes an operational parameter having a size exceeding a first threshold, subdividing at least a row and/or column of including the operational parameter having the size exceeding the first threshold in at least one of a dimension of depth and number of kernels, such that each operational parameter in an operational parameter array obtained by the subdividing has a size less than or equal to the first threshold. (Yang [p. 5 §3.3] "The first constraint is that we need to block the application such that the dimension being unrolled, e.g. Cp, is S times that of the previous level, Cp−1. The parallelism can be performed by partitioning the problem across the input XY, the kernels K, or the channels C" S interpreted as multiple of predetermined number (Cp-1).  See also figure 1.  Cp interpreted as operational parameter exceeding the threshold.  Cp-1 interpreted as synonymous with operational parameter having a size less than Cp.).
	
	 Regarding claim 9, the combination of Yang, and Liang teaches The method of claim 1 wherein each partial operation result in the partial operation result array corresponds to one output data of the selected layer.(Yang [p. 4] "Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C" Output pixel interpreted as synonymous with one output data of the selected layer.).
	
	 Regarding claim 10, the combination of Yang, and Liang teaches The method of claim 1 where generating the output data comprises: compressing the partial operation result array into one column by adding up all the partial operation results in each row of the partial operation result array in a point-to-point manner when the partial operation result array includes a plurality of columns, each partial operation result in the compressed partial operation result array corresponding to an output data of the selected layer.(Yang [p. 2] "Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C" [p. 5 §3.2] "The inner loop takes a small amount of input data with block size X0Y0C0 and convolves it with K0 kernels to create some partial outputs with block size X0Y0K0. A complete output cannot be generated until all the channels of the input are processed for that kernel and the output pixel is generated, which will happen only when all of the channels (C2 loop) finish" For the iteration of X0=1 partial output is compressed into a single column.  Yang explicitly teaches that the partial results are accumulated (summed) from the channels (rows).).
	
	 Regarding claim 11, the combination of Yang, and Liang teaches The method of claim 1 wherein generating the output data comprises: compressing the partial operation result array into one row by combining all the partial operation results in each column of the partial operation result array in the depth direction when the partial operation result array includes a plurality of rows, each partial operation result in the compressed partial operation result array corresponding to an output data of the selected layer.(Yang [p. 2] "Partial results for each output pixel are accumulated hierarchically across the three levels of blocking in C" [p. 5 §3.2] "The inner loop takes a small amount of input data with block size X0Y0C0 and convolves it with K0 kernels to create some partial outputs with block size X0Y0K0. A complete output cannot be generated until all the channels of the input are processed for that kernel and the output pixel is generated, which will happen only when all of the channels (C2 loop) finish" For the iteration of X0=1 partial output is compressed into a single row.  See Figure 1 of how the partial operation results correspond to the output.).
	
	 Regarding claims 13 and 15, claims 13 and 15 are directed towards an apparatus for performing the method of claim 1.  Therefore, the rejection applied to claim 1 also applies to claims 13 and 15.  Claims 13 and 15 also mention additional elements including a processor to perform the method ([p. 1 §1] "Early attempts [20, 1, 24, 2] to optimize CPU and GPU CNN implementations treated the convolutional layers as matrix multiplication and used an optimized BLAS matrix matrix-multiplication (GEMM) routine") as well as memory to store the instructions performed by the processor ([p. 1 §1] "the design of the memory hierarchy and how the data is choreographed has a dramatic effect on the energy required for the computation.").
	
	 Regarding claim 19, the combination of Yang, and Liang teaches The method of claim 1 wherein splitting the weight parameter matrix comprises: splitting the weight parameter matrix in a case where a size of the weight parameter matrix exceeds a first threshold, such that the each operational parameter in the operational parameter array obtained by the splitting has a size less than or equal to the first threshold.(Yang [p. 5 §3.2] "When a new C loop Ci is added, a series of images and kernels are streamed and Ci channels reductions are being performed on the same set of outputs. Therefore those partial outputs are being reduced Ci/Ci-1 times, and should be stored in a new output buffer to prevent these fetches from going to a larger memory at a higher level in the memory hierarchy" [p. 5 §3.2] "Suppose we apply parallelism for S cores at a given level p by unrolling that loop p across the processors. The first constraint is that we need to block the application such that the dimension being unrolled, e.g. Cp, is S times that of the previous level, Cp􀀀1. The parallelism can be performed by partitioning the problem across the input XY, the kernels K, or the channels C" Threshold interpreted as Cp-1 such that if Cp is equal to Cp-1 no splitting occurs.  Yang explicitly teaches that the partitioning may occur as a function of operational parameters (Cp = Cp-1).).
	
	 Regarding claim 20, the combination of Yang, and Liang teaches The method of claim 1, wherein splitting the weight parameter comprises: In response to that the number of the kernels of the weight parameter is greater than or equal to a first predetermined number, splitting the weight parameter, such that the operational parameter array obtained by the splitting has a number of rows equal to a multiple of the first predetermined number.(Liang [p. 9 §5] "Considering the datapath shared among different layers, NPE should be a common divisor of Nout of different layers"").
	
	 Regarding claim 21, the combination of Yang, and Liang teaches The method of claim 1, wherein splitting the weight parameter comprises: in response to that the selected layer receives a plurality of partial input data, any two of which do not have the same channel, and the plurality of partial input data collectively correspond to the complete input data of the selected layer, splitting the weight parameter according to each of the partial input data, such that the operational parameter array obtained by the splitting has a number of columns equal to a number of the received plurality of partial input data, and the respective operational parameters in the each column correspond to the same one or more channels as one of the plurality of partial input data.(Liang [p. 9 §5] "For FC layers, one tiled input will be fed into all PE channels. For better resource utilization we set TNin as close as possible to PEsize. It will take [ Nin/ TNin ] iterations to get the intermediate result accumulated for one output. This will be repeated by [ Nout/ NPE ] times for all outputs get done [...] Considering the datapath shared among different layers, NPE should be a common divisor of Nout of different layers, TNin for each layer should preferably be best a sub-multiple of Nin, and PEsize should be a big value and also close to Lin to best explore the resource utilization" Tnin interpreted as partial input data, any two of which do not have the same channel.).
	
	Claim 12 is rejected under U.S.C. §103 as being unpatentable over the combination of Yang and Liang and in further view of Stanford (“CS231n Convolutional Neural Networks for Visual Recognition”, 2015).

	 Regarding claim 12, the combination of Yang, and Liang teaches The method of claim 1.
	However, the combination of Yang, and Liang doesn't explicitly teach wherein generating the output data comprises: generating an output data of the selected layer by adding up all the partial operation results in each row of the partial operation result array in a point-to-point manner and then combining, in the depth direction, all the partial operation results in each column of the partial operation result array compressed by the adding up, or by combining all the partial operation results in each column of the partial operation result array in the depth direction and then adding up all the partial operation results in each row of the partial operation result array compressed by the combining in a point-to-point manner, when the partial operation result array includes a plurality of rows and a plurality of columns.

	Stanford, in the same field of endeavor, teaches The method of claim 1 wherein generating the output data comprises: generating an output data of the selected layer by adding up all the partial operation results in each row of the partial operation result array in a point-to-point manner and then combining, in the depth direction, all the partial operation results in each column of the partial operation result array compressed by the adding up, or by combining all the partial operation results in each column of the partial operation result array in the depth direction and then adding up all the partial operation results in each row of the partial operation result array compressed by the combining in a point-to-point manner, when the partial operation result array includes a plurality of rows and a plurality of columns. ([p. 11] "The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias." See FIG. on p. 12.).

	The combination of Yang and Liang as well as Stanford are directed towards accelerating convolutional neural networks.  Therefore, the combination of Yang and Liang as well as Stanford are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Yang and Liang with the teachings of Stanford.  While the content on the Stanford convnet course website would be considered to be well-understood to one of ordinary skill in the art, a motivation for combination with regards to matrix factorization of convolutional neural networks has been provided ([p. 13 "Implementation as Matrix Multiplication"] "the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used BLAS API). Moreover, the same im2col idea can be reused to perform the pooling operation").  

	Claims 16, 17,  and 18 are rejected under U.S.C. §103 as being unpatentable over the combination of Yang and Liang and Stanford and Henry (US20180189633A1).

	 Regarding claim 16, the combination of Yang, Liang, and Stanford teaches The method of claim 12.
	However, the combination of Yang, Liang, and Stanford doesn't explicitly teach, further comprising: providing the partial operation results to a next layer; and in the next layer, using a weight parameter of the next layer to perform operations on each partial input data, and then adding results obtained by the operations in a point-to-point manner..

	Henry, in the same field of endeavor, teaches providing the partial operation results to a next layer; and in the next layer, using a weight parameter of the next layer to perform operations on each partial input data, and then adding results obtained by the operations in a point-to-point manner. ([¶0110] "Each data word functions as the output value (also sometimes referred to as an activation) of a neuron of the previous layer in the network, and each weight word functions as a weight associated with a connection coming into a neuron of the instant layer of the network" [¶0599] " each iteration of loop 2 implicates a horizontal 2D input slice (i.e., all C channels of a given row of the H rows of the input 5802) and a horizontal 2D filter slice (i.e., all C channels of a given row of the R rows of the filter 5804). The column-channel-sum is the result of, for each channel of all the C channels, convolving the channel's portion of the implicated horizontal 2D input slice and the channel's portion of the implicated horizontal 2D filter slice to generate a column-sum, and continually accumulating all of the C channel's column-sums to produce the column-channel-sum" Accumulating row by row in a column-channel-sum fashion is interpreted as synonymous with adding results obtained by the operations in a point-to-point manner.).

	The combination of Yang, Liang, and Stanford as well as Henry are directed towards neural network accelerators with an emphasis on CNNs.  Therefore, Yang, Liang, and Stanford as well as Henry are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Yang, Liang, and Stanford with the teachings of Henry by performing the loop partitioning in an explicitly column-first fashion.  Algorithm 600 in Henry is substantially similar to algorithm 1 in Yang, Liang, and Stanford and it would be obvious to one of ordinary skill in the art that it would be trivial to modify either to match the other.  Henry provides as additional motivation for combination ([¶0524] "Partitioning the NPU 126 array, a data RAM 122 row, weight RAM 124 row, and data RAM 122 row into their respective G NPU blocks 5906, input blocks 5902, filter blocks 5904, and output blocks 5908 each of size B facilitates the NNU 121 convolving the input 5802 with the filters 5804 to generate the output 5808 in an efficient manner. In particular, the partitioning, in conjunction with the layout of the input data and filter weights within the data RAM 122 and weight RAM 124, facilitates a unique nested loop structure that advantageously uses the rotater mux-reg 208 structure of the NNU 121 to rotate the input blocks 5902 associated with all the C channels of the input 5802 so that each of F of the G NPU blocks 5906 associated with the F filters 5804 “see” (i.e., to receive) all C channels of the input 5802 for convolving with its corresponding filter 5804. More specifically, the NNU 121 reads the input blocks 5902 of a row of the data RAM 122 into the mux-regs 208 and then, using the rotater formed by the mux-regs 208, rotates the input blocks 5902 through at least C adjacent NPU blocks 5906. This enables each NPU 126 to perform multiply-accumulate operations of all the channels of a row of its corresponding filter 5804 with all the channels of a row of the input 5802 (e.g., to perform a column-channel-sum, as described below with respect to FIG. 60) before another row of the input 5802 is read into the mux-regs 208").

	 Regarding claim 17, the combination of Yang, Liang, and Stanford teaches The method of claim 12.
	However, the combination of Yang, Liang, and Stanford doesn't explicitly teach, further comprising: providing the partial operation results to a next layer; and in the next layer, using a weight parameter of the next layer to perform operations on each partial input data, and then directly providing the partial output data to a yet next layer.

	Henry, in the same field of endeavor, teaches The method of claim 12, further comprising: providing the partial operation results to a next layer; and in the next layer, using a weight parameter of the next layer to perform operations on each partial input data, and then directly providing the partial output data to a yet next layer.([¶0110] "Each data word functions as the output value (also sometimes referred to as an activation) of a neuron of the previous layer in the network, and each weight word functions as a weight associated with a connection coming into a neuron of the instant layer of the network" [¶0149] "the architectural program writes the weight words for the next layer to the weight RAM 124 while the NNU 121 is performing the hidden layer computations for the current layer so that the NNU 121 can immediately start performing the hidden layer computations for the next layer once the computations for the current layer are complete" Current and instant layer in Henry are interpreted as synonymous such that a previous, instant, and next layer are three layers among which intermediate results are explicitly relayed using the weight parameters.).

	The combination of Yang, Liang, and Stanford as well as Henry are directed towards neural network accelerators with an emphasis on CNNs.  Therefore, Yang, Liang, and Stanford as well as Henry are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Yang, Liang, and Stanford with the teachings of Henry by performing the loop partitioning in an explicitly column-first fashion.  Algorithm 600 in Henry is substantially similar to algorithm 1 in Yang, Liang, and Stanford and it would be obvious to one of ordinary skill in the art that it would be trivial to modify either to match the other.  Henry provides as additional motivation for combination ([¶0524] "Partitioning the NPU 126 array, a data RAM 122 row, weight RAM 124 row, and data RAM 122 row into their respective G NPU blocks 5906, input blocks 5902, filter blocks 5904, and output blocks 5908 each of size B facilitates the NNU 121 convolving the input 5802 with the filters 5804 to generate the output 5808 in an efficient manner. In particular, the partitioning, in conjunction with the layout of the input data and filter weights within the data RAM 122 and weight RAM 124, facilitates a unique nested loop structure that advantageously uses the rotater mux-reg 208 structure of the NNU 121 to rotate the input blocks 5902 associated with all the C channels of the input 5802 so that each of F of the G NPU blocks 5906 associated with the F filters 5804 “see” (i.e., to receive) all C channels of the input 5802 for convolving with its corresponding filter 5804. More specifically, the NNU 121 reads the input blocks 5902 of a row of the data RAM 122 into the mux-regs 208 and then, using the rotater formed by the mux-regs 208, rotates the input blocks 5902 through at least C adjacent NPU blocks 5906. This enables each NPU 126 to perform multiply-accumulate operations of all the channels of a row of its corresponding filter 5804 with all the channels of a row of the input 5802 (e.g., to perform a column-channel-sum, as described below with respect to FIG. 60) before another row of the input 5802 is read into the mux-regs 208").

	 Regarding claim 18, the combination of Yang, Liang, and Stanford teaches The method of claim 12.
	However, the combination of Yang, Liang, and Stanford doesn't explicitly teach further comprising: providing the partial operation results to a next layer; and adding the partial input data received in the next layer first in a point-to-point manner to obtain a complete input data, and then performing conventional operations on the complete input data.

	Henry, in the same field of endeavor, teaches The method of claim 12, further comprising: providing the partial operation results to a next layer; and adding the partial input data received in the next layer first in a point-to-point manner to obtain a complete input data, and then performing conventional operations on the complete input data ([¶0149] "the architectural program writes the weight words for the next layer to the weight RAM 124 while the NNU 121 is performing the hidden layer computations for the current layer so that the NNU 121 can immediately start performing the hidden layer computations for the next layer once the computations for the current layer are complete").

	Yang, Liang, and Stanford as well as Henry are directed towards neural network accelerators with an emphasis on CNNs.  Therefore, Yang, Liang, and Stanford as well as Henry are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Yang, Liang, and Stanford with the teachings of Henry by performing the loop partitioning in an explicitly column-first fashion.  Algorithm 600 in Henry is substantially similar to algorithm 1 in Yang, Liang, and Stanford and it would be obvious to one of ordinary skill in the art that it would be trivial to modify either to match the other.  Henry provides as additional motivation for combination ([¶0524] "Partitioning the NPU 126 array, a data RAM 122 row, weight RAM 124 row, and data RAM 122 row into their respective G NPU blocks 5906, input blocks 5902, filter blocks 5904, and output blocks 5908 each of size B facilitates the NNU 121 convolving the input 5802 with the filters 5804 to generate the output 5808 in an efficient manner. In particular, the partitioning, in conjunction with the layout of the input data and filter weights within the data RAM 122 and weight RAM 124, facilitates a unique nested loop structure that advantageously uses the rotater mux-reg 208 structure of the NNU 121 to rotate the input blocks 5902 associated with all the C channels of the input 5802 so that each of F of the G NPU blocks 5906 associated with the F filters 5804 “see” (i.e., to receive) all C channels of the input 5802 for convolving with its corresponding filter 5804. More specifically, the NNU 121 reads the input blocks 5902 of a row of the data RAM 122 into the mux-regs 208 and then, using the rotater formed by the mux-regs 208, rotates the input blocks 5902 through at least C adjacent NPU blocks 5906. This enables each NPU 126 to perform multiply-accumulate operations of all the channels of a row of its corresponding filter 5804 with all the channels of a row of the input 5802 (e.g., to perform a column-channel-sum, as described below with respect to FIG. 60) before another row of the input 5802 is read into the mux-regs 208").

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Dinh (“Density Estimation Using Real NVP”, 2016) is directed towards splitting convolutional weight tensors for better memory efficiency.

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124
Read full office action
Prosecution Timeline

Show 22 earlier events
May 28, 2025
Response Filed
Jul 16, 2025
Final Rejection mailed — §103
Sep 16, 2025
Response after Non-Final Action
Oct 13, 2025
Request for Continued Examination
Oct 16, 2025
Response after Non-Final Action
Jan 06, 2026
Non-Final Rejection mailed — §103
Apr 03, 2026
Response Filed
May 28, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/925,882
Patent 12675673
NEURAL NETWORK PROCESSING DEVICE, METHOD, AND COMPUTER-READABLE RECORDING MEDIUM
3y 7m to grant Granted Jul 07, 2026
18/072,012
Patent 12645914
INSTRUCTION PRUNING FOR NEURAL NETWORKS
3y 6m to grant Granted Jun 02, 2026
17/632,509
Patent 12626139
SECRET SOFTMAX FUNCTION CALCULATION SYSTEM, SECRET SOFTMAX FUNCTION CALCULATION APPARATUS, SECRET SOFTMAX FUNCTION CALCULATION METHOD, SECRET NEURAL NETWORK CALCULATION SYSTEM, SECRET NEURAL NETWORK LEARNING SYSTEM, AND PROGRAM
4y 3m to grant Granted May 12, 2026
18/909,558
Patent 12619815
Magnitude Invariant Multimodal Agent for Efficient Image-Text Interface Automation
1y 6m to grant Granted May 05, 2026
17/373,021
Patent 12561604
SYSTEM AND METHOD FOR ITERATIVE DATA CLUSTERING USING MACHINE LEARNING
4y 7m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

11-12
Expected OA Rounds
52%
Grant Probability
89%
With Interview (+37.1%)
4y 5m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 143 resolved cases by this examiner. Grant probability derived from career allowance rate.