DETAILED ACTION
This Office Action is in response to the amendments filed on 11/18/2025.
Claims 1, 5, 7, 10, 14, 16, and 19 currently amended.
Claims 1-19 and 21 are currently pending in this application and have been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Allowable Subject Matter
Claims 5-9 and 14-18 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Response to Arguments
In reference to Applicant’s arguments on page(s) 10-11 regarding rejections made under 35 U.S.C. 102:
Claims 1-4, 10-13, 19, and 21 are rejected under 35 U.S.C. @ 102(a)(1) as being anticipated by Non-Patent Literature "Mini-batch Serialization: CNN Training with Inter-layer Data Reuse" by Lym et al. (hereinafter "Lym").
Applicant respectfully traverses these rejections for the comments set forth below.
In particular, a second layer group is determined based on operation overheads of the neural network. The second layer group comprises at least two first layer groups with different first batch sizes, where the buffer requirement of the second layer group is less than or equal to the capacity of the on-chip memory. The Office Action fails to cite any passages in the Lym to disclose, teach, or suggest the above-identified limitations.
In summary, the cited portions of Lym describe reducing the sub-batch size of one group to that of the adjacent group. The Office Action appears to offer the one group and the adjacent group as two "first layer groups with different first batch size" that are included in the "second layer group" recited in the claim. The Office Action further appears to offer the statement that "the sub-batch does not exceed the on-chip buffer capacity" in Lym as teaching that "a buffer requirement of each second layer group is less than or equal to the capacity of the on-chip memory" as recited claim 1. However, the cited portions of Lym merely discuss the sub-batch size of each group (offered to teach "the first layer group" as discussed previously) does not exceed the on-chip buffer capacity, but is silent on the buffer requirement of "the second layer group (that includes at least two first layer groups)"to be less than or equal to the capacity of the on-chip memory. In other words, the cited portions of Lym merely discuss the sub-batch size of each first layer group, but not the buffer requirement of the second layer group that includes at least two first layer groups.
In addition, the Office Action fails to cite any passages in Lym disclosing that the second layer group (including the at least two first layer groups, as discussed previously) is determined based on operation overheads of the neural network.
Examiner’s response:
Applicant’s arguments have been fully considered but are moot in light of the amendments made on the claims.
Applicant argues that the prior art reference of Lym does not teach the amended limitation of creating a second layer group that is determined based on operation overheads of the neural network. Examiner agrees. Lym is deficient in this teaching, however a new search was performed and new art was found and is applied to the claims below.
Applicant argues that the provided reference of Lym is silent on the buffer requirement of the second layer group that contains at least two first layer groups. Examiner disagrees. Examiner cited to section 3 (Mini-Batch Serialization) of the Lym reference that highlight the processing of batches not exceeding an on-chip buffer capacity. It can be further seen in that same section of Lym, both in the text and in figures 4 and 5, that the layer groups support layers that have different batch sizes. Figure 5 clearly shows that in each layer there are block sizes that are different. For example, in Group 1 there ten blocks of size 3 and one block of size 2, in Group 2 there is a block of size 6 and size 2, in Group 3 there are two blocks of size 11 and one block of size 10, and in Group 4 there are four blocks of size 16. All of these layers adhere to the processing limitation of the on-chip buffer capacity.
The rejections made under 35 U.S.C. 102 are withdrawn and new grounds for rejection is presented below.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-4, 10-13, 19, and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lym et al (Lym, Sangkug & Behroozi, Armand & Wen, Wei & Li, Ge & Kwon, Yongkee & Erez, Mattan. (2018). Mini-batch Serialization: CNN Training with Inter-layer Data Reuse. 10.48550/arXiv.1810.00307., hereinafter Lym), in view of Sui et al (US 20190303762 A1, hereinafter Sui).
Regarding Claim 1:
Lym teaches
A neural network scheduling method, wherein the method comprises: determining a first batch size corresponding to each layer of one or more layers in a neural network (Lym [Section 3, p. 3]: "An improvement on full serialization is to process multiple samples at a time (a sub-batch) to provide some intra-layer weight reuse and extra parallelism, as long as the footprint at any point in the sub-batch does not exceed the on-chip buffer capacity."; [Section 3, p. 4]: "We do this by varying the number of samples per sub-batch across layers such that layers that can support more samples require fewer iterations and can benefit from the greater parallelism and locality");
forming, through grouping based on the first batch size, the neural network into a neural network comprising at least one first layer group, wherein each first layer group comprises at least one layer in the neural network, first batch sizes corresponding to layers in each first layer group are the same, and a buffer requirement of each first layer group is less than or equal to a capacity of an on-chip memory (Lym [Section 3, p. 5]: "The MBS algorithm forms initial layer groups by grouping adjacent layers that require the same number of sub-batch iterations. This is shown in Fig. 4 where grey vertical bars represent the data volume required for the inter-layer data per layer (or one multi-branch module block) of ResNet50, and the red line represents the resulting minimal sub-batch iteration count for each layer."; [Section 3, p. 3]: "An improvement on full serialization is to process multiple samples at a time (a sub-batch) to provide some intra-layer weight reuse and extra parallelism, as long as the footprint at any point in the sub-batch does not exceed the on-chip buffer capacity.");
wherein each second layer group comprises at least one first layer group, a buffer requirement of each second layer group is less than or equal to the capacity of the on-chip memory, and at least one second layer group comprises at least two first layer groups with different first batch sizes (Lym [Section 3, p. 5]: "Then, layer groups are merged to improve overall locality: groups are merged by reducing the sub-batch size of one group to that of an adjacent group. The first group then requires more iterations (with more weight and gradient accesses), but inter-layer reuse increases across the two layers where the groups meet."; [Section 3, p. 3]: "An improvement on full serialization is to process multiple samples at a time (a sub-batch) to provide some intra-layer weight reuse and extra parallelism, as long as the footprint at any point in the sub-batch does not exceed the on-chip buffer capacity.");
and scheduling the neural network based on a grouping result of the second layer group (Lym [Section 3, p. 6]: "The mini-batch is then processed in several sub-batch iterations (d mini−batch size sub−batch size e) within each group as shown in Fig. 5, which emphasizes how locality is increased and memory traffic reduced across features and weights").
Lym does not distinctly disclose
forming, through grouping based on a grouping result of the first layer group, the neural network into a neural network comprising at least one second layer group that is determined based on operation overheads of the neural network
However, Sui teaches
forming, through grouping based on a grouping result of the first layer group, the neural network into a neural network comprising at least one second layer group that is determined based on operation overheads of the neural network (Sui [0022]: “neural network calculation can obtain the highest processing efficiency on a computational platform flexibly based on various limits. The optimization method of the present invention, through reconstructing the computational graph, by greatly reusing the shared input and/or intermediate data, avoids unnecessary bandwidth saturation. Through reasonable arrangement of storage and/or reading, time-consuming data rearrangement operation is pruned, and certain subsequent operations can be merged into a previous operation. Thus, various operations and I/O process are optimized in the implementations of neural networks, the overall calculation efficiency is improved”; [0088]: “FIG. 11 shows an example of grouping operation according to an embodiment of the present invention. Under the condition that there is limit for hardware processing capability, etc., grouping processing can be carried out for layers, the number of the parameters and computational complexity are reduced by setting group parameters”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, and having the teachings of Lym and Sui before him or her, to modify the approach that significantly reduces memory traffic by partially serializing mini-batch processing across groups of layers of Lim to include the methods for neural network computational graph optimization based on operational constraints as shown in Sui. The motivation for doing so would have been to use the operational constraints of Sui as a basis for the approach of memory traffic reduction of Lym (Sui [0022]: “neural network calculation can obtain the highest processing efficiency on a computational platform flexibly based on various limits. The optimization method of the present invention, through reconstructing the computational graph, by greatly reusing the shared input and/or intermediate data, avoids unnecessary bandwidth saturation. Through reasonable arrangement of storage and/or reading, time-consuming data rearrangement operation is pruned, and certain subsequent operations can be merged into a previous operation. Thus, various operations and I/O process are optimized in the implementations of neural networks, the overall calculation efficiency is improved”).
Regarding Claim 2:
Lym teaches
The method according to claim 1, wherein the determining a first batch size corresponding to each layer of the one or more layers in a neural network comprises: determining, for a buffer requirement of each layer of the one or more layers in the neural network and the capacity of the on-chip memory, the first batch size corresponding to each layer of the one or more layers in the neural network (Lym [Section 3, p. 3]: "An improvement on full serialization is to process multiple samples at a time (a sub-batch) to provide some intra-layer weight reuse and extra parallelism, as long as the footprint at any point in the sub-batch does not exceed the on-chip buffer capacity.").
Regarding Claim 3:
Lym teaches
The method according to claim 2, wherein the determining, for a buffer requirement of each layer of the one or more layers in the neural network and the capacity of the on-chip memory, the first batch size corresponding to each layer of the one or more layers in the neural network comprises: determining, for one or more pieces of input data and one or more pieces of output data of each layer of the one or more layers in the neural network and the capacity of the on-chip memory, the first batch size corresponding to each layer of the one or more layers in the neural network, wherein at least one piece of input data or at least one piece of output data of at least one layer in the neural network is stored in an off-chip memory (Lym [Section 3, p. 3]: "An improvement on full serialization is to process multiple samples at a time (a sub-batch) to provide some intra-layer weight reuse and extra parallelism, as long as the footprint at any point in the sub-batch does not exceed the on-chip buffer capacity."; [Section 3, p. 4]: "We do this by varying the number of samples per sub-batch across layers such that layers that can support more samples require fewer iterations and can benefit from the greater parallelism and locality"; [Section 3, p. 6]: "The mini-batch is then processed in several sub-batch iterations (d mini−batch size sub−batch size e) within each group as shown in Fig. 5, which emphasizes how locality is increased and memory traffic reduced across features and weights."; (EN): it can be seen in Fig. 5 denoted by the red and blue arrows, that data is loaded from and stored to an off-chip memory).
Regarding Claim 4:
Lym teaches
The method according to claim 3, wherein the determining, for one or more pieces of input data and one or more pieces of output data of each layer of the one or more layers in the neural network and the capacity of the on-chip memory, the first batch size corresponding to each layer of the one or more layers in the neural network comprises: adjusting storage locations of one or more pieces of input data or one or more pieces of output data of at least one layer in the neural network based on operation overheads of the neural network, wherein the storage location comprises the on-chip memory or the off-chip memory (Lym [Section 1, p. 4]: "data needed during back propagation is stored off chip as well. Otherwise, layer output data is only written and later read from main memory between groups."; [Section 2, p. 1]: "The convolution outputs and the activations are stored in off-chip memory for reuse in back propagation, because their large storage requirements and long data reuse distance prevent on-chip buffering.");
in a process of adjusting the storage location, obtaining storage locations that are of one or more pieces of input data and one or more pieces of output data of each layer of the one or more layers in the neural network (Lym [Section 1, p. 4]: " data needed during back propagation is stored off chip as well. Otherwise, layer output data is only written and later read from main memory between groups."; [Section 2, p. 1]: "The convolution outputs and the activations are stored in off-chip memory for reuse in back propagation, because their large storage requirements and long data reuse distance prevent on-chip buffering.");
and determining the first batch size corresponding to each layer of the one or more layers in the neural network based on the storage locations of the one or more pieces of input data and the one or more pieces of output data of each layer of the one or more layers in the neural network and the capacity of the on-chip memory (Lym [Section 3, p. 5]: "Then, layer groups are merged to improve overall locality: groups are merged by reducing the sub-batch size of one group to that of an adjacent group. The first group then requires more iterations (with more weight and gradient accesses), but inter-layer reuse increases across the two layers where the groups meet."; [Section 3, p. 3]: "An improvement on full serialization is to process multiple samples at a time (a sub-batch) to provide some intra-layer weight reuse and extra parallelism, as long as the footprint at any point in the sub-batch does not exceed the on-chip buffer capacity.").
Regarding Claim 10:
Due to claim language similar to that of claim 1, claim 10 is rejected for the same reasons as presented above in the rejection of claim 1, with the exception of the limitation(s) covered below.
Lym teaches
one or more non-transitory computer-readable storage medium coupled to the at least one processor and storing programming instructions for execution by the at least one processor, wherein the programming instructions, when executed, cause the apparatus to perform operations (Lym [Section 4.2, p. 1]: “In addition to the systolic cores, the WaveCore CNN training accelerator contains several more structures and units. Fig. 9 illustrates the overall architecture of one core of the processor. There are two such cores in our proposed design that are connected by an on-chip network”; [Section 4.2, p. 4 Main Memory]: “The off-chip memory is connected to memory controllers, which communicate with the on-chip buffers via the crossbar switches. Our baseline WaveCore uses a single HBM2 stack with 4 dice (Joi, 2016), which provides 8GiB off-chip DRAM with 300GiB/s data bandwidth over 8 channels (4 channels per core). We choose HBM2 because it is used by other modern training accelerators”; (EN): Table 4 shows a configuration of the off-chip memory and Fig. 9 shows the per-core architecture of the accelerator, which is coupled to the off chip memory)
Regarding Claim 11:
Due to claim language similar to that of claim 2, claim 11 is rejected for the same reasons as presented above in the rejection of claim 2.
Regarding Claim 12:
Due to claim language similar to that of claim 3, claim 12 is rejected for the same reasons as presented above in the rejection of claim 3.
Regarding Claim 13:
Due to claim language similar to that of claim 4, claim 13 is rejected for the same reasons as presented above in the rejection of claim 4.
Regarding Claim 19:
Due to claim language similar to that of claims 1 and 10, claim 19 is rejected for the same reasons as presented above in the rejection of claims 1 and 10.
Regarding Claim 21:
Due to claim language similar to that of claims 2 and 11, claim 21 is rejected for the same reasons as presented above in the rejection of claims 2 and 11.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 11216726 B2 – Batch Processing In A Neural Network Processor
US 10019668 B1 – systems and methods for receiving a batch of neural network inputs to be processed using a neural network on a hardware circuit
Y. Shen, M. Ferdman and P. Milder, "Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer," 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, USA, 2017, pp. 93-100, doi: 10.1109/FCCM.2017.47. – a CNN accelerator with a flexible data buffering scheme that ensures a balance between the input and weight transfer bandwidth, significantly reducing overall bandwidth requirements
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to COREY M SACKALOSKY whose telephone number is (703)756-1590. The examiner can normally be reached M-F 7:30am-3:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached at (571) 272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/COREY M SACKALOSKY/Examiner, Art Unit 2128
/OMAR F FERNANDEZ RIVAS/Supervisory Patent Examiner, Art Unit 2128