DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-2, 4-5, 10-14, 16 and 18-19 have been amended. Claims 1-20 have been examined.
Response to Arguments
Applicant's arguments filed 12/17/2025 have been fully considered but they are not persuasive.
On pp. 9-10 of the 12/17/2025 remarks, Applicant argues that cited art of record Shah does not rely of profiling of layer execution times to form subgraphs. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). The rejection does not rely upon Shah for a teaching of profiling. Instead, the rejection relies upon the teaching of cited art of record Cai. Cai also teaches using profiling data for partitioning a computation graph into subgraphs (e.g. see ¶ 0003). The rejection is not based upon Shah alone, but instead relies upon the combined teaching of the references.
On p. 10 of the remarks, Applicant argues that Cai’s profile data is not used to group layers. However, the rejection is based upon the combination of references. Shah is relied upon to teach grouping layers using layer-wise execution data, while Cai is relied upon to specifically teach the act of execution profiling. The references combine to teach the claimed limitations.
Further on p. 10 of the remarks, Applicant argues that Shah and Cai fail to teach pipelined execution of subgraphs. However, Shah ¶ 0057 discusses the use of execution data to schedule subgraph pipelines (e.g. “keep the compute pipelines as busy as possible”). Applicant’s argument is not persuasive.
Further on p. 10 of the remarks, Applicant argues that Shah and Cai do not teach or suggest “intermediate activations generated within a compute circuit of the multiple compute circuits remain in on-chip memory during execution of its assigned subgraph.” However, Shah ¶ 0057 teaches arranging subgraphs to reduce data transfers between tiles and “keep all data in L1 memory.” As depicted in Fig. 2B, L1 memory resides on-chip in each tile. Applicant’s argument is not persuasive.
Additional arguments on pp. 11 are based upon previous arguments and are not persuasive for the reasons provided above.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
The term “approximately equal” in lines 6 and 10 of claim 1 is a relative term which renders the claim indefinite. The term “approximately” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The term is found in ¶ 0021 of published application 20230153583 in the context of “in a manner that minimizes the differences between compute times of the subgraphs.” For the purpose of further examination, the limitation will be interpreted accordingly.
Independent claims 13 and 16 contain limitations similar to claim 1 and are rejected for the same reasons indicated above. Dependent claims 2-12, 14-15 and 17-20 are rejected for carrying the limitations of a rejected parent claim.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-4 and 7-18 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent Application Publication 20210342673 by Shah et al. ("Shah") in view of U.S. Patent Application Publication 20210073625 by Cai et al. ("Cai").
In regard to claim 1, Shah discloses:
1. A method comprising: See Shah, Fig. 1.
gathering first layers of a neural network graph by a data processing system into groups of layers using layer-wise … data that indicate respective execution times of the layers … Shah, ¶ 0023, “For each data transfer, the compiler analyzes all possible data transfer paths and generates a non-colliding data transfer path. Because the compiler can determine these at compile-time rather than at run-time, it can ensure that the data transfer instructions are not conflicting …” ¶ 0032, “In the example of FIG. 1B, the computations performed by layers 102A-D are allocated to groups 182A-D of Tiles as indicated. The allocation is not required to be 1:1. For example, multiple layers could be allocated to a single Tile or vice versa.” Also ¶ 0053, “The compiler knows when regions A/B will be ready and the instructions to implement layer K+1 will execute after that time.”
Shah does not expressly disclose the following limitations which are taught by Cai:
Shah does not expressly disclose profile data. See Cai ¶ 0045, “For example, a computation graph may include subgraphs that are commonly used in many machine learning models as their components. For example, the commonly used subgraphs can include MobileNets layers, ResNet layers, Region Proposal Network, etc. In some embodiments, prior history of execution, experiments, or simulations of a certain subgraph on accelerators can identify which accelerator is optimal for processing the certain subgraph.” Also ¶ 0046, “The operation profiling information may include execution time or speed information and delay information of an accelerator for executing a certain operation such as a convolution, matrix multiplication, etc.” It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Cai’s profile data with Shah’s execution data in order to identify which accelerator is optimal for processing the certain subgraph as suggested by Cai.
Shah and Cai also teach:
the profile data obtained by a compiler, and Shah, ¶ 0042, “In order to statically schedule the instructions in a deterministic phase, the compiler typically will know the duration of each instruction (i.e., how long each instruction takes to execute), the capabilities of each Tile (which Tiles can execute which instructions), the topology of data transfer paths to and from Tiles (including between Tiles, and between Tiles and on-chip memory), and the computations required and their dependencies (i.e., the MLN description). With this information, the compiler can schedule unconditional start times for the Tile instructions.” Also see Cai, ¶ 0029, “The compiler is a program or computer software that transforms computer codes written in one programming language into NPU instructions to create an executable program.” Note that since a compiler generates the executable program, associated profile data must be utilized by the compiler in order to produce the associated subgraphs. Also ¶ 0035, “In some embodiments, scheduler 210 may be implemented within a compiler.”
based on profiled compute times of the layers, forming the groups [ in a manner that minimizes the differences between compute times of the subgraphs] according to the profile data. Shah, ¶ 0057, “the compiler may optimize to increase the utilization of compute resources in the Tiles—to keep the compute pipelines as busy as possible. … the compiler may optimize with respect to data transfer to reduce the wait times of computations.” Also see Cai, ¶ 0045, “In some embodiments, graph partitioner 320 can partition a computation graph into multiple subgraphs that are executed on different accelerators based on the subgraph profiling information to optimize performance in executing the computation graph.” Also ¶ 0046, “The operation profiling information may include execution time or speed information and delay information of an accelerator for executing a certain operation such as a convolution, matrix multiplication, etc.”
Shah also discloses:
wherein each group is a subgraph of one or more of the layers of the neural network; Shah, Fig. 3, depicting an optimized graph 335 which is then partitioned 328 into subgraphs for tile execution. Also see ¶ 0032, “In the example of FIG. 1B, the computations performed by layers 102A-D are allocated to groups 182A-D of Tiles as indicated. The allocation is not required to be 1:1. For example, multiple layers could be allocated to a single Tile or vice versa.”
compiling the neural network graph into instructions for pipelined execution of subgraphs across multiple compute circuits based on the [ minimized compute time differences] of the subgraphs; See Shah Fig. 2A, depicting compute circuits. Also Fig. 3 element 320 “Compiler,” along with ¶ 0055, “The compiler 320 receives the optimized graph 335 and produces the resulting computer program 350.” Also ¶ 0064, “The groups of Tiles G may operate in a pipelined manner.”
wherein the compiling includes designating, for each first subgraph of the subgraphs having output activations that are input activations of a second subgraph of the subgraphs, operations of the first subgraph to be performed by a first compute circuit of the plurality of compute circuits and operations of the second subgraph to be performed by a second compute circuit of the plurality of compute circuits; and See Shah Fig. 1B where elements 102A-D depict a neural network (i.e. “MLN”) whereby “output activations” flow from 1 layer to “input activations” of a following layer. Also ¶ 0032, “In each mesh, the Tiles 180 are organized in a regular pattern and the interconnections within each mesh provide data transfer paths between Tiles in the mesh. … In the example of FIG. 1B, the computations performed by layers 102A-D are allocated to groups 182A-D of Tiles as indicated.” Also see Fig. 3 and ¶ 0058, “Partitioning 328 concerns mapping the computations in the MLN to an implementation on the MLA. This includes determining which computations are allocated to which Tiles and how data flows through the mesh of Tiles during computation.” Also ¶ 0062, “For example, an MLN with fifteen layers may be allocated between five groups of Tiles in which the first group of Tiles G1 may perform computations associated with layers 1, 2, and 3; the second group of Tiles G2 may perform computations associated with layers 4, 5, and 6; the third group of Tiles G3 may perform computations associated with layers 7, 8, and 9; and so on.”
configuring the multiple compute circuits to execute the instructions such that intermediate activations generated within a compute circuit of the multiple compute circuits remain in on-chip memory during execution of its assigned subgraph. See Shah ¶ 0028, “The compiler 120 receives a description of a machine learning network 100 and generates a computer program 150 that implements the machine learning network using MLA 170.” Also Fig. 2B and ¶ 0047, “Each Tile 280 includes an L1 memory 282.” Also ¶ 0057, “It may also allocate computations to Tiles in order to reduce data transfers between Tiles in the same mesh, to reduce data transfers from outside the MLA and/or to reduce data transfers that cross the boundary of the mesh (e.g., reducing data transfers between L1 and L2 memory and trying to keep all data in L1 memory).”
In regard to claim 2, Shah also discloses:
2. The method of claim 1, further comprising:
gathering second layers of the neural network into another subgraph, wherein the second layers are serially connected and include layer 1 through layer N, and each layer 1 through layer N specifies generation of respective output activations based on respective input activations; wherein the compiling includes for the other subgraph: decomposing the input activations to layer 1 into a plurality of tiles; specifying for first layer processing, tile-by-tile processing of the plurality of tiles; and specifying for each layer M processing for 2 ≤M ≤(N - 1), tile-by-tile processing of output tiles from layer M as input tiles to layer (M + 1), where M is in an intermediate layer and N is a last layer. See Shah, Fig. 4 and ¶ 0062-0064, e.g. “For example, an MLN with fifteen layers may be allocated between five groups of Tiles in which the first group of Tiles G1 may perform computations associated with layers 1, 2, and 3; the second group of Tiles G2 may perform computations associated with layers 4, 5, and 6; the third group of Tiles G3 may perform computations associated with layers 7, 8, and 9; and so on. … The groups of Tiles implementing the intermediate layers of the MLN each perform computations on the outputs from the group of Tiles implementing the previous layer in the MLN and generate outputs to the groups of Tiles implementing the subsequent layer of the MLN.”
In regard to claim 3, Shah also discloses:
3. The method of claim 2, wherein the compiling includes specifying stitching of output tiles from layer N into a complete set of output activations of the other subgraph. Shah, ¶ 0053, “The layer K output may be stored in L2 memory and then retrieved from L2 memory as needed for the next layer's computations.”
In regard to claim 4, Shah also discloses:
4. The method of claim 2, wherein: the compute circuits are programmable logic circuits; the programmable logic circuits include the on-chip memory coupled to the compute circuits; and See Fig. 2A, depicting programmable logic circuits coupled to memory 274.
the compiling includes specifying that the output tiles from each layer M remain in the on-chip memory for processing as input tiles in layer (M + 1). Shah Fig. 1B elements 102A-D depict a neural network with 4 layers whereby output from one layer is provided as input for the next layer. Also see ¶ 0068, “The L2 memories 274 and Tiles 280 are interconnected nodes for data transfers to implement an MLN.”
In regard to claim 7, Shah discloses:
7. The method of claim 1, further comprising:
gathering second layers of the neural network into another subgraph, wherein the second layers are serially connected and include a first layer configured to generate first output activations based on first input activations and to provide the first output activations as second input activations to a second layer of the other subgraph; Shah, ¶ 0032, “In each mesh, the Tiles 180 are organized in a regular pattern and the interconnections within each mesh provide data transfer paths between Tiles in the mesh.” Also see Shah, Fig. 4 and ¶ 0062-0064, e.g. “For example, an MLN with fifteen layers may be allocated between five groups of Tiles in which the first group of Tiles G1 may perform computations associated with layers 1, 2, and 3; the second group of Tiles G2 may perform computations associated with layers 4, 5, and 6; the third group of Tiles G3 may perform computations associated with layers 7, 8, and 9; and so on. … The groups of Tiles implementing the intermediate layers of the MLN each perform computations on the outputs from the group of Tiles implementing the previous layer in the MLN and generate outputs to the groups of Tiles implementing the subsequent layer of the MLN.”
wherein the compiling includes for the other subgraph:
decomposing the first input activations into a plurality of tiles that includes a first input tile and a second input tile; Shah, ¶ 0032, “For example, multiple layers could be allocated to a single Tile or vice versa.” Here, the “vice versa” provides multiple tiles for a single layer.
specifying first layer processing of the first input tile and the second input tile by one compute circuit at different times or by two compute circuits in parallel; and ¶ 0035, “In FIG. 1C, the deterministic and non-deterministic phases are shown as alternating. This is not required. For example, deterministic and non-deterministic phases may execute concurrently.”
specifying, for a first output tile generated from the first layer processing of the first input tile and for a second output tile generated from the first layer processing of the second input tile, second layer processing of the first output tile and the second output tile without stitching the first output tile and the second output tile, wherein the second layer processing is by one compute circuit at different times or by two compute circuits in parallel. Shah, ¶ 0047, “In FIG. 2B, the L1 memory 282 may receive data from any of its adjacent Tiles and/or from L2 memory if it is an edge Tile. Similarly, it may transfer data to any of its adjacent Tiles and/or to L2 memory if it is an edge Tile. The data transfer operations are controlled by data transfer instructions received and executed by the Tiles.” Also ¶ 0064, “The groups of Tiles implementing the intermediate layers of the MLN each perform computations on the outputs from the group of Tiles implementing the previous layer in the MLN and generate outputs to the groups of Tiles implementing the subsequent layer of the MLN.”
In regard to claim 8, Shah discloses:
8. The method of claim 1, wherein:
the first and second compute circuits are disposed on a programmable device, the first compute circuit is coupled to a first on-platform memory bank, and the second compute circuit is coupled to a second on-platform memory bank; and See Fig. 2A, depicting compute circuits on a programmable device coupled to memory 274. Also Fig. 6, depicting multiple on-platform L2 memory banks .
the compiling includes generating instructions executable by a host processor coupled to the programmable device and when executed cause the host processor to initiate moving the output activations from the first on-platform memory bank to the second on-platform memory bank in response to completion of the operations associated with the first subgraph. Shah, Fig. 2A, depicting MLA 270 which consists of mosaics 272A-N. Any transfer of data among mosaics necessarily requires movement between respective memory banks 274 associated with a particular mosaic.
In regard to claim 9, Shah discloses:
9. The method of claim 1, wherein:
the first and second compute circuits are disposed on a programmable device, the first compute circuit is coupled to a first on-platform memory bank, and the second compute circuit is coupled to a second on-platform memory bank; and See Fig. 2A, depicting compute circuits on a programmable device coupled to memory 274. Also Fig. 6, depicting multiple on-platform L2 memory banks .
the compiling includes generating instructions executable by an on-device controller and when executed cause the on-device controller to initiate moving the output activations from the first on-platform memory bank to the second on-platform memory bank in response to completion of the operations associated with the first subgraph. Shah, ¶ 0036, “The Tiles may have circuitry that synchronizes the Tiles. For example, each Tile may monitor when it is ready to begin execution of a deterministic phase 152B and then actual execution begins when all Tiles signal that they are ready. Alternatively, an external controller may synchronize the Tiles and start the deterministic phase 152B when all Tiles are ready.”
In regard to claim 10, Shah does not expressly disclose:
10. The method of claim 1, wherein the compiler determines tile dimensions for inter-layer tile-by-tile processing based on profiled compute-time information and available on-chip memory capacity to minimize off-chip transfers. See Shah, Fig. 1B and ¶ 0032, “In the example of FIG. 1B, the computations performed by layers 102A-D are allocated to groups 182A-D of Tiles as indicated.” Also ¶ 0063, “The groups G may be limited to blocks of specific shapes (e.g., rectangles or squares) or the shape of the blocks may be unconstrained and may comprise arbitrary shapes.” Also ¶ 0067, “The compiler 120 identifies 510 specific Tiles for inclusion in the group of Tiles Gi having the size Ti. … For example, the Tiles may be selected to minimize data transfer times between the groups.”
In regard to claim 11, Cai also teaches:
11. The method of claim 1, wherein the profile data specify for each layer a respective number of cycles of a clock signal. Cai, ¶ 0046, “3) subgraph profiling information per accelerator. … The operation profiling information may include execution time.”
In regard to claim 12, Cai also teaches:
12. The method of claim 1, wherein the profile data specify for each layer a respective elapsed real time. Cai, ¶ 0046, “3) subgraph profiling information per accelerator. … The operation profiling information may include execution time.”
In regard to claim 13, Shah discloses:
13. A system, comprising: a computer storage arrangement configured with program code that when executed by one or more processors causes the one or more processors to perform operations including: See Shah, p. 11 at claim 20, “A non-transitory computer readable storage medium storing instructions for implementing a machine learning network on a machine learning accelerator (MLA), … the instructions when executed by one or more processors causing the one or more processors to perform steps …”
…
an arrangement of one or more processors coupled to the computer storage arrangement and configured to communicate the program code to another computer storage arrangement in response to download instructions. Shah, ¶ 0059, “The resulting computer program 350 may be loaded into memory for execution on a machine learning accelerator 370.”
All further limitations of claim 13 have been addressed in the above rejection of claim 1.
In regard to claims 14-15, parent claim 13 is addressed above.
All further limitations of claims 14-15 have been addressed in the above rejections of claims 2-3, respectively.
In regard to claim 16, Shah discloses:
16. A method comprising: Shah, Fig. 1A, depicting a method.
All further limitations of claim 16 have been addressed in the above rejection of claims 1-2.
In regard to claims 17-18, parent claim 16 is addressed above.
All further limitations of claims 17-18 have been addressed in the above rejections of claims 3-4, respectively.
Claims 5 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Shah in view of Cai and further in view of U.S. Patent 9436760 to Tacchi et al. ("Tacchi").
In regard to claim 5, Shah does not expressly disclose:
5. The method of claim 2, wherein the gathering includes: determining a quantity of the on-chip memory available to a compute circuit; and determining a tile size of the plurality of tiles of the input activations of layer 1 based on the quantity of on-chip memory available. This is taught by Tacchi. See col. 20, lines 16-20 “In some embodiments, tile size may be selected based on the amount of available memory at various levels of a memory hierarchy, such that a given tile can fit within a targeted level of the hierarchy, like the level 2 or level 3 cache.” It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Tacchi’s tile size selection with Shah’s memory in order to ensure that a tile can fit in the memory as suggested by Tacchi.
In regard to claim 19, parent claim 16 is addressed above.
All further limitations of claim 19 have been addressed in the above rejection of claim 5.
Claims 6 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shah in view of Cai and further in view of U.S. Patent 10019668 to Woo ("Woo").
In regard to claim 6, Shah does not expressly disclose:
6. The method of claim 2, wherein the compiling includes: determining respective activation sizes of the layers of the neural network; and wherein the gathering the second layers of the neural network into groups of layers includes limiting the subgraphs to layers having activation sizes greater than a threshold. This is taught by Woo. See col. 15, lines 27-36, “Circuit 100 can compare the total memory usage for each group of layers to the 500 MB on-chip storage capacity. Circuit 100 can then determine a partitioning or grouping of layers that form a sequence of superlayers based on the results of the comparison. Circuit 100 determines the partitioning of the layers into a sequence of superlayers so as to not exceed the threshold storage capacity of the on-chip memory (500 MB) when a hardware circuit processes a batch of neural network inputs for the working sets.” It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Woo’s partitioning in Shah’s compiler in order to avoid exceeding storage capacity as suggested by Woo.
In regard to claim 20, parent claim 16 is addressed above.
All further limitations of claim 20 have been addressed in the above rejection of claim 6.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James D Rutten whose telephone number is (571)272-3703. The examiner can normally be reached M-F 9:00-5:30 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached at (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/James D. Rutten/Primary Examiner, Art Unit 2121