Office Action Analysis: 17919164 — EFFICIENTLY ALLOCATING MEMORY ON NEURAL NETWORK COMPUTE TILES

Office Action

§103
Notice of Pre-AIA  or AIA  Status
This Final communication is in response to Application No. 17/919,164 filed 10/14/2022. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed 12/04/2025 which provides amendments to claims 1, 4, 11-12, and 20 has been entered. Claims 1-21 are pending. The amendment to the claims has overcome the 101 rejection.

Response to Arguments
Applicant’s arguments with respect to 35 U.S.C § 103 filed 12/04/2025 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 7-9, 11-14, 17 are rejected under 35 U.S.C. 103 as being unpatentable over Thomas (US 12,165,069 B1) in view of Narayanaswami (US 2018/0197068 A1).
Regarding claim 1, Thomas teaches:
obtaining data indicating a neural network comprising a plurality of layers; (Col 4 lines 65-67 “The compiler, in some embodiments, is a software application that is responsible for receiving a trained neural network and converting that network into instructions for loading the weight values onto the neural network computation circuit and instructions for the circuit to execute”)
assigning, from among a plurality of computing units that each include a respective addressable memory unit, a subset of the plurality of computing units to at least partially perform inference computations associated with the layer; (Col 5 lines 2-3 “The compiler assigns each layer to a particular number of cores” where cores are similar to computing unit and Col 3 lines 34-37 “The output bus carries the computation node outputs from the post-processing units back to the cores, to be stored in the memory of the core and used as inputs for the next layer of neural network computation nodes.” Here we can see each core has memory).
determining a memory size and a common memory address for the respective addressable memory unit of each computing unit in the assigned subset; and (Col 35 lines 12-18 “identifies activations that must be stored in the memory at the same time, a maximum activation size on any core for each activation, an available space in each core after weight allocation, a set of cores on which each activation is stored, an effective core size for each activation, a number of memory banks in each core, and a number of memory words in each bank” and Col 6 lines 5-7 each core stores data for each layer at a same memory location (i.e., a same memory unit and a same location in the memory unit)”.
generating a single shared instruction for the assigned subset, the shared instruction comprising a memory allocation instruction, wherein the single shared instruction is configured to be executed by each computing unit in the assigned subset; and (Claim 10 “each core comprises corresponding pluralities of memory units for storing activations associated with layers of the neural network, wherein the program instructions include a specification of particular memory units to store input values associated with each layer of the neural network.”  And Claim 11  “for each layer of the neural network, the program instructions specify a same set of memory units in each core to store the activations associated with each layer of the neural network.”)
broadcasting the single shared instruction to the assigned subset, wherein the memory allocation instruction, when execute by each computing unit of the assigned subset, causes the computing unit to store a result of performing inference computations associated with the layer in the determined common memory address with the determined memory size (Col 17 lines 14-21 “The fabric controller also broadcasts these instructions in some embodiments, while including certain bits that specify the difference in setup between the clusters (or whether certain clusters even need to act on the instructions). Some embodiments broadcast the instructions only to the clusters involved in the computation (which could include clusters with source cores, destination cores, or both).”)
Thomas does not teach the single shared instruction. However Narayanaswami does teach this ([0067] “In general, when a single compute tile includes multiple MAC operators 108, the operators collectively provide single instruction multiple data (SIMD) functionality by each sharing a single activation input to perform their respective computations. SIMD generally means that all parallel units (multiple MAC operators 108 in a single system 100) share the same instruction (based on the deep loop nest), but each MAC operator 108 executes the instruction on different data elements of tensor 404 and 406.”) 
Thomas and Narayanaswami are considered analogous art to the claimed invention because they are in the same field of endeavor being neural network architecture. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine memory management system and instructions of Thomas with the shared instructions of Narayanaswami. One would want to do this to enhance acceleration (Narayanaswami [0067]).

Regarding claim 7, Thomas in view of Narayanaswami teaches claim 1 as outlined above. Thomas further teaches:
data identifying one or more computing units of the subset of the plurality of computing units to which the memory allocation instruction applies. (Col 17 lines 28-31 “In addition, these cluster controllers 625-640 determines which of its cores require the instructions and provides these instructions to the core controllers for these identified cores.”)
Regarding claim 8, Thomas in view of Narayanaswami teaches claim 7 as outlined above. Thomas further teaches:
the data identifying the one or more computing units is binary indication data. (Col 17 lines 10-14 “The fabric controller also broadcasts these instructions in some embodiments, while including certain bits that specify the difference in setup between the clusters (or whether certain clusters even need to act on the instructions).”)
Regarding claim 9, Thomas in view of Narayanaswami teaches claim 1 as outlines above. Thomas further teaches:
the memory allocation instruction further comprises, for each computing unit of the subset of the plurality of computing units: data tracking the common memory address of the respective stored result generated by the computing unit. (Col 16 lines 21-32 “the specified memory location stores arguments such as the source cores for the computations (i.e., the cores that will perform the dot product calculations) and the destination cores for the output values (i.e., the cores to which the output values are stored), the memory locations in the cores at which to find the weight and/or input values for the computations (in some embodiments, the weight values are loaded into memory initially such that these memory locations are the same across all of the source cores), information for calculating the non-linear activation function for the layer (e.g., the lookup table mapping information), etc.”)
Regarding claim 11, Thomas in view of Narayanaswami teaches claim 1 as outlined above. Thomas further teaches:
providing the single shared instructions to the plurality of computing units. (Claim 10 “each core comprises corresponding pluralities of memory units for storing activations associated with layers of the neural network, wherein the program instructions include a specification of particular memory units to store input values associated with each layer of the neural network.”  And Claim 11  “for each layer of the neural network, the program instructions specify a same set of memory units in each core to store the activations associated with each layer of the neural network.” And as mentioned above Narayanaswami teaches the shared instruction)

Regarding claim 12, Thomas teaches:
providing a set of instructions for performing inference computations for a plurality of layers of a neural network to a system comprising a plurality of computing units, each computing unit including a respective addressable memory, wherein the set of instructions comprises: (Claim 10 “each core comprises corresponding pluralities of memory units for storing activations associated with layers of the neural network, wherein the program instructions include a specification of particular memory units to store input values associated with each layer of the neural network.”  And Claim 11  “for each layer of the neural network, the program instructions specify a same set of memory units in each core to store the activations associated with each layer of the neural network.”)
a first single shared memory allocation instruction for a first subset of the plurality of computing units, the first single shared memory allocation instruction associated with a first layer and identifying a first common memory address wherein the first single shared memory allocation instruction is absent identifiers specific to individual computing units within the first subset (Col 34 lines 53-59 “For example, a particular core implementing a first layer whose output is the input for a second layer also implemented by the particular core has a first memory unit identified to store activations (input) for the first layer and a second, different memory unit identified to store the output of the first layer that are the activations (input) used for the second layer.”)
a second memory allocation instruction associated with a second layer, the second memory allocation instruction identifying a second memory address and a second subset of the plurality of computing units, wherein the second memory address differs from the first common memory address, and the second subset differs from the first subset; and (Col 34 lines 53-59 “For example, a particular core implementing a first layer whose output is the input for a second layer also implemented by the particular core has a first memory unit identified to store activations (input) for the first layer and a second, different memory unit identified to store the output of the first layer that are the activations (input) used for the second layer.”)
broadcasting the first single shared memory allocation instruction to the first subset to cause each computing unit in the first subset to output results of inference computations associated with the first layer to the first common memory address (Col 17 lines 14-21 “The fabric controller also broadcasts these instructions in some embodiments, while including certain bits that specify the difference in setup between the clusters (or whether certain clusters even need to act on the instructions). Some embodiments broadcast the instructions only to the clusters involved in the computation (which could include clusters with source cores, destination cores, or both).” And Col 34 lines 60-67 and col 35 lines 1-3 “Once the memory units are identified (at 1940), the process generates (at 1950) configuration data specifying the identified memory units for storing activations for each layer of the machine-trained network. In some embodiments, the default memory locations are maintained for any layers that are not connected to layers implemented by a same core as no additional cycles need to be added for reading and writing for those layers. In some embodiments, this configuration data generation is included as part of the configuration data process described above in relation to FIG. 10 and process 1000.”))
providing the second memory allocation instruction to the second subset to cause each computing unit in the second subset to output results of inference computations associated with the second layer to the second memory address (Col 15 lines 49-53 “Once the first portion of the network is completed, the system controller 610 provides the fabric 600 with the instructions for the second portion (e.g., a second layer, or a second pass of the first layer), and so on until the chip fabric has fully executed the network.”)
Thomas does not teach the single shared instruction. However Narayanaswami does teach this ([0067] “In general, when a single compute tile includes multiple MAC operators 108, the operators collectively provide single instruction multiple data (SIMD) functionality by each sharing a single activation input to perform their respective computations. SIMD generally means that all parallel units (multiple MAC operators 108 in a single system 100) share the same instruction (based on the deep loop nest), but each MAC operator 108 executes the instruction on different data elements of tensor 404 and 406.”) 
Thomas and Narayanaswami are considered analogous art to the claimed invention because they are in the same field of endeavor being neural network architecture. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine memory management system and instructions of Thomas with the shared instructions of Narayanaswami. One would want to do this to enhance acceleration (Narayanaswami [0067]).

Regarding claim 13, Thomas in view of Narayanaswami teaches claim 12 as outlined above. Thomas further teaches:
the first subset of the plurality of computing units corresponds to a subset of the plurality of computing units across which inference computations associated with the first layer in the plurality of layers are to be distributed; and (Claim 10 “each core comprises corresponding pluralities of memory units for storing activations associated with layers of the neural network, wherein the program instructions include a specification of particular memory units to store input values associated with each layer of the neural network.”  And Claim 11  “for each layer of the neural network, the program instructions specify a same set of memory units in each core to store the activations associated with each layer of the neural network.”)
the second subset of the plurality of computing units corresponds to a subset of the plurality of computing units across which inference computations associated with the second layer in the plurality of layers are to be distributed. (Claim 10 “each core comprises corresponding pluralities of memory units for storing activations associated with layers of the neural network, wherein the program instructions include a specification of particular memory units to store input values associated with each layer of the neural network.”  And Claim 11  “for each layer of the neural network, the program instructions specify a same set of memory units in each core to store the activations associated with each layer of the neural network.”)
Regarding claim 14, Thomas in view of Narayanaswami teaches claim 12 as outlines above. Thomas further teaches:
for each computing unit in the first subset, allocate the first memory size at the respective memory address in the respective computing unit's addressable memory based on the first memory address; and (Col 35 lines 12-18 “identifies activations that must be stored in the memory at the same time, a maximum activation size on any core for each activation, an available space in each core after weight allocation, a set of cores on which each activation is stored, an effective core size for each activation, a number of memory banks in each core, and a number of memory words in each bank.” And Col 6 lines 5-7 “each core stores data for each layer at a same memory location (i.e., a same memory unit and a same location in the memory unit”)
for each computing unit in the second subset, allocate the second memory size at the respective memory address in the respective computing unit's addressable memory based on the second memory address. (Col 35 lines 12-18 “identifies activations that must be stored in the memory at the same time, a maximum activation size on any core for each activation, an available space in each core after weight allocation, a set of cores on which each activation is stored, an effective core size for each activation, a number of memory banks in each core, and a number of memory words in each bank.” And Col 6 lines 5-7 “each core stores data for each layer at a same memory location (i.e., a same memory unit and a same location in the memory unit”)
Regarding claim 17, Thomas in view of Narayanaswami teaches claim 12 as outlines above. Thomas further teaches:
the set of instructions further include the one or more memory allocation instructions associated with each of one or more layers in the plurality of layers different from the first and second layers. (Claim 10 “each core comprises corresponding pluralities of memory units for storing activations associated with layers of the neural network, wherein the program instructions include a specification of particular memory units to store input values associated with each layer of the neural network.”  And Claim 11  “for each layer of the neural network, the program instructions specify a same set of memory units in each core to store the activations associated with each layer of the neural network.”)


Claims 2-6, 10, 15-16, 18-21 are rejected under 35 U.S.C. 103 as being unpatentable over Thomas and Narayanaswami, further in view of Afzal (US 12,169,786 B1).

Regarding claim 2, Thomas in view of Narayanaswami teaches claim 1 as outlined above. Afzal then teaches:
determining a layer type of each of the plurality of layers of the neural network based on the obtained data indicating the neural network. ((Abstract) “Depending on which type of neural network is being executed and the memory behavior of the specific neural network, a memory configuration can be selected accordingly.”). 
Thomas, Narayanaswami and Afzal are considered analogous art to the claimed invention because they are in the same field of endeavor being memory management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine memory management system of Thomas with the layer memory management of Afzal. One would want to do this because different neural network types need different memory configurations (Afzal abstract).

Regarding claim 3, Thomas in view of Narayanaswami and Afzal teaches claim 2 as outlined above. Afzal further teaches:
the selecting is based at least in part on the determined layer types. ((Abstract) “Depending on which type of neural network is being executed and the memory behavior of the specific neural network, a memory configuration can be selected accordingly.”). 
Regarding claim 4, Thomas in view of Narayanaswami and Afzal teaches claim 2 as outline above. Afzal further teaches:
response to determining the layer type of the layer being a fully-connected layer, determining an extra memory address different from the common memory address of the computing unit, wherein the single shared instruction further comprises an aggregation instruction that, when executed by each of the subset of the plurality of computing units for the layer, causes the computing unit to aggregate one or more results associated with another layer preceding the layer and store the aggregated results in the determined extra memory address in the addressable memory of the computing unit. (Col 9 lines 21-28 “The configuration 510 is designed for fully connected or recurrent neural networks. In configuration 510, there is one activation buffer 502, two weight buffers 504 and 506, and one output buffer 508. In fully connected and recurrent neural networks, there tends to be a greater usage of weights. Accordingly, configuration 510 features more weight buffers than activation or output buffers” And as mentioned above Narayanaswami teaches the shared instruction)
Regarding claim 5, Thomas in view of Narayanaswami and Afzal teaches claim 2 as outline above. Afzal further teaches:
determining the memory size for the respective addressable memory unit of each computing unit in the subset of the plurality of computing units assigned for the layer comprises: determining the memory size for the respective addressable memory unit of each computing unit in the subset of the plurality of computing units assigned for the layer based at least in part on the determined layer type of the respective layer. (Col 9 60-66 “From the examples in FIGS. 5A-5C and the table above, it will be apparent that the total number of each buffer type, the size of each individual buffer, as well as the total amount of memory for any particular buffer type, can vary depending on the overall memory requirements for any given type of data as well as the frequency with which the data is expected to be accessed.”)
Regarding claim 6, Thomas in view of Narayanaswami teaches claim 1 as outline above. Afzal further teaches: 
assigning, from among a plurality of computing units that each include a respective addressable memory unit, a second subset of the plurality of computing units to at least partially perform inference computations associated with the layer; and generating one or more memory allocation instructions each for a corresponding computing unit in the second subset of the plurality of computing units. (Claim 1: “first neural network layer is a convolutional layer, and wherein the second neural network layer is a fully connected layer; a plurality of memory banks; and a memory manager configured to allocate the plurality of memory banks according to a first configuration and a second configuration, the first configuration and the second configuration each including an activation buffer that is dedicated to storing values representing input activations produced by an activation function of a neural network, a weight buffer that is dedicated to storing weights of the neural network, and an output buffer that is dedicated to storing values representing output activations of the neural network, each output activation being a result of a computation involving the input activations and the weights”).
Regarding claim 10, Thomas in view of Narayanaswami and Afzal teaches claim 4 as outline above. Thomas further teaches:
the aggregation instruction further comprises data specifying, for each computing unit of the subset of the plurality of computing units for the layer, whether each of the respective results associated with the other layer preceding the layer will be aggregated in the extra memory address in the addressable memory of the computing unit of the layer (Col 38 lines 43-48: “Stage 2302 illustrates the execution of a third layer of the machine-trained network that uses the activations 2330b from the second layer (layer 2) stored in memory unit 2310b to produce the activations 2330c of the third layer (layer 3) stored in memory unit 2310c (the memory unit powered on during the execution of the previous layer of the machine-trained network).”)
Regarding claim 15, Thomas in view of Narayanaswami teaches claim 14 as outline above. Afzal further teaches:
the first memory size is larger than the second memory size. (Col 11 lines 11-15 “it can be seen that a larger amount of activation buffer memory and a smaller amount of weight buffer memory are required to optimally implement convolutional neural networks in comparison to fully connected neural networks.”)
Thomas, Narayanaswami and Afzal are considered analogous art to the claimed invention because they are in the same field of endeavor being memory management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine memory management system of Thomas with the layer memory management of Afzal. One would want to do this because different neural network types need different memory configurations (Afzal abstract).

Regarding claim 16, Thomas in view of Narayanaswami teaches claim 12 as outline above. Afzal further teaches:
the first layer in the plurality of layers comprises a fully-connected layer and the second layer in the plurality of layers comprises an element-wise layer. (Col 18 lines 15-17  the context executed at block 710 may execute a convolutional layer, and the next context could execute a fully connected layer.)
Regarding claim 18, Thomas in view of Narayanaswami and Afzal teaches claim 16 as outline above. Afzal further teaches:
the set of instructions further comprises a first aggregation instruction associated with the first layer. ( Claim 2:  “the first neural network layer, and wherein the memory manager switches from the first configuration to the second configuration based on execution, by the at least one processing unit, of an instruction in the program code implementing the first neural network layer, the instruction setting a configuration register in the computing system to a value that represents the second configuration.”).
Regarding claim 19, Thomas in view of Narayanaswami and Afzal teaches claim 18 as outline above. Afzal further teaches:
the first aggregation instruction associated with the first layer further causes the system to, when executed by each computing unit of the first subset, allocate an extra memory address associated with respective computing units in the first subset, the extra memory address being different from the first memory address. (Col 9 lines 21-28 “The configuration 510 is designed for fully connected or recurrent neural networks. In configuration 510, there is one activation buffer 502, two weight buffers 504 and 506, and one output buffer 508. In fully connected and recurrent neural networks, there tends to be a greater usage of weights. Accordingly, configuration 510 features more weight buffers than activation or output buffers”).
Regarding claim 20, Thomas in view of Narayanaswami and Afzal teaches claim 19 as outline above. Thomas further teaches:
the first aggregation instruction further comprises data determining, for each computing unit of the first subset, whether each of the respective results of inference computations associated with a preceding layer of the first layer will be aggregated in a respective memory address of the computing unit based on the extra memory address. (Col 38 lines 54-62 “As in stage 2301, an instruction to power on an additional memory unit 2310d is received during the execution of the third layer so that the memory unit will be available by the time the execution of the next layer begins (in some embodiment the time required for powering on may be greater than the execution of a particular layer and the instruction to power on would be received in the execution of a previous layer in the machine-trained network).”)
Regarding claim 21, Thomas in view of Narayanaswami and Afzal teaches claim 20 as outline above. Afzal further teaches:
in response to determining a result of inference computations associated with the preceding layer of the first layer will be aggregated, the first aggregation instruction associated with the first layer further causes the system to: aggregate the result of inference computations associated with the preceding layer in a respective memory address of a corresponding computing unit based on the extra memory address. (Col 38 lines 54-62 “As in stage 2301, an instruction to power on an additional memory unit 2310d is received during the execution of the third layer so that the memory unit will be available by the time the execution of the next layer begins (in some embodiment the time required for powering on may be greater than the execution of a particular layer and the instruction to power on would be received in the execution of a previous layer in the machine-trained network).”)

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL PATRICK GRUSZKA whose telephone number is (571)272-5259. The examiner can normally be reached M-F 9:00 AM - 6:00 PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached at (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/DANIEL GRUSZKA/Examiner, Art Unit 2121                                                                                                                                                                                                        
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action
EFFICIENTLY ALLOCATING MEMORY ON NEURAL NETWORK COMPUTE TILES

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

EFFICIENTLY ALLOCATING MEMORY ON NEURAL NETWORK COMPUTE TILES

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email