Prosecution Insights
Last updated: April 19, 2026
Application No. 17/077,720

MODULAR NEURAL NETWORK COMPUTING APPARATUS WITH DISTRIBUTED NEURAL NETWORK STORAGE

Final Rejection §103
Filed
Oct 22, 2020
Examiner
RAMESH, TIRUMALE K
Art Unit
2121
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
6 (Final)
18%
Grant Probability
At Risk
7-8
OA Rounds
4y 5m
To Grant
20%
With Interview

Examiner Intelligence

Grants only 18% of cases
18%
Career Allow Rate
7 granted / 40 resolved
-37.5% vs TC avg
Minimal +2% lift
Without
With
+2.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
40 currently pending
Career history
80
Total Applications
across all art units

Statute-Specific Performance

§101
30.7%
-9.3% vs TC avg
§103
59.1%
+19.1% vs TC avg
§102
3.7%
-36.3% vs TC avg
§112
5.4%
-34.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Amendment/Reply (Submitted on 12/08/2025) The examiner submits that the applicant has made substantial claim amendments after a non-final rejection (amended all claims 1, 3-15, 17-20 and 22) and has added a new claim 23. Applicant’s arguments with respect to claims 1 and 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The examiner submits that new reference “ Firuzan” has been used as a primary reference. In summary, with the new reference “Firuzan” as a primary reference in association with reference “Modha ” teaches the entire instant application. In regard to 103 rejections The examiner submits that the applicant has amended the claims substantially and it appears that the focus of the theme might have shifted more towards focusing on the connectivity of the inference cores implemented as a router NoC specifically implementing as a 2D hopping network. As a result, applicant’s arguments with respect to claims 1, and 15 on Page 9 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. In summary, the examiner has used new reference “Firuzan” to teach the amendments. In Conclusion, the examiner rejects the independent claims 1 and 15 and all dependent claims 3-14, 17-20 and 22-23 under 103 and MOVE the application as FINAL REJECTION under 103. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1, 3-15, 17-20 and 22-23 are rejected under 35 U.S.C. 103 as being unpatentable over in view of Arash Firuzan, et.al (hereinafter Firuzan) Reconfigurable Network-on-Chip for 3D Neural Network Accelerators, 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), in view of Dharmendra Modha et.al (hereinafter Modha) US 2019/0325295 A1. In regard to claim 1. (Currently Amended) Firuzan discloses: - A neural inference processor comprising: a plurality of neural inference cores, wherein each neural inference core of the plurality of neural inference cores is connected to directly adjacent neural inference cores on the neural inference processor, and wherein each neural inference core comprises:[[ing] In [I, Page 2]: architecture is provided with a mapping and topology construction algorithm that maps different partitions of an input neural network into the NoC clusters (and cluster nodes) and finds an appropriate inter-cluster topology. In [I, Page 2]: we propose a cluster-based reconfigurable NoC architecture for parallel neural network accelerators. The PEs, which act as the nodes of the proposed NoC, are arranged as several clusters. The nodes inside a cluster are connected by broadcast-based topology, whereas the clusters themselves are connected by a reconfigurable network. The reconfigurable architecture can dynamically adapt the inter-cluster topology to the current inter-PE and memory-PE connections. In [II, Page 2]: Cluster-based reconfigurable NoC The reconfigurable topology adopted in this paper is a modified version of the baseline reconfigurable topology we presented in a prior work [11]. In the reconfigurable NoC in [11], routers (squares in Figure 1) are not connected directly to each other, but through a simple logic, called configuration switch (circles in Figure 1) that allow changing the inter-router connections dynamically. PNG media_image1.png 217 448 media_image1.png Greyscale In [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. (BRI: a system where multiple processing elements (PEs) each handle a subset of neurons in parallel, with these neurons mapped onto cluster nodes, is a fundamental representation of a neural inference core or a neuromorphic core. The core as represented in Fig 1 is also a “cluster” that has reconfigurable interconnection through the router (switch matrix)) In [V a, Page 4]:Partition grouping. the neurons are partitioned, the partitions should be mapped onto the NoC nodes. The mapping is done by a two-step procedure. In the first step, the partitions that should be mapped onto the same cluster are selected and bundled into partition groups. Afterwards, each partition group is assigned to a NoC cluster. Among different approaches, the grouping policy that this paper applies puts adjacent partitions in the same group, so partition groups, and hence clusters, mostly contain neurons from the same layer. This way, the PEs of the same cluster have common input data. It can facilitate traffic management, in that one PE of each cluster can get the data and broadcast it to the others internally. - a core control configured to direct operations on the neural inference core; In [IV, Page 4]: we ignore the initialization and control procedures that host initiates to control the entire memory-side neural information processing and focus on managing the on chip communication during the inference phase of neural network execution. In [IV, Page 4]: PNG media_image2.png 272 457 media_image2.png Greyscale As Figure 2 shows, the routers in this design have three bidirectional ports to the bus, local PE/memory, and the adjacent configuration switch. When a router receives a packet, it can take three actions: (1) send the packet to its local processor, (2) broadcast it over the local bus to send it to other nodes of the cluster, and (3) send it to its inter-cluster output port to pass it to other clusters. - an activation network-on-chip (NOC) router connected via a hopping network of the neural inference processor to respective activation NOC routers, of all other neural inference cores of the plurality of neural inference cores, wherein the activation NOC router is configured to: In [II, Page 2]: The reconfigurable topology adopted in this paper is a modified version of the baseline reconfigurable topology we presented in a prior work [11]. In the reconfigurable NoC in [11], routers (squares in Figure 1) are not connected directly to each other, but through a simple logic, called configuration switch (circles in Figure 1) that allow changing the inter-router connections dynamically. PNG media_image1.png 217 448 media_image1.png Greyscale In [II, Page 3]: A Configuration switch consists of several transistor switches, like the switch boxes in an FPGA, which can be configured to set up permanent long links between different routes. Actually, they can be considered as simple 4×4 crossbars In [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. PNG media_image3.png 267 433 media_image3.png Greyscale In [II, Page 3]: By changing the three main parameters of the topology, i.e. cluster size (number of nodes in each cluster), network corridor width (the number of switches between two adjacent clusters at each row and column), and the number of clusters in the network, various reconfigurable NoCs can be implemented. In [V, Page 5]: in the form of a communication task graph (CTG), as input and maps each task onto a NoC node in such a way that the total traffic load across all NoC links is minimalized. In other words, partition groups must be mapped onto network clusters in a way that groups with heavier communication are mapped onto adjacent (or nearby) clusters. (BRI: within the context of designing or describing a mosaic-style or tiled neural architecture (such as a 2D mesh of heterogeneous cores), specifying the number of nodes per cluster along with the switch connectivity between adjacent clusters in rows and columns provides sufficient information to define the topology and structure of the entire set of neural inference cores) - receive, via the hopping network, the input activations, and send, via the hopping network, the output activations; In [II, Page 2]: The most common form of neural networks is the feedforward Multi-Layer Perceptron (MLP) model; an n-layer MLP consists of one input layer, n-2 intermediate (hidden layers), and one output layer. Each layer consists of a set of basic processing elements called neurons. An individual neuron is connected to several neurons in the previous layer, from which it receives data, and several neurons in the next layer, to which it sends data. In [II, Page 2]: Each neuron computes the dot product of the input and weight vectors and passes the result to an activation function to produce the final output. The weight vector of each neuron is determined during an offline or online training phase. - [,]] a model NOC router connected via a 2-D hopping network of the neural inference processor to respective model NOC routers of all other neural inference cores of the plurality of neural inference cores, In [II, Page 2]: we show that our reconfigurable NoC is a promising architecture to connect PEs to each other and also to memory controllers. In [II, Page 2]: The reconfigurable topology adopted in this paper is a modified version of the baseline reconfigurable topology we presented in a prior work [11]. In the reconfigurable NoC in [11], routers (squares in Figure 1) are not connected directly to each other, but through a simple logic, called configuration switch (circles in Figure 1) that allow changing the inter-router connections dynamically. PNG media_image1.png 217 448 media_image1.png Greyscale In [II, Page 3]: A Configuration switch consists of several transistor switches, like the switch boxes in an FPGA, which can be configured to set up permanent long links between different routes. Actually, they can be considered as simple 4×4 crossbars In [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. PNG media_image3.png 267 433 media_image3.png Greyscale In [I, Page 1]: PEs need to communicate to exchange data based on the dataflow model of the neural network (e.g. sending the result of one neuron to all neurons of the next layer), get input data from memory, and get a new set of weights to (re)configure their datapath. The latter case occurs when the target neural network is too large to fit the multiprocessor architecture and different neural network partitions share the PEs in time. In this case, upon switching to a new neural network partition, the corresponding array of weights should be fetched from memory. In [II, Page 3]: By changing the three main parameters of the topology, i.e. cluster size (number of nodes in each cluster), network corridor width (the number of switches between two adjacent clusters at each row and column), and the number of clusters in the network, various reconfigurable NoCs can be implemented. In [V, Page 5]: in the form of a communication task graph (CTG), as input and maps each task onto a NoC node in such a way that the total traffic load across all NoC links is minimalized. In other words, partition groups must be mapped onto network clusters in a way that groups with heavier communication are mapped onto adjacent (or nearby) clusters. (BRI: within the context of designing or describing a mosaic-style or tiled neural architecture (such as a 2D mesh of heterogeneous cores), specifying the number of nodes per cluster along with the switch connectivity between adjacent clusters in rows and columns provides sufficient information to define the topology and structure of the entire set of neural inference cores) - wherein the model NoC router is configured to: multicast, via the 2-D hopping network, network model data of the neural inference core to the respective model NOC routers of the other neural inference cores In [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. PNG media_image3.png 267 433 media_image3.png Greyscale In [V, Page 6]: Multicast support. If multiple edges of the CTG carry the same data, the algorithm tries to build a multicast tree for them to reduce the link usage. In [V, Page 6]: all CTG edges of a node that multicast a data are sorted in the increasing order of the Manhattan distance to their destination clusters. Then, the path for the edge with the shortest distance is set up by the Dijikstra’s algorithm. Suppose this path is constructed from cluster Cs to Cd1 and passes through N intermediate nodes IN1 to INN. Intermediate nodes can be either cluster or configuration switch. For the second edge that goes from Cs to Cd2, the shortest path algorithm not only finds the path with the minimum weight between clusters Cs and Cd2, but also calculates the minimum weight from all of the intermediate nodes of the previous path, i.e. IN1 to INN, to Cd2. - and receive via the 2-D hopping network, from the respective model NOC routers of the other neural inference cores respective network model data of the other neural inference cores; In [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. (BRI: a system where multiple processing elements (PEs) each handle a subset of neurons in parallel, with these neurons mapped onto cluster nodes, is a fundamental representation of a neural inference core or a neuromorphic core) In [IV, Page 4]: As Figure 2 shows, the routers in this design have three bidirectional ports to the bus, local PE/memory, and the adjacent configuration switch. When a router receives a packet, it can take three actions: (1) send the packet to its local processor, (2) broadcast it over the local bus to send it to other nodes of the cluster, and (3) send it to its inter-cluster output port to pass it to other clusters. PNG media_image2.png 272 457 media_image2.png Greyscale In [IV, Page 4]: Among different approaches, the grouping policy that this paper applies puts adjacent partitions in the same group, so partition groups, and hence clusters, mostly contain neurons from the same layer. This way, the PEs of the same cluster have common input data. It can facilitate traffic management, in that one PE of each cluster can get the data and broadcast it to the others internally. In [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. PNG media_image3.png 267 433 media_image3.png Greyscale In [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. In [1, Page 1]: when the target neural network is too large to fit the multiprocessor architecture and different neural network partitions share the PEs in time. In this case, upon switching to a new neural network partition, the corresponding array of weights should be fetched from memory. In [1, Page 2]: architecture is provided with a mapping and topology construction algorithm that maps different partitions of an input neural network into the NoC clusters (and cluster nodes) and finds an appropriate inter-cluster topology. (BRI: a mapping and topology construction algorithm that assigns neural network partitions to NoC clusters and determines the inter-cluster topology constitutes a distributed neural network (NN) model. In [V, Page 4]: a scheduling scheme is further needed to determine the order of execution of different neural network partitions on the architecture. In this case, upon switching the partitions, the synaptic weights of the new partition should be read from the memory that makes the on-chip traffic more complex. (BRI: assigning specific synaptic weights to that PE, is a form of model parallelism) and the portioning) - a partial sum NOC interface configured to send first data to and receive second data from the directly adjacent neural inference cores in [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. To accelerate the algorithm, this step only considers those intermediate nodes of the previous path that lie along one of the shortest paths between Cs and Cd2. If the new path expands the previous path at node IN1 and IN1 is a configuration switch, the internal crossbar of the switch should connect the input from which the path enters to two outputs, each related to one edge. PNG media_image3.png 267 433 media_image3.png Greyscale In [1, Page 1]: Most neural network NoCs come with some variations of the well-known mesh topology [7-8]. The mesh topology has a regular interconnection structure In [1, Page 1]: In addition, as a considerable portion of the traffic of neural networks is in the form of multicast of short messages (to exchange partial results), a NoC that supports multicast routing is critical for a neural network parallel accelerator. (BRI: partial results are partial sums) Firuzan does not explicitly disclose: - an activation memory configured to store input activations of the neural inference core and[[,]] output activations of the neural inference core - model memory configured to store a distinct portion of an artificial neural network model assigned to the neural inference core, such that the artificial neural network model is distributed across the plurality of neural inference cores, - wherein the artificial neural network model comprises[ [ing]] synaptic weights, neuron parameters, and neural network instructions - a weight buffer memory that stores a neural network weight matrix; - a vector matrix multiplier unit configured to generate partial sums from activation vectors of input activations of the activation memory and the neural network weight matrix; a partial sum memory - a vector-vector unit to perform vector-to-vector operations on the partial sums using the partial sum memory and the received second data from the directly adjacent neural inference cores; - and an activation function unit configured to generate the output activations by applying non- linear functions to the partial sums However, Modha discloses: - an activation memory configured to store input activations of the neural inference core and[[,]] output activations of the neural inference core In [0020]: With reference now to FIG. 1, a neural core according to embodiments of the present disclosure is depicted. A neural core 100 is a tileable computational unit that computes one block of an output tensor. A neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 and accumulates the products into weighted sums that are stored in a 1×N intermediate tensor block 103. A U×N parameter tensor block contains the U parameters that specify each of the N neuron activation functions that are applied to the intermediate tensor block 103 to produce a 1×N output tensor block 105. In [0021]: Multiple neural cores may be tiled in a neural core array. In some embodiments, the array is 2-dimensional. In [0025]: Chip 200 includes computation logic 202, which may include one or more neural cores configured to implementing intermediate processing layers within a multi-layer neural network. Chip 200 includes model memory 203 for storing the neural network model, which may include configuration parameters for computation logic 202. - model memory configured to store a distinct portion of an artificial neural network model assigned to the neural inference core, such that the artificial neural network model is distributed across the plurality of neural inference cores, In [0025]: Chip 200 includes computation logic 202, which may include one or more neural cores configured to implementing intermediate processing layers within a multi-layer neural network. Chip 200 includes model memory 203 for storing the neural network model, which may include configuration parameters for computation logic 202. In [0035]: computation is distributed among multiple cores 621. Data memory and model memory are also distributed among multiple cores 621 without having corresponding chip-level entities. Accordingly, input 611 and output 612 are coupled via the on-chip network to the multiple data memory entities on the various cores 621. Likewise, input 631 is coupled via the on-chip network to the multiple model memory entities on the various cores 621. Controller logic and instruction memory are partially distributed among multiple cores 621. Accordingly, there is both chip-level controller logic 604 and instruction memory 605 and corresponding per-core entities. In [0046]: In some embodiments the computation implemented by each neural core may be reconfigured online by loading a different set of parameters from the neural network model memory. As noted above, the neural network model memory may be local to each neural core (BRI: The model memory that stores a distinct portion of an artificial neural network model (specifically weights, biases, and parameters) for a dedicated inference core is typically referred to as local scratchpad memory or local memory that stores a distinct portion of an artificial neural network (ANN) model) - wherein the artificial neural network model comprises[ [ing]] synaptic weights, neuron parameters, and neural network instructions In [0045]: In some embodiments, the parameters required to compute each intermediate layer are stored in the neural network model memory. For example, in some embodiments, the parameters include synaptic weights or synaptic activation functions. In [005]: In various embodiments, communication with neural cores is provided through one or more on-chip network. In various embodiments, the on-chip network is used to distribute the neural network model from centralized model memory to the neural cores. In various embodiments, the on-chip network is used to distribute the controller instructions from centralized instruction memory to the neural cores. In various embodiments, the on-chip network is used to distribute input data to the neural cores and to aggregate output data from the neural cores. In [0055]: the controller is hierarchical, having components that execute instructions at multiple levels of granularity (e.g., centralized chip-level, distributed core-level, and zero or more levels in between). In some embodiments, centralized controller components execute chip-level instructions to distribute core-level instructions to the controller components in each neural core in [0056]: Chip-level and core-level instructions ensure that the entire chip operation and each core's operations are pipelined for very high throughput. In various embodiments, the instruction set architecture includes control instructions to orchestrate the chip's operation. For example, instructions may include generating neural network memory addresses and read/write operations, specifying the computation operations to be executed on the data, specifying the routing of data between cores and between cores and memories, generating input, output, and data memory addresses, and read/write operations. - a weight buffer memory that stores a neural network weight matrix; In [0019]: Each neural network layer is associated with a weight tensor, parameter tensor, input tensor, output tensor, and intermediate tensor. The weight tensor contains all of the weights that connect inputs to the layer. - a vector matrix multiplier unit configured to generate partial sums from activation vectors of input activations of the activation memory and the neural network weight matrix; a partial sum memory; In [0020]: neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 and accumulates the products into weighted sums that are stored in a 1×N intermediate tensor block 103. In [0019]: The intermediate tensor contains any data that the layer produces as intermediate computations, such as partial sums.) (BRI: intermediate tensors can contain data that a layer produces as intermediate computations, including partial sums generated by a vector-matrix multiplier unit from input activations and weight matrices) - a vector-vector unit configured to perform vector-to-vector operations on the partial sums using the partial sum memory and the received second data from the directly adjacent neural inference cores; In [0020] : With reference now to FIG. 1, a neural core according to embodiments of the present disclosure is depicted. A neural core 100 is a tileable computational unit that computes one block of an output tensor. A neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 and accumulates the products into weighted sums that are stored in a 1 × N intermediate tensor block 103. A U×N parameter tensor block contains the U parameters that specify each of the N neuron activation functions that are applied to the intermediate tensor block 103 to produce a 1×N output tensor block 105. In [0019]: the intermediate tensor contains any data that the layer produces as intermediate computations, such as partial sums. (Within the context of a tensor multiplication, a vector-to -vector operations are performed and the stored intermediate blocks are partial sum memory) - and an activation function unit configured to generate the output activations by applying non- linear functions to the partial sums In [0015]: An artificial neuron is a mathematical function whose output is a nonlinear function of a linear combination of its inputs. Two neurons are connected if the output of one is an input to the other. A weight is a scalar value encoding the strength of the connection between the output of one neuron and the input of another neuron. In [0016]: A neuron computes its output, called an activation, by applying a nonlinear activation function to a weighted sum of its inputs. A weighted sum is an intermediate result computed by multiplying each input with the corresponding weight and accumulating the products. A partial sum is a weighted sum of a subset of inputs. A weighted sum of all inputs may be computed in stages by accumulating one or more partial sums. The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 3. (Currently Amended) Firuzan discloses: - at least model NOC router network further configured to [[read]] receive the distinct portion of the artificial neural network model from one of the other neural inference cores in [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. in [V, Page 5]: partition groups must be mapped onto network clusters in a way that groups with heavier communication are mapped onto adjacent (or nearby) clusters. In [III, Page 3]: The EMBRACE architecture is a 2-D array of interconnected neural tiles surrounded by I/O blocks. It adopts H-NoC, a hierarchical mesh-based topology, that consists of three layers; 10 neural cells are connected to a router at the first layer to form a neuron module, up to 10 neuron modules are connected to a upper router to form a tile, and up to 4 tiles are connected to an upper router at the third layer to form a cluster. If a neural network does not fit in one cluster, multiple clusters are interconnected by a mesh at a higher level. In [VI, Page 7]: the dragonfly has 256 PEs arranged as 16 groups, each containing 16 PEs. For the larger neural networks (NN1 and NN2), we use 1024 PEs by increasing the cluster count of HNOC and reconfigurable NoC and the group count of dragonfly by a factor of 4. For the both network sizes, if the neural network is too large to fit the NoC, the neural network is partitioned and its execution is scheduled on the NoC In regard to claim 4. (Currently Amended) Firuzan discloses: - another distinct portion of the artificial neural network model [[from]] to one of the in [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. in [V, Page 5]: partition groups must be mapped onto network clusters in a way that groups with heavier communication are mapped onto adjacent (or nearby) clusters. In [III, Page 3]: The EMBRACE architecture is a 2-D array of interconnected neural tiles surrounded by I/O blocks. It adopts H-NoC, a hierarchical mesh-based topology, that consists of three layers; 10 neural cells are connected to a router at the first layer to form a neuron module, up to 10 neuron modules are connected to a upper router to form a tile, and up to 4 tiles are connected to an upper router at the third layer to form a cluster. If a neural network does not fit in one cluster, multiple clusters are interconnected by a mesh at a higher level. In [VI, Page 7]: the dragonfly has 256 PEs arranged as 16 groups, each containing 16 PEs. For the larger neural networks (NN1 and NN2), we use 1024 PEs by increasing the cluster count of HNOC and reconfigurable NoC and the group count of dragonfly by a factor of 4. For the both network sizes, if the neural network is too large to fit the NoC, the neural network is partitioned and its execution is scheduled on the NoC Furuzan does not explicitly disclose: - wherein the neural inference core further stores an entirety of the artificial neural network model, However, Modha discloses: - wherein the neural inference core further stores an entirety of the artificial neural network model, In [0025]: With the memory 202, 201, 205 provided on chip 200 for the neural network model, transient data, and controller instructions, there is no need for off-chip memory access during computation In [0025]: Accordingly, chip 200 is fast and energy-efficient in comparison to alternative approaches that do not provide such on-chip memory. The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 5. (Currently Amended) Firuzan discloses: - wherein the sent first comprises at least one partial sum of the partial sums In [II, Page 2]: An individual neuron is connected to several neurons in the previous layer, from which it receives data, and several neurons in the next layer, to which it sends data. Consequently, all neurons in any given layer i receive the same set of inputs from layer i-1. Associated with each input, each neuron keeps a weight that specifies the impact of the input in the final output. to produce the final output. In regard to claim 6. (Currently Amended) Firuzan does not explicitly disclose: - a[[t]] However, Modha discloses: - a[[t]] In [0052]: communication with neural cores is provided through one or more on-chip network In [0052]: in various embodiments, the on-chip network is used to distribute the controller instructions from centralized instruction memory to the neural cores (BRI: an on-chip network used to distribute controller instructions from a centralized instruction memory to neural cores represents an instructions delivery network. The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 7: (Currently Amended) Firuzan does not explicitly disclose: - wherein the weight buffer memory is further configured to receive and store the synaptic weights associated with distinct portion from the model NOC router However, Modha discloses: - wherein the weight buffer memory is further configured to receive and store the synaptic weights associated with distinct portion from the model NOC router In [0021]: Multiple neural cores may be tiled in a neural core array. In some embodiments, the array is 2-dimensional. In [0020]: With reference now to FIG. 1, a neural core according to embodiments of the present disclosure is depicted. A neural core 100 is a tileable computational unit that computes one block of an output tensor. A neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 In [0038]: Each neural core implements a part of the larger neural network model for a given problem. Each neural core receives a portion of the overall chip input, and a portion of the overall neural network model. In [0040]: Individual portions of the overall neural network model are distributed to the neural cores. The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 8: (Currently Amended) Firuzan does not explicitly disclose: - wherein the vector matrix multiplier unit generates the partial sums via vector matrix multiplication of one or more of the synaptic weights from [[its]] the weight buffer and one or more of the activations vectors from [[its]] the activation memory. However, Modha discloses: - wherein the vector matrix multiplier unit generates the partial sums via vector matrix multiplication of one or more of the synaptic weights from [[its]] the weight buffer and one or more of the activations vectors from [[its]] the activation memory. In [0019]: Each neural network layer is associated with a weight tensor, parameter tensor, input tensor, output tensor, and intermediate tensor. In [0019]: The intermediate tensor contains any data that the layer produces as intermediate computations, such as partial sums. In [0045]: In some embodiments, the parameters required to compute each intermediate layer are stored in the neural network model memory. For example, in some embodiments, the parameters include synaptic weights or synaptic activation functions. In regard to claim 9. (Currently Amended) Firuzan discloses: - wherein the vector- vector unit generates partial sum vectors from the vector-to-vector operations In [II, Page 2]: An individual neuron is connected to several neurons in the previous layer, from which it receives data, and several neurons in the next layer, to which it sends data. Consequently, all neurons in any given layer i receive the same set of inputs from layer i-1. Associated with each input, each neuron keeps a weight that specifies the impact of the input in the final output. Each neuron computes the dot product of the input and weight vectors and passes the result to an activation function to produce the final output. The weight vector of each neuron is determined during an offline or online training phase. The focus of this paper is on the traffic management of the execution (inference) phase of the neural networks. (A dot product is a vector-to-vector operation) In regard to claim 10: (Currently Amended) Firuzan does not explicitly disclose: - wherein the activation function unit stores output activations in the activation memory. However, Modha discloses: - wherein the activation function unit stores output activations in the activation memory. In [0045]: the parameters required to compute each intermediate layer are stored in the neural network model memory. For example, in some embodiments, the parameters include synaptic weights or synaptic activation functions. (BRI: the model memory acting like an activation memory) The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 11. (Currently Amended) Firuzan does not explicitly disclose: - further comprising a However, Modha discloses: - further comprising a In [0055]: in various embodiments, controller logic is provided on-chip. In some embodiments, the control logic is implemented as a programmable controller that orchestrates the entire chip's operation, In [0055]: In some embodiments, the controller is distributed among the neural cores, each executing a programmable microcode at the core level. In some embodiments, the controller is hierarchical, having components that execute instructions at multiple levels of granularity In [0055]: In some embodiments, centralized controller components execute chip-level instructions to distribute core-level instructions to the controller components in each neural core (BRI: control logic that orchestrates chip core operations at various granularities—ranging from microcode (low-level) to core-level management—functions as an internal supervisor) The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 12: (Currently Amended) Firuzan discloses: - comprising a frame buffer configured to temporarily store the input activations and output activations In [1 Page 2]: the majority of the MLP and CNN models used in various smart services of Google feature between 5M to 100M synaptic weights [14] which impose both excessive memory capacity and memory/network bandwidth. In this paper, we implement the reconfigurable NoC on the logic layer of a 3D memory-on-logic structure. In this scheme, multiple memory layers are stacked on top of a logic layer. All layers are divided into multiple partitions and the vertically adjacent memory and logic form an independent memory channel. The logic layer of each channel accommodates a memory controller and multiple PEs. The proposed topology considers the PEs inside a channel as a cluster and interconnects the clusters by a reconfigurable topology. (BRI: A frame buffer, in the context of deep learning hardware and neural network acceleration, is a specialized, high-bandwidth block of memory (often on-chip SRAM or dedicated VRAM) used to store the complete 3D feature maps (input/output activations) of a convolutional layer. The 3D memory-on-logic represents a frame buffer) In regard to claim 13 (Currently Amended) Firuzan does not explicitly disclose: - wherein the supervisor is configured to be accessed by a host via a memory-mapped interface. However, Modha discloses: - wherein the supervisor is configured to be accessed by a host via a memory-mapped interface. In [0057]: Referring now to FIG. 8, a method of operating a neural inference chip is illustrated according to embodiments of the present disclosure. At 801, input data are written to a second memory of the neural inference chip. In some embodiments, input data are written by a host of the neural inference chip. At 802, the input data are provided to a plurality of neural cores of the neural inference chip. For each of a plurality of layers of a neural network defined by a neural network model in a first memory of the neural inference chip: at 803, a portion of the neural network model is provided from the first memory to the plurality of neural cores; at 804, a portion of instructions are provided from a fourth memory of the neural inference chip to the neural cores; and, at 805, transforming the input data into output data by the plurality of neural cores. At 806, the output data from the plurality of neural cores are aggregated. At 807, the aggregated output is written to the second memory. In some embodiments, intermediate results are communicated among the plurality of neural cores. In some embodiments, the aggregated output data are read from the second memory by a host of the neural inference chip. In [0032]: With reference now to FIG. 5, a neural inference chip according to embodiments of the present disclosure is depicted. Chip 500 includes a data memory 501 for storing data during operation of the chip. Memory 501 accommodates input 511 and output 512, which in some embodiments are addressable from off-chip. In [0032]: Chip 500 includes instruction memory 505 for storing instructions that are executed by the control logic. Instruction memory 505 includes input 551, which in some embodiments is addressable from off-chip. An on-chip network 506 is provided for interconnecting these components. (BRI: memory that is addressable from off-chip, particularly when mapped into the address space of a host CPU, represents a Host Memory-Mapped Interface (often called MMIO - Memory-Mapped I/O). A "supervisor" component (often a central management core or Direct Memory Access controller) is configured to manage these data transfers, and this supervisor is accessible to the host via a memory-mapped interface) In regard to claim 14: (Currently Amended) Firuzan discloses: - wherein the plurality of neural inference cores [[is]] are organized in a grid of two or more dimensions with at least one row and at least one column. In [III, Page 3]: A mesh NoC on the logic die is used to interconnect PEs. Our paper uses the same platform, but shows that a reconfigurable topology can yield better power-efficiency than the mesh and other conventional topologies. In [VI, Page 7]: the reconfigurable cluster-based NoC has 32 clusters of size 8 and corridor width of 2. H-NoC has 8 neurons in each neuron facility, 8 neuron facilities in each tile, and 8 tiles in each cluster, and 4 clusters. The mesh is 8×8 with a concentration degree of 4. In [II, Page 3]: By changing the three main parameters of the topology, i.e. cluster size (number of nodes in each cluster), network corridor width (the number of switches between two adjacent clusters at each row and column), and the number of clusters in the network, various reconfigurable NoCs can be implemented. In regard to claim 15: (Currently Amended) Firuzan discloses: - A method comprising: executing, by a neural inference processor, using a plurality of neural inference cores of the neural inference processor, an artificial neural network model to perform an inferencing function, wherein each neural inference core[[s,]] of the plurality of neural inference cores is connected to directly adjacent neural inference cores on the neural inference processor, and wherein each neural inference core comprises: In [I, Page 2]: architecture is provided with a mapping and topology construction algorithm that maps different partitions of an input neural network into the NoC clusters (and cluster nodes) and finds an appropriate inter-cluster topology. In [I, Page 2]: we propose a cluster-based reconfigurable NoC architecture for parallel neural network accelerators. The PEs, which act as the nodes of the proposed NoC, are arranged as several clusters. The nodes inside a cluster are connected by broadcast-based topology, whereas the clusters themselves are connected by a reconfigurable network. The reconfigurable architecture can dynamically adapt the inter-cluster topology to the current inter-PE and memory-PE connections. In [II, Page 2]: Cluster-based reconfigurable NoC The reconfigurable topology adopted in this paper is a modified version of the baseline reconfigurable topology we presented in a prior work [11]. In the reconfigurable NoC in [11], routers (squares in Figure 1) are not connected directly to each other, but through a simple logic, called configuration switch (circles in Figure 1) that allow changing the inter-router connections dynamically. PNG media_image1.png 217 448 media_image1.png Greyscale In [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. (BRI: a system where multiple processing elements (PEs) each handle a subset of neurons in parallel, with these neurons mapped onto cluster nodes, is a fundamental representation of a neural inference core or a neuromorphic core. The core as represented in Fig 1 is also a “cluster” that has reconfigurable interconnection through the router (switch matrix)) In [V a, Page 4]:Partition grouping. the neurons are partitioned, the partitions should be mapped onto the NoC nodes. The mapping is done by a two-step procedure. In the first step, the partitions that should be mapped onto the same cluster are selected and bundled into partition groups. Afterwards, each partition group is assigned to a NoC cluster. Among different approaches, the grouping policy that this paper applies puts adjacent partitions in the same group, so partition groups, and hence clusters, mostly contain neurons from the same layer. This way, the PEs of the same cluster have common input data. It can facilitate traffic management, in that one PE of each cluster can get the data and broadcast it to the others internally. - a core control that directs operations on the neural inference core; In [IV, Page 4]: we ignore the initialization and control procedures that host initiates to control the entire memory-side neural information processing and focus on managing the on chip communication during the inference phase of neural network execution. In [IV, Page 4]: PNG media_image2.png 272 457 media_image2.png Greyscale As Figure 2 shows, the routers in this design have three bidirectional ports to the bus, local PE/memory, and the adjacent configuration switch. When a router receives a packet, it can take three actions: (1) send the packet to its local processor, (2) broadcast it over the local bus to send it to other nodes of the cluster, and (3) send it to its inter-cluster output port to pass it to other clusters. - an activation network-on-chip (NOC) router connected via a hopping network of the neural inference processor to respective activation NOC routers of all other neural inference cores of the plurality of neural inference cores, wherein the activation NOC router is configured to: In [II, Page 2]: The reconfigurable topology adopted in this paper is a modified version of the baseline reconfigurable topology we presented in a prior work [11]. In the reconfigurable NoC in [11], routers (squares in Figure 1) are not connected directly to each other, but through a simple logic, called configuration switch (circles in Figure 1) that allow changing the inter-router connections dynamically. PNG media_image1.png 217 448 media_image1.png Greyscale In [II, Page 3]: A Configuration switch consists of several transistor switches, like the switch boxes in an FPGA, which can be configured to set up permanent long links between different routes. Actually, they can be considered as simple 4×4 crossbars In [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. PNG media_image3.png 267 433 media_image3.png Greyscale In [II, Page 3]: By changing the three main parameters of the topology, i.e. cluster size (number of nodes in each cluster), network corridor width (the number of switches between two adjacent clusters at each row and column), and the number of clusters in the network, various reconfigurable NoCs can be implemented. In [V, Page 5]: in the form of a communication task graph (CTG), as input and maps each task onto a NoC node in such a way that the total traffic load across all NoC links is minimalized. In other words, partition groups must be mapped onto network clusters in a way that groups with heavier communication are mapped onto adjacent (or nearby) clusters. (BRI: within the context of designing or describing a mosaic-style or tiled neural architecture (such as a 2D mesh of heterogeneous cores), specifying the number of nodes per cluster along with the switch connectivity between adjacent clusters in rows and columns provides sufficient information to define the topology and structure of the entire set of neural inference cores) - receives, via the hopping network, the input activations, and sends, via the hopping network, the output activations; In [II, Page 2]: The most common form of neural networks is the feedforward Multi-Layer Perceptron (MLP) model; an n-layer MLP consists of one input layer, n-2 intermediate (hidden layers), and one output layer. Each layer consists of a set of basic processing elements called neurons. An individual neuron is connected to several neurons in the previous layer, from which it receives data, and several neurons in the next layer, to which it sends data. In [II, Page 2]: Each neuron computes the dot product of the input and weight vectors and passes the result to an activation function to produce the final output. The weight vector of each neuron is determined during an offline or online training phase. - a model NOC router connected via a 2-D hopping network of the neural inference processor to respective model NOC routers of all other neural inference cores of the plurality of neural inference cores, wherein the model NOC router: In [II, Page 2]: we show that our reconfigurable NoC is a promising architecture to connect PEs to each other and also to memory controllers. In [II, Page 2]: The reconfigurable topology adopted in this paper is a modified version of the baseline reconfigurable topology we presented in a prior work [11]. In the reconfigurable NoC in [11], routers (squares in Figure 1) are not connected directly to each other, but through a simple logic, called configuration switch (circles in Figure 1) that allow changing the inter-router connections dynamically. PNG media_image1.png 217 448 media_image1.png Greyscale In [II, Page 3]: A Configuration switch consists of several transistor switches, like the switch boxes in an FPGA, which can be configured to set up permanent long links between different routes. Actually, they can be considered as simple 4×4 crossbars In [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. PNG media_image3.png 267 433 media_image3.png Greyscale In [I, Page 1]: PEs need to communicate to exchange data based on the dataflow model of the neural network (e.g. sending the result of one neuron to all neurons of the next layer), get input data from memory, and get a new set of weights to (re)configure their datapath. The latter case occurs when the target neural network is too large to fit the multiprocessor architecture and different neural network partitions share the PEs in time. In this case, upon switching to a new neural network partition, the corresponding array of weights should be fetched from memory. In [II, Page 3]: By changing the three main parameters of the topology, i.e. cluster size (number of nodes in each cluster), network corridor width (the number of switches between two adjacent clusters at each row and column), and the number of clusters in the network, various reconfigurable NoCs can be implemented. In [V, Page 5]: in the form of a communication task graph (CTG), as input and maps each task onto a NoC node in such a way that the total traffic load across all NoC links is minimalized. In other words, partition groups must be mapped onto network clusters in a way that groups with heavier communication are mapped onto adjacent (or nearby) clusters. (BRI: within the context of designing or describing a mosaic-style or tiled neural architecture (such as a 2D mesh of heterogeneous cores), specifying the number of nodes per cluster along with the switch connectivity between adjacent clusters in rows and columns provides sufficient information to define the topology and structure of the entire set of neural inference cores) - multicasts, via the 2-D hopping network, network model data of the neural inference core to the respective model NOC routers of the other neural inference cores, In [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. PNG media_image3.png 267 433 media_image3.png Greyscale In [V, Page 6]: Multicast support. If multiple edges of the CTG carry the same data, the algorithm tries to build a multicast tree for them to reduce the link usage. In [V, Page 6]: all CTG edges of a node that multicast a data are sorted in the increasing order of the Manhattan distance to their destination clusters. Then, the path for the edge with the shortest distance is set up by the Dijikstra’s algorithm. Suppose this path is constructed from cluster Cs to Cd1 and passes through N intermediate nodes IN1 to INN. Intermediate nodes can be either cluster or configuration switch. For the second edge that goes from Cs to Cd2, the shortest path algorithm not only finds the path with the minimum weight between clusters Cs and Cd2, but also calculates the minimum weight from all of the intermediate nodes of the previous path, i.e. IN1 to INN, to Cd2. - and receives via the 2-D hopping network, from the respective model NOC routers of the other neural inference cores respective network model data of the other neural inference cores; In [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. (BRI: a system where multiple processing elements (PEs) each handle a subset of neurons in parallel, with these neurons mapped onto cluster nodes, is a fundamental representation of a neural inference core or a neuromorphic core) In [IV, Page 4]: As Figure 2 shows, the routers in this design have three bidirectional ports to the bus, local PE/memory, and the adjacent configuration switch. When a router receives a packet, it can take three actions: (1) send the packet to its local processor, (2) broadcast it over the local bus to send it to other nodes of the cluster, and (3) send it to its inter-cluster output port to pass it to other clusters. PNG media_image2.png 272 457 media_image2.png Greyscale In [IV, Page 4]: Among different approaches, the grouping policy that this paper applies puts adjacent partitions in the same group, so partition groups, and hence clusters, mostly contain neurons from the same layer. This way, the PEs of the same cluster have common input data. It can facilitate traffic management, in that one PE of each cluster can get the data and broadcast it to the others internally. In [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. PNG media_image3.png 267 433 media_image3.png Greyscale In [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. In [1, Page 1]: when the target neural network is too large to fit the multiprocessor architecture and different neural network partitions share the PEs in time. In this case, upon switching to a new neural network partition, the corresponding array of weights should be fetched from memory. In [1, Page 2]: architecture is provided with a mapping and topology construction algorithm that maps different partitions of an input neural network into the NoC clusters (and cluster nodes) and finds an appropriate inter-cluster topology. (BRI: a mapping and topology construction algorithm that assigns neural network partitions to NoC clusters and determines the inter-cluster topology constitutes a distributed neural network (NN) model. In [V, Page 4]: a scheduling scheme is further needed to determine the order of execution of different neural network partitions on the architecture. In this case, upon switching the partitions, the synaptic weights of the new partition should be read from the memory that makes the on-chip traffic more complex. (BRI: assigning specific synaptic weights to that PE, is a form of model parallelism) and the portioning) - a partial sum NOC interface that sends first data to and receives second data from the directly adjacent neural inference cores; in [V, Page 6]: Figure 6 shows how a multicast tree from node A to nodes B and C can save links, compared to the case where each communication has its dedicated path. To accelerate the algorithm, this step only considers those intermediate nodes of the previous path that lie along one of the shortest paths between Cs and Cd2. If the new path expands the previous path at node IN1 and IN1 is a configuration switch, the internal crossbar of the switch should connect the input from which the path enters to two outputs, each related to one edge. PNG media_image3.png 267 433 media_image3.png Greyscale In [1, Page 1]: Most neural network NoCs come with some variations of the well-known mesh topology [7-8]. The mesh topology has a regular interconnection structure In [1, Page 1]: In addition, as a considerable portion of the traffic of neural networks is in the form of multicast of short messages (to exchange partial results), a NoC that supports multicast routing is critical for a neural network parallel accelerator. In [II, Page 2]: An individual neuron is connected to several neurons in the previous layer, from which it receives data, and several neurons in the next layer, to which it sends data. Consequently, all neurons in any given layer i receive the same set of inputs from layer i-1. Associated with each input, each neuron keeps a weight that specifies the impact of the input in the final output. Each neuron computes the dot product of the input and weight vectors and passes the result to an activation function to produce the final output. The weight vector of each neuron is determined during an offline or online training phase. The focus of this paper is on the traffic management of the execution (inference) phase of the neural networks. ( BRI: each neuron in a neural network computes the dot product of the input vector and the weight vector acts as a weighted sum of inputs. This weighted sum represents a "partial sum”. Within the iterative context of using layer to layer computation of dot product, a partial sum memory is used to compute next layer partial sum) Firuzan does not explicitly disclose: - an activation memory that stores input activations of the neural inference core and[[,]] output activations of the neural inference core - model memory that stores a distinct portion of an artificial neural network model assigned to the neural inference core, such that the artificial neural network model is distributed across the plurality of neural inference cores, - wherein the artificial neural network model comprises[ [ing]] synaptic weights, neuron parameters, and neural network instructions - a weight buffer memory that stores a neural network weight matrix; - a vector matrix multiplier that generate partial sums from activation vectors of input activations of the activation memory and the neural network weight matrix; a partial sum memory - a vector-vector unit that performs vector-to-vector operations on the partial sums using the partial sum memory and the received second data from the directly adjacent neural inference cores; - and an activation function that generates the output activations by applying non- linear functions to the partial sums However, Modha discloses: - an activation memory that stores input activations of the neural inference core and[[,]] output activations of the neural inference core In [0020]: With reference now to FIG. 1, a neural core according to embodiments of the present disclosure is depicted. A neural core 100 is a tileable computational unit that computes one block of an output tensor. A neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 and accumulates the products into weighted sums that are stored in a 1×N intermediate tensor block 103. A U×N parameter tensor block contains the U parameters that specify each of the N neuron activation functions that are applied to the intermediate tensor block 103 to produce a 1×N output tensor block 105. In [0021]: Multiple neural cores may be tiled in a neural core array. In some embodiments, the array is 2-dimensional. In [0025]: Chip 200 includes computation logic 202, which may include one or more neural cores configured to implementing intermediate processing layers within a multi-layer neural network. Chip 200 includes model memory 203 for storing the neural network model, which may include configuration parameters for computation logic 202. - model memory that stores a distinct portion of an artificial neural network model assigned to the neural inference core, such that the artificial neural network model is distributed across the plurality of neural inference cores, In [0025]: Chip 200 includes computation logic 202, which may include one or more neural cores configured to implementing intermediate processing layers within a multi-layer neural network. Chip 200 includes model memory 203 for storing the neural network model, which may include configuration parameters for computation logic 202. In [0035]: computation is distributed among multiple cores 621. Data memory and model memory are also distributed among multiple cores 621 without having corresponding chip-level entities. Accordingly, input 611 and output 612 are coupled via the on-chip network to the multiple data memory entities on the various cores 621. Likewise, input 631 is coupled via the on-chip network to the multiple model memory entities on the various cores 621. Controller logic and instruction memory are partially distributed among multiple cores 621. Accordingly, there is both chip-level controller logic 604 and instruction memory 605 and corresponding per-core entities. In [0046]: In some embodiments the computation implemented by each neural core may be reconfigured online by loading a different set of parameters from the neural network model memory. As noted above, the neural network model memory may be local to each neural core (BRI: The model memory that stores a distinct portion of an artificial neural network model (specifically weights, biases, and parameters) for a dedicated inference core is typically referred to as local scratchpad memory or local memory that stores a distinct portion of an artificial neural network (ANN) model) - wherein the artificial neural network model comprises[ [ing]] synaptic weights, neuron parameters, and neural network instructions In [0045]: In some embodiments, the parameters required to compute each intermediate layer are stored in the neural network model memory. For example, in some embodiments, the parameters include synaptic weights or synaptic activation functions. In [005]: In various embodiments, communication with neural cores is provided through one or more on-chip network. In various embodiments, the on-chip network is used to distribute the neural network model from centralized model memory to the neural cores. In various embodiments, the on-chip network is used to distribute the controller instructions from centralized instruction memory to the neural cores. In various embodiments, the on-chip network is used to distribute input data to the neural cores and to aggregate output data from the neural cores. In [0055]: the controller is hierarchical, having components that execute instructions at multiple levels of granularity (e.g., centralized chip-level, distributed core-level, and zero or more levels in between). In some embodiments, centralized controller components execute chip-level instructions to distribute core-level instructions to the controller components in each neural core in [0056]: Chip-level and core-level instructions ensure that the entire chip operation and each core's operations are pipelined for very high throughput. In various embodiments, the instruction set architecture includes control instructions to orchestrate the chip's operation. For example, instructions may include generating neural network memory addresses and read/write operations, specifying the computation operations to be executed on the data, specifying the routing of data between cores and between cores and memories, generating input, output, and data memory addresses, and read/write operations. - a weight buffer memory that stores a neural network weight matrix; In [0019]: Each neural network layer is associated with a weight tensor, parameter tensor, input tensor, output tensor, and intermediate tensor. The weight tensor contains all of the weights that connect inputs to the layer. - a vector matrix multiplier unit that generates partial sums from activation vectors of input activations of the activation memory and the neural network weight matrix; a partial sum memory; In [0020]: neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 and accumulates the products into weighted sums that are stored in a 1×N intermediate tensor block 103. In [0019]: The intermediate tensor contains any data that the layer produces as intermediate computations, such as partial sums.) (BRI: intermediate tensors can contain data that a layer produces as intermediate computations, including partial sums generated by a vector-matrix multiplier unit from input activations and weight matrices) - a vector-vector unit that performs vector-to-vector operations on the partial sums using the partial sum memory and the received second data from the directly adjacent neural inference cores; In [0020] : With reference now to FIG. 1, a neural core according to embodiments of the present disclosure is depicted. A neural core 100 is a tileable computational unit that computes one block of an output tensor. A neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 and accumulates the products into weighted sums that are stored in a 1 × N intermediate tensor block 103. A U×N parameter tensor block contains the U parameters that specify each of the N neuron activation functions that are applied to the intermediate tensor block 103 to produce a 1×N output tensor block 105. In [0019]: the intermediate tensor contains any data that the layer produces as intermediate computations, such as partial sums. (Within the context of a tensor multiplication, a vector-to -vector operations are performed and the stored intermediate blocks are partial sum memory) - and an activation function that generates the output activations by applying non- linear functions to the partial sums In [0015]: An artificial neuron is a mathematical function whose output is a nonlinear function of a linear combination of its inputs. Two neurons are connected if the output of one is an input to the other. A weight is a scalar value encoding the strength of the connection between the output of one neuron and the input of another neuron. In [0016]: A neuron computes its output, called an activation, by applying a nonlinear activation function to a weighted sum of its inputs. A weighted sum is an intermediate result computed by multiplying each input with the corresponding weight and accumulating the products. A partial sum is a weighted sum of a subset of inputs. A weighted sum of all inputs may be computed in stages by accumulating one or more partial sums. The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 17 (Currently Amended) Firuzan discloses: - receiving, by the model NoC router, the distinct portion of the artificial neural network model from one of the other neural inference cores in [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. in [V, Page 5]: partition groups must be mapped onto network clusters in a way that groups with heavier communication are mapped onto adjacent (or nearby) clusters. In [III, Page 3]: The EMBRACE architecture is a 2-D array of interconnected neural tiles surrounded by I/O blocks. It adopts H-NoC, a hierarchical mesh-based topology, that consists of three layers; 10 neural cells are connected to a router at the first layer to form a neuron module, up to 10 neuron modules are connected to a upper router to form a tile, and up to 4 tiles are connected to an upper router at the third layer to form a cluster. If a neural network does not fit in one cluster, multiple clusters are interconnected by a mesh at a higher level. In [VI, Page 7]: the dragonfly has 256 PEs arranged as 16 groups, each containing 16 PEs. For the larger neural networks (NN1 and NN2), we use 1024 PEs by increasing the cluster count of HNOC and reconfigurable NoC and the group count of dragonfly by a factor of 4. For the both network sizes, if the neural network is too large to fit the NoC, the neural network is partitioned and its execution is scheduled on the NoC In regard to claim 18. (Currently Amended) Firuzan discloses: - sending, by the model NOC router of the neural inference core, another distinct portion of the artificial neural network model [[from]] to one of the in [V, Page 4]: We propose a design flow to implement a neural network on the reconfigurable NoC architecture. The main steps of the design flow are partitioning the neural network based on the number of neurons that each PE can process in parallel, mapping neurons onto cluster nodes, and customizing the inter cluster topology for the neural network by establishing reconfigurable links between clusters. in [V, Page 5]: partition groups must be mapped onto network clusters in a way that groups with heavier communication are mapped onto adjacent (or nearby) clusters. In [III, Page 3]: The EMBRACE architecture is a 2-D array of interconnected neural tiles surrounded by I/O blocks. It adopts H-NoC, a hierarchical mesh-based topology, that consists of three layers; 10 neural cells are connected to a router at the first layer to form a neuron module, up to 10 neuron modules are connected to a upper router to form a tile, and up to 4 tiles are connected to an upper router at the third layer to form a cluster. If a neural network does not fit in one cluster, multiple clusters are interconnected by a mesh at a higher level. In [VI, Page 7]: the dragonfly has 256 PEs arranged as 16 groups, each containing 16 PEs. For the larger neural networks (NN1 and NN2), we use 1024 PEs by increasing the cluster count of HNOC and reconfigurable NoC and the group count of dragonfly by a factor of 4. For the both network sizes, if the neural network is too large to fit the NoC, the neural network is partitioned and its execution is scheduled on the NoC Firuzan does not explicitly disclose: - storing by the neural inference core , an entirety of the artificial neural network model, However, Modha discloses: - storing by the neural inference core , an entirety of the artificial neural network model, In [0025]: With the memory 202, 201, 205 provided on chip 200 for the neural network model, transient data, and controller instructions, there is no need for off-chip memory access during computation In [0025]: Accordingly, chip 200 is fast and energy-efficient in comparison to alternative approaches that do not provide such on-chip memory. The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 19. (Currently Amended) Firuzan discloses: - wherein the sent first comprises at least one partial sum of the partial sums In [II, Page 2]: An individual neuron is connected to several neurons in the previous layer, from which it receives data, and several neurons in the next layer, to which it sends data. Consequently, all neurons in any given layer i receive the same set of inputs from layer i-1. Associated with each input, each neuron keeps a weight that specifies the impact of the input in the final output. to produce the final output. In regard to claim 20: (Currently Amended) Firuzan does not explicitly disclose: - wherein the vector matrix multiplier unit generates the partial sums via vector matrix multiplication of one or more of the synaptic weights from [[its]] the weight buffer and one or more of the activations vectors from [[its]] the activation memory. However, Modha discloses: - wherein the vector matrix multiplier unit generates the partial sums via vector matrix multiplication of one or more of the synaptic weights from [[its]] the weight buffer and one or more of the activations vectors from [[its]] the activation memory. In [0019]: Each neural network layer is associated with a weight tensor, parameter tensor, input tensor, output tensor, and intermediate tensor. In [0019]: The intermediate tensor contains any data that the layer produces as intermediate computations, such as partial sums. In [0045]: In some embodiments, the parameters required to compute each intermediate layer are stored in the neural network model memory. For example, in some embodiments, the parameters include synaptic weights or synaptic activation functions. The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 22.: (Currently Amended) Firuzan does not explicitly disclose: - wherein the neural network weight matrix comprises the synaptic weights associated with distinct portion However, Modha discloses: - wherein the neural network weight matrix comprises the synaptic weights associated with distinct portion In [0021]: Multiple neural cores may be tiled in a neural core array. In some embodiments, the array is 2-dimensional. In [0020]: With reference now to FIG. 1, a neural core according to embodiments of the present disclosure is depicted. A neural core 100 is a tileable computational unit that computes one block of an output tensor. A neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 In [0038]: Each neural core implements a part of the larger neural network model for a given problem. Each neural core receives a portion of the overall chip input, and a portion of the overall neural network model. In [0040]: Individual portions of the overall neural network model are distributed to the neural cores. (a neural core array is generally considered a synapse array (or more accurately, a neuromorphic chip comprised of both neurons and synapses) in the context of neuromorphic hardware. These arrays are designed to mimic biological brain structure) The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). In regard to claim 23 : (New) Firuzan does not explicitly disclose: - generating, by the vector-vector unit, partial sum vectors from vector-to-vector operations. However, Modha discloses: - generating, by the vector-vector unit, partial sum vectors from vector-to-vector operations. In [0020] : With reference now to FIG. 1, a neural core according to embodiments of the present disclosure is depicted. A neural core 100 is a tileable computational unit that computes one block of an output tensor. A neural core 100 has M inputs and N outputs. In various embodiments, M=N. To compute an output tensor block, a neural core multiplies an M×1 input tensor block 101 with an M×N weight tensor block 102 and accumulates the products into weighted sums that are stored in a 1 × N intermediate tensor block 103. A U×N parameter tensor block contains the U parameters that specify each of the N neuron activation functions that are applied to the intermediate tensor block 103 to produce a 1×N output tensor block 105. In [0019]: the intermediate tensor contains any data that the layer produces as intermediate computations, such as partial sums. The examiner interprets the theme of the invention is to provide a neural network inference engine with a plurality of distributed cores using a NoC router that is implemented based on 2D hopping network to enable adjacency of cores to provide optimized performance. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Firuzan, and Modha. Firuzan teaches an inference processor consisting of plurality of cores within a distributed (portioned) environment using a 2D hopping network router NoC. Modha teaches a vector-vector unit to perform vector-to-vector operations and activation function using non-linear functions, One of ordinary skill would be motivated to combine Firuzan, and Modha that can provide multiple distributed neural cores that can increase speed of neural network processing while decreasing latency between presentation of input and computation of output (Modha [0038]). Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /TIRUMALE K RAMESH/Examiner, Art Unit 2121 /Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action

Prosecution Timeline

Oct 22, 2020
Application Filed
Apr 06, 2023
Non-Final Rejection — §103
Jul 11, 2023
Response Filed
Nov 02, 2023
Final Rejection — §103
May 03, 2024
Response after Non-Final Action
May 06, 2024
Applicant Interview (Telephonic)
May 07, 2024
Examiner Interview Summary
May 08, 2024
Request for Continued Examination
May 09, 2024
Response after Non-Final Action
Sep 23, 2024
Non-Final Rejection — §103
Dec 30, 2024
Response Filed
Mar 13, 2025
Final Rejection — §103
Jun 04, 2025
Applicant Interview (Telephonic)
Jun 09, 2025
Examiner Interview Summary
Jun 20, 2025
Request for Continued Examination
Jun 24, 2025
Response after Non-Final Action
Sep 05, 2025
Non-Final Rejection — §103
Oct 23, 2025
Interview Requested
Nov 03, 2025
Applicant Interview (Telephonic)
Nov 06, 2025
Examiner Interview Summary
Dec 08, 2025
Response Filed
Feb 25, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12518153
TRAINING MACHINE LEARNING SYSTEMS
2y 5m to grant Granted Jan 06, 2026
Patent 12293284
META COOPERATIVE TRAINING PARADIGMS
2y 5m to grant Granted May 06, 2025
Patent 12229651
BLOCK-BASED INFERENCE METHOD FOR MEMORY-EFFICIENT CONVOLUTIONAL NEURAL NETWORK IMPLEMENTATION AND SYSTEM THEREOF
2y 5m to grant Granted Feb 18, 2025
Patent 12131244
HARDWARE-OPTIMIZED NEURAL ARCHITECTURE SEARCH
2y 5m to grant Granted Oct 29, 2024
Patent 11803745
TERMINAL DEVICE AND METHOD FOR ESTIMATING FIREFIGHTING DATA
2y 5m to grant Granted Oct 31, 2023
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

7-8
Expected OA Rounds
18%
Grant Probability
20%
With Interview (+2.1%)
4y 5m
Median Time to Grant
High
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month