Prosecution Insights
Last updated: April 19, 2026
Application No. 17/956,985

METHODS AND DEVICES FOR CONFIGURING A NEURAL NETWORK ACCELERATOR WITH A CONFIGURABLE PIPELINE

Non-Final OA §101§103
Filed
Sep 30, 2022
Examiner
RAMESH, TIRUMALE K
Art Unit
2121
Tech Center
2100 — Computer Architecture & Software
Assignee
Imagination Technologies Limited
OA Round
1 (Non-Final)
18%
Grant Probability
At Risk
1-2
OA Rounds
4y 5m
To Grant
20%
With Interview

Examiner Intelligence

Grants only 18% of cases
18%
Career Allow Rate
7 granted / 40 resolved
-37.5% vs TC avg
Minimal +2% lift
Without
With
+2.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
40 currently pending
Career history
80
Total Applications
across all art units

Statute-Specific Performance

§101
30.7%
-9.3% vs TC avg
§103
59.1%
+19.1% vs TC avg
§102
3.7%
-36.3% vs TC avg
§112
5.4%
-34.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases

Office Action

§101 §103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Step 1: According to the first part of the analysis, in the instant case, claims 1-17 and 19 are directed to a method claim, claim 18 is directed to an apparatus (device) claim comprising one or more processors. Thus, each of the claims falls within one of the four statutory categories (i.e. process, machine, manufacture, or composition of matter). In regard to claim 1: Step 2A Prong 1: “ determining an order (except of “ the selected set of hardware processing units”) “ to perform the one or more neural network operations in accordance with the sequence” is a mental step of mathematical concept interpreting the “determining” as a process that can be performed by human. (BRI: the "determining an order", on its own, could be interpreted as a purely mental process or a mathematical concept that can, in principle, be performed by a human mind, even if complex in practice. Merely stating “determination” using hardware processing unit may not be enough to demonstrate improvement to the computer technology and in this context, the hardware units are generic computer function as there is no specificity of a novel hardware to perform the function) Additional Elements Step 2A Prong 2: “ A computer-implemented method for configuring a neural network accelerator to process input data, the neural network accelerator comprising a plurality of hardware processing units and a crossbar coupled to each hardware processing unit of the plurality of hardware processing units” recited in the preamble does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “each hardware processing unit comprising hardware to accelerate performing one or more neural network operations on received data, the method comprising:” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ obtaining a sequence of one or more neural network operations to be performed by the neural network accelerator on the input data” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “selecting a set of hardware processing units from the plurality of hardware processing units to perform the one or more neural network operations;” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ of the selected set of hardware processing units” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and providing the neural network accelerator with control information that causes the crossbar of the neural network accelerator to form a pipeline of the selected set of hardware processing units in the determined order to process the input data.” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ A computer-implemented method for configuring a neural network accelerator to process input data, the neural network accelerator comprising a plurality of hardware processing units and a crossbar coupled to each hardware processing unit of the plurality of hardware processing units” recited in the preamble does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “each hardware processing unit comprising hardware to accelerate performing one or more neural network operations on received data, the method comprising:” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ obtaining a sequence of one or more neural network operations to be performed by the neural network accelerator on the input data” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “selecting a set of hardware processing units from the plurality of hardware processing units to perform the one or more neural network operations;” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ the selected set of hardware processing units” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and providing the neural network accelerator with control information that causes the crossbar of the neural network accelerator to form a pipeline of the selected set of hardware processing units in the determined order to process the input data.” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 2: Step 2A Prong 1: “ and information identifying the determined order of the selected set of hardware processing units” is a mental step of Additional Elements Step 2A Prong 2: “ wherein the control information comprises information identifying the selected set of hardware processing units” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ wherein the control information comprises information identifying the selected set of hardware processing units” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 3: Step 2A Prong 1: “ and the control information comprises information identifying which input ports of the crossbar are to be connected to which output ports of the crossbar to form the pipeline” is a mental process that can performed using a pen and paper envisioning a crossbar as a reconfigurable graph structure. “ wherein the crossbar comprises a plurality of input ports and a plurality of output ports” mental step of describing a fundamental physical building block of an architecture. Step 2A Prong 2: no additional elements Step 2B: no additional elements In regard to claim 4: Step 2A Prong 2: “ wherein the neural network accelerator comprises a register for each output port” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h). “ and providing the control information to the neural network accelerator comprises causing a value to be written each register that identifies which input port of the plurality of input ports” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h). Step 2B: “ wherein the neural network accelerator comprises a register for each output port” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and providing the control information to the neural network accelerator comprises causing a value to be written each register that identifies which input port of the plurality of input ports” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 5: Step 2A Prong 1: “ wherein each input port is allocated a number “ is a mental step of performing using a pen and paper. The act of assigning an identifying number to a physical input port on a device (or conceptual one in software) is a simple organizational task that a person can do mentally. Additional Elements Step 2A Prong 2: “ and the value written to a register is the number of the input port to be connected to the corresponding output port.” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ and the value written to a register is the number of the input port to be connected to the corresponding output port.” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 6: Step 2A Prong 2: “wherein when a hardware processing unit of the plurality of hardware processing units does not from part of the set of hardware processing units then a predetermined value is written to the register corresponding to the output port connected to that hardware processing unit to indicate that the hardware processing unit is to be disabled” does not integrate the judicial exception into a practical application. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ wherein when a hardware processing unit of the plurality of hardware processing units does not from part of the set of hardware processing units then a predetermined value is written to the register corresponding to the output port connected to that hardware processing unit to indicate that the hardware processing unit is to be disabled” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 7: Step 2A Prong 1: “ determining whether the control information is valid” is a mental step using pen and paper. Step 2A Prong 2: “ prior to providing the neural network accelerator with the control information, and only providing the neural network accelerator with the control information if it is determined that the control information is valid” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ prior to providing the neural network accelerator with the control information, and only providing the neural network accelerator with the control information if it is determined that the control information is valid” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(f)). In regard to claim 8: Step 2A Prong 1: “ wherein it is determined that the control information is valid” is a mental step using pen and paper. Additional Elements Step 2A Prong 2: “ only if when the output of a first hardware processing unit is to be the input to a second hardware processing unit, the control information indicates that the input port of the crossbar coupled to the output of the first hardware processing unit is to be connected to the output port of the crossbar coupled to the input of the second hardware processing unit.” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ only if when the output of a first hardware processing unit is to be the input to a second hardware processing unit, the control information indicates that the input port of the crossbar coupled to the output of the first hardware processing unit is to be connected to the output port of the crossbar coupled to the input of the second hardware processing unit.” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 9: Step 2A Prong 1: “ and wherein determining an order (except of the selected set of hardware processing units”) is a mental step of mathematical concept interpreting the “determining” as a process that can be performed by human. “ comprises determining the order such that the restrictions are not contravened and only valid” is a mental step that can be performed using pen and paper. Additional Elements Step 2A Prong 2: “ further comprising reading, from a memory, a predefined set of restrictions defining which hardware processing units can be validly connected to each other” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ of the selected set of hardware processing units” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “combinations of hardware processing units are to be connected using the crossbar” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ further comprising reading, from a memory, a predefined set of restrictions defining which hardware processing units can be validly connected to each other” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ of the selected set of hardware processing units” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “combinations of hardware processing units are to be connected using the crossbar” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 10: Step 2A Prong 2: “ wherein the set of hardware processing units are selected such that each hardware processing unit of the set of hardware processing units is only used once in performing the sequence of one or more neural network operations” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ wherein the set of hardware processing units are selected such that each hardware processing unit of the set of hardware processing units is only used once in performing the sequence of one or more neural network operations” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 11: Step 2A Prong 1: “ and the control information comprises information identifying the selected data input unit” is a mental step as the phrase is associated with data analysis and is viewed very broad without showing significantly more. Step 2A Prong 2: “ wherein the neural network accelerator comprises a plurality of data input units configured to load the input data into the neural network accelerator” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and the method further comprises selecting one of the plurality of data input units to load the input data into the neural network accelerator based on one or more characteristics of the input data and/or the pipeline” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ wherein the neural network accelerator comprises a plurality of data input units configured to load the input data into the neural network accelerator” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and the method further comprises selecting one of the plurality of data input units to load the input data into the neural network accelerator based on one or more characteristics of the input data and/or the pipeline” does not amount to more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 12: Step 2A Prong 2: “wherein at least one of the hardware processing units in the set is configurable to transmit or receive a tensor in a selected processing order of a plurality of selectable processing orders” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and the method further comprises selecting a processing order to be used by one or more of the at least one processing units for transmitting or receiving a tensor based on the pipeline, and wherein the control information comprises information identifying the selected processing order” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: ” wherein at least one of the hardware processing units in the set is configurable to transmit or receive a tensor in a selected processing order of a plurality of selectable processing orders” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and the method further comprises selecting a processing order to be used by one or more of the at least one processing units for transmitting or receiving a tensor based on the pipeline, and wherein the control information comprises information identifying the selected processing order” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 13: Step 2A Prong 1: “ wherein the control information further comprises information identifying a function and/or one or more operations Additional Elements Step 2A Prong 2: “ to be implemented by one or more of the hardware processing units in the set of hardware processing units” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ to be implemented by one or more of the hardware processing units in the set of hardware processing units” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 14: Step 2A Prong 1: “ determining a second order of the second selected set of hardware processing units to perform the one or more neural network operations in accordance with the second sequence” “and the second determined order is different than the determined order” Additional Elements Step 2A Prong 2: “ obtaining a second sequence of one or more neural network operations to be performed by the neural network accelerator on second input data” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ selecting a second set of hardware processing units from the plurality of hardware processing units to perform the one or more neural network operations of the second sequence” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and providing the neural network accelerator with second control information that causes the crossbar of the neural network accelerator to form a second pipeline of the selected second set of hardware processing units in the determined second order to process the second input data” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ wherein the second set of hardware processing units is the same as the set of hardware processing units” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ obtaining a second sequence of one or more neural network operations to be performed by the neural network accelerator on second input data “does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ selecting a second set of hardware processing units from the plurality of hardware processing units to perform the one or more neural network operations of the second sequence” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and providing the neural network accelerator with second control information that causes the crossbar of the neural network accelerator to form a second pipeline of the selected second set of hardware processing units in the determined second order to process the second input data “ “does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ wherein the second set of hardware processing units is the same as the set of hardware processing units” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 15: Step 2A Prong 2: “ wherein the plurality of hardware processing units comprises one or more of: a convolution processing unit configured to accelerate convolution operations between input data and weight data, an activation processing unit configured to accelerate applying an activation function to data, an element-wise operations processing unit configured to accelerate performing one or more element- wise operations on a set of data, a pooling processing unit configured to accelerate applying a pooling function on data, a normalisation processing unit configured to accelerate applying a normalisation function to data, and an interleave processing unit configured to accelerate rearrangement of data” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ wherein the plurality of hardware processing units comprises one or more of: a convolution processing unit configured to accelerate convolution operations between input data and weight data, an activation processing unit configured to accelerate applying an activation function to data, an element-wise operations processing unit configured to accelerate performing one or more element- wise operations on a set of data, a pooling processing unit configured to accelerate applying a pooling function on data, a normalisation processing unit configured to accelerate applying a normalisation function to data, and an interleave processing unit configured to accelerate rearrangement of data” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 16: Step 2A Prong 2: “ A method of configuring a neural network accelerator to implement a neural network, the neural network comprising a plurality of layers, each layer configured to receive input data and perform one or more neural network operations on the received input data, the method comprising:” recited in the preamble does not integrate the judicial exception into a practical application. This element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ grouping the neural network operations of the neural network into one or more sequences of neural network operations” does not integrate the judicial exception into a practical application. This element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ each sequence of neural network operations being executable by a combination of hardware processing elements; and executing the method as set forth in claim 1 to 8 for each sequence of neural network operations” does not integrate the judicial exception into a practical application. This element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ A method of configuring a neural network accelerator to implement a neural network, the neural network comprising a plurality of layers, each layer configured to receive input data and perform one or more neural network operations on the received input data, the method comprising:” recited in the preamble does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ grouping the neural network operations of the neural network into one or more sequences of neural network operations” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ each sequence of neural network operations being executable by a combination of hardware processing elements; and executing the method as set forth in claim 1 to 8 for each sequence of neural network operations” does not amount to significantly more than the judicial exception in the claim. The additional element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 17: Step 2A Prong 2: “ A non-transitory computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform the method as set forth in claim 1” recited in the preamble does not integrate the judicial exception into a practical application. This element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ A non-transitory computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform the method as set forth in claim 1”recited in the preamble does not amount to significantly more than the judicial exception in the claim. The element of “a computer system” is merely uses a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 18: Step 2A Prong 1: “ determine an order (except “ of the selected set of hardware processing units”) to perform the one or more neural network operations in accordance with the sequence” is a mental step of mathematical concept interpreting the “determining” as a process that can be performed by human. Additional Elements Step 2A Prong 2: “ computing-based device for configuring a neural network accelerator to process input data” recited in the preamble does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ the neural network accelerator comprising a plurality of hardware processing units and a crossbar coupled to each hardware processing unit of the plurality of hardware processing units, each hardware processing unit comprising hardware to accelerate performing one or more neural network operations on received data” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “the computing- based device comprising one or more processors configured to: obtain a sequence of one or more neural network operations to be performed by the neural network accelerator on the input data” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ select a set of hardware processing units from the plurality of hardware processing units to perform the one or more neural network operation” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ of the selected set of hardware processing units” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and provide the neural network accelerator with control information that causes the crossbar of the neural network accelerator to form a pipeline of the selected set of hardware processing units in the determined order to process the input data” does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ computing-based device for configuring a neural network accelerator to process input data” recited in the preamble does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ the neural network accelerator comprising a plurality of hardware processing units and a crossbar coupled to each hardware processing unit of the plurality of hardware processing units, each hardware processing unit comprising hardware to accelerate performing one or more neural network operations on received data” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “the computing- based device comprising one or more processors configured to: obtain a sequence of one or more neural network operations to be performed by the neural network accelerator on the input data” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ select a set of hardware processing units from the plurality of hardware processing units to perform the one or more neural network operation” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ of the selected set of hardware processing units” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). “ and provide the neural network accelerator with control information that causes the crossbar of the neural network accelerator to form a pipeline of the selected set of hardware processing units in the determined order to process the input data” does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). In regard to claim 19: Step 2A Prong 2: “ A computing-based device comprising one or more processors configured to perform the method as set forth in claim 1” recited in the preamble does not integrate the judicial exception into a practical application. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Step 2B: “ A computing-based device comprising one or more processors configured to perform the method as set forth in claim 1” recited in the preamble does not amount to significantly more than the judicial exception in the claim. This additional element is merely using a computer as a tool to perform an abstract idea (see MPEP 2106.05(h)). Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-2, 9-14, and 17-19 are rejected under 35 U.S.C. 103 unpatentable over Mihir MODY et.al. (hereinafter MODY) US 2022/0391776 A1, in view of Tirumale K Ramesh et.al. (hereinafter Ramesh) US 8103853 B2. In regard to claim 1: MODY discloses: - A computer-implemented method for configuring a neural network accelerator to process input data, In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: Each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0025]: the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. - the neural network accelerator comprising a plurality of hardware processing units and a crossbar coupled to each hardware processing unit of the plurality of hardware processing units In [0025]: the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. In [0008]: FIG. 2 is a block diagram of a device including hardware for executing ML models, in accordance with aspects of the present disclosure. PNG media_image1.png 531 606 media_image1.png Greyscale - the method comprising: obtaining a sequence of one or more neural network operations to be performed by the neural network accelerator on the input data; In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0019]: For example, each node may represent a mathematical function that takes, as input (aside from the nodes of the first layer 106), output from a previous layer and a weight. in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. (BRI: the primary purpose of accelerating machine learning (ML) models is to directly and significantly accelerate neural network (NN) operations, which are the most computationally intensive parts of deep learning) In [0019]: As each node of a layer (such as first layer 106) outputs to each node in a subsequent layer (such as second layer 108), ML model 100 is a fully connected neural network. Other embodiments may utilize a partially connected neural network or another neural network design which may not connect each node of a layer to each node of a subsequent layer, where some node connections may skip layers, where no feedback is provided from output to inputs (e.g. Feed Forward CNN), etc. (BRI: in the context of neural networks and their acceleration, each "layer" (e.g., first layer, second layer, third layer, etc.) typically represents a sequence of one or more neural network operations performed on the input data) - determining an order of the selected set of hardware processing units to perform the one or more neural network operations in accordance with the sequence; In [0028]: Once a ML model 302 is trained, the ML model 302 may be compiled and translated for a target hardware by a ML model complier 304A, 304B, . . . 304n (collectively). In this example, the target hardware 306 is shown as a simplified version of the device shown in FIG. 2, and the target hardware 306 includes a SoC 308 with one or more cores 310A, 310B, . . . 310n, coupled to a shared memory 312. The SoC 308 is also coupled to external memory 314. The ML model compiler 304 helps prepare the ML model 302 for execution by the target hardware 306 by translating the ML model 302 to a runtime code 316A, 316B, . . . 316n (collectively 316) format that is compatible with the target hardware 306. The ML model compiler 304 may also parameterize the ML model 302 being compiled. In some cases, the ML parameters may include information that may be dynamically loaded from memory for executing the ML model 302, such as weights, layer ordering information, structure, etc. In cases with multiple ML models 302 executing on multiple cores 310, the ML model compiler 304 may determine which core 310 a ML model 302 should run on. - and providing the neural network accelerator with control information that causes the crossbar of the neural network accelerator In [0035]: Resource requirements of the ML models 504 may be balanced, for example, by adjusting an execution order of the ML models and/or an amount of time to delay execution of one or more ML models or portions of one or more ML model. For example, where ML model 2 504B consumes a relatively large amount of resources in a number of initial layers and then consumes relatively less resources after the initial layers, execution of ML model 1 504A may be scheduled after ML model 2 504B has started so that a high resource consumption period of ML model 1 504A does not coincide with the high resource consumption period of ML model 2 504B. In [0025]: The CPU cores 202 may be coupled to a crossbar (e.g., interconnect) 206, which interconnects and routes data between various components of the device. In some cases, the crossbar 206 may be a memory controller or any other circuit that can provide an interconnect between peripherals. In [0025]: In this example, the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit In [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. A runtime controller 218 for controlling ML model execution and interfacing between the ML model and the ML cores 216 may execute on the ML cores. The runtime controller 218 may be software based, for example, an operating system, kernel, and/or hypervisor. In some cases, the runtime controller 218 may include hardware configured to control and/or manage execution of ML models on one or more ML cores 216. (BRI: within the context of ML accelerator that is coupled with a crossbar, a runtime controller provides the control information to form a pipeline) MODY does not explicitly disclose: - to form a pipeline of the selected set of hardware processing units in the determined order to process the input data. However, Ramesh discloses: - to form a pipeline of the selected set of hardware processing units in the determined order to process the input data. In [Col 5, lines 9-14]: FIG. 5 shows a block diagram identifying different types of fabric element cells which may be instantiated on a chip. The base entity of a fabric element may comprise a "fabric element cell" termed as FEC. The fluidity in the fabric may be demonstrated by the flexible residency of the fabric element cell within the physical entity. In [Col 5, lines 17-28]: The fabric element cell (FEC) may comprise the computing support to fabric functional elements for implementation and execution. In its generality, a FEC can be formed from a reconfigurable hardware and/or from software entities as a thread. For example, if security function is required for processing, a security fabric element may be formed by grouping FECs that may have different types of FEC to be adaptively selected. Each FEC may interact in a distributed environment with any of the other FECs. This capability may be translated to identification of fabric element cells in each functional entity and hardware units of the system-on chip. PNG media_image2.png 767 270 media_image2.png Greyscale (BRI: The reconfigurable fabric is a powerful overarching architecture that facilitates adaptation of fabric cells for specific functions that is mapped into different hardware units perhaps for hardware accelerator function for a neural network operation. See adapatation of fabric for a security function as an example) In [Col 3, lines 32-35]: any custom cores for specific functions may be combined into a group of a single entity for aggregation of processing powers from the cores. In [Col 5, lines 33-36]: as shown in FIG. 5, may include cognitive fabric element cells 34, morph fabric element cells 35, soft fabric elements cells 36, storage fabric element cells 37, and reconfigurable fabric element cells 38. In [Col 5, lines 52-55]: As shown in FIG. 5, a morph fabric element cell 35 may control morphing of another fabric element cell residing in a global network/module level to a low level chip micro architecture. (BRI: morphing from a global application to a microarchitecture provides using separate hardware units for NN acceleration operations” may be a major trend in computing where specialized hardware is integrated at the microarchitectural level to efficiently handle the increasing demands of neural networks (NNs) and AI workloads for enhanced performance, energy efficiency and lower latency). In [Col 7, lines 61-67]: the fabric chip system shown in FIG. 1 may also be used for hardware acceleration processing. In such a manner, general purpose reconfigurable processing elements 11 may be used for specific hardware acceleration functions. An array of such elements may offer powerful reconfigurable computing solutions for adaptive stream processing for signal processing and packet processing applications. For example, signal pro- in [Col 8, lines 1-3]: cessing sequences as frames and packets can be processed on a number of such distributed reconfigurable processing elements 11. PNG media_image3.png 402 319 media_image3.png Greyscale In [Col 3, lines 54-57]: FIG. 2 comprises an architecture diagram of an distributed virtual connectivity switch VS including cognitive processors 93, switches 9, edge caches 10, distributed reconfigurable processors 11, PNG media_image4.png 454 188 media_image4.png Greyscale In [Col 3, lines 58-61]: the integrated virtual connective switch VS may comprise a mesh connected multi-processing architecture with distributed processor switch elements 9 having at least four ports per element. In [Col 3, lines 61-67]: Each distributed element may comprise a cognitive processor 93 and a switch element 9. The cognitive processors 93 and 4-port switches 9 may be orthogonally laid and distributed. Every cognitive processor 93 may take intermediate decisions and pass it onto next cognitive processor 93 via switch 9 (BRI: an orthogonally laid out switch is fundamentally a crossbar switch, as the "orthogonal" layout describes the grid-like matrix of input (X-axis) and output (Y-axis) lines with switches at their intersections (crosspoints), enabling any-to-any connection, the core definition of a crossbar. A 4-port (typically 4x4) switch uses this matrix to connect any of its four inputs to any of its four outputs, often employing scheduling/arbitration logic for simultaneous transfers) In [Col 2, lines 15-16]: FIG. 4 shows a block diagram showing switch element configurations with 4-ports to each switch element of FIG. 3; PNG media_image5.png 206 417 media_image5.png Greyscale In [Col 4, line 1]: it may be possible to bypass a series of switches 9 for one cognitive processor 93 to virtually connect to another in [Col 7, lines 43-47]: Each chip infrastructure comprises soft processor 14 (shown in FIG. 1), distributed reconfigurable elements 11 (shown in FIG. 1), and a reconfigurable communication processor 57 to interconnect them globally to form a fabric group (Note: A soft processor is a CPU design described in a Hardware Description Language (HDL like Verilog/VHDL) that gets implemented in the flexible, reconfigurable fabric of a Field-Programmable Gate Array (FPGA), rather than being a fixed silicon circuit) In [Col 7, lines 27-39]: The fabric chip mechanism may comprise a combination of instructions from the compiler and/or parser (static scheduling) which may be added with second level run-time hardware scheduling such as traditional superscalar architecture via a reconfigurable fabric manager 45. (BRI: within the context of reconfigurable computing systems, run-time hardware scheduling via a reconfigurable fabric manager can involve dynamically forming or reconfiguring a pipeline of selected hardware units to execute a task) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches hardware accelerator with set of hardware units for neural network acceleration. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain (cognitive processor) added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). In regard to claim 2: MODY does not explicitly disclose: - wherein the control information comprises information identifying the selected set of hardware processing units and information identifying the determined order of the selected set of hardware processing units. However, Ramesh discloses: - wherein the control information comprises information identifying the selected set of hardware processing units and information identifying the determined order of the selected set of hardware processing units. In [Col 5, lines 9-14]: FIG. 5 shows a block diagram identifying different types of fabric element cells which may be instantiated on a chip. The base entity of a fabric element may comprise a "fabric element cell" termed as FEC. The fluidity in the fabric may be demonstrated by the flexible residency of the fabric element cell within the physical entity. In [Col 6, line 46-49]: the fabric morphing control 12 may provide global control of fabric element cells morphing from global application instances to a processor micro-architecture 29( shown in FIG. 6). In [Col 3, line 29-35]: The soft processor 14 may comprise single core or multiple cores 15. With multiple cores, cores may be allocated and reallocated at run-time to optimize for performance based on the load balancing on these core workloads. Any custom cores for specific functions may be combined into a group of a single entity for aggregation of processing powers from the cores. (BRI: the fabric morphing control offers global control over how fabric element cells group and reconfigure and provides the selection of hardware units to match the data flow or control flow of the running hardware) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches hardware accelerator with set of hardware units for neural network acceleration. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain (cognitive processor) added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). In regard to claim 9: MODY discloses: - reading, from a memory, and wherein determining an order of the selected set of hardware processing units comprises determining the order such that the restrictions are not contravened In [0028]: in some cases, the ML parameters may include information that may be dynamically loaded from memory for executing the ML model 302, such as weights, layer ordering information, structure, etc. In cases with multiple ML models 302 executing on multiple cores 310, the ML model compiler 304 may determine which core 310 a ML model 302 should run on. In [0047]: the multi-core orchestrator 906 may simulate the execution of the ML models 902 on the target hardware. In some cases, the simulation may be subject to a number of constraints. In some cases, these constraints may be external memory bandwidth, amount of power needed, memory bandwidth, and memory sizes. In some cases, memory bandwidth and memory sizes may also take into consideration the specific types of memories available on the target hardware. In [0032]: hardware resources required by a ML model may vary depending on a portion of the ML model being executed at a particular moment. For example, a ML model such as a deep learning or neural network model may include a number of layers. The hardware resources, for example processor time, memory capacity and throughput, power, etc., required for each layer may be different. In some cases, the execution of the multiple ML models executing on two or more logical computing cores may be sequenced across to balance the hardware resources required by the ML models. (BRI: scheduling, load balancing, and resource management are specifically designed to sequence and distribute machine learning workloads across hardware resources in an order that respects constraints. The determination of the order is a core function of these systems) MODY does not explicitly disclose: - and only valid combinations of hardware processing units are to be connected using the crossbar. However, Ramesh discloses: - and only valid combinations of hardware processing units are to be connected using the crossbar. In [Col 5, lines 45-51]: The morphing control 12 shown in FIG. 1 may provide overall control for how the fabric element cells forms into groups, reconfigure from one type to another, and monitor the status of FEC. With a cognitive brain added within a smart switch 8, the cognitive control to the morphing process may add greater optimization of resource utilization and performance. In [Col 1, lines 45-50]: the reconfigurable hardware intelligent processor may be configured to implement a distributed cognitive processor, may be configured to implement a distributed reconfigurable processor, and may be configured to provide cognitive control for at least one of allocation, reallocation, and performance monitoring. In [Col 6, lines 4-9]: the reconfigurable fabric element cell 38 may also be allocated with specific functions and mapped to hardware reconfigurable processors. The automated allocation and deployment of functions onto hardware and software may be directed by a cognitive cell application control. (BRI: morphing within the context of computing refers to the dynamic reconfiguration and re-allocation of hardware resources, which provides valid and optimized combinations of hardware processing units within a concurrent processing system. This process is part of an advanced approach to processor architecture known as polymorphic computing. A cognitive processor can be a form of polymorphic computing where the computational structure can dynamically change or reconfigure itself adaptable to the task at hand and the cognitive processor provide the hardware designed to mimic the brain's structure and function to support this) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain (cognitive processor) added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). In regard to claim 10: MODY discloses: - wherein the set of hardware processing units are selected such that each hardware processing unit of the set of hardware processing units is only used once in performing the sequence of one or more neural network operations. In [Col 5, lines 9-14]: FIG. 5 shows a block diagram identifying different types of fabric element cells which may be instantiated on a chip. The base entity of a fabric element may comprise a "fabric element cell" termed as FEC. The fluidity in the fabric may be demonstrated by the flexible residency of the fabric element cell within the physical entity. In [0029]: certain ML models may be designated to execute on certain cores of the target hardware. For example, a ML model which uses more processing resources may be assigned to execute on a certain ML core which may have an increased amount of processing power, (BRI: the designation of certain Machine Learning (ML) models or parts of models to specific hardware cores can mean that a given hardware processing unit is used exclusively for that specific operation or set of operations within a sequence that is not repetitive in nature) In regard to claim 11: MODY discloses: - wherein the neural network accelerator comprises a plurality of data input units configured to load the input data into the neural network accelerator, and the method further comprises selecting one of the plurality of data input units to load the input data into the neural network accelerator based on one or more characteristics of the input data and/or the pipeline, and the control information comprises information identifying the selected data input unit. In [0029]: After compilation of the ML model 302 to runtime code 316 for the target hardware 306, the parameters of the ML model 302 may be stored, for example, in the external memory 314. When a ML model 302 is executed, the runtime code and parameters 316 may be loaded, for example into shared memory 312 or other memory, such as a memory dedicated to a specific ML core 310, and executed by the ML core 310. In regard to claim 13: MODY does not explicitly disclose: - wherein the control information further comprises information identifying a function and/or one or more operations to be implemented by one or more of the hardware processing units in the set of hardware processing units. However, Ramesh discloses: - wherein the control information further comprises information identifying a function and/or one or more operations to be implemented by one or more of the hardware processing units in the set of hardware processing units. In [Col 6, line 46-49]: the fabric morphing control 12 may provide global control of fabric element cells morphing from global application instances to a processor micro-architecture 29( shown in FIG. 6). In [Col 3, line 29-35]: The soft processor 14 may comprise single core or multiple cores 15. With multiple cores, cores may be allocated and reallocated at run-time to optimize for performance based on the load balancing on these core workloads. Any custom cores for specific functions may be combined into a group of a single entity for aggregation of processing powers from the cores. (BRI: the fabric morphing control offers global control over how fabric element cells group and reconfigure) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain (cognitive processor) added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). In regard to claim 14: MODY discloses: - obtaining a second sequence of one or more neural network operations to be performed by the neural network accelerator on second input data; In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0019]: For example, each node may represent a mathematical function that takes, as input (aside from the nodes of the first layer 106), output from a previous layer and a weight. In [0019]: As each node of a layer (such as first layer 106) outputs to each node in a subsequent layer (such as second layer 108), ML model 100 is a fully connected neural network. Other embodiments may utilize a partially connected neural network or another neural network design which may not connect each node of a layer to each node of a subsequent layer, where some node connections may skip layers, where no feedback is provided from output to inputs (e.g. Feed Forward CNN), etc. (BRI: in the context of neural networks and their acceleration, each "layer" (e.g., first layer, second layer, third layer, etc.) typically represents a sequence of one or more neural network operations performed on the input data) - selecting a second set of hardware processing units from the plurality of hardware processing units to perform the one or more neural network operations of the second sequence; In [0025]: the ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. A runtime controller 218 for controlling ML model execution and interfacing between the ML model and the ML cores 216 may execute on the ML cores. The runtime controller 218 may be software based, for example, an operating system, kernel, and/or hypervisor. In some cases, the runtime controller 218 may include hardware configured to control and/or manage execution of ML models on one or more ML cores 216. (BRI: ML cores are hardware units designed specifically to accelerate machine learning operations and their configuration effectively selecting these specialized units to perform specific computations) - determining a second order of the second selected set of hardware processing units to perform the one or more neural network operations in accordance with the second sequence; In [0028]: in FIG. 2, and the target hardware 306 includes a SoC 308 with one or more cores 310A, 310B, . . . 310n, coupled to a shared memory 312. The SoC 308 is also coupled to external memory 314. The ML model compiler 304 helps prepare the ML model 302 for execution by the target hardware 306 by translating the ML model 302 to a runtime code 316A, 316B, . . . 316n (collectively 316) format that is compatible with the target hardware 306. The ML model compiler 304 may also parameterize the ML model 302 being compiled. In some cases, the ML parameters may include information that may be dynamically loaded from memory for executing the ML model 302, such as weights, layer ordering information, structure, etc. In cases with multiple ML models 302 executing on multiple cores 310, the ML model compiler 304 may determine which core 310 a ML model 302 should run on. (BRI: the context of multiple cores as identified by a compiler can provide second set of hardware units) - wherein the second set of hardware processing units is the same as the set of hardware processing units, and the second determined order is different than the determined order In [0028]: The ML model compiler 304 may also parameterize the ML model 302 being compiled. In some cases, the ML parameters may include information that may be dynamically loaded from memory for executing the ML model 302, such as weights, layer ordering information, structure, etc. (BRI: the context of multiple cores as identified by a compiler can provide second order different that determined order as a result of dynamic loaded information) - and providing the neural network accelerator with second control information that causes the crossbar of the neural network accelerator In [0035]: Resource requirements of the ML models 504 may be balanced, for example, by adjusting an execution order of the ML models and/or an amount of time to delay execution of one or more ML models or portions of one or more ML model. For example, where ML model 2 504B consumes a relatively large amount of resources in a number of initial layers and then consumes relatively less resources after the initial layers, execution of ML model 1 504A may be scheduled after ML model 2 504B has started so that a high resource consumption period of ML model 1 504A does not coincide with the high resource consumption period of ML model 2 504B. In [0025]: The CPU cores 202 may be coupled to a crossbar (e.g., interconnect) 206, which interconnects and routes data between various components of the device. In some cases, the crossbar 206 may be a memory controller or any other circuit that can provide an interconnect between peripherals. In [0025]: In this example, the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit In [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. A runtime controller 218 for controlling ML model execution and interfacing between the ML model and the ML cores 216 may execute on the ML cores. The runtime controller 218 may be software based, for example, an operating system, kernel, and/or hypervisor. In some cases, the runtime controller 218 may include hardware configured to control and/or manage execution of ML models on one or more ML cores 216. (BRI: the context of multiple cores as identified by a compiler can provide a runtime controller provides the control information to form a pipeline) However, Ramesh discloses: - to form a second pipeline of the selected second set of hardware processing units in the determined second order to process the second input data; In [Col 5, lines 9-14]: FIG. 5 shows a block diagram identifying different types of fabric element cells which may be instantiated on a chip. The base entity of a fabric element may comprise a "fabric element cell" termed as FEC. The fluidity in the fabric may be demonstrated by the flexible residency of the fabric element cell within the physical entity. In [Col 5, lines 17-28]: The fabric element cell (FEC) may comprise the computing support to fabric functional elements for implementation and execution. In its generality, a FEC can be formed from a reconfigurable hardware and/or from software entities as a thread. For example, if security function is required for processing, a security fabric element may be formed by grouping FECs that may have different types of FEC to be adaptively selected. Each FEC may interact in a distributed environment with any of the other FECs. This capability may be translated to identification of fabric element cells in each functional entity and hardware units of the system-on chip. PNG media_image2.png 767 270 media_image2.png Greyscale (BRI: The reconfigurable fabric is a powerful overarching architecture that facilitates adaptation of fabric cells for specific functions that is mapped into different hardware units perhaps for hardware accelerator function for a neural network operation. See adaptation of fabric for a security function as an example) In [Col 3, lines 32-35]: any custom cores for specific functions may be combined into a group of a single entity for aggregation of processing powers from the cores. In [Col 5, lines 33-36]: as shown in FIG. 5, may include cognitive fabric element cells 34, morph fabric element cells 35, soft fabric elements cells 36, storage fabric element cells 37, and reconfigurable fabric element cells 38. In [Col 5, lines 52-55]: As shown in FIG. 5, a morph fabric element cell 35 may control morphing of another fabric element cell residing in a global network/module level to a low level chip micro architecture. (BRI: morphing from a global application to a microarchitecture provides using separate hardware units for NN acceleration operations” may be a major trend in computing where specialized hardware is integrated at the microarchitectural level to efficiently handle the increasing demands of neural networks (NNs) and AI workloads for enhanced performance, energy efficiency and lower latency). In [Col 7, lines 61-67]: the fabric chip system shown in FIG. 1 may also be used for hardware acceleration processing. In such a manner, general purpose reconfigurable processing elements 11 may be used for specific hardware acceleration functions. An array of such elements may offer powerful reconfigurable computing solutions for adaptive stream processing for signal processing and packet processing applications. For example, signal pro- in [Col 8, lines 1-3]: cessing sequences as frames and packets can be processed on a number of such distributed reconfigurable processing elements 11. PNG media_image3.png 402 319 media_image3.png Greyscale In [Col 3, lines 54-57]: FIG. 2 comprises an architecture diagram of an distributed virtual connectivity switch VS including cognitive processors 93, switches 9, edge caches 10, distributed reconfigurable processors 11, PNG media_image4.png 454 188 media_image4.png Greyscale In [Col 3, lines 58-61]: the integrated virtual connective switch VS may comprise a mesh connected multi-processing architecture with distributed processor switch elements 9 having at least four ports per element. In [Col 3, lines 61-67]: Each distributed element may comprise a cognitive processor 93 and a switch element 9. The cognitive processors 93 and 4-port switches 9 may be orthogonally laid and distributed. Every cognitive processor 93 may take intermediate decisions and pass it onto next cognitive processor 93 via switch 9 (BRI: an orthogonally laid out switch is fundamentally a crossbar switch, as the "orthogonal" layout describes the grid-like matrix of input (X-axis) and output (Y-axis) lines with switches at their intersections (crosspoints), enabling any-to-any connection, the core definition of a crossbar. A 4-port (typically 4x4) switch uses this matrix to connect any of its four inputs to any of its four outputs, often employing scheduling/arbitration logic for simultaneous transfers) In [Col 2, lines 15-16]: FIG. 4 shows a block diagram showing switch element configurations with 4-ports to each switch element of FIG. 3; PNG media_image5.png 206 417 media_image5.png Greyscale In [Col 4, line 1]: it may be possible to bypass a series of switches 9 for one cognitive processor 93 to virtually connect to another in [Col 7, lines 43-47]: Each chip infrastructure comprises soft processor 14 (shown in FIG. 1), distributed reconfigurable elements 11 (shown in FIG. 1), and a reconfigurable communication processor 57 to interconnect them globally to form a fabric group (Note: A soft processor is a CPU design described in a Hardware Description Language (HDL like Verilog/VHDL) that gets implemented in the flexible, reconfigurable fabric of a Field-Programmable Gate Array (FPGA), rather than being a fixed silicon circuit) In [Col 7, lines 27-39]: The fabric chip mechanism may comprise a combination of instructions from the compiler and/or parser (static scheduling) which may be added with second level run-time hardware scheduling such as traditional superscalar architecture via a reconfigurable fabric manager 45. (BRI: within the context of reconfigurable computing systems, run-time hardware scheduling via a reconfigurable fabric manager can involve dynamically forming or reconfiguring a pipeline of selected hardware units to execute a task) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches hardware accelerator with set of hardware units for neural network acceleration. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful accelerator architecture. One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain (cognitive processor) added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain (cognitive processor) added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). In regard to claim 17: MODY discloses: - A non-transitory computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform the method as set forth in claim 1. In [0004]: - A computer-implemented method for configuring a neural network accelerator to process input data, In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: Each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0025]: the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. - the neural network accelerator comprising a plurality of hardware processing units and a crossbar coupled to each hardware processing unit of the plurality of hardware processing units In [0025]: the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. In [0008]: FIG. 2 is a block diagram of a device including hardware for executing ML models, in accordance with aspects of the present disclosure. PNG media_image1.png 531 606 media_image1.png Greyscale - the method comprising: obtaining a sequence of one or more neural network operations to be performed by the neural network accelerator on the input data; In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0019]: For example, each node may represent a mathematical function that takes, as input (aside from the nodes of the first layer 106), output from a previous layer and a weight. In [0019]: As each node of a layer (such as first layer 106) outputs to each node in a subsequent layer (such as second layer 108), ML model 100 is a fully connected neural network. Other embodiments may utilize a partially connected neural network or another neural network design which may not connect each node of a layer to each node of a subsequent layer, where some node connections may skip layers, where no feedback is provided from output to inputs (e.g. Feed Forward CNN), etc. (BRI: in the context of neural networks and their acceleration, each "layer" (e.g., first layer, second layer, third layer, etc.) typically represents a sequence of one or more neural network operations performed on the input data) - determining an order of the selected set of hardware processing units to perform the one or more neural network operations in accordance with the sequence; In [0028]: Once a ML model 302 is trained, the ML model 302 may be compiled and translated for a target hardware by a ML model complier 304A, 304B, . . . 304n (collectively). In this example, the target hardware 306 is shown as a simplified version of the device shown in FIG. 2, and the target hardware 306 includes a SoC 308 with one or more cores 310A, 310B, . . . 310n, coupled to a shared memory 312. The SoC 308 is also coupled to external memory 314. The ML model compiler 304 helps prepare the ML model 302 for execution by the target hardware 306 by translating the ML model 302 to a runtime code 316A, 316B, . . . 316n (collectively 316) format that is compatible with the target hardware 306. The ML model compiler 304 may also parameterize the ML model 302 being compiled. In some cases, the ML parameters may include information that may be dynamically loaded from memory for executing the ML model 302, such as weights, layer ordering information, structure, etc. In cases with multiple ML models 302 executing on multiple cores 310, the ML model compiler 304 may determine which core 310 a ML model 302 should run on. - and providing the neural network accelerator with control information that causes the crossbar of the neural network accelerator In [0035]: Resource requirements of the ML models 504 may be balanced, for example, by adjusting an execution order of the ML models and/or an amount of time to delay execution of one or more ML models or portions of one or more ML model. For example, where ML model 2 504B consumes a relatively large amount of resources in a number of initial layers and then consumes relatively less resources after the initial layers, execution of ML model 1 504A may be scheduled after ML model 2 504B has started so that a high resource consumption period of ML model 1 504A does not coincide with the high resource consumption period of ML model 2 504B. In [0025]: The CPU cores 202 may be coupled to a crossbar (e.g., interconnect) 206, which interconnects and routes data between various components of the device. In some cases, the crossbar 206 may be a memory controller or any other circuit that can provide an interconnect between peripherals. In [0025]: In this example, the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit In [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. A runtime controller 218 for controlling ML model execution and interfacing between the ML model and the ML cores 216 may execute on the ML cores. The runtime controller 218 may be software based, for example, an operating system, kernel, and/or hypervisor. In some cases, the runtime controller 218 may include hardware configured to control and/or manage execution of ML models on one or more ML cores 216. (BRI: within the context of ML accelerator that is coupled with a crossbar, a runtime controller provides the control information to form a pipeline) MODY does not explicitly disclose: - to form a pipeline of the selected set of hardware processing units in the determined order to process the input data. However, Ramesh discloses: - to form a pipeline of the selected set of hardware processing units in the determined order to process the input data. In [Col 5, lines 9-14]: FIG. 5 shows a block diagram identifying different types of fabric element cells which may be instantiated on a chip. The base entity of a fabric element may comprise a "fabric element cell" termed as FEC. The fluidity in the fabric may be demonstrated by the flexible residency of the fabric element cell within the physical entity. In [Col 5, lines 17-28]: The fabric element cell (FEC) may comprise the computing support to fabric functional elements for implementation and execution. In its generality, a FEC can be formed from a reconfigurable hardware and/or from software entities as a thread. For example, if security function is required for processing, a security fabric element may be formed by grouping FECs that may have different types of FEC to be adaptively selected. Each FEC may interact in a distributed environment with any of the other FECs. This capability may be translated to identification of fabric element cells in each functional entity and hardware units of the system-on chip. PNG media_image2.png 767 270 media_image2.png Greyscale (BRI: The reconfigurable fabric is a powerful overarching architecture that facilitates adaptation of fabric cells for specific functions that is mapped into different hardware units perhaps for hardware accelerator function for a neural network operation. See adapatation of fabric for a security function as an example) In [Col 3, lines 32-35]: any custom cores for specific functions may be combined into a group of a single entity for aggregation of processing powers from the cores. In [Col 5, lines 33-36]: as shown in FIG. 5, may include cognitive fabric element cells 34, morph fabric element cells 35, soft fabric elements cells 36, storage fabric element cells 37, and reconfigurable fabric element cells 38. In [Col 5, lines 52-55]: As shown in FIG. 5, a morph fabric element cell 35 may control morphing of another fabric element cell residing in a global network/module level to a low level chip micro architecture. (BRI: morphing from a global application to a microarchitecture provides using separate hardware units for NN acceleration operations” may be a major trend in computing where specialized hardware is integrated at the microarchitectural level to efficiently handle the increasing demands of neural networks (NNs) and AI workloads for enhanced performance, energy efficiency and lower latency). In [Col 7, lines 61-67]: the fabric chip system shown in FIG. 1 may also be used for hardware acceleration processing. In such a manner, general purpose reconfigurable processing elements 11 may be used for specific hardware acceleration functions. An array of such elements may offer powerful reconfigurable computing solutions for adaptive stream processing for signal processing and packet processing applications. For example, signal pro- in [Col 8, lines 1-3]: cessing sequences as frames and packets can be processed on a number of such distributed reconfigurable processing elements 11. PNG media_image3.png 402 319 media_image3.png Greyscale In [Col 3, lines 54-57]: FIG. 2 comprises an architecture diagram of an distributed virtual connectivity switch VS including cognitive processors 93, switches 9, edge caches 10, distributed reconfigurable processors 11, PNG media_image4.png 454 188 media_image4.png Greyscale In [Col 3, lines 58-61]: the integrated virtual connective switch VS may comprise a mesh connected multi-processing architecture with distributed processor switch elements 9 having at least four ports per element. In [Col 3, lines 61-67]: Each distributed element may comprise a cognitive processor 93 and a switch element 9. The cognitive processors 93 and 4-port switches 9 may be orthogonally laid and distributed. Every cognitive processor 93 may take intermediate decisions and pass it onto next cognitive processor 93 via switch 9 (BRI: an orthogonally laid out switch is fundamentally a crossbar switch, as the "orthogonal" layout describes the grid-like matrix of input (X-axis) and output (Y-axis) lines with switches at their intersections (crosspoints), enabling any-to-any connection, the core definition of a crossbar. A 4-port (typically 4x4) switch uses this matrix to connect any of its four inputs to any of its four outputs, often employing scheduling/arbitration logic for simultaneous transfers) In [Col 2, lines 15-16]: FIG. 4 shows a block diagram showing switch element configurations with 4-ports to each switch element of FIG. 3; PNG media_image5.png 206 417 media_image5.png Greyscale In [Col 4, line 1]: it may be possible to bypass a series of switches 9 for one cognitive processor 93 to virtually connect to another in [Col 7, lines 43-47]: Each chip infrastructure comprises soft processor 14 (shown in FIG. 1), distributed reconfigurable elements 11 (shown in FIG. 1), and a reconfigurable communication processor 57 to interconnect them globally to form a fabric group (Note: A soft processor is a CPU design described in a Hardware Description Language (HDL like Verilog/VHDL) that gets implemented in the flexible, reconfigurable fabric of a Field-Programmable Gate Array (FPGA), rather than being a fixed silicon circuit) In [Col 7, lines 27-39]: The fabric chip mechanism may comprise a combination of instructions from the compiler and/or parser (static scheduling) which may be added with second level run-time hardware scheduling such as traditional superscalar architecture via a reconfigurable fabric manager 45. (BRI: within the context of reconfigurable computing systems, run-time hardware scheduling via a reconfigurable fabric manager can involve dynamically forming or reconfiguring a pipeline of selected hardware units to execute a task) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain (cognitive processor) added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). In regard to claim 18: MODY discloses: - A computer-based device for configuring a neural network accelerator to process input data, In [0008]: FIG. 2 is a block diagram of a device including hardware for executing ML models, in accordance with aspects of the present disclosure. In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: Each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0025]: the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. - the neural network accelerator comprising a plurality of hardware processing units and a crossbar coupled to each hardware processing unit of the plurality of hardware processing units In [0025]: the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. In [0008]: FIG. 2 is a block diagram of a device including hardware for executing ML models, in accordance with aspects of the present disclosure. PNG media_image1.png 531 606 media_image1.png Greyscale - each hardware processing unit comprising hardware to accelerate performing one or more neural network operations on received data In [0020]: Nodes compute one or more functions based on the inputs received and corresponding weights and outputs a number. For example, the node may use a linear combination function which multiplies an input values from a node of the previous layer with a corresponding weight and sums across the results of the multiplication, coupled with a non-linear activation function which acts as a floor for the resulting number for output In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0019]: For example, each node may represent a mathematical function that takes, as input (aside from the nodes of the first layer 106), output from a previous layer and a weight. in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. In [0020]: Nodes compute one or more functions based on the inputs received and corresponding weights and outputs a number. For example, the node may use a linear combination function which multiplies an input values from a node of the previous layer with a corresponding weight and sums across the results of the multiplication, coupled with a non-linear activation function which acts as a floor for the resulting number for output (BRI: the primary purpose of accelerating machine learning (ML) models is to directly and significantly accelerate neural network (NN) operations, which are the most computationally intensive parts of deep learning) In [0019]: As each node of a layer (such as first layer 106) outputs to each node in a subsequent layer (such as second layer 108), ML model 100 is a fully connected neural network. Other embodiments may utilize a partially connected neural network or another neural network design which may not connect each node of a layer to each node of a subsequent layer, where some node connections may skip layers, where no feedback is provided from output to inputs (e.g. Feed Forward CNN), etc. (BRI: in the context of neural networks and their acceleration, each "layer" (e.g., first layer, second layer, third layer, etc.) typically represents a sequence of one or more neural network operations performed on the input data) - the computing- based device comprising one or more processors configured to: obtain a sequence of one or more neural network operations to be performed by the neural network accelerator on the input data; In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0019]: For example, each node may represent a mathematical function that takes, as input (aside from the nodes of the first layer 106), output from a previous layer and a weight. In [0019]: As each node of a layer (such as first layer 106) outputs to each node in a subsequent layer (such as second layer 108), ML model 100 is a fully connected neural network. Other embodiments may utilize a partially connected neural network or another neural network design which may not connect each node of a layer to each node of a subsequent layer, where some node connections may skip layers, where no feedback is provided from output to inputs (e.g. Feed Forward CNN), etc. (BRI: in the context of neural networks and their acceleration, each "layer" (e.g., first layer, second layer, third layer, etc.) typically represents a sequence of one or more neural network operations performed on the input data) - select a set of hardware processing units from the plurality of hardware processing units to perform the one or more neural network operations In [0025]: the ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. A runtime controller 218 for controlling ML model execution and interfacing between the ML model and the ML cores 216 may execute on the ML cores. The runtime controller 218 may be software based, for example, an operating system, kernel, and/or hypervisor. In some cases, the runtime controller 218 may include hardware configured to control and/or manage execution of ML models on one or more ML cores 216. (BRI: ML cores are hardware units designed specifically to accelerate machine learning operations and their configuration effectively selecting these specialized units to perform specific computations) - determine an order of the selected set of hardware processing units to perform the one or more neural network operations in accordance with the sequence; In [0028]: Once a ML model 302 is trained, the ML model 302 may be compiled and translated for a target hardware by a ML model complier 304A, 304B, . . . 304n (collectively). In this example, the target hardware 306 is shown as a simplified version of the device shown in FIG. 2, and the target hardware 306 includes a SoC 308 with one or more cores 310A, 310B, . . . 310n, coupled to a shared memory 312. The SoC 308 is also coupled to external memory 314. The ML model compiler 304 helps prepare the ML model 302 for execution by the target hardware 306 by translating the ML model 302 to a runtime code 316A, 316B, . . . 316n (collectively 316) format that is compatible with the target hardware 306. The ML model compiler 304 may also parameterize the ML model 302 being compiled. In some cases, the ML parameters may include information that may be dynamically loaded from memory for executing the ML model 302, such as weights, layer ordering information, structure, etc. In cases with multiple ML models 302 executing on multiple cores 310, the ML model compiler 304 may determine which core 310 a ML model 302 should run on. - and provide the neural network accelerator with control information that causes the crossbar of the neural network accelerator In [0035]: Resource requirements of the ML models 504 may be balanced, for example, by adjusting an execution order of the ML models and/or an amount of time to delay execution of one or more ML models or portions of one or more ML model. For example, where ML model 2 504B consumes a relatively large amount of resources in a number of initial layers and then consumes relatively less resources after the initial layers, execution of ML model 1 504A may be scheduled after ML model 2 504B has started so that a high resource consumption period of ML model 1 504A does not coincide with the high resource consumption period of ML model 2 504B. In [0025]: The CPU cores 202 may be coupled to a crossbar (e.g., interconnect) 206, which interconnects and routes data between various components of the device. In some cases, the crossbar 206 may be a memory controller or any other circuit that can provide an interconnect between peripherals. In [0025]: In this example, the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit In [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. A runtime controller 218 for controlling ML model execution and interfacing between the ML model and the ML cores 216 may execute on the ML cores. The runtime controller 218 may be software based, for example, an operating system, kernel, and/or hypervisor. In some cases, the runtime controller 218 may include hardware configured to control and/or manage execution of ML models on one or more ML cores 216. (BRI: within the context of ML accelerator that is coupled with a crossbar, a runtime controller provides the control information to form a pipeline) MODY does not explicitly disclose: - to form a pipeline of the selected set of hardware processing units in the determined order to process the input data. - wherein the control information comprises information identifying the selected set of hardware processing units and information identifying the determined order of the selected set of hardware processing units. [From claim 2] - wherein when a hardware processing unit of the plurality of hardware processing units does not from part of the set of hardware processing units then a predetermined value is written to the register corresponding to the output port connected to that hardware processing unit to indicate that the hardware processing unit is to be disabled. [From claim 6] However, Ramesh discloses: - to form a pipeline of the selected set of hardware processing units in the determined order to process the input data. In [Col 5, lines 9-14]: FIG. 5 shows a block diagram identifying different types of fabric element cells which may be instantiated on a chip. The base entity of a fabric element may comprise a "fabric element cell" termed as FEC. The fluidity in the fabric may be demonstrated by the flexible residency of the fabric element cell within the physical entity. In [Col 5, lines 17-28]: The fabric element cell (FEC) may comprise the computing support to fabric functional elements for implementation and execution. In its generality, a FEC can be formed from a reconfigurable hardware and/or from software entities as a thread. For example, if security function is required for processing, a security fabric element may be formed by grouping FECs that may have different types of FEC to be adaptively selected. Each FEC may interact in a distributed environment with any of the other FECs. This capability may be translated to identification of fabric element cells in each functional entity and hardware units of the system-on chip. PNG media_image2.png 767 270 media_image2.png Greyscale (BRI: The reconfigurable fabric is a powerful overarching architecture that facilitates adaptation of fabric cells for specific functions that is mapped into different hardware units perhaps for hardware accelerator function for a neural network operation. See adapatation of fabric for a security function as an example) In [Col 3, lines 32-35]: any custom cores for specific functions may be combined into a group of a single entity for aggregation of processing powers from the cores. In [Col 5, lines 33-36]: as shown in FIG. 5, may include cognitive fabric element cells 34, morph fabric element cells 35, soft fabric elements cells 36, storage fabric element cells 37, and reconfigurable fabric element cells 38. In [Col 5, lines 52-55]: As shown in FIG. 5, a morph fabric element cell 35 may control morphing of another fabric element cell residing in a global network/module level to a low level chip micro architecture. (BRI: morphing from a global application to a microarchitecture provides using separate hardware units for NN acceleration operations” may be a major trend in computing where specialized hardware is integrated at the microarchitectural level to efficiently handle the increasing demands of neural networks (NNs) and AI workloads for enhanced performance, energy efficiency and lower latency). In [Col 7, lines 61-67]: the fabric chip system shown in FIG. 1 may also be used for hardware acceleration processing. In such a manner, general purpose reconfigurable processing elements 11 may be used for specific hardware acceleration functions. An array of such elements may offer powerful reconfigurable computing solutions for adaptive stream processing for signal processing and packet processing applications. For example, signal pro- in [Col 8, lines 1-3]: cessing sequences as frames and packets can be processed on a number of such distributed reconfigurable processing elements 11. PNG media_image3.png 402 319 media_image3.png Greyscale In [Col 3, lines 54-57]: FIG. 2 comprises an architecture diagram of an distributed virtual connectivity switch VS including cognitive processors 93, switches 9, edge caches 10, distributed reconfigurable processors 11, PNG media_image4.png 454 188 media_image4.png Greyscale In [Col 3, lines 58-61]: the integrated virtual connective switch VS may comprise a mesh connected multi-processing architecture with distributed processor switch elements 9 having at least four ports per element. In [Col 3, lines 61-67]: Each distributed element may comprise a cognitive processor 93 and a switch element 9. The cognitive processors 93 and 4-port switches 9 may be orthogonally laid and distributed. Every cognitive processor 93 may take intermediate decisions and pass it onto next cognitive processor 93 via switch 9 (BRI: an orthogonally laid out switch is fundamentally a crossbar switch, as the "orthogonal" layout describes the grid-like matrix of input (X-axis) and output (Y-axis) lines with switches at their intersections (crosspoints), enabling any-to-any connection, the core definition of a crossbar. A 4-port (typically 4x4) switch uses this matrix to connect any of its four inputs to any of its four outputs, often employing scheduling/arbitration logic for simultaneous transfers) In [Col 2, lines 15-16]: FIG. 4 shows a block diagram showing switch element configurations with 4-ports to each switch element of FIG. 3; PNG media_image5.png 206 417 media_image5.png Greyscale In [Col 4, line 1]: it may be possible to bypass a series of switches 9 for one cognitive processor 93 to virtually connect to another in [Col 7, lines 43-47]: Each chip infrastructure comprises soft processor 14 (shown in FIG. 1), distributed reconfigurable elements 11 (shown in FIG. 1), and a reconfigurable communication processor 57 to interconnect them globally to form a fabric group (Note: A soft processor is a CPU design described in a Hardware Description Language (HDL like Verilog/VHDL) that gets implemented in the flexible, reconfigurable fabric of a Field-Programmable Gate Array (FPGA), rather than being a fixed silicon circuit) In [Col 7, lines 27-39]: The fabric chip mechanism may comprise a combination of instructions from the compiler and/or parser (static scheduling) which may be added with second level run-time hardware scheduling such as traditional superscalar architecture via a reconfigurable fabric manager 45. (BRI: within the context of reconfigurable computing systems, run-time hardware scheduling via a reconfigurable fabric manager can involve dynamically forming or reconfiguring a pipeline of selected hardware units to execute a task) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain added (cognitive processor) within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). In regard to claim 19: MODY discloses: - A computing-based device comprising one or more processors configured to perform the method as set forth in claim 1. In [0008]: FIG. 2 is a block diagram of a device including hardware for executing ML models, in accordance with aspects of the present disclosure. In [0024]: FIG. 2 is a block diagram 200 of a device including hardware for executing ML models, in accordance with aspects of the present disclosure. The device may be system on a chip (SoC) including multiple components configured to perform different tasks. As shown, the device includes one or more central processing unit (CPU) cores 202, which may include one or more internal cache memories 204. The CPU cores 202 may be configured for general computing tasks. Claims 3-8 and 16 are rejected under 35 U.S.C. 103 unpatentable over Mihir MODY et.al. (hereinafter MODY) US 2022/0391776 A1, in view of Tirumale K Ramesh et.al. (hereinafter Ramesh) US 8103853 B2. further in view of Anuja Naik et.al. (hereinafter Naik) Efficient Network on Chip (NoC) using heterogeneous circuit switched routers, IEEE 2016 International Conference on VLSI Systems, Architectures, Technology and Applications (VLSI-SATA). In regard to claim 3: MODY and Ramesh do not explicitly disclose: - wherein the crossbar comprises a plurality of input ports and a plurality of output ports, and the control information comprises information identifying which input ports of the crossbar are to be connected to which output ports of the crossbar to form the pipeline. However, Naik discloses: - wherein the crossbar comprises a plurality of input ports and a plurality of output ports, and the control information comprises information identifying which input ports of the crossbar are to be connected to which output ports of the crossbar to form the pipeline. In [4.1, Page 1868]: Input-Stage Routing. PEs at the input side, which are connected to the input ports of input-stage switches, generate packets and write them to the input buffer of the input-stage switches. Each input of the switches has a controller logic that processes the contents of the input buffers in a first-in first-served manner. Once a packet reaches to the head of the queue, the routing unit makes the routing decision for the packet to determine the output port (and then the middle-stage switch) (BRI: the decision of the routing unit provides this control information routing decision) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, Ramesh and Naik. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). Naik teaches routing destination address within a CLOS-based circuit switched router in which CLOS is formed as multi-stage crossbar. The examiner uses "crossbar" as broadly as "multistage crossbar" during the examination process, under the principle of "broadest reasonable interpretation". There are no explicit definition for “crossbar” in the specification for examiner to use otherwise. One of ordinary skill would have motivation to combine MODY, Ramesh and Naik that can provide improved energy efficiency for the CLOS router that uses multiple crossbar switches (Naik [Abstract, Page 1]). In regard to claim 4: MODY and Ramesh do not explicitly disclose: - wherein the neural network accelerator comprises a register for each output port, and providing the control information to the neural network accelerator comprises causing a value to be written each register that identifies which input port of the plurality of input ports is to be connected to the corresponding output port. However, Naik discloses: - wherein the neural network accelerator comprises a register for each output port, and providing the control information to the neural network accelerator comprises causing a value to be written each register that identifies which input port of the plurality of input ports is to be connected to the corresponding output port. However, Naik discloses: - wherein the neural network accelerator comprises a register for each output port, and providing the control information to the neural network accelerator comprises causing a value to be written each register that identifies which input port of the plurality of input ports is to be connected to the corresponding output port. In [1, Page 1]: With increase in Very Large Scale Integration (VLSI) density, it is now possible to integrate general purpose processors, memory blocks, application specific intellectual property blocks (IP), digital signal processor (DSP), Graphic processor unit (GPU) and mixed signal functions on a single system-on-chip (SoC). In [1, Page 1]: With more applications that require battery powered embedded system units, the energy and area efficiency of the SoC is a very important factor. In [IV, Page 3]: Circuit switched router We propose the concept of lane division multiplexing (LDM) In [IV, Page 3]: Using LDM, a single port is segmented into smaller sets of bus which can be used by different data streams simultaneously. Our implementation terms a router as R(5,4) consist of 5 ports where each port is divided into 4 lanes of equal size in one direction. For example, the router shown in Figure 4 has eight lanes per port with four incoming and four outgoing lanes. We recognize switching network inside a router consumes major silicon area. To reduce this silicon area and power dissipation, we propose using multistage CLOS network where a single CLOS switch is made up of multiple small Crossbar switches. PNG media_image6.png 397 431 media_image6.png Greyscale In [IV, Page 4;In Input allocation unit: Input allocation unit checks the incoming four data arriving to a router port and allocate appropriate lane to each of them. The allocation depends upon the destined direction. (i.e. data destined to go in the South will be allocated to that particular lane). The unit stores incoming data in the temporary buffers till the routing decision takes place. The allocation algorithm checks for the destination address of each input and then send it to appropriate lane and set the flag “high” for that channel. A higher flag for the lane suggests that the particular lane reserved to transmit the message cannot be used by any other data until the previous transmission is completed. Flags for all the lanes cleared every time before initiating a data transfer. Once all four input channels of a single port are allocated, data can travel to the desired output ports simultaneously. (BRI: the allocation unit temporary buffer has the destination address of the routing( output port) In regard to claim 5: MODY and Ramesh do not explicitly disclose: - wherein each input port is allocated a number and the value written to a register is the number of the input port to be connected to the corresponding output port. However, Naik discloses: - wherein each input port is allocated a number and the value written to a register is the number of the input port to be connected to the corresponding output port. In [IV A, Page 3]: To simplify our design, we allocated each small Crossbar of the first stage to different ports – North, South, East, West and Tile. Figure 5 shows an example where each Crossbar can carry simultaneously 4 inputs from one direction and send them to 4 different directions except the one coming from the tile. We assume that data will not backtrack to the same router from where it arrived from. PNG media_image7.png 216 448 media_image7.png Greyscale (BRI: A tile is a processing elements) In [IV, Page 4;In Input allocation unit: Input allocation unit checks the incoming four data arriving to a router port and allocate appropriate lane to each of them. The allocation depends upon the destined direction. (i.e. data destined to go in the South will be allocated to that particular lane). The unit stores incoming data in the temporary buffers till the routing decision takes place. The allocation algorithm checks for the destination address of each input and then send it to appropriate lane and set the flag “high” for that channel. A higher flag for the lane suggests that the particular lane reserved to transmit the message cannot be used by any other data until the previous transmission is completed. Flags for all the lanes cleared every time before initiating a data transfer. Once all four input channels of a single port are allocated, data can travel to the desired output ports simultaneously. (BRI: the allocation unit temporary buffer has the destination address of each input port is the number of the input port in the routing) In regard to claim 6: MODY does not explicitly disclose: - wherein when a hardware processing unit of the plurality of hardware processing units does not from part of the set of hardware processing units then a predetermined value is written to the register corresponding to the output port connected to that hardware processing unit to indicate that the hardware processing unit is to be disabled. However, Ramesh discloses: - wherein when a hardware processing unit of the plurality of hardware processing units does not from part of the set of hardware processing units then a predetermined value is written to the register corresponding to the output port connected to that hardware processing unit to indicate that the hardware processing unit is to be disabled. In [Col 3, lines 54-67]: FIG. 2 comprises an architecture diagram of an distributed virtual connectivity switch VS including cognitive processors 93, switches 9, edge caches 10, distributed reconfigurable processors 11, network interface 4a, global cache control 16, and storage 18a. The integrated virtual connective switch VS may comprise a mesh connected multi-processing architecture with distributed processor switch elements 9 having at least four ports per element. Each distributed element may comprise a cognitive processor 93 and a switch element 9. The cognitive processors 93 and 4-port switches 9 may be orthogonally laid and distributed. Every cognitive processor 93 may take intermediate decisions and pass it onto next cognitive processor 93 via switch 9. It may be possible to bypass a series of switches 9 for one cognitive processor 93 In [Col 4, line 1]: to virtually connect to another. In [Col 4, lines 4-8]: At each switch interconnection, an edge cache 10 may be inserted that caches intermediate decisions and data by the cognitive processor 93 to effectively be used by any other cognitive processor 93 without having to access data from the source. (BRI: the edge cache and matrix switch combination can act as a hardware-implemented logic gate/multiplexer that ensures system stability by providing a default value when an expected processing unit is absent or offline. The cache itself might store this predetermined value or act as the buffer/pathway for its insertion into the register) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). In regard to claim 7: MODY, and Ramesh do not explicitly disclose: - prior to providing the neural network accelerator with the control information, determining whether the control information is valid, and only providing the neural network accelerator with the control information if it is determined that the control information is valid. However, Naik discloses: - prior to providing the neural network accelerator with the control information, determining whether the control information is valid, and only providing the neural network accelerator with the control information if it is determined that the control information is valid. In [1, Page 1]: In the circuit switching router, a path from source to destination is established before the transmission of data and it cannot be allocated to other resources till the desired data transmission is completed, In [IV, Page 4;In Input allocation unit: Input allocation unit checks the incoming four data arriving to a router port and allocate appropriate lane to each of them. The allocation depends upon the destined direction. (i.e. data destined to go in the South will be allocated to that particular lane). The unit stores incoming data in the temporary buffers till the routing decision takes place. The allocation algorithm checks for the destination address of each input and then send it to appropriate lane and set the flag “high” for that channel. A higher flag for the lane suggests that the particular lane reserved to transmit the message cannot be used by any other data until the previous transmission is completed. Flags for all the lanes cleared every time before initiating a data transfer. Once all four input channels of a single port are allocated, data can travel to the desired output ports simultaneously. ( BRI: A flag status of “0” means the transmission is “valid” that no blocking has occurred) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, Ramesh and Naik. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). Naik teaches routing destination address within a CLOS-based circuit switched router in which CLOS is formed as multi-stage crossbar. The examiner uses "crossbar" as broadly as "multistage crossbar" during the examination process, under the principle of "broadest reasonable interpretation". There are no explicit definition for “crossbar” in the specification for examiner to use otherwise. One of ordinary skill would have motivation to combine MODY, Ramesh and Naik that can provide improved energy efficiency for the CLOS router that uses multiple crossbar switches (Naik [Abstract, Page 1]). In regard to claim 8: MODY and Ramesh do not explicitly disclose: - wherein it is determined that the control information is valid only if, when the output of a first hardware processing unit is to be the input to a second hardware processing unit, the control information indicates that the input port of the crossbar coupled to the output of the first hardware processing unit is to be connected to the output port of the crossbar coupled to the input of the second hardware processing unit. However, Naik discloses: - wherein it is determined that the control information is valid only if, when the output of a first hardware processing unit is to be the input to a second hardware processing unit, the control information indicates that the input port of the crossbar coupled to the output of the first hardware processing unit is to be connected to the output port of the crossbar coupled to the input of the second hardware processing unit. In [1, Page 1]: In the circuit switching router, a path from source to destination is established before the transmission of data and it cannot be allocated to other resources till the desired data transmission is completed, In [IV, Page 4]:In Input allocation unit: Input allocation unit checks the incoming four data arriving to a router port and allocate appropriate lane to each of them. The allocation depends upon the destined direction. (i.e. data destined to go in the South will be allocated to that particular lane). The unit stores incoming data in the temporary buffers till the routing decision takes place. The allocation algorithm checks for the destination address of each input and then send it to appropriate lane and set the flag “high” for that channel. A higher flag for the lane suggests that the particular lane reserved to transmit the message cannot be used by any other data until the previous transmission is completed. Flags for all the lanes cleared every time before initiating a data transfer. Once all four input channels of a single port are allocated, data can travel to the desired output ports simultaneously. ( BRI: A flag status of “0” means the transmission is “valid” that no blocking has occurred) In regard to claim 16: MODY discloses: - A method of configuring a neural network accelerator to implement a neural network, the neural network comprising a plurality of layers, each layer configured to receive input data and perform one or more neural network operations on the received input data, the method comprising: In [0020]: Nodes compute one or more functions based on the inputs received and corresponding weights and outputs a number. For example, the node may use a linear combination function which multiplies an input values from a node of the previous layer with a corresponding weight and sums across the results of the multiplication, coupled with a non-linear activation function which acts as a floor for the resulting number for output In [0019]: FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. Examples of neural network ML models may include LeNet, Alex Net, Mobilnet, etc. It may be understood that each implementation of a ML model may execute one or more ML algorithms and the ML model may be trained or tuned in a different way, depending on a variety of factors including, but not limited to, a type of ML model being used, parameters being used for the ML model, relationships as among the parameters, desired speed of training, etc. In this simplified example, parameters values of W, L, and iref are parameter inputs 102, 104, and 112 are passed into the ML model 100. Each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc. In [0019]: For example, each node may represent a mathematical function that takes, as input (aside from the nodes of the first layer 106), output from a previous layer and a weight. - grouping the neural network operations of the neural network into one or more sequences of neural network operations, each sequence of neural network operations being executable by a combination of hardware processing elements; In [0004]: one or more processors to receive a set of ML models, simulating running the set of ML models on a target hardware to determine resources required by the ML models of the set of ML models and timing information in [0019]: each implementation of a ML model may execute one or more ML algorithms and the ML model may be trained or tuned in a different way, depending on a variety of factors including, but not limited to, a type of ML model being used, parameters being used for the ML model, relationships as among the parameters, desired speed of training, etc. in [0019]: inputs 102, 104, and 112 are passed into the ML model 100. Each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc - and executing the method as set forth in claim 1 to 8 for each sequence of neural network operations. A computer-implemented method for configuring a neural network accelerator to process input data, In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: Each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0025]: the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. - the neural network accelerator comprising a plurality of hardware processing units and a crossbar coupled to each hardware processing unit of the plurality of hardware processing units In [0025]: the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, in [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. In [0008]: FIG. 2 is a block diagram of a device including hardware for executing ML models, in accordance with aspects of the present disclosure. PNG media_image1.png 531 606 media_image1.png Greyscale - the method comprising: obtaining a sequence of one or more neural network operations to be performed by the neural network accelerator on the input data; In [0019] : FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0019]: each layer (e.g., first layer 106, second layer 108, and third layer 110) includes a plurality of nodes (e.g., neurons) and generally represents a set of operations performed on the parameters, such as a set of matrix multiplications, convolutions, deconvolutions, etc In [0019]: For example, each node may represent a mathematical function that takes, as input (aside from the nodes of the first layer 106), output from a previous layer and a weight. In [0019]: As each node of a layer (such as first layer 106) outputs to each node in a subsequent layer (such as second layer 108), ML model 100 is a fully connected neural network. Other embodiments may utilize a partially connected neural network or another neural network design which may not connect each node of a layer to each node of a subsequent layer, where some node connections may skip layers, where no feedback is provided from output to inputs (e.g. Feed Forward CNN), etc. (BRI: in the context of neural networks and their acceleration, each "layer" (e.g., first layer, second layer, third layer, etc.) typically represents a sequence of one or more neural network operations performed on the input data) - determining an order of the selected set of hardware processing units to perform the one or more neural network operations in accordance with the sequence; In [0028]: Once a ML model 302 is trained, the ML model 302 may be compiled and translated for a target hardware by a ML model complier 304A, 304B, . . . 304n (collectively). In this example, the target hardware 306 is shown as a simplified version of the device shown in FIG. 2, and the target hardware 306 includes a SoC 308 with one or more cores 310A, 310B, . . . 310n, coupled to a shared memory 312. The SoC 308 is also coupled to external memory 314. The ML model compiler 304 helps prepare the ML model 302 for execution by the target hardware 306 by translating the ML model 302 to a runtime code 316A, 316B, . . . 316n (collectively 316) format that is compatible with the target hardware 306. The ML model compiler 304 may also parameterize the ML model 302 being compiled. In some cases, the ML parameters may include information that may be dynamically loaded from memory for executing the ML model 302, such as weights, layer ordering information, structure, etc. In cases with multiple ML models 302 executing on multiple cores 310, the ML model compiler 304 may determine which core 310 a ML model 302 should run on. - and providing the neural network accelerator with control information that causes the crossbar of the neural network accelerator In [0035]: Resource requirements of the ML models 504 may be balanced, for example, by adjusting an execution order of the ML models and/or an amount of time to delay execution of one or more ML models or portions of one or more ML model. For example, where ML model 2 504B consumes a relatively large amount of resources in a number of initial layers and then consumes relatively less resources after the initial layers, execution of ML model 1 504A may be scheduled after ML model 2 504B has started so that a high resource consumption period of ML model 1 504A does not coincide with the high resource consumption period of ML model 2 504B. In [0025]: The CPU cores 202 may be coupled to a crossbar (e.g., interconnect) 206, which interconnects and routes data between various components of the device. In some cases, the crossbar 206 may be a memory controller or any other circuit that can provide an interconnect between peripherals. In [0025]: In this example, the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit In [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. A runtime controller 218 for controlling ML model execution and interfacing between the ML model and the ML cores 216 may execute on the ML cores. The runtime controller 218 may be software based, for example, an operating system, kernel, and/or hypervisor. In some cases, the runtime controller 218 may include hardware configured to control and/or manage execution of ML models on one or more ML cores 216. (BRI: within the context of ML accelerator that is coupled with a crossbar, a runtime controller provides the control information to form a pipeline) MODY does not explicitly disclose: - to form a pipeline of the selected set of hardware processing units in the determined order to process the input data. - wherein the control information comprises information identifying the selected set of hardware processing units and information identifying the determined order of the selected set of hardware processing units. [From claim 2] - wherein when a hardware processing unit of the plurality of hardware processing units does not from part of the set of hardware processing units then a predetermined value is written to the register corresponding to the output port connected to that hardware processing unit to indicate that the hardware processing unit is to be disabled. [From claim 6] However, Ramesh discloses: - to form a pipeline of the selected set of hardware processing units in the determined order to process the input data. In [Col 5, lines 9-14]: FIG. 5 shows a block diagram identifying different types of fabric element cells which may be instantiated on a chip. The base entity of a fabric element may comprise a "fabric element cell" termed as FEC. The fluidity in the fabric may be demonstrated by the flexible residency of the fabric element cell within the physical entity. In [Col 5, lines 17-28]: The fabric element cell (FEC) may comprise the computing support to fabric functional elements for implementation and execution. In its generality, a FEC can be formed from a reconfigurable hardware and/or from software entities as a thread. For example, if security function is required for processing, a security fabric element may be formed by grouping FECs that may have different types of FEC to be adaptively selected. Each FEC may interact in a distributed environment with any of the other FECs. This capability may be translated to identification of fabric element cells in each functional entity and hardware units of the system-on chip. PNG media_image2.png 767 270 media_image2.png Greyscale (BRI: The reconfigurable fabric is a powerful overarching architecture that facilitates adaptation of fabric cells for specific functions that is mapped into different hardware units perhaps for hardware accelerator function for a neural network operation. See adapatation of fabric for a security function as an example) In [Col 3, lines 32-35]: any custom cores for specific functions may be combined into a group of a single entity for aggregation of processing powers from the cores. In [Col 5, lines 33-36]: as shown in FIG. 5, may include cognitive fabric element cells 34, morph fabric element cells 35, soft fabric elements cells 36, storage fabric element cells 37, and reconfigurable fabric element cells 38. In [Col 5, lines 52-55]: As shown in FIG. 5, a morph fabric element cell 35 may control morphing of another fabric element cell residing in a global network/module level to a low level chip micro architecture. (BRI: morphing from a global application to a microarchitecture provides using separate hardware units for NN acceleration operations” may be a major trend in computing where specialized hardware is integrated at the microarchitectural level to efficiently handle the increasing demands of neural networks (NNs) and AI workloads for enhanced performance, energy efficiency and lower latency). In [Col 7, lines 61-67]: the fabric chip system shown in FIG. 1 may also be used for hardware acceleration processing. In such a manner, general purpose reconfigurable processing elements 11 may be used for specific hardware acceleration functions. An array of such elements may offer powerful reconfigurable computing solutions for adaptive stream processing for signal processing and packet processing applications. For example, signal pro- in [Col 8, lines 1-3]: cessing sequences as frames and packets can be processed on a number of such distributed reconfigurable processing elements 11. PNG media_image3.png 402 319 media_image3.png Greyscale In [Col 3, lines 54-57]: FIG. 2 comprises an architecture diagram of an distributed virtual connectivity switch VS including cognitive processors 93, switches 9, edge caches 10, distributed reconfigurable processors 11, PNG media_image4.png 454 188 media_image4.png Greyscale In [Col 3, lines 58-61]: the integrated virtual connective switch VS may comprise a mesh connected multi-processing architecture with distributed processor switch elements 9 having at least four ports per element. In [Col 3, lines 61-67]: Each distributed element may comprise a cognitive processor 93 and a switch element 9. The cognitive processors 93 and 4-port switches 9 may be orthogonally laid and distributed. Every cognitive processor 93 may take intermediate decisions and pass it onto next cognitive processor 93 via switch 9 (BRI: an orthogonally laid out switch is fundamentally a crossbar switch, as the "orthogonal" layout describes the grid-like matrix of input (X-axis) and output (Y-axis) lines with switches at their intersections (crosspoints), enabling any-to-any connection, the core definition of a crossbar. A 4-port (typically 4x4) switch uses this matrix to connect any of its four inputs to any of its four outputs, often employing scheduling/arbitration logic for simultaneous transfers) In [Col 2, lines 15-16]: FIG. 4 shows a block diagram showing switch element configurations with 4-ports to each switch element of FIG. 3; PNG media_image5.png 206 417 media_image5.png Greyscale In [Col 4, line 1]: it may be possible to bypass a series of switches 9 for one cognitive processor 93 to virtually connect to another in [Col 7, lines 43-47]: Each chip infrastructure comprises soft processor 14 (shown in FIG. 1), distributed reconfigurable elements 11 (shown in FIG. 1), and a reconfigurable communication processor 57 to interconnect them globally to form a fabric group (Note: A soft processor is a CPU design described in a Hardware Description Language (HDL like Verilog/VHDL) that gets implemented in the flexible, reconfigurable fabric of a Field-Programmable Gate Array (FPGA), rather than being a fixed silicon circuit) In [Col 7, lines 27-39]: The fabric chip mechanism may comprise a combination of instructions from the compiler and/or parser (static scheduling) which may be added with second level run-time hardware scheduling such as traditional superscalar architecture via a reconfigurable fabric manager 45. (BRI: within the context of reconfigurable computing systems, run-time hardware scheduling via a reconfigurable fabric manager can involve dynamically forming or reconfiguring a pipeline of selected hardware units to execute a task) - wherein the control information comprises information identifying the selected set of hardware processing units and information identifying the determined order of the selected set of hardware processing units. In [Col 6, line 46-49]: the fabric morphing control 12 may provide global control of fabric element cells morphing from global application instances to a processor micro-architecture 29( shown in FIG. 6). In [Col 3, line 29-35]: The soft processor 14 may comprise single core or multiple cores 15. With multiple cores, cores may be allocated and reallocated at run-time to optimize for performance based on the load balancing on these core workloads. Any custom cores for specific functions may be combined into a group of a single entity for aggregation of processing powers from the cores. (BRI: the fabric morphing control offers global control over how fabric element cells group and reconfigure) - wherein when a hardware processing unit of the plurality of hardware processing units does not from part of the set of hardware processing units then a predetermined value is written to the register corresponding to the output port connected to that hardware processing unit to indicate that the hardware processing unit is to be disabled. In [Col 3, lines 54-67]: FIG. 2 comprises an architecture diagram of an distributed virtual connectivity switch VS including cognitive processors 93, switches 9, edge caches 10, distributed reconfigurable processors 11, network interface 4a, global cache control 16, and storage 18a. The integrated virtual connective switch VS may comprise a mesh connected multi-processing architecture with distributed processor switch elements 9 having at least four ports per element. Each distributed element may comprise a cognitive processor 93 and a switch element 9. The cognitive processors 93 and 4-port switches 9 may be orthogonally laid and distributed. Every cognitive processor 93 may take intermediate decisions and pass it onto next cognitive processor 93 via switch 9. It may be possible to bypass a series of switches 9 for one cognitive processor 93 In [Col 4, line 1]: to virtually connect to another. In [Col 4, lines 4-8]: At each switch interconnection, an edge cache 10 may be inserted that caches intermediate decisions and data by the cognitive processor 93 to effectively be used by any other cognitive processor 93 without having to access data from the source. (BRI: the edge cache and matrix switch combination can act as a hardware-implemented logic gate/multiplexer that ensures system stability by providing a default value when an expected processing unit is absent or offline. The cache itself might store this predetermined value or act as the buffer/pathway for its insertion into the register) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, and Ramesh. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). One of ordinary skill would have motivation to combine MODY, and Ramesh and with a cognitive brain added within a smart switch, the cognitive control for the morphing process may add greater optimization of resource utilization and performance (Ramesh [Col 5, lines 48-51]). MODY and Ramesh do not explicitly disclose: - wherein the crossbar comprises a plurality of input ports and a plurality of output ports, and the control information comprises information identifying which input ports of the crossbar are to be connected to which output ports of the crossbar to form the pipeline. [From claim 3] - wherein the neural network accelerator comprises a register for each output port, and providing the control information to the neural network accelerator comprises causing a value to be written each register that identifies which input port of the plurality of input ports is to be connected to the corresponding output port. [From claim 4] - wherein each input port is allocated a number and the value written to a register is the number of the input port to be connected to the corresponding output port. [From claim 5] - prior to providing the neural network accelerator with the control information, determining whether the control information is valid, and only providing the neural network accelerator with the control information if it is determined that the control information is valid. [From claim 7] - wherein it is determined that the control information is valid only if, when the output of a first hardware processing unit is to be the input to a second hardware processing unit, the control information indicates that the input port of the crossbar coupled to the output of the first hardware processing unit is to be connected to the output port of the crossbar coupled to the input of the second hardware processing unit. [From claim 8] However, Naik discloses: - wherein the crossbar comprises a plurality of input ports and a plurality of output ports, and the control information comprises information identifying which input ports of the crossbar are to be connected to which output ports of the crossbar to form the pipeline. In [4.1, Page 1868]: Input-Stage Routing. PEs at the input side, which are connected to the input ports of input-stage switches, generate packets and write them to the input buffer of the input-stage switches. Each input of the switches has a controller logic that processes the contents of the input buffers in a first-in first-served manner. Once a packet reaches to the head of the queue, the routing unit makes the routing decision for the packet to determine the output port (and then the middle-stage switch) (BRI: the decision of the routing unit provides this control information routing decision) - wherein the neural network accelerator comprises a register for each output port, and providing the control information to the neural network accelerator comprises causing a value to be written each register that identifies which input port of the plurality of input ports is to be connected to the corresponding output port. In [1, Page 1]: With increase in Very Large Scale Integration (VLSI) density, it is now possible to integrate general purpose processors, memory blocks, application specific intellectual property blocks (IP), digital signal processor (DSP), Graphic processor unit (GPU) and mixed signal functions on a single system-on-chip (SoC). In [1, Page 1]: With more applications that require battery powered embedded system units, the energy and area efficiency of the SoC is a very important factor. In [IV, Page 3]: Circuit switched router We propose the concept of lane division multiplexing (LDM) In [IV, Page 3]: Using LDM, a single port is segmented into smaller sets of bus which can be used by different data streams simultaneously. Our implementation terms a router as R(5,4) consist of 5 ports where each port is divided into 4 lanes of equal size in one direction. For example, the router shown in Figure 4 has eight lanes per port with four incoming and four outgoing lanes. We recognize switching network inside a router consumes major silicon area. To reduce this silicon area and power dissipation, we propose using multistage CLOS network where a single CLOS switch is made up of multiple small Crossbar switches. PNG media_image6.png 397 431 media_image6.png Greyscale In [IV, Page 4;In Input allocation unit: Input allocation unit checks the incoming four data arriving to a router port and allocate appropriate lane to each of them. The allocation depends upon the destined direction. (i.e. data destined to go in the South will be allocated to that particular lane). The unit stores incoming data in the temporary buffers till the routing decision takes place. The allocation algorithm checks for the destination address of each input and then send it to appropriate lane and set the flag “high” for that channel. A higher flag for the lane suggests that the particular lane reserved to transmit the message cannot be used by any other data until the previous transmission is completed. Flags for all the lanes cleared every time before initiating a data transfer. Once all four input channels of a single port are allocated, data can travel to the desired output ports simultaneously. (BRI: the allocation unit temporary buffer has the destination address of the routing( output port) - wherein each input port is allocated a number and the value written to a register is the number of the input port to be connected to the corresponding output port. In [IV A, Page 3]: To simplify our design, we allocated each small Crossbar of the first stage to different ports – North, South, East, West and Tile. Figure 5 shows an example where each Crossbar can carry simultaneously 4 inputs from one direction and send them to 4 different directions except the one coming from the tile. We assume that data will not backtrack to the same router from where it arrived from. PNG media_image7.png 216 448 media_image7.png Greyscale (BRI: A tile is a processing elements) In [IV, Page 4;In Input allocation unit: Input allocation unit checks the incoming four data arriving to a router port and allocate appropriate lane to each of them. The allocation depends upon the destined direction. (i.e. data destined to go in the South will be allocated to that particular lane). The unit stores incoming data in the temporary buffers till the routing decision takes place. The allocation algorithm checks for the destination address of each input and then send it to appropriate lane and set the flag “high” for that channel. A higher flag for the lane suggests that the particular lane reserved to transmit the message cannot be used by any other data until the previous transmission is completed. Flags for all the lanes cleared every time before initiating a data transfer. Once all four input channels of a single port are allocated, data can travel to the desired output ports simultaneously. (BRI: the allocation unit temporary buffer has the destination address of each input port is the number of the input port in the routing) - prior to providing the neural network accelerator with the control information, determining whether the control information is valid, and only providing the neural network accelerator with the control information if it is determined that the control information is valid. In [1, Page 1]: In the circuit switching router, a path from source to destination is established before the transmission of data and it cannot be allocated to other resources till the desired data transmission is completed, In [IV, Page 4;In Input allocation unit: Input allocation unit checks the incoming four data arriving to a router port and allocate appropriate lane to each of them. The allocation depends upon the destined direction. (i.e. data destined to go in the South will be allocated to that particular lane). The unit stores incoming data in the temporary buffers till the routing decision takes place. The allocation algorithm checks for the destination address of each input and then send it to appropriate lane and set the flag “high” for that channel. A higher flag for the lane suggests that the particular lane reserved to transmit the message cannot be used by any other data until the previous transmission is completed. Flags for all the lanes cleared every time before initiating a data transfer. Once all four input channels of a single port are allocated, data can travel to the desired output ports simultaneously. ( BRI: A flag status of “0” means the transmission is “valid” that no blocking has occurred) - wherein it is determined that the control information is valid only if, when the output of a first hardware processing unit is to be the input to a second hardware processing unit, the control information indicates that the input port of the crossbar coupled to the output of the first hardware processing unit is to be connected to the output port of the crossbar coupled to the input of the second hardware processing unit. In [1, Page 1]: In the circuit switching router, a path from source to destination is established before the transmission of data and it cannot be allocated to other resources till the desired data transmission is completed, In [IV, Page 4]:In Input allocation unit: Input allocation unit checks the incoming four data arriving to a router port and allocate appropriate lane to each of them. The allocation depends upon the destined direction. (i.e. data destined to go in the South will be allocated to that particular lane). The unit stores incoming data in the temporary buffers till the routing decision takes place. The allocation algorithm checks for the destination address of each input and then send it to appropriate lane and set the flag “high” for that channel. A higher flag for the lane suggests that the particular lane reserved to transmit the message cannot be used by any other data until the previous transmission is completed. Flags for all the lanes cleared every time before initiating a data transfer. Once all four input channels of a single port are allocated, data can travel to the desired output ports simultaneously. ( BRI: A flag status of “0” means the transmission is “valid” that no blocking has occurred) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, Ramesh and Naik. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). Naik teaches routing via the ports of a CLOS network (crossbar) as a circuit switched router. One of ordinary skill would have motivation to combine MODY, Ramesh and Naik that can provide improved energy efficiency for the CLOS router that uses multiple crossbar switches (Naik [Abstract, Page 1]). Claims 12 and 15 is rejected under 35 U.S.C. 103 unpatentable over Mihir MODY et.al. (hereinafter MODY) US 2022/0391776 A1, in view of Tirumale K Ramesh et.al. (hereinafter Ramesh) US 8103853 B2. further in view of Eunjin Baek et.al. (hereinafter Baek) A Multi-Neural Network Acceleration Architecture, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). In regard to claim 12: MODY, and Ramesh do not explicitly disclose: - wherein at least one of the hardware processing units in the set is configurable to transmit or receive a tensor in a selected processing order of a plurality of selectable processing orders, and the method further comprises selecting a processing order to be used by one or more of the at least one processing units for transmitting or receiving a tensor based on the pipeline, and wherein the control information comprises information identifying the selected processing order. In [IV D, Page 949]: we propose a hardware implementation of AI-MT, but it is also possible to implement our mechanism only using software by implementing the scheduling layer between the software framework (e.g., TensorFlow) and the accelerator. In [VI, Page 952]: NIVIDA’s TensorRT provides the concurrent DNN execution support, with which users can run multiple DNNs on the same GPUs simultaneously. In [VII, Page 951]: Xilinx’s FPGA-based xDNN systolic-array architecture provides two operational modes, throughput and latency-optimized modes, by adopting different mapping strategies for CNN’s early layers and adjusting its pipeline stages. In [1, Page 941]: To create fine-grain tasks, AI-MT first divides each layer into multiple identical sub-layers at compile time. As the size of the sub-layer is statically determined by the PE array’s mapping granularity, each sub-layer’s SRAM requirement is small and identical. During a sub-layer execution, we define the phase of loading its weights to the on-chip SRAM as Memory Block (MB) execution and the phase of input processing with the loaded weights as Compute Block (CB) execution In [1, Page 941]: To dynamically execute MBs and CBs for the best resource load matching during runtime, AI-MT exploits a hardware-based sub-layer scheduler. The scheduler dynamically schedules MBs and CBs as long as their dependency is satisfied (BRI: A dynamic scheduling of CB may employ specific hardware components to manage and select functional units) In [IV, Page 946]: By combining three features, AI-MT can easily track the weight address for CBs to be executed. For example, when AI-MT prefetches the weight values, the memory controller allocates block ids referring to the free list and stores the weights to the blocks. After that, AI-MT refers to the weight management table using w tail as an index and updates the value to the newly allocated weight block id. AI-MT also updates w tail to the newly allocated weight block id to keep the last weight block id for the layer. This mechanism enables to find the CB’s weight address sequentially using only w head and w tail. When CB is scheduled, AI-MT refers to the w head to find the corresponding weight blocks for the CB, and updates w head to the next weight blocks by referring weight management table for the next CB (BRI: the memory controller provides the control information using allocation of CB blocks) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, Ramesh and Baek. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). Baek teaches tensor based pipeline. One of ordinary skill would have motivation to combine MODY, Ramesh, and Baek that can improve the CB resource utilization (Baek IV B, Page 946]) In regard to claim 15: MODY discloses: - wherein the plurality of hardware processing units comprises one or more of: In [0047]: In some cases, the multi-core orchestrator module 906 may be integrated with the ML model compilation and translation, or run separate from and in addition to the compilation and translation. (BRI: a multi-core orchestration module integrated into ML model compilation can provide use of specialized functional units for operations such as convolution, activation, pooling, normalization, element-wise operations, and interleaving) - a convolution processing unit configured to accelerate convolution operations between input data and weight data, In [0019]: FIG. 1 illustrates an example neural network ML model 100, in accordance with aspects of the present disclosure. The example neural network ML model 100 is a simplified example presented to help understand how a neural network ML model 100, such as a CNN, is structured and trained. In [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. In [0028]: the ML parameters may include information that may be dynamically loaded from memory for executing the ML model 302, such as weights, layer ordering information, structure, etc. - an activation processing unit configured to accelerate applying an activation function to data, In [0025]: The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. In [0020]: Different weights may be applied for the input received from each node of the previous layer by the subsequent layer. In [0020]: Nodes compute one or more functions based on the inputs received and corresponding weights and outputs a number. For example, the node may use a linear combination function which multiplies an input values from a node of the previous layer with a corresponding weight and sums across the results of the multiplication, coupled with a non-linear activation function which acts as a floor for the resulting number for output MODY and Ramesh do not explicitly disclose: - an element-wise operations processing unit configured to accelerate performing one or more element- wise operations on a set of data, - a pooling processing unit configured to accelerate applying a pooling function on data, - a normalisation processing unit configured to accelerate applying a normalisation function to data, - and an interleave processing unit configured to accelerate rearrangement of data. However, Baek discloses: - an element-wise operations processing unit configured to accelerate performing one or more element- wise operations on a set of data, In [V A, Page 949]: We take the baseline hardware parameters from recent TPU specifications [1] and scale up the number of PE arrays from two to 16, In [V A, Page 949]: The architectural parameters are summarized in Table I, PNG media_image8.png 215 510 media_image8.png Greyscale (BRI: A hardware parameter that specifies the "processing element dimensions in an array" likely refers to an architecture designed for parallel, element-wise operations. This configuration enables the hardware to process multiple data points simultaneously’ with the use of array for vectorization that facilitates Single Instruction, Multiple Data (SIMD) operations) In [V A, Page 949]: Figure 14 shows the speedup results for all co-located neural network workloads over the FIFO mechanism (Figure 6a) using a single batch (BRI: Perhaps known in the art that a neural network workload processing a single batch is fundamentally a Single Instruction, Multiple Data (SIMD) operation at the hardware level) - a pooling processing unit configured to accelerate applying a pooling function on data, In [II B, Page 942]: to start a neural network execution, our baseline architecture first loads the weight values from HBM to the unified buffers. After feeding the weight values into PE arrays, it performs the layer’s operations using the arrays. Next, the intermediate results are passed to the dedicated units which perform subsequent operations (e.g., activation, normalization, pooling). - a normalisation processing unit configured to accelerate applying a normalisation function to data, In [II B, Page 942]: to start a neural network execution, our baseline architecture first loads the weight values from HBM to the unified buffers. After feeding the weight values into PE arrays, it performs the layer’s operations using the arrays. Next, the intermediate results are passed to the dedicated units which perform subsequent operations (e.g., activation, normalization, pooling). - and an interleave processing unit configured to accelerate rearrangement of data. In [II B, Page 942]: the accelerator passes the output values to the buffer which can be reused as the input values for the next layer. Our baseline architecture supports a double-buffering to prefetch the weights to the PE array during its computation for hiding the fetching latency [33]. (BRI: This process is generally known as data reuse and is fundamental to maximizing performance by minimizing slow and energy-intensive data movement to and from main memory in which the buffer holds a temporary data as an interleaving mechanism to facilitate reuse) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine MODY, Ramesh and Baek. MODY teaches within the broader concept of ML orchestration which is the automation, coordination, and management from data prep to deployment and monitoring, a hardware accelerator with set of hardware units for neural network acceleration (See [0025]) where accelerating machine learning (ML) models provide acceleration of neural network (NN) operations, which are the most computationally intensive parts of deep learning. Ramesh teaches reconfigurable fabric with fabric switch forming a crossbar and facilitate forming a pipelining of execution units in the determined order and in fact demonstrates a highly powerful overarching accelerator architecture (See [Col 7, lines 61-67]). Baek teaches tensor based pipeline. One of ordinary skill would have motivation to combine MODY, Ramesh, and Baek that can improve the CB resource utilization (Baek IV B, Page 946]) Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /TIRUMALE K RAMESH/Examiner, Art Unit 2121 /Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action

Prosecution Timeline

Sep 30, 2022
Application Filed
Jan 06, 2026
Non-Final Rejection — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12518153
TRAINING MACHINE LEARNING SYSTEMS
2y 5m to grant Granted Jan 06, 2026
Patent 12293284
META COOPERATIVE TRAINING PARADIGMS
2y 5m to grant Granted May 06, 2025
Patent 12229651
BLOCK-BASED INFERENCE METHOD FOR MEMORY-EFFICIENT CONVOLUTIONAL NEURAL NETWORK IMPLEMENTATION AND SYSTEM THEREOF
2y 5m to grant Granted Feb 18, 2025
Patent 12131244
HARDWARE-OPTIMIZED NEURAL ARCHITECTURE SEARCH
2y 5m to grant Granted Oct 29, 2024
Patent 11803745
TERMINAL DEVICE AND METHOD FOR ESTIMATING FIREFIGHTING DATA
2y 5m to grant Granted Oct 31, 2023
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
18%
Grant Probability
20%
With Interview (+2.1%)
4y 5m
Median Time to Grant
Low
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month