Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed Application No. KR10-2024-0047283, filed on 04/08/2024.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/04/2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Examiner’s Note (EN)
The prior art rejections below cite particular paragraphs, columns, and/or line numbers in the references for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-22 are rejected under 35 U.S.C. 102(a)(1) and (a)(2) as being anticipated by Vemuri et al. (US10789402B1).
Regarding Claim 1, Vemuri teaches an integrated circuit comprising: a neural processing unit (NPU) comprising a plurality of processing elements (PEs), each of the PEs comprising a multiplier-accumulator circuit configured to perform multiply-accumulate operations (Col 3, ln 27-66, "One type of programmable IC that may work for processing and accelerating data passing through the layers of DNNs are FPGAs, which have many lookup arrays, available on-chip storage, and digital signal processing units. Using these FPGA components, an exemplary software design to take in a neural network and configure the programmable IC to execute the DNN is described herein. While the present disclosure discusses a software design to configure a neural network, the present disclosure is not limited to neural networks or deep neural networks and can include other types of machine learning frameworks. … In one embodiment, the programmable IC 120 includes programmable logic 122, a DPE array 130 having multiple DPEs 1321-132N, memory 140, and control logic 150. In one embodiment, the control logic 150 configures the programmable logic 122, and the programmable logic uses run-time parameters from the control logic 150 to control the DPE array 130. For example, using a received bitstream that contains configuration data, control logic 150 can configure the programmable logic 122 (which can include a plurality of configurable logic blocks) with run-time parameters, and the programmable logic 122 controls the DPE array 130 that has any number of DPEs (132 1-132 N). For example, the programmable logic 122 can include look up tables, function generators, registers, multiplexers, and the like. In one embodiment, the programmable IC includes a DPE array 130 having any number of DPEs, and each DPE comprises specialized circuitry to connect an array of neural network units (NNU) (not illustrated). In one embodiment, the NNUs of the DPEs comprise non-programmable logic i.e., are hardened specialized processing elements, and comprise hardware elements including, but not limited to, program memories, an instruction fetch/decode unit, fixed-point vector units, floating-point vector units, arithmetic logic units (ALUs), and multiply accumulators (MAC)").
a central processing unit (CPU) coupled to the NPU; (Col 4, ln 3-63, “FIG. 2 is a block diagram 200 of the compiler 114 and the HAL 116 to be used with a hardware-software interface 118 to communicate with the programmable IC 120. As mentioned with FIG. 1, the host computer 102 includes a compiler 114 and a HAL 116 for use with a DNN inference accelerator (also referred herein as a programmable IC). In one embodiment, the compiler 114 exports an application program interface (API) to the host computer 102. This exported API takes in a network description of a DNN in various framework specific formats (e.g., deploy.prototxt of the caffe framework) and generates an intermediate hardware-dependent representation of the network. The HAL 116 takes this intermediate representation of the network and programs the hardware for execution using the hardware-software interface 118”)
PNG
media_image1.png
876
597
media_image1.png
Greyscale
one or more memory circuits coupled to the NPU and the CPU, the one or more circuits storing instructions, when executed by the CPU, cause the CPU to (Col 3, ln 8-27, "Embodiments herein describe a compiler and hardware-abstraction-layer architecture for a programmable integrated circuit (IC). The complexity of mapping and porting a neural network to the programmable IC is abstracted by exporting a set of application programming interfaces (APIs). A software developer with minimal know how on hardware design can attach their network description of the neural network to the API and map/port their neural networks to FPGA for acceleration. The API takes the network description of the neural network in a high level abstraction. The compiler generates a network graph and a corresponding execution sequence vector based on the network description and optimally allocates buffer handles for each of the layers in the network graph. The hardware abstraction layer, then, takes the network graph, the corresponding execution sequence vector, and the handles allocated by the compiler, sets up the hardware runtime parameters, and schedules the commands in the network graph and corresponding execution sequence vector to respective hardware blocks on a programmable IC").
compile a first neural network model of a first machine learning framework incompatible with the NPU into first machine code executable by the NPU, according to first mapping information representing mapping of elements of the first machine learning framework to functions or operations executable on at least one of the NPU or the CPU (Col 7, ln 44-50, “Operations 400 begin, at 402, with the compiler 114 receiving a network description of a neural network. In one embodiment, a user provides the network description of the neural network to an API, and the API in turn transmits the network description to the compiler 114 on the host computer 102. In some embodiments, the network description uses framework specific formats (e.g., caffe, TensorFlow).” Col 3, ln 17-27, "The compiler generates a network graph and a corresponding execution sequence vector based on the network description and optimally allocates buffer handles for each of the layers in the network graph. The hardware abstraction layer, then, takes the network graph, the corresponding execution sequence vector, and the handles allocated by the compiler, sets up the hardware runtime parameters, and schedules the commands in the network graph and corresponding execution sequence vector to respective hardware blocks on a programmable IC." Col 4-5, ln 64-5, "In one embodiment, the parser 202 provides an interface to various deep learning network frameworks 206 with an API, like an API exported by the compiler 114. The API takes inputs in the same format as the deep learning frameworks do. Accordingly, the parser 202 takes models trained using various deep learning network frameworks 206 like caffe or TensorFlow and converts them to a network graph structure. In one embodiment, the network graph structure is an XGraph. In one embodiment, the graph structure converted by the parser 202 is a directed acyclic graph with heterogeneous nodes which encode information about various network layers and their connectivity. An example of a directed acyclic graph is presented in FIG. 3. In one embodiment, the backend 210 of the compiler 114 works on the network graph structure (generated by the parser 202) and performs operations on the network graph structure to generate an execution sequence vector. The execution sequence vector comprises a sequential queue of the layers of the network graph structure. Details about the execution sequence vector are provided below. The backend 210 comprises a hardware independent optimizer 212, a hardware dependent optimizer 214, a job queue scheduler 216 and an IO memory optimizer 218. Each of these components in the backend 210 works to perform operations on the network graph structure and generate an execution sequence vector to pass onto the HAL 116.").
store the first machine code (Col 4, ln 54-62, "In one embodiment, the HAL 116 takes the hardware-dependent graph from the compiler 114 and sets up the hardware runtime parameters of the programmable IC 120, allocates the buffers needed by the programmable IC hardware for processing the network, and schedules the nodes in the hardware-dependent graph into respective hardware execution queues. The command scheduler 226 of the HAL 116 then invokes the programmable IC through the hardware-software interface 118." Col 7, ln 20-24 ,"In one embodiment, the HAL 116 also comprises a command scheduler 226 that efficiently dispatches commands in the execution sequence vector to the programmable IC for processing. The command scheduler is further detailed with regards to FIG. 18"; see also col. 7, ln 44-61 and Col 18, ln 3-19 , "The command scheduler 226 uses the layer classifier 1804 to segregate the commands in the execution sequence vector 1802 based on the DPE to be used for processing the command. In some embodiments, the command scheduler 226 maintains a separate command queue 228 1-228 N for each DPE 132 1-132 N of the programmable IC 120. Once the commands of the execution sequence vector 1802 are separated based on layer type, the dispatcher 1806 then pops commands from the queues, checks for any dependencies on the command, and if the dependencies are cleared for a command, the scheduler dispatches the command to the respective DPEs 132 1-132 N asynchronously and receives a corresponding response from the respective DPE upon completion of the command. Because each DPE has its own command queue 228 1-228 N for dispatch, multiple DPEs can be active simultaneously.").
send the first machine code to the NPU for execution (Col 16, ln 7-22, "As mentioned previously, the HAL 116 receives an execution sequence vector from the compiler 114, and the execution sequence vector passes to the programmable IC setup component 222, the buffer manager 224, and to the command scheduler 226. Of the components of the HAL 116, the buffer manager 224 handles both constant buffers and I/O buffers used for both hardware and software of the programmable IC 120. The buffer manager 224 allocates two kinds of buffers: constant buffers and I/O buffers. The constant buffers are read-only buffers for the programmable IC 120 and are used for trained parameters (e.g., weights for layers in the neural network to process input data). The I/O buffers are read-write buffers for the programmable IC 120 to store the intermediate outputs between layers/nodes and accordingly can be reused between layers/nodes of the neural network"; see also col. 7, ln 44-61).
Regarding claim 2, Vemuri teaches wherein the instructions, when executed by the CPU, cause the CPU to: compile a second neural network model of a second machine learning framework incompatible with the NPU into second machine code executable by the NPU, according to second mapping information representing mapping of the second machine learning framework to the configuration of at least one of the NPU or the CPU (Col 7, ln 44-50, “Operations 400 begin, at 402, with the compiler 114 receiving a network description of a neural network. In one embodiment, a user provides the network description of the neural network to an API, and the API in turn transmits the network description to the compiler 114 on the host computer 102. In some embodiments, the network description uses framework specific formats (e.g., caffe, TensorFlow).” Col 3, ln 17-27, "The compiler generates a network graph and a corresponding execution sequence vector based on the network description and optimally allocates buffer handles for each of the layers in the network graph. The hardware abstraction layer, then, takes the network graph, the corresponding execution sequence vector, and the handles allocated by the compiler, sets up the hardware runtime parameters, and schedules the commands in the network graph and corresponding execution sequence vector to respective hardware blocks on a programmable IC." Col 4-5, ln 64-5, "In one embodiment, the parser 202 provides an interface to various deep learning network frameworks 206 with an API, like an API exported by the compiler 114. The API takes inputs in the same format as the deep learning frameworks do. Accordingly, the parser 202 takes models trained using various deep learning network frameworks 206 like caffe or TensorFlow and converts them to a network graph structure. In one embodiment, the network graph structure is an XGraph. In one embodiment, the graph structure converted by the parser 202 is a directed acyclic graph with heterogeneous nodes which encode information about various network layers and their connectivity. An example of a directed acyclic graph is presented in FIG. 3. In one embodiment, the backend 210 of the compiler 114 works on the network graph structure (generated by the parser 202) and performs operations on the network graph structure to generate an execution sequence vector. The execution sequence vector comprises a sequential queue of the layers of the network graph structure. Details about the execution sequence vector are provided below. The backend 210 comprises a hardware independent optimizer 212, a hardware dependent optimizer 214, a job queue scheduler 216 and an IO memory optimizer 218. Each of these components in the backend 210 works to perform operations on the network graph structure and generate an execution sequence vector to pass onto the HAL 116." EN: Caffe and Tensorflow are two separate frameworks denoting a first and second framework; see also col. 18, line 64 – col. 19, line 17).
store the second machine code (Col 4, ln 54-62, "In one embodiment, the HAL 116 takes the hardware-dependent graph from the compiler 114 and sets up the hardware runtime parameters of the programmable IC 120, allocates the buffers needed by the programmable IC hardware for processing the network, and schedules the nodes in the hardware-dependent graph into respective hardware execution queues. The command scheduler 226 of the HAL 116 then invokes the programmable IC through the hardware-software interface 118." Col 7, ln 20-24 ,"In one embodiment, the HAL 116 also comprises a command scheduler 226 that efficiently dispatches commands in the execution sequence vector to the programmable IC for processing. The command scheduler is further detailed with regards to FIG. 18" and Col 18, ln 3-19 , "The command scheduler 226 uses the layer classifier 1804 to segregate the commands in the execution sequence vector 1802 based on the DPE to be used for processing the command. In some embodiments, the command scheduler 226 maintains a separate command queue 228 1-228 N for each DPE 132 1-132 N of the programmable IC 120. Once the commands of the execution sequence vector 1802 are separated based on layer type, the dispatcher 1806 then pops commands from the queues, checks for any dependencies on the command, and if the dependencies are cleared for a command, the scheduler dispatches the command to the respective DPEs 132 1-132 N asynchronously and receives a corresponding response from the respective DPE upon completion of the command. Because each DPE has its own command queue 228 1-228 N for dispatch, multiple DPEs can be active simultaneously."; see also col. 18, line 64 – col. 19, line 17).
send the second machine code to the NPU for execution (Col 16, ln 7-22, "As mentioned previously, the HAL 116 receives an execution sequence vector from the compiler 114, and the execution sequence vector passes to the programmable IC setup component 222, the buffer manager 224, and to the command scheduler 226. Of the components of the HAL 116, the buffer manager 224 handles both constant buffers and I/O buffers used for both hardware and software of the programmable IC 120. The buffer manager 224 allocates two kinds of buffers: constant buffers and I/O buffers. The constant buffers are read-only buffers for the programmable IC 120 and are used for trained parameters (e.g., weights for layers in the neural network to process input data). The I/O buffers are read-write buffers for the programmable IC 120 to store the intermediate outputs between layers/nodes and accordingly can be reused between layers/nodes of the neural network"; see also col. 18, line 64 – col. 19, line 17).
Regarding claim 3, Vemuri teaches wherein the configuration of the NPU further includes at least one of:
an internal memory size of the NPU (Col 16, ln 25-39, "For the constant buffers, each layer of the network graph has its own set of constants data (e.g., weights, biases) and the buffer manager 224 loads the constant data into the constant buffers before invoking the programmable IC for inference. The buffer manager 224 allocates a pool of constant buffers and generates the layer offsets into these constant buffers. The hardware-setup block, described in further detail below, uses these layer offsets to populate the constant buffers with the constants data. The buffer manager 224 pre-allocates a pool of fixed-size buffers (e.g., 64 MB) based on the memory footprint of the constants (e.g., parameters, biases) used by the network. Each buffer is a contiguous block of memory and can host constants of multiple layers, but the constant buffers do not permit the constants data to straddle across multiple buffers").
a bitwidth of read or write operations associated with the one or more memory circuit;
a type, structure or speed of the one or more memory circuit (Col 3-4, ln 43-2, "In one embodiment, the programmable IC includes a DPE array 130 having any number of DPEs, and each DPE comprises specialized circuitry to connect an array of neural network units (NNU) (not illustrated). In one embodiment, the NNUs of the DPEs comprise non-programmable logic i.e., are hardened specialized processing elements, and comprise hardware elements including, but not limited to, program memories, an instruction fetch/decode unit, fixed-point vector units, floating-point vector units, arithmetic logic units (ALUs), and multiply accumulators (MAC). The detailed circuitry within the memory 140 can include any type of volatile or nonvolatile memory. In one embodiment, the memory 140 includes an array of memory elements" and Col 16, ln 40-42, "In one embodiment of FIG. 15, the buffer manager 224 allocates constant buffers 1502 of equal sizes in memory (such as DDR memory)").
types of number formats supported by the NPU (Col 3-4, ln 43-2, "In one embodiment, the programmable IC includes a DPE array 130 having any number of DPEs, and each DPE comprises specialized circuitry to connect an array of neural network units (NNU) (not illustrated). In one embodiment, the NNUs of the DPEs comprise non-programmable logic i.e., are hardened specialized processing elements, and comprise hardware elements including, but not limited to, program memories, an instruction fetch/decode unit, fixed-point vector units, floating-point vector units, arithmetic logic units (ALUs), and multiply accumulators (MAC). The detailed circuitry within the memory 140 can include any type of volatile or nonvolatile memory. In one embodiment, the memory 140 includes an array of memory elements").
a range of bitwidth supported for integer operations or floating-point operations;
an operating frequency of the NPU;
a number of the plurality of PEs (Col 3-4, ln 43-2, "In one embodiment, the programmable IC 120 includes programmable logic 122, a DPE array 130 having multiple DPEs 1321-132N, memory 140, and control logic 150. In one embodiment, the control logic 150 configures the programmable logic 122, and the programmable logic uses run-time parameters from the control logic 150 to control the DPE array 130. For example, using a received bitstream that contains configuration data, control logic 150 can configure the programmable logic 122 (which can include a plurality of configurable logic blocks) with run-time parameters, and the programmable logic 122 controls the DPE array 130 that has any number of DPEs (132 1-132 N). For example, the programmable logic 122 can include look up tables, function generators, registers, multiplexers, and the like.").
capability of special function unit circuits in the NPU (Col 5, ln 60-67, "Below is a table providing a list of OpCodes supported by the compiler 114. These opcodes correspond to various operations performed by layers of the DNN. In some embodiments, the opcodes correspond to operations resulting from an optimization by the hardware independent optimizer 212 or the hardware dependent optimizer 214. In some embodiments, the opcodes correspond to software operations" Please see Table 1).
Regarding claim 4, Vemuri teaches wherein the instructions causing the CPU to compile the first neural network model into the first machine code cause the CPU to: convert the first neural network model into a framework-independent model (Col 4, ln 48-52, "The front-end parser 202 takes the network description in framework specific formats and generates a framework independent network graph" Also please refer to Fig. 2 and Fig. 4A Col 7, ln 39-43, “FIG. 4A illustrates example operations performed by a compiler 114 and a HAL 116 to apply a DNN such as the network graph 300 of FIG. 3 to a programmable IC 120 for execution, according to embodiments of the present disclosure”).
convert the framework-independent model into a hardware-independent graph (Fig. 4B, Col 9, Ln 1-22, “After allocating buffer handles for the neural network, at block 414 the compiler 114 optimizes the network graph using hardware-independent optimizations and hardware dependent optimizations. Optimization of the network graph can improve the efficiency of data passing through the neural network. Table 1 provided some types of optimizations performed by the compiler 114 to the generated network graph. FIGS. 5-12 also illustrate various example optimizations performed by the compiler 114 on the generated network graph. In some embodiments, the compiler 114 performs hardware independent optimizations on the network graph before performing hardware dependent optimizations. In such embodiments, if the compiler 114 performs hardware dependent optimizations before hardware independent optimizations, the compiler 114 may have to replay some hardware dependent optimizations in order to achieve the same resulting network graph or the optimized network graph may produce different output data compared to output data from a network graph optimized using hardware independent optimizations first. In some embodiments, the compiler 114 can perform any number of optimizations on the network graph to increase efficiency.” and Col 4, ln 52-54, "The backend 210 refines this framework-independent and hardware-agnostic network graph into a hardware-dependent graph" Col 5, ln 17-19, “The backend 210 comprises a hardware independent optimizer 212, a hardware dependent optimizer 214”).
convert the hardware-independent model into a hardware-dependent code (Col 9, Ln 1-22, “After allocating buffer handles for the neural network, at block 414 the compiler 114 optimizes the network graph using hardware-independent optimizations and hardware dependent optimizations. Optimization of the network graph can improve the efficiency of data passing through the neural network. Table 1 provided some types of optimizations performed by the compiler 114 to the generated network graph. FIGS. 5-12 also illustrate various example optimizations performed by the compiler 114 on the generated network graph. In some embodiments, the compiler 114 performs hardware independent optimizations on the network graph before performing hardware dependent optimizations. In such embodiments, if the compiler 114 performs hardware dependent optimizations before hardware independent optimizations, the compiler 114 may have to replay some hardware dependent optimizations in order to achieve the same resulting network graph or the optimized network graph may produce different output data compared to output data from a network graph optimized using hardware independent optimizations first. In some embodiments, the compiler 114 can perform any number of optimizations on the network graph to increase efficiency.” Col 4, ln 52-54, "The backend 210 refines this framework-independent and hardware-agnostic network graph into a hardware-dependent graph" Col 5, ln 17-19, “The backend 210 comprises a hardware independent optimizer 212, a hardware dependent optimizer 214” Col 5, ln 9-25, "In one embodiment, the backend 210 of the compiler 114 works on the network graph structure (generated by the parser 202) and performs operations on the network graph structure to generate an execution sequence vector. The execution sequence vector comprises a sequential queue of the layers of the network graph structure. Details about the execution sequence vector are provided below. The backend 210 comprises a hardware independent optimizer 212, a hardware dependent optimizer 214, a job queue scheduler 216 and an IO memory optimizer 218. Each of these components in the backend 210 works to perform operations on the network graph structure and generate an execution sequence vector to pass onto the HAL 116" and Col 4, ln 33-47, “FIG. 2 is a block diagram 200 of the compiler 114 and the HAL 116 to be used with a hardware-software interface 118 to communicate with the programmable IC 120. As mentioned with FIG. 1, the host computer 102 includes a compiler 114 and a HAL 116 for use with a DNN inference accelerator (also referred herein as a programmable IC). In one embodiment, the compiler 114 exports an application program interface (API) to the host computer 102. This exported API takes in a network description of a DNN in various framework specific formats (e.g., deploy.prototxt of the caffe framework) and generates an intermediate hardware-dependent representation of the network. The HAL 116 takes this intermediate representation of the network and programs the hardware for execution using the hardware-software interface 118. In one embodiment, the compiler 114 has two components: the front-end parser 202 and the backend 210. The front-end parser 202 takes the network description in framework specific formats and generates a framework independent network graph. The backend 210 refines this framework-independent and hardware-agnostic network graph into a hardware-dependent graph. In one embodiment, the HAL 116 takes the hardware-dependent graph from the compiler 114 and sets up the hardware runtime parameters of the programmable IC 120, allocates the buffers needed by the programmable IC hardware for processing the network, and schedules the nodes in the hardware-dependent graph into respective hardware execution queues. The command scheduler 226 of the HAL 116 then invokes the programmable IC through the hardware-software interface 118”).
convert the hardware-dependent code into the first machine code (Col 8, ln 15-54, “At 408, operations 400 continue with the HAL 116 configuring the IC based on the execution sequence vector. In some embodiments, configuring the IC based on the execution sequence vector includes the HAL 116 calibrating a plurality of hardware runtime parameters of the programmable IC based on the execution sequence vector. Once the compiler 114 generates the execution sequence vector, the compiler 114 passes the execution sequence vector to the HAL 116 for further processing. In some embodiment, once the HAL 116 receives the execution sequence vector, the HAL 116 begins to setup the hardware components of the programmable IC 120, and in some embodiments, setup includes calibrating the hardware runtime parameters. In some embodiments, the HAL 116 allocates buffers on the programmable IC 120 required by both hardware components and software components based on the execution sequence vector. In such embodiments, the execution sequence vector also includes information about buffer nodes of the network graph. In one embodiment, the HAL 116 keeps track of a list of pointers for allocated buffers corresponding to the buffer nodes of the network graph. In some embodiments, configuring the IC based on the execution sequence vector includes the HAL 116 scheduling the plurality of commands of the execution sequence vector for a plurality of components of the programmable IC. Because the commands in the execution sequence vector correspond to the operations of the layer nodes of the network graph, the HAL 116 schedules when to transmit the commands of the execution sequence vector to the programmable IC 120. When the programmable IC 120 receives the commands from the HAL 116 via the hardware-software interface 118, the programmable IC begins executing the operation corresponding to the command. The operation is based on the layer nodes of the network graph. In one embodiment, the plurality of components of the programmable IC 120 include the programmable logic 122 with the plurality of controllers, the DPE array 130, the memory 140, and the control logic 150. Further details about the HAL 116 scheduling the commands of the execution sequence vector are provided with respect to FIG. 18-20").
Regarding claim 5, Vemuri teaches wherein the instructions to compile the first neural network cause the CPU to perform at least one of optimizing or verification of the machine code(Col 5, ln 26-59, "To improve the efficiency of the DNN, the compiler 114 can perform several layers of optimizations and layer fusion operations onto the network graph structure. Consequently, the network graph structure has updated layers and buffers and is structured with the HAL 116. In one embodiment, the hardware independent optimizer 212 performs optimizations (also referred herein as optimization rules) of the DNN that do not require or impact the hardware aspects of the DNN. Some of these optimizations performed by the hardware independent optimizer 212 include: parallel 1×1 convolutions fuse optimizations, software fuse optimizations, dropout optimizations, reshape optimizations, flatten optimizations, concatenation layer optimizations, custom layer optimizations, and prior box optimizations. Further, in one embodiment, the hardware dependent optimizer 214 performs optimizations of the DNN that do use or impact the hardware aspects of the DNN. Some of these optimizations performed by the hardware dependent optimizer 214 include: convolution+ReLU optimizations, hardware fusion optimization, CReLU optimizations, ElementWise (sometimes shortened to “Eltwise”) Addition optimizations, ReLU optimizations, 3D separable convolution optimizations, and deconvolution optimizations. In one embodiment, the optimizations performed by the hardware independent optimizer 212 include removal of layers used in the training phase of the DNN. With training layer removal optimization, the backend 210 of the compiler 114, specifically the hardware independent optimizer 212, identifies all the layers in the network graph which are not used during the interference phase and removes them" and Col 17, ln 32-45, "In one embodiment, the buffer handle is a string notation to represent input and output buffers of each layer and indicates blocks of memory dedicated to corresponding buffers. The buffer manager 224 allocates a continuous block of memory for each unique buffer handle, and maintains a dictionary of buffer handles and the corresponding pointers to the contiguous block of memory. The buffer manager 224 parses through the execution sequence vector, and for each layer, checks the input and output handle occurrence in the dictionary. If the dictionary returns a miss on the check, the buffer manager 224 allocates a contiguous block of memory for the handle and registers the address of the block allocated along the handle with the dictionary").
Regarding claim 6, Vemuri teaches wherein the instructions to optimize the machine code cause the CPU to perform at least one of: perform pruning (Col 5, ln 53-59, "In one embodiment, the optimizations performed by the hardware independent optimizer 212 include removal of layers used in the training phase of the DNN. With training layer removal optimization, the backend 210 of the compiler 114, specifically the hardware independent optimizer 212, identifies all the layers in the network graph which are not used during the interference phase and removes them").
perform quantization (Col 7, ln 11-19, "The programmable IC setup component 222 converts the weights and parameters of the DNN to fixed point format and loads them into the constant buffers managed by the buffer manager 224 using the pointers and offsets in the execution sequence vector. In one embodiment, the programmable IC setup component 222 uses a prescribed layer, optimized for hardware performance, for the data in the constant buffers managed by the buffer manager 224").
perform retraining,
perform compression (Col 10, ln 45-60, "One type of optimization performed by the hardware independent optimizer 212 is a parallel [1×1] convolution fusion optimization, which is illustrated in FIGS. 5A and 5B. With a parallel convolution fusion optimization, the backend 210 of the compiler 114, specifically the hardware independent optimizer 212, identifies network topology regions of the network graphs where multiple convolution layers take the same input buffer and write to different output buffers and merge these convolution layers into one layer. The merged convolution layer attaches to an output buffer with a size enough to hold the output of all the convolution layers merged. Also, the hardware independent optimizer 212 of the backend 210 registers the offsets of each convolution layer's output into the new output buffer for processing of downstream layers in the network graph" and Col 11, ln 12-28, "FIGS. 6A-B depict another example optimization of a network graph performed by the compiler 114 to generate the execution sequence vector, according to embodiments of the present disclosure. In one embodiment, FIGS. 6A and 6B illustrate an example pre-execute fusion optimization. With a pre-execute fusion optimization, the backend 210 of the compiler 114, specifically the hardware independent optimizer 212, looks up for a pattern of convolution layers followed by batch-norm layers followed by scale layers, and fuses the three layers into one convolution layer, by merging the parameters and weights of the input convolution, batch-norm, and scale layers. This optimization gets rids of the buffers connecting the layers, and therefore reduces the buffer requirements to execute the network. In some embodiments, the pre-execute fusion optimization applies to convolution layers, batch-norm layers, and scale layers of any order, combination, or arrangement").
perform an artificial intelligence (AI)-based optimization algorithm,
or perform knowledge distillation.
Regarding claim 7, Vemuri teaches wherein the instructions to compile the first neural network cause the CPU to analyze parameter information of each layer of the first neural network model (Col 7, ln 11-15, "The programmable IC setup component 222 converts the weights and parameters of the DNN to fixed point format and loads them into the constant buffers managed by the buffer manager 224 using the pointers and offsets in the execution sequence vector" and Col 11, ln 12-23, "FIGS. 6A and 6B illustrate an example pre-execute fusion optimization. With a pre-execute fusion optimization, the backend 210 of the compiler 114, specifically the hardware independent optimizer 212, looks up for a pattern of convolution layers followed by batch-norm layers followed by scale layers, and fuses the three layers into one convolution layer, by merging the parameters and weights of the input convolution, batch-norm, and scale layers." Col 16, ln 16-39, "The constant buffers are read-only buffers for the programmable IC 120 and are used for trained parameters (e.g., weights for layers in the neural network to process input data). The I/O buffers are read-write buffers for the programmable IC 120 to store the intermediate outputs between layers/nodes and accordingly can be reused between layers/nodes of the neural network … For the constant buffers, each layer of the network graph has its own set of constants data (e.g., weights, biases) and the buffer manager 224 loads the constant data into the constant buffers before invoking the programmable IC for inference. The buffer manager 224 allocates a pool of constant buffers and generates the layer offsets into these constant buffers. The hardware-setup block, described in further detail below, uses these layer offsets to populate the constant buffers with the constants data. The buffer manager 224 pre-allocates a pool of fixed-size buffers (e.g., 64 MB) based on the memory footprint of the constants (e.g., parameters, biases) used by the network. Each buffer is a contiguous block of memory and can host constants of multiple layers, but the constant buffers do not permit the constants data to straddle across multiple buffers").
Regarding claim 8, Vemuri teaches wherein the instructions to compile the first neural network cause the CPU to analyze sizes of weight parameters (Col 7, ln 11-15, "The programmable IC setup component 222 converts the weights and parameters of the DNN to fixed point format and loads them into the constant buffers managed by the buffer manager 224 using the pointers and offsets in the execution sequence vector." and Col 16, ln 11-29, "Of the components of the HAL 116, the buffer manager 224 handles both constant buffers and I/O buffers used for both hardware and software of the programmable IC 120. The buffer manager 224 allocates two kinds of buffers: constant buffers and I/O buffers. The constant buffers are read-only buffers for the programmable IC 120 and are used for trained parameters (e.g., weights for layers in the neural network to process input data). The I/O buffers are read-write buffers for the programmable IC 120 to store the intermediate outputs between layers/nodes and accordingly can be reused between layers/nodes of the neural network. The following discussion further describes the differences between constant buffers and the I/O buffers, especially as to the data organization of each type of buffer. For the constant buffers, each layer of the network graph has its own set of constants data (e.g., weights, biases) and the buffer manager 224 loads the constant data into the constant buffers before invoking the programmable IC for inference").
and feature map parameters of each layer in the first neural network model (Col 15, ln 36-42, "In one embodiment, the backend 210 of the compiler 114 further comprises a IO memory optimizer 218, and this IO memory optimizer 218 allocates a set of buffer handles along with the sizes, which can be used for storing I/O (also referred herein as activations) between layers while reusing the buffers between layers of the network graph." Col 4, ln 26-32, "In one embodiment, the memory 106 includes an array of memory elements. In one embodiment, the memory 106 stores input image data, such as input feature maps, and activation outputs from various and/or previous layers of the DNN. Details about the compiler 114, the HAL 116, and the hardware-software interface 118 are provided below with regards with FIG. 2." Col 15, ln 42-54, "In one embodiment, a buffer handle is a string notation to represent input and output buffers of each layer and indicates blocks of memory dedicated to corresponding buffers. The backend 210 loads the buffer handles and corresponding sizes onto the execution sequence vector from the job queue scheduler 216. In one embodiment, the backend 210 may make design choices such as: (1) the backend 210 can initialize the buffer sizes of all the buffer handles to the size of the largest buffer for IO activations, and can attach all the buffer handles to the same size; and (2) the backend 210 cannot reuse buffer handles attached to layers optimized for software execution (e.g., layers that are not hardware-accelerated)." EN: The instant application’s specification defines feature map as activation parameter, and node data [0042] and [0068] respectively).
Regarding claim 9, Vemuri teaches wherein the instructions to compile the first neural network cause the CPU to analyze connectivity between layers in the first neural network model (Col 4-5, ln 63-51, "In one embodiment, the parser 202 provides an interface to various deep learning network frameworks 206 with an API, like an API exported by the compiler 114. The API takes inputs in the same format as the deep learning frameworks do. Accordingly, the parser 202 takes models trained using various deep learning network frameworks 206 like caffe or TensorFlow and converts them to a network graph structure. In one embodiment, the network graph structure is an XGraph. In one embodiment, the graph structure converted by the parser 202 is a directed acyclic graph with heterogeneous nodes which encode information about various network layers and their connectivity. An example of a directed acyclic graph is presented in FIG. 3. In one embodiment, the backend 210 of the compiler 114 works on the network graph structure (generated by the parser 202) and performs operations on the network graph structure to generate an execution sequence vector. The execution sequence vector comprises a sequential queue of the layers of the network graph structure. Details about the execution sequence vector are provided below. The backend 210 comprises a hardware independent optimizer 212, a hardware dependent optimizer 214, a job queue scheduler 216 and an IO memory optimizer 218. Each of these components in the backend 210 works to perform operations on the network graph structure and generate an execution sequence vector to pass onto the HAL 116. To improve the efficiency of the DNN, the compiler 114 can perform several layers of optimizations and layer fusion operations onto the network graph structure. Consequently, the network graph structure has updated layers and buffers and is structured with the HAL 116. In one embodiment, the hardware independent optimizer 212 performs optimizations (also referred herein as optimization rules) of the DNN that do not require or impact the hardware aspects of the DNN. Some of these optimizations performed by the hardware independent optimizer 212 include: parallel 1×1 convolutions fuse optimizations, software fuse optimizations, dropout optimizations, reshape optimizations, flatten optimizations, concatenation layer optimizations, custom layer optimizations, and prior box optimizations. Further, in one embodiment, the hardware dependent optimizer 214 performs optimizations of the DNN that do use or impact the hardware aspects of the DNN. Some of these optimizations performed by the hardware dependent optimizer 214 include: convolution+ReLU optimizations, hardware fusion optimization, CReLU optimizations, ElementWise (sometimes shortened to “Eltwise”) Addition optimizations, ReLU optimizations, 3D separable convolution optimizations, and deconvolution optimizations").
Regarding claim 10, Vemuri teaches a non-transitory computer readable storage medium storing instructions thereon, the instructions when executed by a central processing unit (CPU) cause the CPU to: (Col 1, ln 49-51, “Aspects of the present disclosure also provide apparatus, methods, processing systems, and computer readable mediums for performing the operations described above”).
The remaining limitations are similar to claim 1 and are rejected under the same rationale.
Claim 11-18 are medium claims reciting limitations similar to claims 1,4, 2, and 5-9 respectively and are rejected under the same rationale.
Claim 19-22 are method claims reciting limitations similar to claims 1, 2, 5 and 4 respectively and are rejected under the same rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Chen et al. (TVM: An Automated End-to-End Optimizing Compiler for Deep Learning): discloses the same framework for abstraction and layer by layer break down to the hardware level to provide a hardware indifferent API for various machine learning frameworks.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMIR DARWISH whose telephone number is (571)272-4779. The examiner can normally be reached 7:30-5:30 M-Thurs.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emerson Puente can be reached on 571-272-3652. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/A.E.D./Examiner, Art Unit 2187
/LEWIS A BULLOCK JR/Supervisory Patent Examiner, Art Unit 2199