DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/10/2025 has been entered.
Response to Arguments
Applicant's arguments filed 11/10/2025 have been fully considered and they are partially persuasive.
Regarding applicant’s remarks directed to the rejection of claims under 35 USC § 102,
Alleged No teaching of a process of configuring a portion of FPGA as ML processor while configuring another portion of FPGA as ML accelerators for offloading tasks from the ML processor
In Remarks p. 9, Applicant contends:
“For example, Gschwend has never disclosed a process of configuring a portion of FPGA as ML processor while configuring another portion of FPGA as ML accelerators for offloading tasks from the ML processor.”
The relevant claim limitations appear to be “configuring a portion of the FPGA to be a machine learning ("ML") processor capable of processing computational operations offloaded from a microcontroller ("MCU") in accordance with information stored in the second NVM; and configuring another portion of the FPGA to be one or more ML accelerators capable of performing one or more dedicated neural network operations for offloading tasks from the ML processor in accordance with information stored in the second NVM” in claim 14.
As noted in the previous Office Action, Gschwend teaches (emphasis added):
Gschwend teaches configuring a portion of the FPGA to be a machine learning processor capable of processing computational operations
(Gschwend, pg. 33 4.1.3, “This project focuses on the proof-of-concept implementation of an FPGA-accelerated embedded CNN. First and foremost, the challenge in this chapter consists of
fitting a complete CNN for image classification on ImageNet
onto the low-power Zynq XC-7Z045 with decent performance”)
(Gschwend, pg. 84 fig. E.1, pg. 38, “Parallelization Opportunities The nested loops can be a source of loop-level parallelism: independent loop iterations can be partially or fully unrolled and executed in parallel on different processing elements [configuring a portion of the FPGA ie processing elements to be a machine learning processor capable of processing computational operations ie dot-products/intra-kernel multiplications]. The following sources of loop-level parallelism can be exploited in the ZynqNet CNN:
• independence of layers when applied to different image frames
• independence of dot-products at different pixel positions (y, x)
• independence of input channels ci
• independence of output channels co
• independence of intra-kernel multiplications”)
PNG
media_image1.png
254
675
media_image1.png
Greyscale
Gschwend teaches offloaded from a microcontroller ("MCU") in accordance with information stored in the second NVM:
(Gschwend, pg. 31 4.1.1, pg. 32 Fig. 4.1, “The Zynqbox has been designed by Supercomputing Systems AG for the evaluation of high-performance image processing algorithms, especially in automotive settings. The embedded platform is based on the Xilinx Zynq-7000 All Programmable System-on-Chip (SoC), which combines a dual-core ARM Cortex-A9 processor [offloaded from a microcontroller; wherein Examiner interprets offloaded as the FPGA is communicated to perform the computational operations instead of the microcontroller and the computational operations are performed with the weights obtained from the Wcache (second NVM)] with programmable FPGA fabric in a single device. The Zynqbox includes a Xilinx Zynq XC-7Z045 SoC, 1 GB DDR3 memory for the ARM processor, 768 MB independent DDR3 memory for the programmable logic, and plenty of connection options (Serial Camera Interfaces, USB, CAN, Gigabit Ethernet). The Kintex-7 FPGA fabric of the SoC features 350k logic cells, 218k LUTs, 2180 kB Block RAM and 900 DSP slices. The CPU runs at up to 1 GHz, boots a standard Linux operating system and is connected to the programmable logic via high-performance AXI4 ports for data exchange and control. Figure 4.1 shows a schematic overview of the Zynqbox platform [92].”)
Gschwend teaches the memory type can be nonvolatile memory
(Gschwend, pg. 51, “Memory Type and Style Memories in the FPGA hardware are described as arrays in the high-level source code. Only statically declared arrays with a fixed size are supported for
synthesis. The mapping between the C/C++ arrays and the underlying hardware can be
influenced with a number of compiler directives. The memory type (RAM, ROM [nonvolatile], FIFO) and implementation style (Block RAM, Distributed RAM, Shift Register) can be chosen by using the previously introduced #pragma HLS RESOURCE directive.
PNG
media_image2.png
235
673
media_image2.png
Greyscale
”)
Gschwend teaches and configuring another portion of the FPGA to be one or more ML accelerators capable of performing one or more dedicated neural network operations for offloading tasks from the ML processor in accordance with information stored in the second NVM.
(Gschwend, pg. 84 fig. E.1, pg. 38, “Parallelization Opportunities The nested loops can be a source of loop-level parallelism: independent loop iterations can be partially or fully unrolled and executed in parallel on different processing elements [configuring another portion of the FPGA to be one or more ML accelerators ie different processing elements capable of performing one or more dedicated neural network operations for offloading tasks from the ML processor in accordance with information stored in the second NVM; wherein the parallelization of the different processing elements is another portion of the FPGA and offloading computations to other processing elements and parallelizing the executions using different processing elements is interpreted to be accelerating the executions]. The following sources of loop-level parallelism can be exploited in the ZynqNet CNN:
• independence of layers when applied to different image frames
• independence of dot-products at different pixel positions (y, x)
• independence of input channels ci
• independence of output channels co
• independence of intra-kernel multiplications”)
After careful consideration, the argument is considered unpersuasive as the processing elements of Gschwend are on the FPGA (thus, portions of the FPGA) and the computations are offloaded to be performed on the FPGA wherein the same manner of offloading between different portions of the FPGA (ie from the ML processor to the “another portion of the FPGA”) in the FPGA occurs between processing elements as different processing elements are relied upon to compute different computations of the layer (which, in the case, can be interpreted as parallelization). Examiner emphasizes that the claim recites that the ML processor (ie a portion of the FPGA) offloads tasks to the “another portion of the FPGA.” Thus, examiner applies the teachings of Gschwend to the same context of offloading as recited in the claim.
Lastly, the arguments are directed to newly amended limitations that were not previously examined by the examiner. Therefore, applicants arguments are rendered moot. The examiner refers to the rejection under 35 USC § 103 in the current office action for more details.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 14-34 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claim 14 and analogous claim 27 recite “obtaining a trained model file for a machine learning operation from a model information block situated in a field programmable gate arrays ("FPGA")”. The specification and figure 3 of the instant application discloses “[0047] FIG. 3 is a block diagram 300 illustrating a software architecture of implementing neural network in FPGA in accordance with one embodiment of the present invention. Diagram 300 includes a block of Tensorflow™ Flatbuffers™ file or flatbuffers file 302, model information block 306, and
PNG
media_image3.png
1058
653
media_image3.png
Greyscale
coefficients block 308. In one aspect, FPGA can be programmed to be COS implementing neural network operations or ML processing in accordance with the information in flatbuffers file 302. It should be noted that the underlying concept of the exemplary embodiment(s) of the present invention would not change if one or more blocks (circuit or elements) were added to or removed from diagram 300.”
As disclosed in the specification and shown in figure 3, the model information block (306) does not store the trained model file itself; as the model information block merely stores parsed information from the trained model file (see 302 in figure 3). Thus, the trained model file (Tensorflow Flatbuffers File 302) cannot be obtained from the model information block (Model Information 306).
Claims 15-19 are further rejected on virtue of their dependency to claim 14.
Claims 28-34 are further rejected on virtue of their dependency to claim 27.
Claim 20 recites “wherein each of the ML accelerators includes an arbitrator configured to read data of previous layer.” The specification of the instant application discloses ([0041], “…PSRAM block 220 includes an arbitrator and PSRAM. PSRAM, in one example, includes a DRAM macro block with an on-chip refresh circuit.”) and ([0043], “PSRAM block 220 reads data from previous layer as input data and writes data of current layer as output.”) However, the specification does not explicitly disclose that the arbitrator is configured to read data of the previous layer.
Claims 21-26 are further rejected on virtue of their dependency to claim 14.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 20-24 and 26 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Gschwend, David. "Zynqnet: An fpga-accelerated embedded convolutional neural network." (“Gschwend”)
In regards to claim 20,
Gschwend teaches A semiconductor device able to be selectively programmed for parallel processing logic operations, comprising:
an input memory for buffering
(Gschwend, pg. 62 5.2.1, “The final ZynqNet FPGA Accelerator contains NPE = 16 processing units, which concurrently operate on the calculation of different output feature maps. Each processing unit contains a fully pipelined 3×3 multiply-accumulate unit with 9 separate floating-point multipliers and a subsequent adder tree for the summation of their products. This results in a total of 144 floating-point multipliers and 128 floating-point adders, which constitute the computational core of the accelerator. The processing units are fed from on-chip caches. In total, up to 1.7 MB parameters (442 000 single-precision floating-point weights) and 133 kB image data are buffered in the on-chip Block RAM [an input memory for buffering]. When synthesized for the Zynq XC-7Z045 FPGA, this configuration results in the resource requirements and device utilization figures shown in table 5.2. The fact that more than 90 % of all Block RAM resources and more than 80 % of the DSP slices are utilized highlights the good fit of the architecture to the given FPGA and is a result from the co-optimization of both the FPGA architecture and the ZynqNet CNN.”)
Gschwend teaches input signals from an external component;
(Gschwend, pg. 30, “These settings add a substantial amount of variation to the images, and were chosen to approximately emulate the reduced quality of webcam images, preparing the network for actual input images during demonstrations [input signals from an external component; wherein the input signals ie input images are provided via the CAM (camera) in fig. 4.1]. In addition to the increased amount of images in the augmented dataset, the final trainings were run for 60 epochs instead of 30 epochs, effectively showing the network each image 60 times in 6 variations. This resulted in another gain of 3.1 % accuracy”)
PNG
media_image4.png
329
659
media_image4.png
Greyscale
Gschwend teaches a microcontroller ("MCU") configured to provide a stream of pre-processed data in accordance with the input signals;
(Gschwend, “The Zynqbox has been designed by Supercomputing Systems AG for the evaluation of high-performance image processing algorithms, especially in automotive settings. The embedded platform is based on the Xilinx Zynq-7000 All Programmable System-on-Chip (SoC), which combines a dual-core ARM Cortex-A9 processor [a microcontroller ("MCU") configured to] with programmable FPGA fabric in a single device. The Zynqbox includes a Xilinx Zynq XC-7Z045 SoC, 1 GB DDR3 memory for the ARM processor, 768 MB independent DDR3 memory for the programmable logic, and plenty of connection options (Serial Camera Interfaces [provide a stream of pre-processed data in accordance with the input signals; wherein a stream of pre-processed data in accordance with the input signals is the stream of input images provided by the camera], USB, CAN, Gigabit Ethernet). The Kintex-7 FPGA fabric of the SoC features 350k logic cells, 218k LUTs, 2180 kB Block RAM and 900 DSP slices. The CPU runs at up to 1 GHz, boots a standard Linux operating system and is connected to the programmable logic via high-performance AXI4 ports for data exchange and control. Figure 4.1 shows a schematic overview of the Zynqbox platform [92].”)
Gschwend teaches and a first portion of configurable logic blocks ("LBs") of a field programmable gate arrays ("FPGA"), coupled to the MCU, configured to be programmed to behave as a machine learning processor
(Gschwend, pg. 84 fig. E.1, pg. 38, “Parallelization Opportunities The nested loops can be a source of loop-level parallelism: independent loop iterations can be partially or fully unrolled and executed in parallel on different processing elements [a first portion of configurable logic blocks ("LBs") ie processing elements of a field programmable gate arrays ("FPGA"), coupled to the MCU (see Fig. 4.1), configured to be programmed to behave as a machine learning processor ie perform computations of the CNN]. The following sources of loop-level parallelism can be exploited in the ZynqNet CNN:
• independence of layers when applied to different image frames
• independence of dot-products at different pixel positions (y, x)
• independence of input channels ci
• independence of output channels co
• independence of intra-kernel multiplications”)
Gschwend teaches containing a memory controller, wherein the memory controller includes a local memory to cache a portion of coefficients
(Gschwend, pg. 84 fig. E.1 teaches the MemoryController coupled to WCache store the coefficients)
Gschwend teaches obtained from a dynamic random-access memory ("DRAM").
(Gschwend, pg. 41 Fig. 4.4 teaches loading weights from the DRAM to the WCache in the FPGA)
Gschwend teaches wherein each of the ML accelerators includes an arbitrator configured to read data of previous layer.
(Gschwend, Figure E.1,
PNG
media_image5.png
38
234
media_image5.png
Greyscale
and figure 4.4 teaches each ML accelerator (processing elements) reading data (weights) for each layer)
PNG
media_image6.png
567
421
media_image6.png
Greyscale
)
In regards to claim 21,
Gschwend teaches The device of claim 20,
Gschwend teaches wherein the local memory is a static RAM ("SRAM") configured to store addresses for accessing DRAM.
(Gschwend, “The Zynqbox has been designed by Supercomputing Systems AG for the evaluation of high-performance image processing algorithms, especially in automotive settings. The embedded platform is based on the Xilinx Zynq-7000 All Programmable System-on-Chip (SoC), which combines a dual-core ARM Cortex-A9 processor with programmable FPGA fabric in a single device. The Zynqbox includes a Xilinx Zynq XC-7Z045 SoC, 1 GB DDR3 memory for the ARM processor, 768 MB independent DDR3 memory for the programmable logic, and plenty of connection options (Serial Camera Interfaces, USB, CAN, Gigabit Ethernet). The Kintex-7 FPGA fabric of the SoC features 350k logic cells, 218k LUTs, 2180 kB Block RAM [local memory ie WCache is a static RAM ("SRAM") ie Block RAM configured to store addresses for accessing DRAM] and 900 DSP slices. The CPU runs at up to 1 GHz, boots a standard Linux operating system and is connected to the programmable logic via high-performance AXI4 ports for data exchange and control. Figure 4.1 shows a schematic overview of the Zynqbox platform [92].”)
In regards to claim 22,
Gschwend teaches The device of claim 20,
Gschwend teaches wherein the local memory stores addresses for facilitating DRAM data burst mode.
(Gschwend, pg. 38 Section 4.2.4, “Need for on-chip Caching [local memory stores addresses] Looking at the pseudo-code in algorithm 1, it can be seen that multiple memory locations are read and written more than once. Accesses into main memory are expensive, both in terms of latency and energy. They cannot be completely avoided because the on-chip memory is not big enough to hold all CNN parameters as well as the intermediate feature maps. However, the goal is to minimize the number of reads and writes to the external memory by maximizing on-chip data reuse. Furthermore, all unavoidable memory operations should be linear in order to facilitate burst mode transfers [facilitating DRAM data burst mode]. Caches allow both the linearization of memory accesses, as well as the temporary storage of values that will be reused shortly after.”)
In regards to claim 23,
Gschwend teaches The device of claim 20,
Gschwend teaches wherein the memory controller is configured to reorder trained machine learning and neural network model coefficients in a sequential addressing order.
Examiner interprets sequential addressing to be utilizing the neural network coefficients in order of the neural network layers in light of the specification, (“[0076] Layers with multiple inputs and outputs also benefit from a memory controller which offers multiple address caching since these multiple inputs and outputs are located at different addresses themselves. Layers with sequential addressing (such as standard convolutions) include a larger number of inputs at a time for calculating one output. Layer computation tends to use the same address space repeatedly with different coefficients for calculating each layer output.”)
(Gschwend, pg. 41 fig. 4.4 teaches for each layer, the layer config is transmitted to the FPGA and set for all blocks and the weights are loaded per layer)
In regards to claim 24,
Gschwend teaches The device of claim 20,
Gschwend teaches wherein the memory controller facilitates to temporally maintain read addresses for in-progress read operations.
(Gschwend, pg. 49, “
PNG
media_image7.png
123
535
media_image7.png
Greyscale
Assuming that array has enough read ports, the loop could be partially unrolled to allow
parallel read accesses. However, this is prevented because the function readArray can only
be called sequentially. Adding the function instantiation directive creates four copies of
readArray, and unrolling by a factor of 4 becomes possible:
PNG
media_image8.png
154
534
media_image8.png
Greyscale
Dataflow #pragma HLS DATAFLOW activates the dataflow optimization used for task-level
parallelism in Vivado HLS. By default, the compiler always tries to minimize latency and
improve concurrency by scheduling operations as soon as possible. Data dependencies limit
this type of parallelism: By default, a process A must finish all write accesses to an array
before it is considered finished and a second process B can start consuming the data [temporally maintain read addresses for in-progress read operations; wherein the MemoryController manages the dataflow (restricting read/write access) in the FPGA].”)
Claim 26 is rejected under the same rationale under 35 U.S.C. 102 as it is substantially similar to claim 20
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 14-16, 18-19, 27-31, and 33-34 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gschwend, David. "Zynqnet: An fpga-accelerated embedded convolutional neural network" (“Gschwend”) in further view of Farabet, Clément, et al. "Hardware accelerated convolutional neural networks for synthetic vision systems." Proceedings of 2010 IEEE international symposium on circuits and systems. IEEE, 2010. (“Farabet”)
In regards to claim 14,
Gschwend teaches A method for processing data via a dedicated neural network processor, comprising:
obtaining a trained model file for a machine learning operation;
(Gschwend, pg. 8 paragraph 2, “Network Specification In order to fully describe a convolutional neural network, the following information is required:
1. a topological description of the network graph
2. a list of layers and their settings
3. the weights and biases in each layer
4. (optionally) a training protocol
In Caffe, the network description and the layer settings are stored in a JSON-like, humanreadable text format called .prototxt. The weights are saved in binary .caffemodel files.
The training protocol is also supplied in .prototxt format and includes settings such as the
base learning rate, the learning rate schedule, the batch size, the optimization algorithm, as
well as the random seeds for training initialization. These settings are only needed if the
network is to be trained from scratch or finetuned, which refers to the process of adapting
a trained network to a different dataset. For inference, where a fully trained network is
utilized for forward-computation on new input data, the network description and the trained
weights are sufficient [obtaining a trained model file for a machine learning operation].”)
Gschwend teaches extracting model information from the trained model file
(Gschwend, pg. 84, Fig. E.1 teaches extracting model information ie layer, num_weights, etc from the given input (trained model file),
PNG
media_image9.png
145
227
media_image9.png
Greyscale
”)
Gschwend teaches and storing the model information in an onboard first [nonvolatile] memory [("NVM")] in the FPGA;
(Gschwend, pg. 84, Fig. E.1. teaches storing the model information ie setLayer Config in a first memory of the FPGA)
PNG
media_image10.png
111
656
media_image10.png
Greyscale
Gschwend teaches parsing coefficients representing model layer weights and bias from the trained model and storing the coefficients in a second [N]VM in the FPGA; and
(Gschwend, pg. 84 Fig. E.1, pg. 40, “Weights Cache (WCache) [storing the coefficients in a second [N]VM in the FPGA] is the final and biggest cache. It holds all the ci × co filters of the current layer. The accelerator benefits massively from the low parameter count in
ZynqNet CNN, which allows all weights to be kept on-chip. The maximum number
of 384 × 112 × (3 × 3) = 387 072 weights plus 112 bias values [parsing coefficients representing model layer weights and bias from the trained model; wherein the weights and bias are obtained from the setLayer Config] is required in layer fire8/squeeze3x3. Layer conv10 requires a comparable number of 736 × 512 × (1 × 1) =
376 832 parameters plus 512 bias values. Due to implementation details, the cache is
implemented with a capacity of 16 × 3 × 1024 × 9 = 442 368 elements.”)
Gschwend teaches configuring a portion of the FPGA to be a machine learning processor capable of processing computational operations
(Gschwend, pg. 33 4.1.3, “This project focuses on the proof-of-concept implementation of an FPGA-accelerated embedded CNN. First and foremost, the challenge in this chapter consists of
fitting a complete CNN for image classification on ImageNet
onto the low-power Zynq XC-7Z045 with decent performance”)
(Gschwend, pg. 84 fig. E.1, pg. 38, “Parallelization Opportunities The nested loops can be a source of loop-level parallelism: independent loop iterations can be partially or fully unrolled and executed in parallel on different processing elements [configuring a portion of the FPGA ie processing elements to be a machine learning processor capable of processing computational operations ie dot-products/intra-kernel multiplications]. The following sources of loop-level parallelism can be exploited in the ZynqNet CNN:
• independence of layers when applied to different image frames
• independence of dot-products at different pixel positions (y, x)
• independence of input channels ci
• independence of output channels co
• independence of intra-kernel multiplications”)
PNG
media_image1.png
254
675
media_image1.png
Greyscale
Gschwend teaches offloaded from a microcontroller ("MCU") in accordance with information stored in the second NVM:
(Gschwend, pg. 31 4.1.1, pg. 32 Fig. 4.1, “The Zynqbox has been designed by Supercomputing Systems AG for the evaluation of high-performance image processing algorithms, especially in automotive settings. The embedded platform is based on the Xilinx Zynq-7000 All Programmable System-on-Chip (SoC), which combines a dual-core ARM Cortex-A9 processor [offloaded from a microcontroller; wherein Examiner interprets offloaded as the FPGA is communicated to perform the computational operations instead of the microcontroller and the computational operations are performed with the weights obtained from the Wcache (second NVM)] with programmable FPGA fabric in a single device. The Zynqbox includes a Xilinx Zynq XC-7Z045 SoC, 1 GB DDR3 memory for the ARM processor, 768 MB independent DDR3 memory for the programmable logic, and plenty of connection options (Serial Camera Interfaces, USB, CAN, Gigabit Ethernet). The Kintex-7 FPGA fabric of the SoC features 350k logic cells, 218k LUTs, 2180 kB Block RAM and 900 DSP slices. The CPU runs at up to 1 GHz, boots a standard Linux operating system and is connected to the programmable logic via high-performance AXI4 ports for data exchange and control. Figure 4.1 shows a schematic overview of the Zynqbox platform [92].”)
Gschwend teaches the memory type can be nonvolatile memory
(Gschwend, pg. 51, “Memory Type and Style Memories in the FPGA hardware are described as arrays in the high-level source code. Only statically declared arrays with a fixed size are supported for
synthesis. The mapping between the C/C++ arrays and the underlying hardware can be
influenced with a number of compiler directives. The memory type (RAM, ROM [nonvolatile], FIFO) and implementation style (Block RAM, Distributed RAM, Shift Register) can be chosen by using the previously introduced #pragma HLS RESOURCE directive.
PNG
media_image2.png
235
673
media_image2.png
Greyscale
”)
Gschwend teaches and configuring another portion of the FPGA to be one or more ML accelerators capable of performing one or more dedicated neural network operations for offloading tasks from the ML processor in accordance with information stored in the second NVM.
(Gschwend, pg. 84 fig. E.1, pg. 38, “Parallelization Opportunities The nested loops can be a source of loop-level parallelism: independent loop iterations can be partially or fully unrolled and executed in parallel on different processing elements [configuring another portion of the FPGA to be one or more ML accelerators ie different processing elements capable of performing one or more dedicated neural network operations for offloading tasks from the ML processor in accordance with information stored in the second NVM; wherein the parallelization of the different processing elements is another portion of the FPGA and offloading computations to other processing elements and parallelizing the executions using different processing elements is interpreted to be accelerating the executions]. The following sources of loop-level parallelism can be exploited in the ZynqNet CNN:
• independence of layers when applied to different image frames
• independence of dot-products at different pixel positions (y, x)
• independence of input channels ci
• independence of output channels co
• independence of intra-kernel multiplications”)
However, Gschwend does not explicitly teach from a model information block situated in a field programmable gate arrays ("FPGA")
Farabet teaches from a model information block situated in a field programmable gate arrays ("FPGA");
Examiner’s note: Examiner interprets the model information block as software in light of the specification (“[0047] FIG. 3 is a block diagram300 illustrating a software architecture of implementing neural network in FPGA in accordance with one embodiment of the present invention. Diagram 300 includes a block of TensorflowTM FlatbufferTM file or flatbuffers file 302, model information block 306, and coefficients block 308. In one aspect, FPGA can be programmed to be COS implementing neural network operations or ML processing in accordance with the information in flatbuffers file 302. It should be noted that the underlying concept of the exemplary embodiment(s) of the present invention would not change if one or more blocks (circuit or elements) were added to or removed from diagram 300.”)
(Farabet, Section III, “A fully-digital coded hardware implementation of a scalable ConvNet [4] has been developed and implemented. Small analog versions of ConvNets have been implemented, but at the time were not able to scale [7]. We believe a fully-digital implementation with current FPGA and ASIC technologies is the easiest way to get a software-compatible object-recognition networks, that is easy to setup and operate, use reduced power consumption and provides high numeric precision. The entire system is coded in hardware description languages (HDL), and is targeted for ASIC synthesis or programmable hardware like FPGAs [from a model information block situated in a field programmable gate arrays ("FPGA"); wherein digital coded hardware implementation of the neural network is interpreted to be the neural network coded (software) on the FPGA].”)
Gschwend and Farabet are both considered to be analogous to the claimed invention because they are in the same field of embedding hardware with neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gschwend to incorporate the teachings of Farabet in order to provide a fully-digital implementation of a neural network on an FPGA to reduce power consumption and provide high numeric precision (Farabet, Section III., “We believe a fully-digital implementation with current FPGA and ASIC technologies is the easiest way to get a software-compatible object-recognition networks, that is easy to setup and operate, use reduced power consumption and provides high numeric precision.”)
In regards to claim 15,
Gschwend and Farabet teaches The method of claim 14,
Gschwend teaches further comprising: retrieving one or more model information for the first NVM; and
(Gschwend, pg. 57, “The files also contain all the relative address offsets of the corresponding memory-mapped registers. However, the driver relies on the Userspace I/O (UIO) kernel module, which in turn relies on the correct device tree being loaded into the kernel at boot time. Neiter of these requirements is fulfilled in the default SCS Zynqbox installation, and the advanced project time did not allow to fix this. Therefore, we had to patch the low-level driver functions to directly access the Zynq’s memory bus to talk to the FPGA-based block, instead of using the elegant UIO module.
In Linux, the root user can directly access the physical memory bus without going through
the virtual-to-physical address translation by reading and writing the /dev/mem character
file. The physical address range which is assigned to the accelerator’s AXI4-Lite interface can
be found in the Address Editor in Vivado Design Suite’s Block Design tool. The corresponding
section of the /dev/mem file can then be memory-mapped into the application’s own memory space usingint fd = open("/dev/mem", O_RDWR); volatile uint32_t* axilite =
(uint32_t*)mmap(NULL, AXILITE_LENGTH, PROT_READ|PROT_WRITE, MAP_SHARED, fd, AXILITE_BASEADDR) [retrieving the model information for the first NVM; wherein fig. E.1 teaches AXILITE inputting the model information and storing the model information ie setLayer Config in the Image Cache ie first NVM]; All subsequent reads and writes of *(axilite + byte_offset) are mapped into the /dev/mem file, and from there directly onto the memory bus. This method has been successfully implemented, and the communication between the FPGA accelerator and the CPU-based software is fully functional. The only drawback is the requirement for root privileges when running the ZynqNet Embedded CNN.”)
Gschwend teaches forwarding the model information to layer register map in the FPGA.
(Gschwend, pg. 84 fig. E.1 teaches loading the model information into the registers of the processing elements ie machine learning processor wherein the processing element is in the FPGA; further the model information is provided per layer)
PNG
media_image1.png
254
675
media_image1.png
Greyscale
In regards to claim 16,
Gschwend and Farabet teaches The method of claim 15,
Gschwend teaches further comprising: retrieving the model information from the layer register map
(Gschwend, pg. 57, “The files also contain all the relative address offsets of the corresponding memory-mapped registers [from the layer register map]. However, the driver relies on the Userspace I/O (UIO) kernel module, which in turn relies on the correct device tree being loaded into the kernel at boot time. Neither of these requirements is fulfilled in the default SCS Zynqbox installation, and the advanced project time did not allow to fix this. Therefore, we had to patch the low-level driver functions to directly access the Zynq’s memory bus to talk to the FPGA-based block, instead of using the elegant UIO module.
In Linux, the root user can directly access the physical memory bus without going through
the virtual-to-physical address translation by reading and writing the /dev/mem character
file. The physical address range which is assigned to the accelerator’s AXI4-Lite interface can
be found in the Address Editor in Vivado Design Suite’s Block Design tool. The corresponding
section of the /dev/mem file can then be memory-mapped into the application’s own memory space usingint fd = open("/dev/mem", O_RDWR); volatile uint32_t* axilite =
(uint32_t*)mmap(NULL, AXILITE_LENGTH, PROT_READ|PROT_WRITE, MAP_SHARED, fd, AXILITE_BASEADDR) [retrieving the model information; wherein fig. E.1 teaches AXILITE inputting the model information]; All subsequent reads and writes of *(axilite + byte_offset) are mapped into the /dev/mem file, and from there directly onto the memory bus. This method has been successfully implemented, and the communication between the FPGA accelerator and the CPU-based software is fully functional. The only drawback is the requirement for root privileges when running the ZynqNet Embedded CNN.”)
Gschwend teaches and sending the model information to the machine learning processor; and
(Gschwend, pg. 84 fig. E.1 teaches loading the model information into the registers of the processing elements ie machine learning processor wherein the processing element is in the FPGA; further the model information is provided per layer)
PNG
media_image1.png
254
675
media_image1.png
Greyscale
Gschwend teaches performing machine learning process in accordance with the model information and the coefficients from the second NVM.
(Gschwend, pg. 84 fig. E.1, pg. 38, “Parallelization Opportunities The nested loops can be a source of loop-level parallelism: independent loop iterations can be partially or fully unrolled and executed in parallel on different processing elements [performing machine learning process ie execution of the processing elements to compute operations of the CNN in accordance with the model information and the coefficients from the second NVM; wherein fig. E.1 teaches sending the weights and bias from the weight cache to the processing element]. The following sources of loop-level parallelism can be exploited in the ZynqNet CNN:
• independence of layers when applied to different image frames
• independence of dot-products at different pixel positions (y, x)
• independence of input channels ci
• independence of output channels co
• independence of intra-kernel multiplications”)
PNG
media_image11.png
787
1151
media_image11.png
Greyscale
In regards to claim 18,
Gschwend and Farabet teaches The method of claim 14,
Gschwend teaches further comprising programming a first portion of configurable logic blocks ("LBs") of the FPGA to perform functions of the machine learning processor for facilitating offloading computational tasks.
(Gschwend, pg. 84 fig. E.1, pg. 38, “Parallelization Opportunities The nested loops can be a source of loop-level parallelism: independent loop iterations can be partially or fully unrolled and executed in parallel on different processing elements [a first portion of configurable logic blocks ("LBs") ie processing elements of the FPGA to perform functions ie computations for the CNN of the machine learning processor for facilitating offloading computational tasks ie parallelization of the computations between processing elements]. The following sources of loop-level parallelism can be exploited in the ZynqNet CNN:
• independence of layers when applied to different image frames
• independence of dot-products at different pixel positions (y, x)
• independence of input channels ci
• independence of output channels co
• independence of intra-kernel multiplications”)
In regards to claim 19,
Gschwend and Farabet teaches The method of claim 18,
Gschwend teaches further comprising programming a second portion of configurable LBs of the FPGA to perform functions of the MCU for offloading computational tasks to one or more secondary computing units.
(Gschwend, pg. 84 fig. E.1, pg. 38, “Parallelization Opportunities The nested loops can be a source of loop-level parallelism: independent loop iterations can be partially or fully unrolled and executed in parallel on different processing elements [programming a second portion of configurable LBs ie different processing elements of the FPGA to perform functions of the MCU for offloading computational tasks to one or more secondary computing units; wherein the parallelization of the different processing elements is a second portion of configurable LBs and offloading computations to other processing elements ie secondary computing units]. The following sources of loop-level parallelism can be exploited in the ZynqNet CNN:
• independence of layers when applied to different image frames
• independence of dot-products at different pixel positions (y, x)
• independence of input channels ci
• independence of output channels co
• independence of intra-kernel multiplications”)
Claim 27 is rejected under the same rationale under 35 U.S.C. 103 as it is substantially similar to claim 14
Claims 28 and 29 are rejected under the same rationale under 35 U.S.C. 103 as it is substantially similar to claim 15
Claims 30 and 31 are rejected under the same rationale under 35 U.S.C. 103 as it is substantially similar to claim 16
Claim 33 is rejected under the same rationale under 35 U.S.C. 103 as it is substantially similar to claim 18
Claim 34 is rejected under the same rationale under 35 U.S.C. 103 as it is substantially similar to claim 19
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 17 and 32 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gschwend in view of Farabet in further view of Sinha, Anugraha, et al. "Quantized deep learning models on low-power edge devices for robotic systems."
In regards to claim 17,
Gschwend and Farabet teaches The method of claim 14,
Gschwend teaches wherein obtaining a trained model file includes extracting model information from flatbufferTM of TensorflowTM
(Gschwend, pg. 8 paragraph 1, “Neural Network Training Frameworks There are many popular software frameworks specifically built for the design and training of neural networks, including, among others, the Neural Network Toolbox for MATLAB [27], Theano [28] with the extensions Lasagne [29] and Keras [30], Torch [31], TensorFlow [32] [wherein obtaining a trained model file includes extracting model information from… TensorflowTM] and Caffe [33]. Most of these frameworks can utilize one or multiple GPUs in order to heavily accelerate the training of neural networks. For this thesis, the Caffe framework has been used due to its maturity, its support in the GPU-based training system NVidia DIGITS, [34] and most importantly because of the excellent availability of network descriptions and pretrained network topologies in native Caffe format.”)
However, Gschwend does not teach [wherein obtaining a trained model file includes extracting model information from] flatbufferTM of TensorflowTM
Sinha teaches [wherein obtaining a trained model file includes extracting model information from] flatbufferTM of TensorflowTM
(Sinha, Section 3.4, “The model was trained and saved using the Keras API.3 For conversion and quantization we used TensorFlow Lite,4 an open source deep learning framework for on-device inference, based on TensorFlow[10]. The saved model was loaded and converted to a TensorFlow Lite FlatBuffer file [flatbufferTM of TensorflowTM] using a post-training weight quantization conversion technique.”)
Sinha is considered to be analogous to the claimed invention because they are in the same field of embedding hardware with neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gschwend and Farabet to incorporate the teachings of Sinha in order to provide a device embedded with a neural network at a very low cost (Tensorflow lite is open-source) (Sinha, Section I., “Deep learning on edge devices has major benefits that could have great impact on the developing world:
• Low latency: Fast on-device inference
• Privacy: Data is processed on-device
• Connectivity: Fully offline
• Power consumption: Low-power, low-cost
We utilize a device that is available at a very low cost (approximately $15 per microcontroller) and requires low power consumption in working mode.”)
Claim 32 is rejected under the same rationale under 35 U.S.C. 103 as it is substantially similar to claim 17
Claim(s) 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gschwend in view of U.S. Pub. No. US20190190538A1 Park et al. (“Park”)
In regards to claim 25,
Gschwend teaches The device of claim 20,
Park teaches wherein the memory controller
(Park, “[0070] Memory controller 718 generally represents any type or form of device capable of handling memory or data or controlling communication between one or more components of computing system 710. For example, in certain embodiments memory controller 718 may control communication between processor 714, system memory 716, and I/O controller 720 via communication infrastructure 712.”)
Park teaches facilitates to compress and decompress trained machine learning and neural network model coefficients for conserving storage space.
(Park, “[0041] In various embodiments, reducing memory bandwidth consumption [conserving storage space] may directly translate to accelerated computation in bandwidth-limited systems. Compressing data, which may reduce a size of the data, may serve to reduce bandwidth usage and therefore eliminate a memory-bandwidth bottleneck. In some embodiments, data may be compressed before being written to memory, and the compressed data may be read from memory and decompressed on an accelerator before being used in neural network computations. Compression may be applied to an entire set of parameters for a neural network layer or may be selectively applied, for example, to a subset of parameters of a neural network or layer. Certain data, such as filter weights, may be compressed and cached locally for all or a portion of the processing involved in a particular neural network layer (or set of neural network layers) [compress… trained machine learning and neural network model coefficients].”)
(Park, “[0043] At decompression step 420, data that is compressed in DDR 402 or SRAM 404 may be transferred to on-board decompression logic of an accelerator and may be cached for access by, or streamed directly to, network-layer logical units 435. For example, network-layer logical units 435 may request decompressed parameters for a network layer from a cache (e.g., SRAM 404), which may store compressed and/or decompressed data) and may receive a stream of decompressed data directly from decompression logic. Alternatively, network-layer logical units 435 may read or receive compressed parameters and may perform decompression [decompress… trained machine learning and neural network model coefficients] within layer processing logic before using the parameters for layer processing operations.”)
Park is considered to be analogous to the claimed invention because they are in the same field of embedding hardware with neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gschwend to incorporate the teachings of Park in order to reduce memory bandwidth and eliminate any memory-bandwidth bottlenecks (Park, “[0041] In various embodiments, reducing memory bandwidth consumption may directly translate to accelerated computation in bandwidth-limited systems. Compressing data, which may reduce a size of the data, may serve to reduce bandwidth usage and therefore eliminate a memory-bandwidth bottleneck.”)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US Pub No. US20190244095A1 Huang et al. teaches Deep learning fpga converter
US Pub No. US20200257955A1 Apple teaches Customizable Chip for AI Applications
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JASMINE THAI whose telephone number is (703)756-5904. The examiner can normally be reached M-F 8-4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.T.T./Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129