Office Action Analysis: 18122665 — PROCESSING METHOD IN A CONVOLUTIONAL NEURAL NETWORK ACCELERATOR, AND ASSOCIATED ACCELERATOR

Office Action

§102
DETAILED ACTION
This action is in response to the application filed on 3/16/2023. Claims 1-10 are pending in the application and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Xiao et al. (“Neuronlink: An Efficient Chip-to-Chip Interconnect for Large-Scale Neural Network Accelerators” [2020], as disclosed in IDS, hereinafter “Xiao”)


Regarding Claim 1,
Xiao discloses A processing method in a convolutional neural network accelerator comprising an array of unitary processing blocks (Xiao [Section III Subsection B]; “In DNNs, each hidden/output neuron is connected to a small region/all of the input neurons (e.g., convolutional layers/fully connected layers). The outputs of the neurons in one layer form the inputs of the neurons in the next layer. Communications occur between different layers, mainly including one to-one, one-to-many, and many-to-one patterns, as indicated in Fig. 2.” wherein the convolutional layers thus read on an array of unitary processing blocks)
each unitary processing block comprising a router and a unitary computing element PE associated with a set of respective local memories, the unitary computing element making it possible to perform computing operations from among multiplications and accumulations on data stored in its local memories (Xiao [Figure 10]; 

    PNG
    media_image1.png
    291
    722
    media_image1.png
    Greyscale
 
Xiao [Section V]; “In this section, we present a large-scale DNN accelerator with the proposed interconnect, NeuronLink. The overall architecture of the DNN accelerator is shown in Fig. 10. The accelerator consists of four chips [see Fig. 10(a)]. These chips are connected through NeuronLink and organized in a ring approach. Each chip consists of 16 processing nodes, which are organized in a 2-D mesh NoC [see Fig. 10(b)]. For a high bandwidth off-chip communication, like data exchanging with the main memory (DRAM), each chip maintains a PCIe interface, as indicated in Fig. 10(b). Each node is made up of eDRAM buffers to store input feature maps, four digital processing units (DPUs) to perform shift-and-add, activation, and max-pooling operations, and eight analog processing units (APUs) to perform in situ MAC operations, connected with a shared bus [see Fig. 10(c)]. Each APU contains several crossbar arrays, DACs, and ADCs, connected with a shared bus [see Fig. 10(d)]. Our architecture is targeted on DNN inference, which is the dominant application in many fields. Fig. 11 shows the mapping of the ResNet-18 model [1] to the proposed large-scale DNN accelerator. The circular and square represent the router and processing node, respectively. The number inside a circle indicates the layer number of ResNet-18 model. The arrow represents the data movement inside a chip or cross chips. The dark green circle indicates that this router transfers data from the local processing node to the main memory (DRAM) through the PCIe interface” wherein the combination of the Router and Node together form a unit block; wherein the processing units performing shift-and-add, activation, and pooling operations thus read on computing operations stored in its local memories)
the router making it possible to carry out multiple independent data routing operations in parallel to separate outputs of the router (Xiao [Figure 5]; “East output,” “West output,” “South output,” “North output”
Xiao [Section III Subsection B]; “one-to many”; 
Xiao [Section IV Subsection B]; “Our router architecture is shown in Fig. 5, which is the optimization of a conventional router. It consists of five input ports, five output ports, VC buffers, a route computation unit, a VC allocator, a switch allocator, and a 5 × 5 crossbar switch”
Xiao [Section V]; “On-Chip Interconnection Network Performance,” “The architecture of the on-chip network is shown in Fig. 10(b), which is a 2-D 4 × 4 mesh NoC with the optimized router (NeuronLink-R, see Fig. 5) described in Section IV-B.” “different packet types (unicast and multicast)”))
said method comprising the following steps carried out in parallel by one and the same unitary processing block during one and the same respective processing cycle clocked by a clock of the accelerator: receiving and/or transmitting, through the router of the unitary block, first and second data from or to neighbouring unitary blocks in the array in first and second directions selected, on the basis of said data, from among at least the vertical and horizontal directions in the array (Xiao [Fig. 11, Section V]; “The arrow represents the data movement inside a chip or cross chips”; Section IV.A, “At the transmitting side, the data link layer receives packets from the transaction layer (i.e., NoCs) through a bus. The bus consists of data, request, and acknowledge signals. The acknowledge signal (4 bits) indicates if the data link layer receives packets. The header flit of the received packet is analyzed first; packet priority, multicast type, and destination address information from the header flit are available. Then, the body flits are stored in corresponding VCs according to their priorities. With these messages and credit mechanisms, the credit management (CRM) unit decides which VC to be sent if there are several requests. Finally, the packet to be sent is stored into an asynchronous first-input first-output (FIFO) and backed up in the retry buffer.” “At the receiving side, the data processed by the physical medium attachment (PMA) sublayer is recovered to normal data after block boundary alignment and channel bonding. Then, the data are descrambled to normal order and sent to the decoder. After that, the elastic buffer is used to synchronize data from the recovery clock to the local clock.”)
the elementary computing unit performing one of said computing operations in relation to data stored in said set of local memories during at least one previous processing cycle (Xiao [Section V]; “Section V, “Each node is made up of eDRAM buffers to store input feature maps, four digital processing units (DPUs) to perform shift-and-add, activation and max-pooling operations, and eight analog processing units (APUs) to perform in situ MAC operations, connected with a shared bus.”)

Regarding Claim 2,
Xiao teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated). Xiao further discloses wherein said router comprises a block of parallel routing controllers, a block of parallel arbitrators, a block of parallel switches and a block of parallel input buffers, the router being able to receive and process various data communication requests in parallel (Xiao [Section IV Subsection B]; “B. On-Chip VCs Routing Optimization Our router architecture is shown in Fig. 5, which is the optimization of a conventional router. It consists of five input ports, five output ports, VC buffers, a route computation unit, a VC allocator, a switch allocator, and a 5 ×5 crossbar switch. The dashed rectangle indicates that the component is with our optimization. Details about the VC routing optimizations are given in the following. 1) Scoring Crossbar Arbitration: Fig. 6(a) illustrates a conventional round-robin crossbar arbitration method. For each input port, if there exist packets of different priorities in the VCs, the highest priority one will be forwarded to the crossbar arbitration unit by the VC allocator. Then, packets from different input ports are granted by priority comparator and round-robin arbitration to access the routing resource. This multilevel arbitration method would introduce long latency and consume more hardware resources.”
Xiao [Figure 5];

    PNG
    media_image2.png
    241
    340
    media_image2.png
    Greyscale

Xiao [Figure 6]; 

    PNG
    media_image3.png
    390
    346
    media_image3.png
    Greyscale

wherein the disclosed router comprising route computation of which crossbar priority arbitration and round-robin arbitration are conducted in parallel thus reads on a router comprising a block of parallel routing controllers and a block of parallel arbitrators; wherein Fig.5’s disclosed 5x5 crossbar switch can be read as a parallel switch block; wherein the VC buffer comprising “East input”, “West input”, “South input”… can be read as a parallel input buffers block; wherein the router processing data communications from the respective cardinal buffers is performed in Figure 6)

Regarding Claim 3,
Xiao teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated). Xiao further discloses wherein said accelerator comprises a global control block, a computing control block and a communication control block, the communication control is performed independently of the computing control, the computing controller making it possible to control the computing operations carried out by the unitary computing elements, and the read and write operations from and to the associated local memories, the communication controller managing the data transfers between a global memory and the local memories, and the data transfers between the processing blocks (Xiao [Section IV Subsection A]; “The physical layer receives data and commands from the data link layer. We utilize asynchronous FIFOs to trans fer data and asynchronous handshake mechanism to transfer commands. Asynchronous handshake approach is introduced to address the issue that commands need a higher priority than data (e.g., for retransmission, commands are required to transmit before the data). Then, the synchronous header, check header, and package header are added to the data or com mands by the encoder. After that, scrambler utilizes the IEEE 802.3 standard to avoid successive “0" or “1" value and balance the number of “0" and “1." The encoded message will not be scrambled. Therefore, these two steps can be executed simultaneously to improve parallelism. Finally, control fields are generated by the byte striping module to avoid mistakes on the chips. At the receiving side, the data processed by the physical medium attachment (PMA) sublayer is recovered to normal data after block boundary alignment and channel bonding. Then, the data are descrambled to normal order and sent to the decoder. After that, the elastic buffer is used to synchronize data from the recovery clock to the local clock. Command interface and data interface transfer command or data to the data link layer by recognizing the sync header. For a command, the CRM in the data link layer analyzes it and responds accordingly. For data, it is buffered and sent to different VCs depending on the priority and address. Once the data fails to pass the check by the decoder, the CRM generates retry command to require the transport side to retry data and abandon any data until it receives the retry data. “
Xiao [Figure 4];

    PNG
    media_image4.png
    329
    345
    media_image4.png
    Greyscale

wherein the accelerator comprises a Gearbox (read as global control block), a CMD interface (computing control block), Data interface (communication control block), both of which are performed independently of one another (communication control performed independently of the computing control), wherein data transfer commands and data to the data link layers to subsequently analyze and respond accordingly to transmitted commands or buffer and transmit data appropriately dependent on priority thus reads on the computing controller performing read and writes of information to manage data transfers between global memory and local memories)

Regarding Claim 4,
Xiao teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated). Xiao further discloses wherein a unitary block performs transmission of a type selected between broadcast and multicast on the basis of a header of the packet to be transmitted (Xiao [Section IV Subsection A]; “At the transmitting side, the data link layer receives packets from the transaction layer (i.e., NoCs) through a bus. The bus consists of data, request, and acknowledge signals. The acknowledge signal (4 bits) indicates if the data link layer receives packets. The header flit of the received packet is analyzed first, packet priority, multicast type, and destination address information from the header flit are available.”)
wherein the unitary block applies at least one of said rules: for a packet to be transmitted in broadcast mode from a neighbouring unitary block located in a given direction with respect to said block having to perform the transmission, said block transmits the packet in the course of a cycle in all directions except for that of said neighbouring block; for a packet to be transmitted in multicast mode: if the packet comes from the PE of the unitary block, the multicast implemented by the block is bidirectional in two opposite directions; if not, the multicast implemented by the block is unidirectional, directed opposite to the neighbouring processing block from which said packet originates (Xiao [Figure 5];

    PNG
    media_image5.png
    244
    343
    media_image5.png
    Greyscale

wherein the cardinal direction directions by which packet routing is performed thus reads implicitly on a rule for packets to be transmitted in broadcast in all directions except that of said neighboring block, since multicast transmissions are inherently bidirectional in nature; thus, unidirectional broadcast mode would already disclose propagating a packet without retransmission of the packet to the sending block)

Regarding Claim 5,
Xiao teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated). Xiao further discloses wherein, in the case of at least two simultaneous transmission requests in one and the same direction by a unitary block during a processing cycle, the priority between said requests is arbitrated, the request arbitrated as having priority is transmitted in said direction and the other request is stored and then transmitted in said direction in a subsequent processing cycle (Xiao [Section IV Subsection A]; “one of the key features of NeuronLink is that we introduce transmission priority to ease the congestion in chip-to-chip communication for large-scale NN applications”
Xiao [Section IV Subsection B1]; “performing crossbar priority arbitration and round-robin arbitration in parallel with the scoring mechanism. The related score of the packets from each port is computed based on both their priority (els) and the current round-robin factor (reslast) as weight, and the one gets the highest score will be granted and transmitted to the next router”)

Claims 6-10 recite a convolution neural accelerator apparatus comprising an array of unitary processing blocks and a clock to perform the same disclosed processing method of Claims 1-5 respectively. Claims 6-10 are thus rejected for reasons set forth in the rejection of Claims 1-5 respectively. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
“Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” (IEEE Journal on Emerging and Selected Topics in Circuits and Systems [2019], as disclosed in IDS, Chen et al.) which discloses a processing method in a convolutional neural network accelerated directed towards parallel data routings comprising data computational operations for transmissions
“A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm” (IEEE Journal of Solid-State Circuits [2020], as disclosed in IDS, Zimmer et al.) which discloses selected-type transmissions between broadcast and multicast
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN J KIM whose telephone number is (571) 272-0523. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kieu Vu can be reached on (571) 272-4057. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JONATHAN J KIM/Examiner, Art Unit 2141                                                                                                                                                                                                        
/MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141
Read full office action
PROCESSING METHOD IN A CONVOLUTIONAL NEURAL NETWORK ACCELERATOR, AND ASSOCIATED ACCELERATOR

This examiner grants 33% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

PROCESSING METHOD IN A CONVOLUTIONAL NEURAL NETWORK ACCELERATOR, AND ASSOCIATED ACCELERATOR

This examiner grants 33% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email