DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are presented for the examination.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
2. Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Chen ( US 20140044015 A1 ) in view of PARKER( EP 2451127 A1) and further in view of Tong(US 20200366626 A1).
As to claim 1, Chen teaches one or more processors including multiple clusters ofthe nodes may include a router, a main processor, a memory device and a processing element located in the memory device. The nodes are organized into node groups, where the node groups are organized in a grid, which may be a mesh or torus topology, para[0015])
to perform an all-to-all communication procedure in which each of the processing elements for data packets in parallel for all-to-all data communication between the multiple clusters(In an embodiment, each of the nodes in each node group are directly connected to each other in an all-to-all fashion. For example, intra-group links 210 in node group 102 directly connect each node to each other node in the group. Further, inter-group links directly connect each node in each node group to a node in each neighboring node group, para[0019], ln 1-12/ inter-group links directly connecting, in parallel, each node in each node group to a node in each neighboring node group in the M dimensional grid, the nodes each including a router. The method includes transmitting a packet from a first node in a first location in a first node group to a second node in a second location within the first node group and transmitting the packet from the second node in the second location in the first node group to a third node in a corresponding second location in a second node group, para[0006], ln 7-17/ links between nodes in different node groups, called inter-group links, are provided between nodes in neighboring node groups, where the inter-group links are parallel direct connections from each node in each node group to a node in a neighboring node group. The position of the node within each neighboring node group receiving the inter-group link may be the same, thus providing parallel connection from each node to each of the neighboring node groups, para[0013], ln 8-15 ).
Parker teaches and each of the multiple clusters communicates the data packets to all other clusters of the multiple clusters( processor interconnect networks are used in multiprocessor computer systems to transfer data from one processor to another, or from one group of processors to another group, Each group is treated as a very high-radix router, and a single dimension flattened butterfly ( all-to-all ) connects all of the groups to form the second layer of the dragonfly topology example presented here, SEC: The dragonfly network topology, 3-10/ to increase the terminal bandwidth of a high-radix network such as a Dragonfly, channel slicing can be employed. Rather than make the channels wider, which would decrease the router radix, multiple network can be connected in parallel to add capacity, Similarly, the dragonfly topology in some embodiments can also utilize parallel networks to add capacity to the network. In addition, the dragonfly networks described so far assumed uniform bandwidth to all nodes in the network.), Sec: to increase the terminal bandwidth, ln 1-10
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Chen with Parker incorporate the above feature because this achieves the benefits of a very high radix with routers without requiring hundreds or thousands of ports per node and increases the physical length of the global channels, exploiting the capabilities of emerging optical signaling technology.
Tong teaches processing elements that are each configured to generate data packets in parallel(FIG. 5 is an architectural diagram of a server according to an embodiment of the present invention. As shown in FIG. 5, the server may be a server cluster including a master server on and a plurality of computing servers 012. Each computing server 012 may be configured to generate write operation information based on a to-be-updated forwarding entry, and encapsulate the write operation information into a write operation packet. The master server on may be configured to deliver the write operation packet to a line card box 02. The plurality of computing servers 012 may generate write operation packets in parallel, to ensure a rate of generating write operation packets, para[0073], ln 14/ , the NP of the line card box may include a plurality of processor cores. The plurality of processor cores may process the write operation packet in parallel, in other words, obtain the write operation information from the write operation packet in parallel, para[0084], ln 1-6/ where the write operation information includes write operation data and a write operation address, the write operation data is used to indicate the to-be-updated forwarding entry, and the write operation address is used to indicate an address to which the write operation data is to be written, para[0021], ln 3-9),
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Chen and Parker with Tong to incorporate the above feature because this ensures the forwarding entry is correctly updated, thereby improving reliability in updating the forwarding entry.
3. Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Chen ( US 20140044015 A1 ) in view of PARKER( EP 2451127 A1) in view of Tong(US 20200366626 A1) and further in view of Mineo( US 9401774 B1).
As to claim 2, Mineo teaches the processing elements of the multiple clusters are graphics processing units (GPUs)( Each node 302 could also represent, for example, a group of nodes in a hierarchy or a storage cluster. Alternatively, each node 302 could represent different types of processing elements, for example, a CPU or a GPU. Alternatively still, the node 302 may represent an interface to another type of network. For example, an element for translating between the HPC interconnect fabric and Ethernet. An “all-to-all” connection as used herein refers to connections within the system 300 which provide dedicated, unshared, arbitration-free communication between each node 302 of the system 300 and each of the remaining nodes 302 of the system 300. For clarity, a transmitter portion of each node 302 is illustrated on the left hand side of the drawing and a receiver portion of each node 302 is illustrated in the right hand side of the drawing, i.e. P1_TX and P1_RX are two portions of the same node, col 4, ln 50-67/ The routing of the signals from the transmission portions of the nodes to the receiver portions of the nodes is therefore asymmetric. Specifically for each connection on the transmission side, the W transmitters in bank j (1≦j≦M) of node k (1≦k≦N) are connected to the j.sup.th input port k in the same group. However, the W receivers of the bank j (1≦j≦M) of node k (1≦k≦N) are connected to the output port k of the AWGR group j. The input ports of the AWGRs are numbered the same as the W nodes in the same group, and are repeated M times in one group, while the output ports of the AWGRs are numbered repetitively from 1 to N for all the M groups, col 5, ln 50-60).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Chen, Parker and Tong with Menio to incorporate the above feature because this provides all-to-all connection between the nodes of the network using a wavelength routing device and a limited number of wavelengths.
4. Claim(s) 3, 8 are rejected under 35 U.S.C. 103 as being unpatentable over Chen ( US 20140044015 A1 ) in view of PARKER( EP 2451127 A1) in view of Tong(US 20200366626 A1) and further in view of Billa(US 20210097082 A1).
As to claim 3, Billa teaches the processing elements are each configured to communicate the data packets in parallel intra-cluster( DPUs 17 interface and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity such that any of storage nodes 12 or compute nodes 13 may communicate packet data for a given packet flow to any other of the servers using any of a number of parallel data paths within the data center 10. For example, in some example network architectures, DPUs spray individual packets for packet flows between the DPUs and across some or all of the multiple parallel data paths in the data center switch fabric 14 and reorder the packets for delivery to the destinations so as to provide full mesh connectivity. In this way, DPUs 17 interface and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity such that any of storage nodes 12 or compute nodes 13 may communicate packet data for a given packet flow to any other of the servers using any of a number of parallel data paths within the data center 10. For example, in some example network architectures, DPUs spray individual packets for packet flows between the DPUs and across some or all of the multiple parallel data paths in the data center switch fabric 14 and reorder the packets for delivery to the destinations so as to provide full mesh connectivity., para[0044]/ each highly programmable DPU 17 comprises a network interface (e.g., Ethernet) to connect to a network to send and receive stream data units (e.g., data packets), one or more host interfaces (e.g., Peripheral Component Interconnect-Express (PCI-e)) to connect to one or more application processors (e.g., a CPU or a graphics processing unit (GPU)), para[0036], ln 1-10/ DPU 17 operates as a new type of processor separate from any CPU or GPU of computing device 13, para[0083], ln 1-3).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Chen, Parker and Tong with Billa to incorporate the above feature because this provides a high-level controller for configuring and managing the routing and switching infrastructure of data center.
As to claim 8, Billa teaches wherein the all-to-all communication procedure comprises: a first stage of intra-cluster parallel data communication between respective processing elements of each of the multiple clusters, and data is coalesced for inter-cluster data exchange; a second stage of the inter-cluster data exchange for the all-to-all data communication between the multiple clusters( DPUs 17 interface and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity such that any of storage nodes 12 or compute nodes 13 may communicate packet data for a given packet flow to any other of the servers using any of a number of parallel data paths within the data center 10. For example, in some example network architectures, DPUs spray individual packets for packet flows between the DPUs and across some or all of the multiple parallel data paths in the data center switch fabric 14 and reorder the packets for delivery to the destinations so as to provide full mesh connectivity. In this way, DPUs 17 interface and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity such that any of storage nodes 12 or compute nodes 13 may communicate packet data for a given packet flow to any other of the servers using any of a number of parallel data paths within the data center 10. For example, in some example network architectures, DPUs spray individual packets for packet flows between the DPUs and across some or all of the multiple parallel data paths in the data center switch fabric 14 and reorder the packets for delivery to the destinations so as to provide full mesh connectivity., para[0044]/DPU 17 operates as a new type of processor separate from any CPU or GPU of computing device 13, para[0083], ln 1-3/ and a third stage of intra-cluster data distribution to the respective processing elements of each of the multiple clusters( Analytics service control nodes 25 present one or more interfaces (e.g., APIs) with which general analytics software tools 23 interact to direct analytics processing of data from data sources 19 via one or more clusters of one or more DPU-enhanced compute nodes 13 and, in some examples, one or more DPU-enhanced storage nodes 12, para[0047], ln 5-14/ In one example, data ingestion engine 31 reads rows of tables of data from data sources 19 and distributes the rows of data to compute nodes 13 via DPUs 17 using distribution keys for storage and subsequent, high-speed analytics processing. Alternatively, in some implementations, data ingestion engine 31 may horizontally slice each table of data within data sources 19 into N slices and allocate each slice to one of compute nodes 13 of cluster 42 identified by analytics service control node 25 for servicing the request. In one example, the number of slices N is the same as the number of compute nodes 13 selected for the cluster servicing the request. Each compute node 13 reads the slice or slices from data sources 19 assigned to the compute node for retrieval, para[0054], ln 10-24) for the same reason as to claim 3 above.
5. Claim(s) 4 is rejected under 35 U.S.C. 103 as being unpatentable over Chen ( US 20140044015 A1 ) in view of PARKER( EP 2451127 A1) in view of Tong(US 20200366626 A1) in view of Billa(US 20210097082 A1) and further in view of Bataineh( US 20140140341 A1).
As to claim 4, Bataineh teaches the data packets include at least GET requests or PUT requests communicated by the processing elements in parallel intra-cluster( FIG. 1 shows a system 10 comprising multiple groups of nodes in which each of the groups of nodes 12 is connected to all of the others (illustrated by the lines between groups of nodes). Where traffic is uniformly distributed all paths are equally loaded as shown on the left hand side of the Figure. Where traffic is between pairs of groups of nodes 12 (shown in heavier lines on the right hand side of the Figure) many of the links are unused (thinner lines) with minimal routing. Adaptive routing algorithms select between minimal and non-minimal routing according to network load. This choice can be biased to favor minimal or non-minimal routing, for example, so that minimal routing can be preferred when the load is lower. In general, global communication patterns (all-to-all or FFT for example) perform well with minimal routing and local-communication patterns (nearest neighbor for example) perform well with non-minimal (or some element of non-minimal) routing, para[0002], ln 17-30 to para[0003], ln 1-6/ his simulation consisted of each endpoint injecting messages of size 64 bytes to 128K bytes. Each message consisted of cache-line sized GET request packets to random (evenly distributed) destinations in the network, para[0057], ln 1-6).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Chen, Parker, Tong and Billa with Bataineh to incorporate the above feature because this increases minimal bias algorithm results in minimal routing of a higher percentage of traffic. As such it improves performance and cost effectiveness.
6. Claim(s) 5, 15 are rejected under 35 U.S.C. 103 as being unpatentable over Chen ( US 20140044015 A1 ) in view of PARKER( EP 2451127 A1) in view of Tong(US 20200366626 A1) and further in view of Froese( US 20220166705 A1).
As to claim 5, Froese teaches a single data message is communicated between a pair of the multiple clusters for the all-to-all data communication between the multiple clusters( It should be noted that each column may have identical connections with the all-to-all column bus connections for a single column, and there may be a two clock delay per tile, resulting in a six clock delay to get from the top row to the bottom row. It should also be understood that both row and column buses both use the aforementioned credit-based protocol to determine when they are able to send. In the case of row buses, the source port maintains credit count, para[0056], ln 1-10/ There may be multiple local links between pairs of switches within a group and there may be multiple links between pairs of groups. A packet may be routed directly from its source group to its destination group or it may be routed through one other group (an intermediate group) on its way from the source to the destination group, para[0207], ln 9-15).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Chen, Parker, Tong with Froese to incorporate the above feature because this reduces the maximum latency of small communications in the presence of large communications.
As to claim 15, it is rejected for the same reason as to claims 1 and 5 above.
7. Claim(s) 6, 7 are rejected under 35 U.S.C. 103 as being unpatentable over Chen ( US 20140044015 A1 ) in view of PARKER( EP 2451127 A1) in view of in view of Tong(US 20200366626 A1) in view of Froese( US 20220166705 A1) and further in view of Ajima( US 20080089329 A1)
As to claim 6, Ajima teaches the data packets are coalesced in a send buffer from which the single data message is generated for inter-cluster communication between the pair of the multiple clusters( When one or more packets are stored in the buffer memories in the output circuit 113 and the reception circuits 114, 116a, 116b, and 116c, the switch circuit 118 successively acquires the packets from the output circuit 113 and the reception circuits 114, 116a, 116b, and 116c. Then, the switch circuit 118 determines the destination (from the switch circuit 118) of each of the acquired packets on the basis of the destination address contained in the packet. The destination (from the switch circuit 118) of each of the acquired packets is one of the input circuit 112 and the transmission circuits 115, 117a, 117b, and 117c, para[0062]/ if the sixteen nodes in FIG. 1 (i.e., the first to fourth nodes 11, 12, 13, and 14 in the computer cluster 10, the first to fourth nodes 21, 22, 23, and 24 in the computer cluster 20, the first to fourth nodes 31, 32, 33, and 34 in the computer cluster 30, and the first to fourth nodes 41, 42, 43, and 44 in the computer cluster 40) are respectively arranged in sixteen lattice points in a two-dimensional single-layer lattice with the dimensions of 4.times.4, transmission of a packet from the node 11 to the node 43 needs six operations of relaying the packet. However, in the interconnection network according to the present invention in which four computer clusters each containing four nodes are respective, para[0042], ln 14).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Chen, Parker, Tong and Froese with Ajima to incorporate the above feature because this reduces the number of operations of relaying a packet.
As to claim 7, Ajma teaches the single data message is communicated from the send buffer to a receive buffer for the inter-cluster communication between the pair of the multiple clusters( para[0062]/para[0042], ln 14) for the same reason as to claim 6 above.
8. Claim(s) 9 is rejected under 35 U.S.C. 103 as being unpatentable over Chen ( US 20140044015 A1 ) in view of PARKER( EP 2451127 A1) in view of in view of Tong(US 20200366626 A1) and further in view of Sharapov(US 7333444 B1).
As to claim 9, Sharapov teaches the all-to-all communication procedure is performed in a number of steps that is twice a number of clustering levels plus one additional step(This network topology is based on multiple levels of all-to-all clustering within subsets of the overall network, col 6, ln 40-45).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Chen, Parker, Tong with Sharapov to incorporate the above feature because this provides a high degree of connectivity with low communication latency, it is desirable to connect system components together into highly connected networks through direct point-to-point links.
9. Claim(s) 10, 11 are rejected under 35 U.S.C. 103 as being unpatentable over in view of PARKER( EP 2451127 A1) in view of Kaldewey(US 20200125368 A1) and further in view of Mineo(US 9401774 B1).
As to claim 10, Parker teaches and each of the multiple clusters communicates the data packets to all other clusters of the multiple clusters( processor interconnect networks are used in multiprocessor computer systems to transfer data from one processor to another, or from one group of processors to another group, Each group is treated as a very high-radix router, and a single dimension flattened butterfly ( all-to-all ) connects all of the groups to form the second layer of the dragonfly topology example presented here, SEC: The dragonfly network topology, 3-10/ to increase the terminal bandwidth of a high-radix network such as a Dragonfly, channel slicing can be employed. Rather than make the channels wider, which would decrease the router radix, multiple network can be connected in parallel to add capacity, Similarly, the dragonfly topology in some embodiments can also utilize parallel networks to add capacity to the network. In addition, the dragonfly networks described so far assumed uniform bandwidth to all nodes in the network.), Sec: to increase the terminal bandwidth, ln 1-10),
Kaldewey teaches multiple graphics processing units (GPUs) distributed in clusters, the multiple GPUs configured to communicate data intra-cluster, between respective GPUs of each of the clusters exchange the data inter-cluster ( In a hierarchical exchange, the operations processor 134 may logically group the GPUs 102A-102N into sets of M GPUs, where M is less than the number of GPUs N. At the outset of probing the hash table 144, each set S of the GPUs 102A-102N may collectively include a full copy the probe table 142 across a number of partitions in the GPU memory 122 of the GPUs in the set, so that the probe table partitions are replicated in each set S. When exchanging probe table data between GPUs, the partitions may be exchanged (e.g., after filtering in a round robin fashion) within each set S. This approach may reduce the number of passes of probe table data from N to M, which may be desirable for GPU-GPU connection topologies that may not provide fast bisection bandwidth between all GPUs, and are effectively limited by communication throughput. For example, the pressure from an all-to-all interconnect between the GPUs 102A-102N may be offloaded, as the GPUs may only communicate all-to-all during the first iteration of probing the hash table 144, but in the next M−1 iterations, the GPUs may communicate within the sets, para[0098]),
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Parker with Kaldewey to incorporate the above feature because this increases the effective memory capacity available to the system and may also leverage the processing capabilities of multiple GPUs executing in parallel when processing a join operation to reduce the runtime of a join relative to conventional approaches.
Mineo teaches for an all-to-all data communication between the clusters; distribute the data intra-cluster to the respective GPUs of each of the clusters( The network 300 of the present invention is illustrated in FIG. 4. The network 300 includes N nodes 302 for which all-to-all connection is provided utilizing AWGRs 304. The nodes are identified in FIG. 4 as P1-PN. The nodes 302 represent, for example a processing element along with memory and a network interface (e.g. a computer, blade, or rack), and a WDM optical interconnect link including a transmitter and a receiver. Each node 302 could also represent, for example, a group of nodes in a hierarchy or a storage cluster. Alternatively, each node 302 could represent different types of processing elements, for example, a CPU or a GPU. Alternatively still, the node 302 may represent an interface to another type of network. For example, an element for translating between the HPC interconnect fabric and Ethernet. An “all-to-all” connection as used herein refers to connections within the system 300 which provide dedicated, unshared, arbitration-free communication between each node 302 of the system 300 and each of the remaining nodes 302 of the system 300. For clarity, a transmitter portion of each node 302 is illustrated on the left hand side of the drawing and a receiver portion of each node 302 is illustrated in the right hand side of the drawing, i.e. P1_TX and P1_RX are two portions of the same node, specifically the transmitter and receiver portions respectively, col 4, ln 43-67).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Parker, Kaldewey and Mineo with Sharapov to incorporate the above feature because this provides arbitration-free all-to-all connection between the nodes of the network utilizing wavelength routing devices and utilizing a limited number of wavelengths for routing optical signals to the nodes of the network.
As to claim 11, Kaldewey teaches the multiple GPUs are configured to coalesce the data intra- cluster for exchange of the data inter-cluster( para[0098]) for the same reason as to claim 10 above.
10. Claim(s) 12, 13, 14 are rejected under 35 U.S.C. 103 as being unpatentable over PARKER( EP 2451127 A1) in view of Kaldewey(US 20200125368 A1) in view of Mineo(US 9401774 B1) and further in view of Ajima( US 20080089329 A1).
As to claim 12, Ajima teaches the data packets are coalesced in a send buffer from which the single data message is generated for inter-cluster communication between the pair of the multiple clusters( When one or more packets are stored in the buffer memories in the output circuit 113 and the reception circuits 114, 116a, 116b, and 116c, the switch circuit 118 successively acquires the packets from the output circuit 113 and the reception circuits 114, 116a, 116b, and 116c. Then, the switch circuit 118 determines the destination (from the switch circuit 118) of each of the acquired packets on the basis of the destination address contained in the packet. The destination (from the switch circuit 118) of each of the acquired packets is one of the input circuit 112 and the transmission circuits 115, 117a, 117b, and 117c, para[0062]/ if the sixteen nodes in FIG. 1 (i.e., the first to fourth nodes 11, 12, 13, and 14 in the computer cluster 10, the first to fourth nodes 21, 22, 23, and 24 in the computer cluster 20, the first to fourth nodes 31, 32, 33, and 34 in the computer cluster 30, and the first to fourth nodes 41, 42, 43, and 44 in the computer cluster 40) are respectively arranged in sixteen lattice points in a two-dimensional single-layer lattice with the dimensions of 4.times.4, transmission of a packet from the node 11 to the node 43 needs six operations of relaying the packet. However, in the interconnection network according to the present invention in which four computer clusters each containing four nodes are respective, para[0042], ln 14).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Parker, Kaldewey and Mineo with Ajima to incorporate the above feature because this reduces the number of operations of relaying a packet.
As to claim 13, Ajma teaches wherein the second stage comprises a single data message being communicated between a pair of the clusters for the inter-cluster data exchange.
( para[0062]/para[0042], ln 14) for the same reason as to claim 12 above.
As to claim 14, Ajma teaches the single data message is communicated from a send buffer to a receive buffer for the inter-cluster data exchange between the pair of the clusters(para[0062]/para[0042], ln 14) for the same reason as to claim 12 above .
11. Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over PARKER( EP 2451127 A1) in view of Klenk( US 20210037107 A1) and further in view of Mineo(US 9401774 B1).
As to claim 15, Parker teaches and each of the multiple clusters communicates the data packets to all other clusters of the multiple clusters( processor interconnect networks are used in multiprocessor computer systems to transfer data from one processor to another, or from one group of processors to another group, Each group is treated as a very high-radix router, and a single dimension flattened butterfly ( all-to-all ) connects all of the groups to form the second layer of the dragonfly topology example presented here, SEC: The dragonfly network topology, 3-10/ to increase the terminal bandwidth of a high-radix network such as a Dragonfly, channel slicing can be employed. Rather than make the channels wider, which would decrease the router radix, multiple network can be connected in parallel to add capacity, Similarly, the dragonfly topology in some embodiments can also utilize parallel networks to add capacity to the network. In addition, the dragonfly networks described so far assumed uniform bandwidth to all nodes in the network.), Sec: to increase the terminal bandwidth, ln 1-10),
Klenk teaches generating, by each of multiple GPUs distributed in cluster, data packets in parallel( the PPU 300 is a graphics processing unit (GPU), para[0063], ln 8-10/ general processing clusters (GPCs) 350, para[0065], ln 3-6/ in large-scale cluster computing environments where PPUs 300 process very large datasets and/or run applications for extended periods, para[0084], ln 6-7/ the parallel processing unit in the endpoint to generate a data packet associated with the load/store instruction that is forwarded to the network device 110. A description of an exemplary parallel processing unit is set forth below before discussing the detailed methods for performing a network computation, para[0062], ln 9-15/ then the All-to-All primitive causes each endpoint to receive one element of each row such that the matrix is transposed and each endpoint stores a column of the matrix. , para[0149], ln 37-42).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Parker with Klenk to incorporate the above feature because this reduces the complexity of the task and reducing network latency while increasing the effective network bandwidth.
Mineo teaches multiple GPUs distributed in clusters for all-to-all data communication between the clusters ( The network 300 of the present invention is illustrated in FIG. 4. The network 300 includes N nodes 302 for which all-to-all connection is provided utilizing AWGRs 304. The nodes are identified in FIG. 4 as P1-PN. The nodes 302 represent, for example a processing element along with memory and a network interface (e.g. a computer, blade, or rack), and a WDM optical interconnect link including a transmitter and a receiver. Each node 302 could also represent, for example, a group of nodes in a hierarchy or a storage cluster. Alternatively, each node 302 could represent different types of processing elements, for example, a CPU or a GPU. Alternatively still, the node 302 may represent an interface to another type of network. For example, an element for translating between the HPC interconnect fabric and Ethernet. An “all-to-all” connection as used herein refers to connections within the system 300 which provide dedicated, unshared, arbitration-free communication between each node 302 of the system 300 and each of the remaining nodes 302 of the system 300. For clarity, a transmitter portion of each node 302 is illustrated on the left hand side of the drawing and a receiver portion of each node 302 is illustrated in the right hand side of the drawing, i.e. P1_TX and P1_RX are two portions of the same node, specifically the transmitter and receiver portions respectively, col 4, ln 43-67).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Parker and Klenk with Mineo to incorporate the above feature because this provides arbitration-free all-to-all connection between the nodes of the network utilizing wavelength routing devices and utilizing a limited number of wavelengths for routing optical signals to the nodes of the network.
12. Claim(s) 16 is rejected under 35 U.S.C. 103 as being unpatentable over in view of PARKER( EP 2451127 A1) in view of Klenk( US 20210037107 A1) in view of Mineo(US 9401774 B1) and further in view of Ajima( US 20080089329 A1)
As to claim 16, Ajima teaches coalescing the data packets in a send buffer from which the single data message is generated for inter-cluster communication between the pair of the clusters( When one or more packets are stored in the buffer memories in the output circuit 113 and the reception circuits 114, 116a, 116b, and 116c, the switch circuit 118 successively acquires the packets from the output circuit 113 and the reception circuits 114, 116a, 116b, and 116c. Then, the switch circuit 118 determines the destination (from the switch circuit 118) of each of the acquired packets on the basis of the destination address contained in the packet. The destination (from the switch circuit 118) of each of the acquired packets is one of the input circuit 112 and the transmission circuits 115, 117a, 117b, and 117c, para[0062]/ if the sixteen nodes in FIG. 1 (i.e., the first to fourth nodes 11, 12, 13, and 14 in the computer cluster 10, the first to fourth nodes 21, 22, 23, and 24 in the computer cluster 20, the first to fourth nodes 31, 32, 33, and 34 in the computer cluster 30, and the first to fourth nodes 41, 42, 43, and 44 in the computer cluster 40) are respectively arranged in sixteen lattice points in a two-dimensional single-layer lattice with the dimensions of 4.times.4, transmission of a packet from the node 11 to the node 43 needs six operations of relaying the packet. However, in the interconnection network according to the present invention in which four computer clusters each containing four nodes are respective, para[0042], ln 14).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Parker , Klenk and Mineo with Ajima to incorporate the above feature because this reduces the number of operations of relaying a packet.
13 Claim(s) 17, 18, 20 are rejected under 35 U.S.C. 103 as being unpatentable over PARKER( EP 2451127 A1) in view of Klenk( US 20210037107 A1) in view of Mineo(US 9401774 B1) in view of Ajima( US 20080089329 A1) and further in view of Billa(US 20210097082 A1).
As to claim 17, Billa teaches communicating the single data message from the send buffer to a receive buffer for the inter-cluster communication between the pair of the clusters( In some example implementations, DPUs 17 interface and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity such that any of storage nodes 12 or compute nodes 13 may communicate packet data for a given packet flow to any other of the servers using any of a number of parallel data paths within the data center 10. For example, in some example network architectures, DPUs spray individual packets for packet flows between the DPUs and across some or all of the multiple parallel data paths in the data center switch fabric 14 and reorder the packets for delivery to the destinations so as to provide full mesh connectivity, para[0044], ln 1-20).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Parker, Klenk, Mineo and Ajima with Billa to incorporate the above feature because this reduces the number of operations of relaying a packet.
As to claim 18, Billa teaches the all-to-all communication procedure comprises a first stage of intra-cluster parallel data communication between respective GPUs of each of the clusters( para[0044], ln 1-20) for the same reason as to claim 17 above.
As to claim 20, Billa teaches the all-to-all communication procedure comprises a third stage of intra-cluster data distribution to the respective GPUs of each of the clusters ( para[0047], ln 5-14/ para[0054], ln 10-24) for the same reason as to claim 3 above.
14. Claim 21 is ejected under 35 U.S.C. 103 as being unpatentable over in view of PARKER( EP 2451127 A1) in view of Klenk( US 20210037107 A1) in view of Mineo(US 9401774 B1) in view of Ajima( US 20080089329 A1) in view of Billa(US 20210097082 A1) and further in view of HOWARD(WO 2005111843 A2).
As to claim 21, HOWARD teaches the data packets are communicated asynchronously from each of the multiple clusters to all other clusters of the multiple clusters( Manifold and Hyper-Manifold Level All-to-All Cross-Communication In considering the data present on the nodes in each cascade group, each node has all of the data from every node in the group. Thus, data may be exchanged between each cascade group similarly to data exchanges between nodes: each node exchanges only with its corresponding node in the other groups, Sec: Manifold and Hyper-Manifold, ln 1-12/ The exchange process can be organized into 3 distinct steps: 1) All compute nodes in a cascade group exchange among their member nodes. 2) All cascade groups connected to single top level hyper-manifold channel exchange. 3) The nodes on each top level channels exchange. After step 1, each node in a cascade group has a copy of all the data on the group. Thus, step 2 proceeds by having each node exchange only with its counter part in the other groups, Sec: ype I Manifold AU-to-AH , ln 11-33/ specific packets of data are moved from one compute node to every other compute node, Sec: Mersenne Prime Cascade, ln 8-11/ using a processing thread to handle asynchronous input to the compute node; using a second processing thread to process a job of the cascade; and using a processing thread to handle asynchronous output from the compute node, Sec: claim 35, ln 2-5).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Parker, Klenk, Mineo and Ajima with Billa to incorporate the above feature because this reduces communication latency within a compute node of a cascade.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Conclusion
US 20140140341 A1 teaches a system 10 comprising multiple groups of nodes in which each of the groups of nodes 12 is connected to all of the others (illustrated by the lines between groups of nodes).
US 20140044015 A1 teaches In an embodiment, each of the nodes in each node group are directly connected to each other in an all-to-all fashion. For example, intra-group links 210 in node group 102 directly connect each node to each other node in the group. Further, inter-group links directly connect each node in each node group to a node in each neighboring node group. For example, node 120 is connected directly to nodes 130 and 150 in neighboring node groups 104 and 108, by inter-group links 212 and 214, respectively. In
US 6456838 B1 teaches All-to-all communication is one type of collective communication. In all-to-all communication, every node in a group sends a message to each other node in the group. Depending on the nature of the message to be sent, all-to-all communication can be further classified as all-to-all broadcast and all-to-all personalized exchange. In all-to-all broadcast, every node sends the same message to all other nodes. In all-to all personalized exchange, every node sends a distinct message to every other node. All-to-all broadcast and all-to-all personalized exchange may be used in networking and parallel computational applications. For example, all-to-all broadcasting may be used in performing matrix multiplication, LU-factorization, and Householder transformations. All-to-all personalized exchange may be used, US 6567858 B1 teaches the communication pattern known as all-to-all personalized exchange, every processor in a processor group sends a distinct message to every other processor in the group . All-to-all personalized exchange occurs in many important parallel computing/networking applications, such as matrix transposition and fast Fourier transform (FFT).
US 20080089329 A1 teaches In order to perform parallel processing, an interconnection network in which a plurality of nodes each having a processor and a communication device are mutually linked is constructed. In the interconnection network, data processing proceeds while transmitting packets between the nodes. The interconnection network may be the all-to-all-connection (fully-connected) type, the tree type, the star type, the ring type, the mesh type, the torus type, the hypercube type, or the like.
AU 2967892 A teaches between processor elements are bi-directional, all-to-all communications are realized for use in a parallel computer having an n-dimensional torus network in an ai x a2 x x an rectangular parallelopiped by sequentially phasing amax P 8 predetermined transmission/reception phases and by transmitting a message to the message terminating processor element,
US 20170085439 A1 teaches In the following, a method will be proposed that prevents path contention in all-to-all communication between the group of partially cut-out nodes, by using a topology of a Latin square fat tree in a parallel distributed processing system, and partially cutting out a group of nodes N to submit a job. In the following, all-to-all communication between a group of partially cut-out nodes may be also referred to as “part-to-part communication”.
US 20180183857 A1 teaches All nodes coupled to a switch can send data concurrently in order to maximally utilize the all-to-all connections across switches so each node in a collection of nodes has the same data. This is followed by a data exchange across nodes in different groups in such a way that each group has the data of every other group. In order to avoid network congestion, each node sends data to (and receive data from) only those groups to which their switch is in direct communication.
US 20090003344 A1 teaches The system and method for providing an asynchronous broadcast call in a communicator where ordered delivery of data packets is maintained between compute nodes in a parallel computing system where packet header space is limited will now be described according to the preferred embodiments of the present invention
US 20210037107 A1 teaches which causes the parallel processing unit in the endpoint to generate a data packet associated with the load/store
instruction that is forwarded to the network device 110,
US 20090006808 A1 teaches The ASIC nodes 10 (FIG. 1) comprising the parallel computer system that are interconnected by multiple independent networks optimally maximize packet communications throughput the system with minimal latency. As mentioned herein, in one embodiment of the invention, the multiple networks include three high-speed networks for parallel algorithm message passing, including the Torus with direct memory access (DMA), collective network, and a Global Asynchronous network that provides global barrier and notification functions. wherein said direct memory access (DMA) element is operable for Direct Memory Access functions for point-to-point, multicast, and all-to-all communications amongst said nodes.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LECHI TRUONG whose telephone number is (571)272-3767. The examiner can normally be reached 10-8 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor Young Kevin can be reached on (571)270-3180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LECHI TRUONG/Primary Examiner, Art Unit 2194