Last updated: May 29, 2026
Application No. 18/307,212
OFFLOADING NETWORK COMMUNICATION OPERATION SYNCHRONIZATIONS TO ACCELERATOR STREAMS

Final Rejection §103
Filed
Apr 26, 2023
Examiner
SEYE, ABDOU K
Art Unit
2198
Tech Center
2100 — Computer Architecture & Software
Assignee
Hewlett Packard Enterprise Development LP
OA Round
2 (Final)
Interview Optional

— +27.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 82% grant rate with +27.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 583 resolved cases, 2023–2026
Examiner Intelligence

SEYE, ABDOU K View full profile →
Grants 82% — above average
Career Allowance Rate
480 granted / 583 resolved
+27.3% vs TC avg
Strong +28% interview lift
Without
With
+27.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
17 currently pending
Career history
622
Total Applications
across all art units
Statute-Specific Performance

§101
5.6%
-34.4% vs TC avg
§103
89.7%
+49.7% vs TC avg
§102
1.4%
-38.6% vs TC avg
§112
2.0%
-38.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 583 resolved cases
Office Action

§103
DETAILED ACTION

Statement of claims
The present amended application include :
 Claims 1-7, 9-11, 13-15, 17 and  19 were amended. Claim 21-22 is added.
Claims 1-22 remain pending in the application.  Claims 1-22 are being considered on the merits. 

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/12/2026. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.


Response to Arguments
Rejections of Claims  under 35 U.S.C. $ 103 

Applicant argues  that:  

  “Doi and the Michael Lefleane article, as contended by the Office Action, this hypothetical combination fails to disclose or render obvious all of the elements of amended claim 17. ", “Claims 1 and 11  For at least the same reasons that are set forth above, amended independent claims  1 and 11 overcome the corresponding § 103 rejections”.
In response, Examiner respectfully disagree and submit that: Applicant’s arguments with respect to the newly added limitations have been considered but are moot because the arguments do not apply to the newly cited reference Akshay Venkatesh “MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling” being used in the current rejection and  is added only as directly corresponding evidence to support the prior common knowledge finding as stated above.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2 and 7-22 are rejected under 35 U.S.C. 103 as being unpatentable over Doi (US 2018/0067894, Doi hereinafter in view of Akshay Venkatesh “MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling” , 2017, Akshay hereinafter).

As to claim 17, Doi teaches a non-transitory machine-readable storage medium to store instructions that, when executed by a machine [0132]  each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device), cause the machine to (e.g., see FIG. 1, para [0040] An exemplary overhead hiding processing system 100 to which the present description can be applied is shown in accordance with one embodiment. The overhead hiding processing system 100 includes at least a first memory 104, a first central processing unit (CPU) 106, a second memory 108, and a second CPU 110 operatively coupled to other components via a system bus 102. A network adapter 120, a first graphics processing unit (GPU) 130, a second GPU 140, and a network adapter 150, are operatively coupled to the system bus 102): 
enqueue a sequence (e.g., “first and second”)  of operations in a queue (e.g., see FIG. 15, para 85, wherein  “A first kernel overhead (e.g., K5) and a second kernel overhead (e.g., K6) are loaded in a queue of the second thread 1520” and “workloads need to be enqueued on the CUDA stream before the CUDA stream” in para [0074] and para [0078]) associated with an accelerator (e.g., “GPU”, FIG. 15, see The method of claim 4, wherein the accelerator is a graphical processing unit (GPU).) , wherein sequence of operations to be processed by the accelerator (e.g., see FIG. 15, para [0084] The overhead hiding implementation 1500 depicts a first thread 1510 and a second thread 1520 of a host (CPU) 1502. The GPU 1530 is invoked by the host 1502 to perform one or more operations of the host 1502. The first thread 1510 can be used for executing procedures on the host 1502 and synchronizing data streams, such as, e.g., CUDA streams for GPUs. The second thread 1520 can be used for managing the GPUs. The second thread 1520 includes a plurality of kernels 1522 (e.g., K1, K2, K3, K4, K5, K6) and a plurality of memory components/elements 1524 (e.g., M1, M2). The GPU 1530 can include a main stream 1532. Data dependency is present between K2, M1, and M2. Additionally, data dependency is present between K3, K4, and M2.
[0085] All the GPU kernels are launched asynchronously and executed when all the input data is ready. A first kernel overhead (e.g., K5) and a second kernel overhead (e.g., K6) are loaded in a queue of the second thread 1520. A dummy kernel overhead is loaded between the first and second kernel overheads (K5, K6) in the queue of second thread 1520. A waiting process 1512 is loaded in the queue of the first thread 1510, the waiting process 1512 remaining active until a previous kernel of the first and second kernel overheads ends. Memory copy overheads (e.g., M1, M2) related to the previous kernel in the queue of the first thread 1510 are then allocated. Moreover, a stop process 1514 is allocated in the queue of the first thread 1510, the stop process 1514 configured to stop a dummy kernel 1540, the dummy kernel 1540 related to the dummy kernel overhead.
[0086] The first and second kernel overheads (K5, K6) can be put or copied or transferred from the second thread 1520 in a queue of the main stream 1532 of the GPU 1530 while a previous kernel is executed on the GPU 1530 so that overheads are hidden behind kernel execution or memory copy. A dummy kernel 1540 is launched from the second thread 1520 in the main stream 1532 when synchronization with the host 1502 is required. The dummy kernel 1540 waits with a spin loop until a counter is set to a proper value. After the procedure on the CPU host 1502 is executed on the first thread 1510, the first thread 1510 launches a special kernel 1514 to set a counter to a proper value) , and 
enqueuing the sequence of operations (e.g., see para 68 “when the CUDA stream is used, there are overheads of preparation of data or parameters to enqueue memory copy or kernel execution on the CUDA stream, which blocks CPU core.” and para [0086] The first and second kernel overheads (K5, K6) can be put or copied or transferred from the second thread 1520 in a queue of the main stream 1532 of the GPU 1530 “, see FIG. 15)  comprises enqueuing a compute kernel to cause the accelerator to, responsive to completion of processing of the compute kernel, write a value to a memory  (e.g., para 74, “workloads need to be enqueued on the CUDA stream before the CUDA stream is in a “not empty” state. “,  
para 86, wherein “The first and second kernel overheads (K5, K6) can be put or copied or transferred from the second thread 1520 in a queue of the main stream 1532 of the GPU 1530 while a previous kernel is executed on the GPU 1530 so that overheads are hidden behind kernel execution or memory copy. A dummy kernel 1540 is launched from the second thread 1520 in the main stream 1532 when synchronization with the host 1502 is required. The dummy kernel 1540 waits with a spin loop until a counter is set to a proper value. After the procedure on the CPU host 1502 is executed on the first thread 1510, the first thread 1510 launches a special kernel 1514 to set a counter to a proper value. first and second.  Thus,  the “a counter to a proper value”   for  “memory copy”   include  write a value to a memory) associated with a network interface (e.g., para [0132]  A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device. ) ; and 
to cause the network interface to, responsive to the value being written to the memory (e.g., “memory copy”, FIG. 9) , initiate a network communicationdata is exchanged between GPUs on different nodes by using, e.g., MPI communication. The kernel for the Wilson-Dirac calculation of the boundary lattice sites is blocked with the spin loop kernel until the boundary data is received on the host (CPU) and stored in the memory on the GPU.
Thus, the “MPI communication” represent a network communication”).  

However, Doi does not teach enqueueing a trigger event, enqueue a first entry in a deferred work queue associated with the network interface, and enqueue a second entry other than the first entry in the deferred work queue to cause the network interface to, responsive to a completion of the network communication, provide a signal to the accelerator indicating the completion.
Akshay teaches  enqueueing a trigger event, enqueue a first entry in a deferred work queue associated with the network interface (e.g., see page 155, “common receive CQ was used for all peers, then a send from any MPI rank can unblock any arbitrary stream blocked on the shared receive CQ”) , and enqueue a second entry other than the first entry in the deferred work queue to cause the network interface to, responsive to a completion of the network communication (e.g., see page 154 and 155, “Issuing operations on the stream acquire a notion of enqueuing work and progressing/monitoring stream operations and its completion at a later point referred as enqueue phase calls and progress phase calls, respectively” , “Connection Establishment: As MPI-GDS MPI Recv operations are required to potentially block CUDA stream until a matching send arrives, it is necessary to make use of separate receive completion queues (CQ) in order to make use ofGDS calls such as gds stream wait cq”), provide a signal to the accelerator indicating the completion (e.g., see page 155, FIG. 3, “The use of StWtCQ ensures that subsequent operations enqueued in the stream and do not touch send buffer until send has completed and signaled by the generation of the completion event associated with it.” for “on large-scale GPU accelerators” in page 151).

Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Doi with those of Akshay because both references are directed to related systems addressing similar technical problems within the same field and seek to improve system performance, reliability, and efficiency.

Doi et al. disclose a enqueue a sequence of operations in a queue associated with an accelerator of the machine, wherein the sequence of operations to be processed by the accelerator, while Akshay et al. teach enqueue a second entry other than the first entry in the deferred work queue to cause the network interface to, responsive to a completion of the network communication, provide a signal to the accelerator indicating the completion.

Incorporating the teachings of Akshay  et al. into the system of Doi et al. would have been a predictable and logical modification, yielding improved operational robustness and efficiency without requiring undue experimentation.

Such a combination would merely involve the substitution or integration of known elements performing their established functions, as taught by Akshay  et al., into the system of Doi et al., consistent with design incentives and market demands for improved performance and scalability. Moreover, Akshay et al. explicitly recognize benefits t to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract) . —that would naturally be desirable in the system of Doi et al. 
Accordingly, to one of ordinary skill in the art would have had a reasonable expectation of success in combining Doi et al. with Akshay et al., and the combination represents no more than the predictable use of prior art elements according to their known functions.

As to claim 18,  Doi teaches  wherein the network communication comprises a Message Passing Interface (MPI) operation associated with processing of the kernel (e.g., page 11, “GPUHost Networking: Some projects attempt to support GPU networking through helper threads on the host CPU. FLAT [26] allows for the automatic generation of CPU MPI codes from GPU kernels using custom compiler extensions. Distributed Computing for GPU Networks (DCGN) [36] exposes an MPI-like interface for GPU kernels to pass messages to GPUs on remote nodes.).  

As to claim 19,  Doit does not teach further wherein the instructions, when executed by the machine, further cause the processor to enqueue the trigger event and enqueue the entry in the network interface using Message Passing Interface (MPI) communications . However, Akshay teaches wherein the instructions, when executed by the machine, further cause the machine to enqueue the trigger event and enqueue  first and second entries in the network interface (see rejection of claim 17 above).

 Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).


As to claim 20, Doit does not teach wherein MPI communications comprise MPI remote memory access (RMA) communications. However, Akshay teaches wherein MPI communications comprise MPI remote memory access (RMA) communications.  (e.g., see page 153, “MPI-GDS can be extended to include collective and RMA operations”).

Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).
As to claim 21 , Doi does not teach  wherein the network communication interface is associated with an input/output (I/O) space, wherein the I/O space is inaccessible to the accelerator, and the method further comprises the network communication interface storing, in the I/O space, a completion count for operations performed by the network communication interface. However, Akshay teaches wherein the network communication interface is associated with an input/output (I/O) space, wherein the I/O space is inaccessible to the accelerator, and the method further comprises the network communication interface storing, in the I/O space, a completion count for operations performed by the network communication interface (e.g., page 153, “generates a Completion Queue Entry (CQE) to provide the status of a WQE.” for “cuStreamWaitValue32 blocks a given stream until the value at a specific memory changes to a specified value while cuStreamWriteValue32 writes a specific value to a specific memory address in CUDA stream order.” In page 154).  
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).


As to claim 22, Doi does not teach wherein the enqueueing the second operation comprising enqueueing, by the host processor, a write operation to cause the network communication interface to write, to a memory space accessible to the accelerator, a value corresponding to the signal. However, Akshay teaches herein the enqueueing the second operation comprising enqueueing, by the host processor, a write operation to cause the network communication interface to write, to a memory space accessible to the accelerator, a value corresponding to the signal (e.g., see page  154 and 155, ““cuStreamWaitValue32 blocks a given stream until the value at a specific memory changes to a specified value while cuStreamWriteValue32 writes a specific value to a specific memory address in CUDA stream order.”  “1) Connection Establishment: As MPI-GDS MPI Recv operations are required to potentially block CUDA stream until a matching send arrives, it is necessary to make use of separate receive completion queues (CQ) in order to make use of
GDS calls such as gds stream wait cq”).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).


As to claim 1, see rejection of claim 17 above, Doi teaches further enqueueing, by a host processor of a compute node, a stream of first operations to be executed by an accelerator of the compute node, wherein the stream is associated with a compute kernel boundary, synchronizing a network operation to the compute kernel boundary (e.g., see FIG. 15,  para[0084] The GPU 1530 is invoked by the host 1502 to perform one or more operations of the host 1502. The first thread 1510 can be used for executing procedures on the host 1502 and synchronizing data streams, such as, e.g., CUDA streams for GPUs. The second thread 1520 can be used for managing the GPUs. The second thread 1520 includes a plurality of kernels 1522 (e.g., K1, K2, K3, K4, K5, K6) and a plurality of memory components/elements 1524 (e.g., M1, M2). The GPU 1530 can include a main stream 1532. Data dependency is present between K2, M1, and M2. Additionally, data dependency is present between K3, K4, and M2)  and offloading, by the host processor, wherein the offloading comprises: enqueueing, by the host processor of the compute node, a first network communication operation (e.g., Para  0085] All the GPU kernels are launched asynchronously and executed when all the input data is ready. A first kernel overhead (e.g., K5) and a second kernel overhead (e.g., K6) are loaded in a queue of the second thread 1520. A dummy kernel overhead is loaded between the first and second kernel overheads (K5, K6) in the queue of second thread 1520. A waiting process 1512 is loaded in the queue of the first thread 1510, the waiting process 1512 remaining active until a previous kernel of the first and second kernel overheads ends. Memory copy overheads (e.g., M1, M2) related to the previous kernel in the queue of the first thread 1510 are then allocated. Moreover, a stop process 1514 is allocated in the queue of the first thread 1510, the stop process 1514 configured to stop a dummy kernel 1540, the dummy kernel 1540 related to the dummy kernel overhead.). 
However, Doi does not explicitly teach offloading  to the accelerator , enqueueing, by the host processor and to a deferred work queue with a network communication interface ,  the synchronizing to the accelerator,  enqueueing, by the host processor and to the deferred work queue, a second operation other than the first network communication operation to cause the network communication interface to provide, to the accelerator, a signal to represent a completion of the first network communication operation;  the first network communication operation to be performed by the network communication interface, adding, by the host processor and to the stream, a third operation to synchronize the first network communication operation with the compute kernel boundary .

Akshay teaches  offloading  to the accelerator , enqueueing, by the host processor and to a deferred work queue with a network communication interface (see rejection of claim 17 above) ,  the synchronizing to the accelerator (e.g., see page 152, “Proposes a new MPI runtime that leverages GPUDirectaSync to decouple the CPU-GPU control flow and offload the computation, communication and synchronization tasks to GPU”, Fig. 2. Existing CPU Control Flow and the envisioned GPU Control Flow) ,  enqueueing, by the host processor and to the deferred work queue, a second operation other than the first network communication operation to cause the network communication interface to provide, to the accelerator, a signal to represent a completion of the first network communication operation ( see rejection of claim 17 above);  the first network communication operation to be performed by the network communication interface, adding, by the host processor and to the stream, a third operation to synchronize the first network communication operation with the compute kernel boundary  (see page 152, “Proposes a new MPI runtime that leverages GPUDirectaSync to decouple the CPU-GPU control flow and offload the computation, communication and synchronization tasks to GPU” , Fig. 2. Existing CPU Control Flow and the envisioned GPU Control Flow).

Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).


As to claim 2, Doi does not  teach wherein  the compute kernel boundary corresponds to a completion of execution of a compute kernel by the accelerator; the network communication interface initiates the network communication operation responsive to a trigger; and adding the second operation comprises adding, by the host processor, one of a trigger kernel or a write value stream operation to the stream to cause the accelerator to provide the trigger. However, Akshay teaches  wherein  the compute kernel boundary corresponds to a completion of execution of a compute kernel by the accelerator  (e.g., see rejection of claim 1 above)  ; the network communication interface initiates the network communication operation responsive to a trigger; and adding the third operation comprises adding (e.g., page 153, “III. OFFLOADING MPI OPERATIONS TO CUDA STREAMS”  and “To offload MPI-GDS operations to the GPU new calls introduced by GPUDirect-aSync (GDS) need to be leveraged. In this section, we present a brief description of GDS and related CUDA calls and the designs of MPI Point-to-point
(Pt2Pt) communication protocols to illustrate how we decouple the CPU-GPU control flow of communication by exploiting GDS.” In page 154), by the host processor, one of a trigger kernel or a write value stream operation to the stream to cause the accelerator to provide the trigger (e.g., page 159, “an MPI-GDS send operation posted after a kernel launch on the same stream returns immediately from CPU’s perspective but actual data movement will be triggered by the GPU after the kernel completes” and “This realizes receive completion and frees the stream to move on to subsequent enqueued
operations”, “rdma-write-with-immediate can be used instead (immediate being required to generate a completion event on which the receiver is waiting in the StWtCQ call).”  In page 156) .
 
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).


As to claim 7, Doi does not  teach wherein enqueueing the second operation comprises enqueuing, to the deferred work queue, an operation to , by cause the network communication interface to write a value to an accelerator- attached memory to provide the signal. However, Akshay teaches wherein enqueueing the second operation comprises enqueuing, to the deferred work queue, an operation to write, by cause the network communication interface to write a value to an accelerator- attached memory to provide the signal (e.g., page 155,  “StWtCQ operation, CuStWtVal can be issued which serves the same purpose which allows for variations during progress phase of the receive operation. “, “enqueue
phase calls can omit CuMemAsync and directly register the user buffer with the HCA before issuing StQSnd operation and StWtCQ followed by a CuEvntRcrd(AppSt, SndCmpEvnt). The use of StWtCQ ensures that subsequent operations enqueued in the stream and do not touch send buffer until send has
completed and signaled by the generation of the completion event associated with it”).


As to claim 8, Doi does not teach wherein the first network communication operation comprises an operation to communicate data corresponding to a result of the processing of a compute kernel by the accelerator. However,  Akshay teaches, wherein the first network communication operation comprises an operation to communicate data corresponding to a result of the processing of a compute kernel by the accelerator (e.g.,, page 155,  “result
in receiving sent data to the user specified receive buffer until the stream has arrived at a point where all previously issued operations on the CUDA stream have completed (safe point). “, “StWtCQ operation, CuStWtVal can be issued which serves the same purpose which allows for variations during progress phase of the receive operation.” And “Fig. 9. Comparison of GDS-based Send/Receive Operations with GPU kernels
with Traditional MPI+CUDA Application Kernel” in page 159) .  
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

As to claim 9, Doi teaches  wherein enqueueing the first network communication operation and adding the second operation comprise generating, by the host processor, Message Passing Interface (MPI) application programming interface (API) calls (e.g., para 72, “Therefore, most of the function cudaStreamSynchronize can be removed and then CUDA overheads appearing at the beginning of the procedures can be overlapped behind previous kernel execution, except for the case when synchronization between the GPU and the host (CPU) is necessary, for example, collective operations between GPUs or MPI communications. Multiple streams can be synchronized on the same GPU or other GPUs without stopping the CUDA stream by using CUDA event APIs.”) .  

As to claim 10, Doi does not  teach wherein enqueueing the first network communication operation and adding the second operation comprise using, by the host processor Message Passing Interface (MPI) active remote memory interface (RMA)-based communication.  However, Akshay teaches  wherein enqueueing the first network communication operation and adding the second operation comprise using, by the host processor Message Passing Interface (MPI) active remote memory interface (RMA)-based communication (e.g., see se page 153, “conceptually MPI-GDS can be extended to include collective and RMA operations.).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

As to claim 11, see rejection of claim 1 above.  Doi teaches further an apparatus comprising: a network communication interface;a graphics processing unit (GPU); and a processor other than the GPU to: enqueue a first network communication operation (see rejection of claim1 , FIG. 1) . Akshay teaches further  enqueue a stream of operations to be executed by the GPU, wherein the stream of operations comprises a compute kernel associated with a compute kernel boundary and a synchronization event to be executed by the GPU to synchronize the first network communication operation with the compute kernel boundary (e.g., see page 152, “MPI runtime that leverages GPUDirectaSync
to decouple the CPU-GPU control flow and offload the computation, communication and synchronization tasks to GPU • Introduces, MPI-GDS, the notion of offloading communication calls to the GPU while maintaining CUDA
stream ordering” , FIG. 2, b) Envisioned GPU Control Flow).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).


As to claim 12,  Doi teaches further wherein: the GPU comprises a plurality of cores and a GPU processor (e.g., para [0034] Today's supercomputer trends are becoming more general, focusing on common architectures or architectures similar to consumer products. Green computing, with its emphasis on decreasing the power consumption of large-scale supercomputer systems, is another big trend. One major trend in exascale supercomputing is a hybrid architecture featuring accelerators outside of the host CPU cores (e.g., many core chips, FPGAs, or GPUs), and many hybrid systems are ranked on the Top 500 and Green 500 lists. In one or more embodiments, hybrid GPU systems are described by examining the optimization of lattice QCD simulations on, e.g., NVIDIA' s Kepler architecture GPUs.). However, Doi does not teach  the compute kernel boundary corresponds to a completion of execution of a compute kernel by at least one core of the plurality of cores , the network communication interface to initiate the first network communication operation in response to a trigger; and the control GPU processor executes the synchronization event to provide the trigger. Akshay teaches the compute kernel boundary corresponds to a completion of execution of a compute kernel by at least one core of the plurality of cores , the network communication interface to initiate the first network communication operation in response to a trigger; and the control GPU processor executes the synchronization event to provide the trigger (e.g., see page 151 and 152,  “synchronization between CUDA (e.g.,cudaStreamSynchronize and cudaDeviceSynchronize) and MPI (e.g., MPI Wait, MPI Waitall, etc.) are required to satisfy potential dependencies between computation and communication phases in the applications. “, “MPI runtime that leverages GPUDirectaSync to decouple the CPU-GPU control flow and offload the computation, communication and synchronization tasks to GPU”,  “MPI runtime that leverages GPUDirectaSync to decouple the CPU-GPU control flow and offload the computation, communication and synchronization tasks to GPU • Introduces, MPI-GDS, the notion of offloading communication calls to the GPU while maintaining CUDA stream ordering”, Fig. 2 , “ the envisioned GPU Control Flow”).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

As to claim 13, Doi teaches further  wherein: the GPU comprises a plurality of cores and a GPU control processor; the compute kernel boundary corresponds to an initiation of execution of a compute kernel by at least one core of the plurality of cores e.g., para [0034] Today's supercomputer trends are becoming more general, focusing on common architectures or architectures similar to consumer products. Green computing, with its emphasis on decreasing the power consumption of large-scale supercomputer systems, is another big trend. One major trend in exascale supercomputing is a hybrid architecture featuring accelerators outside of the host CPU cores (e.g., many core chips, FPGAs, or GPUs), and many hybrid systems are ranked on the Top 500 and Green 500 lists. In one or more embodiments, hybrid GPU systems are described by examining the optimization of lattice QCD simulations on, e.g., NVIDIA' s Kepler architecture GPUs.). However , Doi does not teach  the control GPU processor executes the synchronization event to cause the  GPU control processor to wait for the signal  before the GPU  control processor initiates execution of the compute kernel.  Akshay teaches  the control GPU processor executes the synchronization event to cause the  GPU control processor to wait for the signal  before the GPU  control processor initiates execution of the compute kernel (e.g., see  page 152, “MPI runtime that leverages GPUDirectaSync to decouple the CPU-GPU control flow and offload the computation, communication and synchronization tasks to GPU”, Fig. 2, envisioned GPU Control Flow and “MPI Send and MPI Recv operations with CUDA kernel and stream synchronize calls before the MPI Send and after the
MPI Recv, respectively. These benefits are mainly due to hiding the kernel launch overhead and keeping the GPU busy for the entire application run. This highlights the effectiveness of GPU handling communication operations that satisfy stream
dependencies.” In page 158).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

As to claim 14, Doi does not teach wherein:the network communication interface to, responsive to execution of the second operation, write a value to a the GPU control processor polls the GPU-attached memory to detect the signal update; and the GPU control processor initiates execution of the compute kernel responsive to the polling detecting the signal. However, Akshay teaches network communication interface to, responsive to execution of the second operation, write a value to a the GPU control processor polls the GPU-attached memory to detect the signal update; and the GPU control processor initiates execution of the compute kernel responsive to the polling detecting the signal ( see FIG. 2, “b) Envisioned GPU Control Flow), page 153, “the completion of a communication operation is detected by polling on a Completion Queue (CQ). On completing an operation, the network adapter generates a
Completion Queue Entry (CQE) to provide the status of a WQE.).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

As to claim 15, Doi does not teach a plurality of compute nodes, wherein the processor comprises a given compute node of the plurality of compute nodes, and wherein each compute node of the plurality of compute nodes is associated with a different operating system instance of a plurality of operating system instances..  However, Akshay teaches  plurality of compute nodes, wherein the processor comprises a given compute node of the plurality of compute nodes, and wherein each compute node of the plurality of compute nodes is associated with a different operating system instance of a plurality of operating system instances. (e.g., see page  153, “network adapters to directly access GPU device memory when moving data between GPUs on different nodes of a cluster. This removes additional copies through host memory
and also removes the CPU dependency in data movement). 
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).


As to claim 16, Doi does not teach wherein: the processor is associated with a first process; the network communication interface comprises a network interface controller; and the first network communication operation comprises at least one of a send operation to communicate, to a second process, data associated with the first process and representing compute kernel output data, or a receive operation to communicate, to the first process, data associated with the second process and representing compute kernel input data.  However, Akshay teaches wherein: the processor is associated with a first process; the network communication interface comprises a network interface controller; and the first network communication operation comprises at least one of a send operation to communicate, to a second process, data associated with the first process and representing compute kernel output data, or a receive operation to communicate, to the first process, data associated with the second process and representing compute kernel input data (e.g., see page 154, F sequence such as Kernel A, MPI Send, Kernel B will simply issue basic
operations required starting the sequence but progress and completion necessities of protocols required for realizing this are embedded in MPIX Wait stream completion”, Fig. 3. Eager Protocol with GDS feature for Point-to-Point communication). 

Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).


Claim(s)  3-6 are rejected under 35 U.S.C. 103 as being unpatentable over Doi (US 2018/0067894, Doi hereinafter in view of Akshay Venkatesh “MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling” , 2017, Akshay hereinafter),  as applied to claim 1 above, and further in view of Michael LeBeane et al. “GPU Triggered Networking for Intra-Kernel Communications”, 2017, Michael hereinafter).

As to claim 3, Doi does not  teach wherein: enqueueing the first network communication operation comprises enqueueing, by the host processor, an entry to the deferred work queue ; the entry contains a command corresponding to the first network communication operation; and the entry identifies a trigger counter of the network communication interface and a threshold value for a count value of the trigger counter; and the method further comprises the accelerator providing the trigger, wherein the accelerator providing the trigger comprises the accelerator changing the count value to be the same as or greater than the threshold value.  However, Akshay teaches further  enqueueing the first network communication operation comprises enqueueing, by the host processor, an entry to the deferred work queue ; the entry contains a command  corresponding to the first network communication operation (e.g., see rejection of claim 1 above) .
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

Michael teaches  the entry identifies a trigger counter of the network communication interface and a threshold value for a count value of the trigger counter ; and the method further comprises the accelerator providing the trigger, wherein the accelerator providing the trigger comprises the accelerator changing the count value to be the same as or greater   than the threshold value(e.g., see  page 4,  3.1 Overview,  “Figure 4 shows the steps involved in performing a GPU-TN enhanced networking operation on the initiator. The CPU first creates the network operation, allocates memory for the message buffer, and sends the command to the NIC 1 . The CPU is responsible for creating the network operation using the triggered operations API (see Section 4) and registering it with the NIC. The network runtime library allocates a trigger entry to represent the state of a triggered operation on the NIC and appends this entry to a list of all registered entries called the trigger list. A trigger entry is composed of the following fields:

“Once a trigger entry has been allocated and is visible to the NIC, the GPU kernel is launched and is provided one or more tags, along with a memory-mapped address with which to activate trigger operations. We will refer to this address as the trigger address. During kernel execution, the GPU will populate the send buffer with data to send to another node 2 . After the send buffer is populated,
the GPU notifies the NIC that the triggered put operation is ready by performing a posted write operation to the memory-mapped trigger address, supplying the tag of the message that it wants to initiate 3 . This write is routed to the NIC and placed in a FIFO associated with the trigger address. The NIC pops entries from the FIFO and searches the trigger list for a tag match on a trigger entry. When a match is found, the NIC increments the counter value associated with the matching trigger entry. When the counter value becomes greater than or equal to the CPU-provided threshold, the NIC performs the associated network operation 4.). 

Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify the method of Doi and Akshay by adopting the teachings of Michael in order  to provide “ to support efficient networking from the GPU by offloading the serial communications runtime and network packet creation to the CPU, while still allowing the GPU to initiate the network operation directly by performing a simple memory-mapped write. operation of a tag to a particular address”,” enables efficient networking from within a kernel. This avoids the high hardware scheduler cost present in kernel-boundary networking solutions and enables more fine-grained messaging capabilities” (see Michael, page 4)

As tom claim 4, Doi does not  teach wherein: the compute kernel boundary corresponds to an initiation of execution of a compute kernel by the accelerator; enqueueing the first network communication operation comprises enqueueing, by the host processor, a first entry to a deferred work queue of the network communication interface; the first entry identifies a completion counter of the network communication interface to provide a count value to indicate completion of the first network communication operation; and adding the second operation comprises adding, by the host processor, a wait kernel to the stream to cause the accelerator to initiate execution of the compute kernel responsive to the completion counter providing the count value.  However, Akshay  teaches further   the compute kernel boundary corresponds to an initiation of execution of a compute kernel by the accelerator; enqueueing the first network communication operation comprises enqueueing, by the host processor  , the first entry to a deferred work queue (see rejection of claim 1 above) ; the first entry identifies a completion counter of the network communication interface, adding the second operation comprises adding, by the host processor, a wait kernel to the stream to cause the accelerator to initiate execution of the compute kernel responsive to the completion(e.g., see page 155, “MPI-GDS MPI Recv operations are required to potentially block CUDA stream until a matching send arrives, it is necessary to make use of separate receive completion queues (CQ) in order to make use of GDS calls such as gds stream wait cq.”, “The
use of StWtCQ ensures that subsequent operations enqueued in the stream and do not touch send buffer until send has completed and signaled by the generation of the completion event associated with it. Progress phase calls can include querying if SndCmpEvnt has completed and polling the associated completion queue to remove the completion entry generated by the HCA.) .
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

Michael teaches to provide a count value to indicate completion of the first network communication operation (e.g., see page 4, Figure 4: Overview of a GPU triggered operation in GPU-TN. The CPU initializes the network operation, which is triggered by the GPU from within a kernel when the message is ready to be sent. Figure 5: Tag matching behavior of trigger entries. GPU provided tags are matched to a CPU registered trigger entry. The network operation is ready when the counter reaches the threshold.) ; and counter providing the count value (e.g., see  Figure 3: Overview of the control flow of different networking strategies on the GPU. GPU Triggered Networking (GPUTN) utilizes the CPU to construct a command packet for the GPU to initiate, bypassing an expensive control flow switch on the critical path. GPU-TN supports flexible, intra-kernel networking using triggered operation semantics on the NIC.
Note that the time spent is not drawn to scale. 
Figure 5: Tag matching behavior of trigger entries. GPU provided tags are matched to a CPU registered trigger entry. The network operation is ready when the counter reaches the threshold.). 
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify the method of Doi and Akshay by adopting the teachings of Michael in order  to provide “ to support efficient networking from the GPU by offloading the serial communications runtime and network packet creation to the CPU, while still allowing the GPU to initiate the network operation directly by performing a simple memory-mapped write. operation of a tag to a particular address”,” enables efficient networking from within a kernel. This avoids the high hardware scheduler cost present in kernel-boundary networking solutions and enables more fine-grained messaging capabilities” (see Michael, page 4)



As to claim 5, Doi does not  teach wherein offloading the synchronizing to the accelerator further comprises:  chaining the third  operation to the first network communication operation . However, Akshay  teaches wherein offloading the synchronizing to the accelerator further comprises:  chaining the third  operation to the first network communication operation (see page 158, “MPI Send and MPI Recv operations with CUDA kernel and stream synchronize calls before the MPI Send and after the MPI Recv, respectively. These benefits are mainly due to hiding the kernel launch overhead and keeping the GPU busy for the entire application run.” ). 

Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

As to claim 6, Doi does not  teach wherein chaining the third operation to the first network communication operation comprises: enqueueing, by the host processor, a second entry to the deferred work queue, wherein the second entry identifies the completion counter as being a trigger counter to initiate the third operation. However, Akshay  teaches  chaining the third operation to the first network communication operation comprises: enqueueing, by the host processor, a second entry to the deferred work queue, wherein the second entry identifies the completion ( see page 155, “a LoopBack-based design (LB). In LB,
receiver-side enqueue phase calls includes CuEvntRcrd(AppSt, RcvRdyEvnt) call followed by ibv post recv from self and StWtCQ on receive completion queue associated with source rank. This way the specific receive operation can be satisfied when progress conditions are met as discussed later. Please
note that, during initialization of a stream-based communicator, loopback connections, regular all-to-all connections and separate completion queues are established in order to circumvent the case of unblocking a CUDA stream incorrectly — which can occur if common completion queue is used.
Alternatively, instead of the StWtCQ operation, CuStWtVal can be issued which serves the same purpose which allows for variations during progress phase of the receive operation.). 
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Doi by adopting the teachings of Akshay in order  to provide “ to “ realize point-to-point communication operations that guarantee stream-ordering while achieving good performance”   (see Akshay, in abstract).

Michael  teaches counter as being a trigger counter to initiate the third operation. (e.g., see page 4, “• Counter: A counter collecting the number of writes “ and “Triggered Operations: Our work makes use of triggered network operations to improve the networking performance of GPUs. Triggered operations were introduced in the Portals 4 network programming API [34] as a way to build efficient sequences of operations that can be progressed by the NIC.” in page 11) . 

Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify the method of Doi and Akshay by adopting the teachings of Michael in order  to provide “ to support efficient networking from the GPU by offloading the serial communications runtime and network packet creation to the CPU, while still allowing the GPU to initiate the network operation directly by performing a simple memory-mapped write. operation of a tag to a particular address”,” enables efficient networking from within a kernel. This avoids the high hardware scheduler cost present in kernel-boundary networking solutions and enables more fine-grained messaging capabilities” (see Michael, page 4).




Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.



Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDOU K SEYE whose telephone number is (571)270-1062. The examiner can normally be reached M-F 9-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Vital can be reached at 5712724215. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ABDOU K SEYE/Examiner, Art Unit 2198         


/PIERRE VITAL/Supervisory Patent Examiner, Art Unit 2198
Read full office action
Prosecution Timeline

Apr 26, 2023
Application Filed
Oct 24, 2025
Non-Final Rejection mailed — §103
Jan 27, 2026
Applicant Interview (Telephonic)
Jan 27, 2026
Examiner Interview Summary
Jan 29, 2026
Response Filed
May 06, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/405,550
Patent 12639140
REAL-TIME DATA PROCESSING PIPELINE AND PACING CONTROL SYSTEMS AND METHODS
2y 4m to grant Granted May 26, 2026
17/392,297
Patent 12632272
ADAPTIVE VIRTUAL DESKTOP SESSION PLACEMENT ON HOST SERVERS VIA USER LOGOFF PREDICTION
4y 9m to grant Granted May 19, 2026
18/610,083
Patent 12598527
Real-Time Any-G SON
2y 0m to grant Granted Apr 07, 2026
17/683,713
Patent 12587456
MACHINE LEARNING BASED EVENT MONITORING
4y 0m to grant Granted Mar 24, 2026
19/171,788
Patent 12585512
CUSTOMIZED SOCKET APPLICATION PROGRAMMING INTERFACE FUNCTIONS
11m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
82%
Grant Probability
99%
With Interview (+27.5%)
3y 3m (~2m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 583 resolved cases by this examiner. Grant probability derived from career allowance rate.