Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Examiner Notes
Examiner cites particular columns and line numbers in the references as applied to the claims below for convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references cited in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 1/06/2026 has been entered.
Response to Amendment
The Amendment filed 1/06/2026 has been entered. Claims 1-28 remain pending in the present Office Action. The Amendments to the claims have overcome all Objections set forth in the previous Office Action.
The Amendments to the Claims have been fully considered and are satisfactory to overcome the non-statutory double patenting rejections set forth in the previous Office Action.
Claim Objections
Claims 6 and 20 are objected to because of the following informalities:
In claim 6, line 2, “first API is to add a semaphore signal node to the graph code, the circuitry is to” should be corrected to recite “first API is to add a semaphore signal node to the graph codeand the circuitry is to”.
In claim 20, line 2, “the other API” should be corrected to recite “the second API”.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-3, 5-18, 20-21, 23-27 are rejected under 35 U.S.C. 103 as being unpatentable over Ashwathnarayan et al. (U.S. Pub. No. 2020/0364088), hereinafter Ashwathnarayan, in view of CUDA C++ Programming Guide (NPL Document V – previously provided with the Office Action dated 4/23/2024) and Mancisidor (U.S. Patent No. 6,519,623).
Regarding claim 1, Ashwathnarayan teaches one or more processors, comprising:
circuitry ([0638]: "processor, comprising: one or more circuits to create a signal to be used to coordinate at least two heterogeneous processing cores in response to performing one or more instructions associated with one or more application programming interfaces (APIs)"; [0679]: "non-limiting examples of a signal includes: semaphores"; [0071] – “a parallel computing platform and API model refers to an API model that can be used by software developers and software engineers to write code that uses a graphics processing unit (GPU) for general purpose processing.”) to, in response to a call to a first application programming interface (API) of a first software library ([0061] – “capabilities of parallel computing platform and application programming interface model external semaphores and parallel computing platform and application programming interface model streams are enhanced using techniques described herein. In at least one embodiment, a parallel computing platform and application programming interface model stream can wait and signal synchronization object by treating it as a type of external semaphore.”; [0071] - “a parallel computing platform and API model refers to an API model that can be used by software developers and software engineers to write code that uses a graphics processing unit (GPU) for general purpose processing. […] parallel computing platform and API models can be implemented using CUDA, Open Computing Language (OpenCL), DirectCompute, C++ Accelerated Massive Parallelism (C++ AMP), and more.”; [0090] – “external semaphore wait/signal APIs can designate hand-off points for parallel computing platform and application programming interface model access to shared buffer and application can invoke external memory APIs”; [0128] – “SignalExternalSemaphoreAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync( ) enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “applications invokes SignalExternalSemaphoresAsync( ) […] API enqueues a signal operation in a parallel computing platform and application programming interface model stream.”; [0549] – “an application generates instructions (e.g., in form of API calls) that cause driver kernel to generate one or more tasks for execution by PPU 4000 and driver kernel outputs tasks to one or more streams”; Parallel computing platform and application programming interface, such as CUDA (“first software library”), comprises an API, SignalExternalSempahoreAsync(), which when invoked (is “called”) by an application, enqueues a signal operation on an externally allocated semaphore.), generate one or more [operations] in […] code based at least in part, on one or more parameters of the first API, wherein the one or more [operations] are to cause the […] code to update a semaphore ([0128] – “SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “GPU executes a previously enqueued signal operation”; [0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”; [0135] – “In at least one embodiment, SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. In at least one embodiment a parameter extSemArray refers to external semaphores to be signaled. In at least one embodiment, a parameter paramsArray refers to array of semaphore parameters. In at least one embodiment, a parameter numExtSems refers to a number of semaphores to signal. In at least one embodiment, a parameter stream refers to stream to enqueue signal operations in.”) created by a second API of a second software library ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan (or other Graphics API like DX)”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores.” Externally allocated (“created”) Vulkan (“a second API of a second software library”) semaphores are imported by parallel computing platform and application programming interface (comprising the “first software library”) and signaled (“updated”).).
Ashwathnarayan fails to expressly teach an API to generate one or more nodes in graph code, wherein the one or more nodes are to cause the graph code to update a semaphore and at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore.
However, the CUDA C++ Programming Guide teaches an API to generate one or more nodes in graph code based, at least in part, on parameters of the API, wherein the one or more nodes are to cause the graph code to perform an operation (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph based on parameters of the API, e.g., “&a, graph, NULL, 0, &nodeParams”.).
Ashwathnarayan and CUDA C++ Programming Guide are considered to be analogous art to the claimed invention because they are in the same field as the claimed invention of executing programs written for a parallel computing platform and application interface. Ashwathnarayan teaches the parallel computing platform and application programming interface model may be CUDA ([0061]). Ashwathnarayan also teaches the SignalExternalSemaphoreAsync() API within the parallel computing platform and application programming model enqueues a signaling operation in a stream ([0128] and [0129]). The CUDA C++ Programming Guide teaches that an operation forms a node within a CUDA graph (Section 3.2.6.6.1). Further, the CUDA C++ Programming Guide teaches CUDA graphs are a model for work submission providing several advantages over the work submission mechanism of streams which include reducing CPU launch costs and enabling optimization by presenting the whole workflow (Section 3.2.6.6). Lastly, Ashwathnarayan suggests using synchronization objects (e.g., external semaphores – see [0061]) in graph-based execution frameworks ([0064]). Therefore, it would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Doing so enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.).
The combination of Ashwathnarayan in view of CUDA C++ Programming Guide fails to expressly teach at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore.
However, Mancisidor teaches at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore (Col. 4, lines 49-54 – “The semaphore application programming interface according to the present invention is illustrated in FIG. 4. The first field 402 contains the identifier of the semaphore to be modified by this operation. Operation field 404 contains an op code describing the operation to be performed on the semaphore.”; Col. 5, lines 3-64 – “Operation field 404 of the preferred embodiment is implemented in a 32 bit field that is divided into 4 bytes. The first byte 420 contains the operation to be performed if the semaphore value is positive. Second byte 422 contains the operation to be performed if the semaphore value is zero and field 424 the operation to perform if the semaphore value is negative. Field 426 contains flags that modify the operations of the previous field. A zero in any of the three most significant bytes (420, 422, 424) indicates to the semaphore function that no operation is to be performed when the value is positive, zero or negative respectively, though the old value will be returned. The first three byte positions can contain indicators to perform the following operations. The semaphore value V can be operated upon as follows: decrement (V=V-1) increment (V=V+1) set to zero (V=0) set to one (V=1) set (V=C), where (C=value 406) add (V=V+C). In addition to invoking these operations based upon the value of semaphore V, all the operations can be performed regardless of the value of V.”).
Mancisidor is considered to be analogous art to the claimed invention because it is reasonably pertinent to the problem faced by the inventor of using semaphores for program synchronization. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the one or more parameters of the first API taught by Ashwathnarayan in view of CUDA C++ Programming Guide to incorporate the at least one parameter of the API which specifies an operation to be performed on the semaphore as taught by Mancisidor. Doing so would provide a generic semaphore operation that is able to support semaphores with multiple different operating system personalities and emulate semaphore APIs from multiple operating system personalities, which further allows a single resource to be used by applications using different operating system personalities (Mancisidor: Col. 2, lines 52-67).
Regarding claim 2, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the one or more processors of claim 1. Ashwathnarayan further teaches wherein the […] code is to be performed, at least in part, by one or more graphics processing units (GPUs) ([0129] – “GPU executes a previously enqueued signal operation”; [0230] – “GPU(s) 1808 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's parallel computing platform and application programming interface model).”).
CUDA C++ Programming Guide further teaches the CUDA graph code is performed at least in part by one or more GPUs (Section 3.2.6.6. CUDA Graphs – “CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. […] execution of the kernel on the GPU”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream to be executed at least in part by a GPU as taught by Ashwathnarayan, to instead generate a node in graph code executed at least in part by a GPU as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 3, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the one or more processors of claim 1. Ashwathnarayan further teaches wherein the first API is to add a semaphore signal operation ([0128]-[0129] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream. [ ... ] In at least one embodiment, API enqueues a signal operation in a parallel computing platform and application programming interface model stream. […] GPU executes a previously enqueued signal operation").
CUDA C++ Programming Guide further teaches an API adds a node to the graph code (Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode adds a node to the graph).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate and add a node to the graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Doing so enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 5, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the one or more processors of claim 1. Ashwathnarayan further teaches wherein the semaphore is to be allocated by the second API ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan (or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”) and the first API is to add a semaphore signal [operation] to the […] code that is to perform a signal operation based, at least in part, on the semaphore, when the semaphore signal [operation] is performed ([0128] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “GPU executes a previously enqueued signal operation.”; [0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS: : params: : fence::value.”).
CUDA C++ Programming Guide further teaches an API to add a node to the graph code that performs an operation when the node is performed (Section 3.2.6.6. CUDA Graphs – “Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution. During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them. […] An executable graph may be launched into a stream, similar to any other CUDA work. It may be launched any number of times without repeating the instantiation.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode adds a node to the graph).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream to be executed as taught by Ashwathnarayan, to instead generate a node in graph code executed as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 6, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the one or more processors of claim 1. Ashwathnarayan further teaches wherein the first API is to add a semaphore signal [operation] to the […] code ([0128] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “GPU executes a previously enqueued signal operation.”), the circuitry is to perform a third API to update the semaphore signal ([0116]-[0119] – “WaitExternalSemaphoresAsync() is a supported API [ ... ] semantics of waiting on a semaphore depend on the type of object [ ... ] In at least one embodiment, […] waiting on semaphore waits until semaphore reaches a signaled state. In at least one embodiment, a semaphore reaches a singled state and is then reset to an unsigned state. In at least one embodiment, for every signal operation, there is exactly one corresponding wait operation.").
CUDA C++ Programming Guide further teaches an API adds a node to the graph code (Section 3.2.6.6. CUDA Graphs – “Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution. During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them. […] An executable graph may be launched into a stream, similar to any other CUDA work. It may be launched any number of times without repeating the instantiation.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode adds a node to the graph), and a third API updates the node of a graph (Section 3.2.6.6.4. Updating Instantiated Graphs – “CUDA provides a lightweight mechanism known as “Graph Update,” which allows certain node parameters to be modified in-place without having to rebuild the entire graph. This is much more efficient than re-instantiation. […] CUDA provides two mechanisms for updating instantiated graphs, whole graph update and individual node update.”; Section 3.2.6.6.4.3. Individual node updates – “Instantiated graph node parameters can be updated directly. This eliminates the overhead of instantiation as well as the overhead of creating a new cudaGraph_t. […] The following methods are available for updating cudaGraphExec_t nodes: cudaGraphExecKernelNodeSetParams()”.).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node, and nodes of the graph can be updated by additional APIs after instantiation as taught by CUDA C++ Programming Guide. Using graphs enables optimizations to be performed on work submitted in CUDA, reduces costs for setting up and launching the work, and enables re-submission of updated work without the overhead of having to redefine and instantiate the work (CUDA C++ Programming Guide: Section 3.2.6.6. and Section 3.2.6.6.4.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 7, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the one or more processors of claim 1. Ashwathnarayan further teaches […] the first API is to set one or more parameters of a semaphore signal ([0128] – “SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream”; [0132] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state.”; [0133] – “In at least one embodiment, […] then semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”; [0135] – “ SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. In at least one embodiment a parameter extSemArray refers to external semaphores to be signaled. In at least one embodiment, a parameter paramsArray refers to array of semaphore parameters. In at least one embodiment, a parameter numExtSems refers to a number of semaphores to signal.”) […].
CUDA C++ Programming Guide further teaches wherein the graph code is executable graph code (Section 3.2.6.6. CUDA Graphs – “Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution. During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them. Instantiation takes a snapshot of the graph template, validates it, and performs much of the setup and initialization of work with the aim of minimizing what needs to be done at launch. The resulting instance is known as an executable graph. An executable graph may be launched into a stream, similar to any other CUDA work. It may be launched any number of times without repeating the instantiation.”) , and an API to set parameters of a node in the executable graph code (Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode adds a node to the graph with parameter &nodeParams; Section 3.2.6.6.4. Updating Instantiated Graphs – “CUDA provides a lightweight mechanism known as “Graph Update,” which allows certain node parameters to be modified in-place without having to rebuild the entire graph. […] CUDA provides two mechanisms for updating instantiated graphs, whole graph update and individual node update.”; Section 3.2.6.6.4.3. Individual node update – “Instantiated graph node parameters can be updated directly. […] The following methods are available for updating cudaGraphExec_t nodes: cudaGraphExecKernelNodeSetParams()”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream for execution as taught by Ashwathnarayan, to instead add a node in executable graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node, and to set parameters of the node in executable graph code generated from the graph code as taught by CUDA C++ Programming Guide. Doing so enables optimizations to be performed on work submitted in CUDA, reduces costs for setting up and launching the work, and enables re-submission of updated work without the overhead of having to redefine and instantiate the work (CUDA C++ Programming Guide: Section 3.2.6.6. and Section 3.2.6.6.4.).
Regarding claim 8, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the one or more processors of claim 1. Ashwathnarayan further teaches wherein the […] code is to be performed, at least in part, by one or more graphics processing units (GPUs) ([0129] – “GPU executes a previously enqueued signal operation”; [0230] – “GPU(s) 1808 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's parallel computing platform and application programming interface model).”), the second API is a graphics rendering API ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores”; [0089] – “Vulkan (or other Graphics API like DX)”), […].
CUDA C++ Programming Guide further teaches the CUDA graph code is performed at least in part by one or more GPUs (Section 3.2.6.6. CUDA Graphs – “CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. […] execution of the kernel on the GPU”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream to be executed at least in part by a GPU as taught by Ashwathnarayan, to instead add a node in graph code executed at least in part by a GPU as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Mancisidor further teaches the semaphore is a counting semaphore (Table 1 and Col. 6, lines 3-43 – “Table 1 illustrates the mapping between semaphores of several different operating systems to the emulated semaphore of the present invention”. Table 1 shows the mapping of a Dijkstra counting semaphore to the emulated generic semaphore, where the semaphore can be incremented and decremented (i.e., the generic semaphore is treated as a counting semaphore).).
It would have been obvious to one of ordinary skill in the art to have modified the teachings of Ashwathnarayan such that the semaphore may be a counting semaphore as taught by Mancisidor. Incorporating the methods of Mancisidor would provide a generic semaphore operation that is able to support semaphores with multiple different operating system personalities, including the traditional Dijkstra counting semaphore (Mancisidor; Col. 2, lines 7-16), and emulate semaphore APIs from multiple operating system personalities, which further allows a single resource to be used by applications using different operating system personalities (Mancisidor: Col. 2, lines 52-67). Further, Ashwathnarayan suggests setting a max count for a semaphore, implying the semaphore could be a counting semaphore as is known in the art (see Ashwathnarayan: [0580]).
Regarding claim 9, Ashwathnarayan teaches a system, comprising:
one or more processors ([0648] – “A system, comprising one or more memories to store instructions that, as a result of execution by one or more processors, cause the system to: create a signal to be used to coordinate at least two heterogeneous processing cores in response to performing one or more instructions associated with one or more application programming interfaces (APIs) based, at least in part, on one or more attributes associated with the at least two heterogeneous processing cores.”; [0679]: "non-limiting examples of a signal includes: semaphores") to perform a first application programming interface (API) of a first software library ([0061] – “capabilities of parallel computing platform and application programming interface model external semaphores and parallel computing platform and application programming interface model streams are enhanced using techniques described herein. In at least one embodiment, a parallel computing platform and application programming interface model stream can wait and signal synchronization object by treating it as a type of external semaphore.”; [0071] - “a parallel computing platform and API model refers to an API model that can be used by software developers and software engineers to write code that uses a graphics processing unit (GPU) for general purpose processing. […] parallel computing platform and API models can be implemented using CUDA, Open Computing Language (OpenCL), DirectCompute, C++ Accelerated Massive Parallelism (C++ AMP), and more.”; [0090] – “external semaphore wait/signal APIs can designate hand-off points for parallel computing platform and application programming interface model access to shared buffer and application can invoke external memory APIs”; [0128] – “SignalExternalSemaphoreAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync( ) enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “API enqueues a signal operation in a parallel computing platform and application programming interface model stream.” Parallel computing platform and application programming interface, such as CUDA (“first software library”), comprises an API to signal an externally allocated semaphore.) to generate one or more [operations] in […] code based, at least in part, on one or more parameters of the first API, wherein the one or more [operations] are to cause the […] code to update a semaphore ([0128] – “SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “GPU executes a previously enqueued signal operation”; [0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”; [0135] – “In at least one embodiment, SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. In at least one embodiment a parameter extSemArray refers to external semaphores to be signaled. In at least one embodiment, a parameter paramsArray refers to array of semaphore parameters. In at least one embodiment, a parameter numExtSems refers to a number of semaphores to signal. In at least one embodiment, a parameter stream refers to stream to enqueue signal operations in.”) created by a second API of a second software library ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan (or other Graphics API like DX)”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores.” Externally allocated (“created”) Vulkan (“a second API of a second software library”) semaphores are imported by parallel computing platform and application programming interface (comprising the “first software library”) and signaled (“updated”).) […]; and
one or more memories to store the […] code ([0648] – “A system, comprising one or more memories to store instructions”; [0185] – “data storage 1505, which may be used to store code (e.g. graph code)").
Ashwathnarayan fails to expressly teach an API to generate one or more nodes in graph code, wherein the one or more nodes are to cause the graph code to update a semaphore and at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore.
However, the CUDA C++ Programming Guide teaches an API to generate one or more nodes in graph code based, at least in part, on parameters of the API, wherein the one or more nodes are to cause the graph code to perform an operation (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph based on parameters of the API, e.g., “&a, graph, NULL, 0, &nodeParams”.).
Ashwathnarayan and CUDA C++ Programming Guide are considered to be analogous art to the claimed invention because they are in the same field as the claimed invention of executing programs written for a parallel computing platform and application interface. Ashwathnarayan teaches the parallel computing platform and application programming interface model may be CUDA ([0061]). Ashwathnarayan also teaches the SignalExternalSemaphoreAsync() API within the parallel computing platform and application programming model enqueues a signaling operation in a stream ([0128] and [0129]). The CUDA C++ Programming Guide teaches that an operation forms a node within a CUDA graph (Section 3.2.6.6.1). Further, the CUDA C++ Programming Guide teaches CUDA graphs are a model for work submission providing several advantages over the work submission mechanism of streams which include reducing CPU launch costs and enabling optimization by presenting the whole workflow (Section 3.2.6.6). Lastly, Ashwathnarayan suggests using synchronization objects (e.g., external semaphores – see [0061]) in graph-based execution frameworks ([0064]). Therefore, it would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Doing so enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.).
The combination of Ashwathnarayan in view of CUDA C++ Programming Guide fails to expressly teach at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore.
However, Mancisidor teaches at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore (Col. 4, lines 49-54 – “The semaphore application programming interface according to the present invention is illustrated in FIG. 4. The first field 402 contains the identifier of the semaphore to be modified by this operation. Operation field 404 contains an op code describing the operation to be performed on the semaphore.”; Col. 5, lines 3-64 – “Operation field 404 of the preferred embodiment is implemented in a 32 bit field that is divided into 4 bytes. The first byte 420 contains the operation to be performed if the semaphore value is positive. Second byte 422 contains the operation to be performed if the semaphore value is zero and field 424 the operation to perform if the semaphore value is negative. Field 426 contains flags that modify the operations of the previous field. A zero in any of the three most significant bytes (420, 422, 424) indicates to the semaphore function that no operation is to be performed when the value is positive, zero or negative respectively, though the old value will be returned. The first three byte positions can contain indicators to perform the following operations. The semaphore value V can be operated upon as follows: decrement (V=V-1) increment (V=V+1) set to zero (V=0) set to one (V=1) set (V=C), where (C=value 406) add (V=V+C). In addition to invoking these operations based upon the value of semaphore V, all the operations can be performed regardless of the value of V.”).
Mancisidor is considered to be analogous art to the claimed invention because it is reasonably pertinent to the problem faced by the inventor of using semaphores for program synchronization. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the one or more parameters of the first API taught by Ashwathnarayan in view of CUDA C++ Programming Guide to incorporate the at least one parameter of the API which specifies an operation to be performed on the semaphore as taught by Mancisidor. Doing so would provide a generic semaphore operation that is able to support semaphores with multiple different operating system personalities and emulate semaphore APIs from multiple operating system personalities, which further allows a single resource to be used by applications using different operating system personalities (Mancisidor: Col. 2, lines 52-67).
Regarding claim 10, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the system of claim 9. Ashwathnarayan further teaches wherein the […] code is to be performed, at least in part, by one or more graphics processing units (GPUs) ([0129] – “GPU executes a previously enqueued signal operation”; [0230] – “GPU(s) 1808 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's parallel computing platform and application programming interface model).”), and the semaphore is to be allocated by the second API ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”).
CUDA C++ Programming Guide further teaches the CUDA graph code is performed at least in part by one or more GPUs (Section 3.2.6.6. CUDA Graphs – “CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. […] execution of the kernel on the GPU”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream to be executed at least in part by a GPU as taught by Ashwathnarayan, to instead generate a node in graph code executed at least in part by a GPU as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 11, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the system of claim 9. Ashwathnarayan further teaches wherein the first API is to return one or more parameters of a semaphore signal [operation] in the […] code in response to an API call to get the one or more parameters ([0128] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0136] – “SignalExternalSemaphoresAsync() returns a result value as an output. In at least one embodiment, an output may indicate success or failure states that may include, but are not limited by: not initialized, invalid handle, not supported.” SignalExternalSemaphoreAsync() is the first API, which in response to being called, returns (i.e., gets) a state parameter indicating success or failure of the semaphore signal operation.).
CUDA C++ Programming Guide further teaches an operation is a node in the graph code (Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode adds a node to the graph).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate and add a node to the graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Doing so enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 12, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the system of claim 9. Ashwathnarayan further teaches wherein the first API is to add a semaphore signal operation ([0128]-[0129] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream. [ ... ] In at least one embodiment, API enqueues a signal operation in a parallel computing platform and application programming interface model stream. […] GPU executes a previously enqueued signal operation")
CUDA C++ Programming Guide further teaches an API adds a node to the graph code (Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode adds a node to the graph).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate and add a node to the graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Doing so enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 13, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the system of claim 9. Ashwathnarayan further teaches wherein the one or more memories are to store the semaphore ([0110] – “an implementation of ImportExternalSemaphore( ) maps resources into parallel computing platform and application programming interface model' s address space and those resources can be accessed at time of signal and wait, it will be subsequently freed at time of DestroyExternalSemaphore(). In at least one embodiment, a resource would be mapping of semaphores”; [0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0065] – “semaphore in system RAM).”) and the second API is to use code not included in the graph code ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores. In at least one embodiment, to differentiate between already supported types and SciSync, _EXTERNAL_SEMAPHORE_HANDLE_TYPE_SciSync can be implemented as a new type to externalSemaphoreHandleType. In at least one embodiment, this is set by application before importing SciSync via ImportExternalSemaphore( ).” Since an already created, externally allocated, semaphore may be imported, the code used to allocate or create the semaphore is interpreted to be not be included in the graph code which updates the semaphore taught by Ashwathnarayan in view of CUDA C++ Programming Guide.).
Regarding claim 14, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the system of claim 9. Ashwathnarayan further teaches wherein the […] code is to be performed, at least in part, by one or more graphics processing units (GPUs) ([0129] – “GPU executes a previously enqueued signal operation”; [0230] – “GPU(s) 1808 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's parallel computing platform and application programming interface model).”), and the second API is a graphics rendering API ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”).
CUDA C++ Programming Guide further teaches the CUDA graph code is performed at least in part by one or more GPUs (Section 3.2.6.6. CUDA Graphs – “CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. […] execution of the kernel on the GPU”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream to be executed at least in part by a GPU as taught by Ashwathnarayan, to instead generate a node in graph code executed at least in part by a GPU as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 15, Ashwathnarayan teaches a non-transitory machine-readable medium having stored thereon([0670] – “a machine-readable medium having stored thereon, having stored thereon one or more application programming interfaces (APIs),") a first application programming interface (API) of a first software library ([0061] – “capabilities of parallel computing platform and application programming interface model external semaphores and parallel computing platform and application programming interface model streams are enhanced using techniques described herein. In at least one embodiment, a parallel computing platform and application programming interface model stream can wait and signal synchronization object by treating it as a type of external semaphore.”; [0071] - “a parallel computing platform and API model refers to an API model that can be used by software developers and software engineers to write code that uses a graphics processing unit (GPU) for general purpose processing. […] parallel computing platform and API models can be implemented using CUDA, Open Computing Language (OpenCL), DirectCompute, C++ Accelerated Massive Parallelism (C++ AMP), and more.”; [0090] – “external semaphore wait/signal APIs can designate hand-off points for parallel computing platform and application programming interface model access to shared buffer and application can invoke external memory APIs”; [0128] – “SignalExternalSemaphoreAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync( ) enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “API enqueues a signal operation in a parallel computing platform and application programming interface model stream.” Parallel computing platform and application programming interface, such as CUDA (“first software library”), comprises an API to signal an externally allocated semaphore.), which if performed by one or more processors ([0670] – “(APIs), which if performed by one or more processors, cause the one or more processors to at least: create a signal to be used to coordinate at least two heterogeneous processing cores in response to performing one or more instructions associated with the one or more APIs”), is to generate one or more [operations] in […] code based, at least in part, on one or more parameters of the first API, wherein the one or more [operations] are to cause […] code to at least update a semaphore ([0128] – “SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “GPU executes a previously enqueued signal operation”; [0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”; [0135] – “In at least one embodiment, SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. In at least one embodiment a parameter extSemArray refers to external semaphores to be signaled. In at least one embodiment, a parameter paramsArray refers to array of semaphore parameters. In at least one embodiment, a parameter numExtSems refers to a number of semaphores to signal. In at least one embodiment, a parameter stream refers to stream to enqueue signal operations in.”) created by a second API of a second software library ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan (or other Graphics API like DX)”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores.” Externally allocated (“created”) Vulkan (“a second API of a second software library”) semaphores are imported by parallel computing platform and application programming interface (comprising the “first software library”) and signaled (“updated”).) […].
Ashwathnarayan fails to expressly teach an API to generate one or more nodes in graph code, wherein the one or more nodes are to cause the graph code to update a semaphore and at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore.
However, the CUDA C++ Programming Guide teaches an API to generate one or more nodes in graph code based, at least in part, on parameters of the API, wherein the one or more nodes are to cause the graph code to perform an operation (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph based on parameters of the API, e.g., “&a, graph, NULL, 0, &nodeParams”.).
Ashwathnarayan and CUDA C++ Programming Guide are considered to be analogous art to the claimed invention because they are in the same field as the claimed invention of executing programs written for a parallel computing platform and application interface. Ashwathnarayan teaches the parallel computing platform and application programming interface model may be CUDA ([0061]). Ashwathnarayan also teaches the SignalExternalSemaphoreAsync() API within the parallel computing platform and application programming model enqueues a signaling operation in a stream ([0128] and [0129]). The CUDA C++ Programming Guide teaches that an operation forms a node within a CUDA graph (Section 3.2.6.6.1). Further, the CUDA C++ Programming Guide teaches CUDA graphs are a model for work submission providing several advantages over the work submission mechanism of streams which include reducing CPU launch costs and enabling optimization by presenting the whole workflow (Section 3.2.6.6). Lastly, Ashwathnarayan suggests using synchronization objects (e.g., external semaphores – see [0061]) in graph-based execution frameworks ([0064]). Therefore, it would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Doing so enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.).
The combination of Ashwathnarayan in view of CUDA C++ Programming Guide fails to expressly teach at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore.
However, Mancisidor teaches at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore (Col. 4, lines 49-54 – “The semaphore application programming interface according to the present invention is illustrated in FIG. 4. The first field 402 contains the identifier of the semaphore to be modified by this operation. Operation field 404 contains an op code describing the operation to be performed on the semaphore.”; Col. 5, lines 3-64 – “Operation field 404 of the preferred embodiment is implemented in a 32 bit field that is divided into 4 bytes. The first byte 420 contains the operation to be performed if the semaphore value is positive. Second byte 422 contains the operation to be performed if the semaphore value is zero and field 424 the operation to perform if the semaphore value is negative. Field 426 contains flags that modify the operations of the previous field. A zero in any of the three most significant bytes (420, 422, 424) indicates to the semaphore function that no operation is to be performed when the value is positive, zero or negative respectively, though the old value will be returned. The first three byte positions can contain indicators to perform the following operations. The semaphore value V can be operated upon as follows: decrement (V=V-1) increment (V=V+1) set to zero (V=0) set to one (V=1) set (V=C), where (C=value 406) add (V=V+C). In addition to invoking these operations based upon the value of semaphore V, all the operations can be performed regardless of the value of V.”).
Mancisidor is considered to be analogous art to the claimed invention because it is reasonably pertinent to the problem faced by the inventor of using semaphores for program synchronization. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the one or more parameters of the first API taught by Ashwathnarayan in view of CUDA C++ Programming Guide to incorporate the at least one parameter of the API which specifies an operation to be performed on the semaphore as taught by Mancisidor. Doing so would provide a generic semaphore operation that is able to support semaphores with multiple different operating system personalities and emulate semaphore APIs from multiple operating system personalities, which further allows a single resource to be used by applications using different operating system personalities (Mancisidor: Col. 2, lines 52-67).
Regarding claim 16, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the non-transitory machine-readable medium of claim 15. Ashwathnarayan further teaches wherein the […] code is to be performed, at least in part, by one or more graphics processing units (GPUs) ([0129] – “GPU executes a previously enqueued signal operation”; [0230] – “GPU(s) 1808 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's parallel computing platform and application programming interface model).”), and the semaphore is a binary semaphore ([0119] – “In at least one embodiment, […] waiting on semaphore waits until semaphore reaches a signaled state. In at least one embodiment, a semaphore reaches a singled state and is then reset to an unsigned state. In at least one embodiment, for every signal operation, there is exactly one corresponding wait operation.”; [0132] – “In at least one embodiment, semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state.” The semaphore object may be a binary semaphore with two states – signaled and unsignaled.) to be allocated by the second API ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”).
CUDA C++ Programming Guide further teaches the CUDA graph code is performed at least in part by one or more GPUs (Section 3.2.6.6. CUDA Graphs – “CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. […] execution of the kernel on the GPU”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream to be executed at least in part by a GPU as taught by Ashwathnarayan, to instead generate a node in graph code executed at least in part by a GPU as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 17, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the non-transitory machine-readable medium of claim 15. Ashwathnarayan further teaches wherein the […] code is to be performed, at least in part, by one or more graphics processing units (GPUs) ([0129] – “GPU executes a previously enqueued signal operation”; [0230] – “GPU(s) 1808 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's parallel computing platform and application programming interface model).”), and the semaphore is […] to be allocated by the second API ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”).
CUDA C++ Programming Guide further teaches the CUDA graph code is performed at least in part by one or more GPUs (Section 3.2.6.6. CUDA Graphs – “CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. […] execution of the kernel on the GPU”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream to be executed at least in part by a GPU as taught by Ashwathnarayan, to instead generate a node in graph code executed at least in part by a GPU as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Mancisidor further teaches the semaphore is a counting semaphore (Table 1 and Col. 6, lines 3-43 – “Table 1 illustrates the mapping between semaphores of several different operating systems to the emulated semaphore of the present invention”. Table 1 shows the mapping of a Dijkstra counting semaphore to the emulated generic semaphore, where the semaphore can be incremented and decremented (i.e., the generic semaphore is treated as a counting semaphore).).
It would have been obvious to one of ordinary skill in the art to have modified the teachings of Ashwathnarayan such that the semaphore may be a counting semaphore as taught by Mancisidor. Incorporating the methods of Mancisidor would provide a generic semaphore operation that is able to support semaphores with multiple different operating system personalities, including the traditional Dijkstra counting semaphore (Mancisidor; Col. 2, lines 7-16), and emulate semaphore APIs from multiple operating system personalities, which further allows a single resource to be used by applications using different operating system personalities (Mancisidor: Col. 2, lines 52-67). Further, Ashwathnarayan suggests setting a max count for a semaphore, implying the semaphore could be a counting semaphore as is known in the art (see Ashwathnarayan: [0580]).
Regarding claim 18, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the non-transitory machine-readable medium of claim 15. Ashwathnarayan further teaches wherein the semaphore is to be allocated by the second API based, at least in part, on code not included in the graph code ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores. In at least one embodiment, to differentiate between already supported types and SciSync, _EXTERNAL_SEMAPHORE_HANDLE_TYPE_SciSync can be implemented as a new type to externalSemaphoreHandleType. In at least one embodiment, this is set by application before importing SciSync via ImportExternalSemaphore( ).” Since an already created, externally allocated, semaphore may be imported, the code used to allocate or create the semaphore is interpreted to be not be included in the graph code which updates the semaphore as taught by Ashwathnarayan in view of CUDA C++ Programming Guide.), and the first API is to add a semaphore signal [operation] to the […] code that is to change a value of the semaphore when the semaphore signal [operation] is performed ([0128] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync( ) enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “GPU executes a previously enqueued signal operation.”; [0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”).
CUDA C++ Programming Guide further teaches an API is to add a node to the graph code (Section 3.2.6.6. CUDA Graphs – “Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution. During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them. […] An executable graph may be launched into a stream, similar to any other CUDA work. It may be launched any number of times without repeating the instantiation.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode adds a node to the graph).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 20, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the non-transitory machine-readable medium of claim 15, wherein the semaphore is […] to be allocated by the other API based, at least in part, on code not included in the graph code ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores. In at least one embodiment, to differentiate between already supported types and SciSync, _EXTERNAL_SEMAPHORE_HANDLE_TYPE_SciSync can be implemented as a new type to externalSemaphoreHandleType. In at least one embodiment, this is set by application before importing SciSync via ImportExternalSemaphore( ).” Since an already created, externally allocated, semaphore may be imported, the code used to allocate or create the semaphore is interpreted to be not be included in the graph code which updates the semaphore as taught by Ashwathnarayan in view of CUDA C++ Programming Guide.).
Mancisidor further teaches the semaphore is a counting semaphore (Table 1 and Col. 6, lines 3-43 – “Table 1 illustrates the mapping between semaphores of several different operating systems to the emulated semaphore of the present invention”. Table 1 shows the mapping of a Dijkstra counting semaphore to the emulated generic semaphore, where the semaphore can be incremented and decremented (i.e., the generic semaphore is treated as a counting semaphore).).
It would have been obvious to one of ordinary skill in the art to have modified the teachings of Ashwathnarayan such that the semaphore may be a counting semaphore as taught by Mancisidor. Incorporating the methods of Mancisidor would provide a generic semaphore operation that is able to support semaphores with multiple different operating system personalities, including the traditional Dijkstra counting semaphore (Mancisidor; Col. 2, lines 7-16), and emulate semaphore APIs from multiple operating system personalities, which further allows a single resource to be used by applications using different operating system personalities (Mancisidor: Col. 2, lines 52-67). Further, Ashwathnarayan suggests setting a max count for a semaphore, implying the semaphore could be a counting semaphore as is known in the art (see Ashwathnarayan: [0580]).
Regarding claim 21, Ashwathnarayan teaches a method, comprising:
causing, based at least in part on a first application programming interface (API) of a first software library ([0061] – “capabilities of parallel computing platform and application programming interface model external semaphores and parallel computing platform and application programming interface model streams are enhanced using techniques described herein. In at least one embodiment, a parallel computing platform and application programming interface model stream can wait and signal synchronization object by treating it as a type of external semaphore.”; [0071] - “a parallel computing platform and API model refers to an API model that can be used by software developers and software engineers to write code that uses a graphics processing unit (GPU) for general purpose processing. […] parallel computing platform and API models can be implemented using CUDA, Open Computing Language (OpenCL), DirectCompute, C++ Accelerated Massive Parallelism (C++ AMP), and more.”; [0090] – “external semaphore wait/signal APIs can designate hand-off points for parallel computing platform and application programming interface model access to shared buffer and application can invoke external memory APIs”; [0128] – “SignalExternalSemaphoreAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync( ) enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “API enqueues a signal operation in a parallel computing platform and application programming interface model stream.” Parallel computing platform and application programming interface, such as CUDA (“first software library”), comprises an API to signal an externally allocated semaphore.), a semaphore signal [operation] to be added to […] code based, at least in part, on one or more parameters of the first API ([0128] – “SignalExternalSemaphoreAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync( ) enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “API enqueues a signal operation in a parallel computing platform and application programming interface model stream. […] GPU executes a previously enqueued signal operation”; [0135] – “In at least one embodiment, SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. In at least one embodiment a parameter extSemArray refers to external semaphores to be signaled. In at least one embodiment, a parameter paramsArray refers to array of semaphore parameters. In at least one embodiment, a parameter numExtSems refers to a number of semaphores to signal. In at least one embodiment, a parameter stream refers to stream to enqueue signal operations in.”), to change a value of a semaphore ([0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”) created by a second API of a second software library […], the another API to use code not included in the […] code ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan (or other Graphics API like DX)”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores. In at least one embodiment, to differentiate between already supported types and SciSync, _EXTERNAL_SEMAPHORE_HANDLE_TYPE_SciSync can be implemented as a new type to externalSemaphoreHandleType. In at least one embodiment, this is set by application before importing SciSync via ImportExternalSemaphore( ).”; [0128] – “externally allocated semaphore objects” Externally allocated (“created”) Vulkan (“a second API of a second software library”) semaphores are imported by parallel computing platform and application programming interface (comprising the “first software library”) and signaled (“change a value”). Since an already created, externally allocated, semaphore may be imported, the code used to allocate or create the semaphore is interpreted to be not be included in the code which updates the semaphore.); and
[…] setting one or more parameters of a semaphore signal ([0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”) […];
wherein the executable […] code is to be performed, at least in part, by one or more graphics processing units (GPUs) ([0129] – “GPU executes a previously enqueued signal operation”; [0230] – “GPU(s) 1808 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's parallel computing platform and application programming interface model).”).
Ashwathnarayan fails to expressly teach an API to add one or more nodes in graph code, wherein the one or more nodes are to cause the graph code to change a value of a semaphore and at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore; generating executable graph code based, at least in part, on the graph code, and setting parameters of a node of the executable graph code; and wherein the executable graph code is performed by one or more GPUs.
However, CUDA C++ Programming Guide teaches API to add one or more nodes in graph code based, at least in part, on parameters of the API, wherein the one or more nodes are to cause the graph code to perform an operation (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph based on parameters of the API, e.g., “&a, graph, NULL, 0, &nodeParams”.); generating executable graph code based, at least in part, on the graph code (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them. Instantiation takes a snapshot of the graph template, validates it, and performs much of the setup and initialization of work with the aim of minimizing what needs to be done at launch. The resulting instance is known as an executable graph.”), and setting parameters of a node of the executable graph code (Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph based on parameters of the API, e.g., “&nodeParams”; Section 3.2.6.6.4 Updating Instantiated Graphs – “Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution. In situations where the workflow is not changing, the overhead of definition and instantiation can be amortized over many executions, and graphs provide a clear advantage over streams. […] CUDA provides a lightweight mechanism known as “Graph Update,” which allows certain node parameters to be modified in-place without having to rebuild the entire graph. This is much more efficient than re-instantiation. […] CUDA provides two mechanisms for updating instantiated graphs, whole graph update and individual node update.”; Section 3.2.6.6.4.3. Individual node update – “Instantiated graph node parameters can be updated directly. This eliminates the overhead of instantiation as well as the overhead of creating a new cudaGraph_t.”); and wherein the executable graph code is performed at least in part by one or more GPUs (Section 3.2.6.6. CUDA Graphs – “CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. […] execution of the kernel on the GPU […] An executable graph may be launched into a stream, similar to any other CUDA work. It may be launched any number of times without repeating the instantiation.”).
Ashwathnarayan and CUDA C++ Programming Guide are considered to be analogous art to the claimed invention because they are in the same field as the claimed invention of executing programs written for a parallel computing platform and application interface. Ashwathnarayan teaches the parallel computing platform and application programming interface model may be CUDA ([0061]). Ashwathnarayan also teaches the SignalExternalSemaphoreAsync() API within the parallel computing platform and application programming model enqueues a signaling operation in a stream ([0128] and [0129]). The CUDA C++ Programming Guide teaches that an operation forms a node within a CUDA graph (Section 3.2.6.6.1). Further, the CUDA C++ Programming Guide teaches CUDA graphs are a model for work submission providing several advantages over the work submission mechanism of streams which include reducing CPU launch costs and enabling optimization by presenting the whole workflow (Section 3.2.6.6). Further, since graph definition, instantiation, and execution are separated, parameters of nodes of an executable graph can be updated without having to create and instantiate a new graph, which reduces overhead, and allows for execution of the graph multiple times unlike streams (Section 6.2.6.6.4. and 3.2.6.6.4.3.). Lastly, Ashwathnarayan suggests using synchronization objects (e.g., external semaphores – see [0061]) in graph-based execution frameworks ([0064]). Therefore, it would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead add a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node, and to set parameters of the node in executable graph code generated from the graph code as taught by CUDA C++ Programming Guide. Doing so enables optimizations to be performed on work submitted in CUDA, reduces costs for setting up and launching the work, and enables re-submission of updated work without the overhead of having to redefine and instantiate the work (CUDA C++ Programming Guide: Section 3.2.6.6. and Section 3.2.6.6.4.).
The combination of Ashwathnarayan in view of CUDA C++ Programming Guide fails to expressly teach at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore.
However, Mancisidor teaches at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore (Col. 4, lines 49-54 – “The semaphore application programming interface according to the present invention is illustrated in FIG. 4. The first field 402 contains the identifier of the semaphore to be modified by this operation. Operation field 404 contains an op code describing the operation to be performed on the semaphore.”; Col. 5, lines 3-64 – “Operation field 404 of the preferred embodiment is implemented in a 32 bit field that is divided into 4 bytes. The first byte 420 contains the operation to be performed if the semaphore value is positive. Second byte 422 contains the operation to be performed if the semaphore value is zero and field 424 the operation to perform if the semaphore value is negative. Field 426 contains flags that modify the operations of the previous field. A zero in any of the three most significant bytes (420, 422, 424) indicates to the semaphore function that no operation is to be performed when the value is positive, zero or negative respectively, though the old value will be returned. The first three byte positions can contain indicators to perform the following operations. The semaphore value V can be operated upon as follows: decrement (V=V-1) increment (V=V+1) set to zero (V=0) set to one (V=1) set (V=C), where (C=value 406) add (V=V+C). In addition to invoking these operations based upon the value of semaphore V, all the operations can be performed regardless of the value of V.”).
Mancisidor is considered to be analogous art to the claimed invention because it is reasonably pertinent to the problem faced by the inventor of using semaphores for program synchronization. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the one or more parameters of the first API taught by Ashwathnarayan in view of CUDA C++ Programming Guide to incorporate the at least one parameter of the API which specifies an operation to be performed on the semaphore as taught by Mancisidor. Doing so would provide a generic semaphore operation that is able to support semaphores with multiple different operating system personalities and emulate semaphore APIs from multiple operating system personalities, which further allows a single resource to be used by applications using different operating system personalities (Mancisidor: Col. 2, lines 52-67).
Regarding claim 23, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the method of claim 21. Ashwathnarayan further teaches wherein the semaphore is to be allocated by the second API ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”), and causing the […] code to change a value of the semaphore includes causing the […] code to change a value of the semaphore when the semaphore signal [operation] of the […] code is to be performed ([0128] – “SignalExternalSemaphoreAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync( ) enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “API enqueues a signal operation in a parallel computing platform and application programming interface model stream. […] GPU executes a previously enqueued signal operation”; [0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”).
CUDA C++ Programming Guide teaches causing graph code to perform an operation when a node of the graph code corresponding to the operation is to be performed (Section 3.2.6.6. – CUDA Graphs – “Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution. During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them. […] An executable graph may be launched into a stream, similar to any other CUDA work.”; Section 3.2.6.6.1. Graph Structure – “An operation forms a node in a graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete. Scheduling is left up to the CUDA system.”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead generate a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 24, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the method of claim 21. Ashwathnarayan further teaches wherein the second API is to use code not included in the graph code ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores. In at least one embodiment, to differentiate between already supported types and SciSync, _EXTERNAL_SEMAPHORE_HANDLE_TYPE_SciSync can be implemented as a new type to externalSemaphoreHandleType. In at least one embodiment, this is set by application before importing SciSync via ImportExternalSemaphore( ).” Since an already created, externally allocated, semaphore may be imported, the code used to allocate or create the semaphore is interpreted to be not be included in the graph code which updates the semaphore as taught by Ashwathnarayan in view of CUDA C++ Programming Guide.).
Regarding claim 25, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the method of claim 21. ASHWATHANARAYAN further teaches wherein the method further includes […] and setting one or more parameters of a semaphore signal ([0128] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “GPU executes a previously enqueued signal operation.”; [0132]-[0133] – “semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”) […].
CUDA C++ Programming Guide further teaches generating executable graph code based, at least in part, on the graph code (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them. Instantiation takes a snapshot of the graph template, validates it, and performs much of the setup and initialization of work with the aim of minimizing what needs to be done at launch. The resulting instance is known as an executable graph.”), and setting parameters of a node of the executable graph code (Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph based on parameters of the API, e.g., “&nodeParams”; Section 3.2.6.6.4 Updating Instantiated Graphs – “Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution. In situations where the workflow is not changing, the overhead of definition and instantiation can be amortized over many executions, and graphs provide a clear advantage over streams. […] CUDA provides a lightweight mechanism known as “Graph Update,” which allows certain node parameters to be modified in-place without having to rebuild the entire graph. This is much more efficient than re-instantiation. […] CUDA provides two mechanisms for updating instantiated graphs, whole graph update and individual node update.”; Section 3.2.6.6.4.3. Individual node update – “Instantiated graph node parameters can be updated directly. This eliminates the overhead of instantiation as well as the overhead of creating a new cudaGraph_t.”).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream for execution as taught by Ashwathnarayan, to instead add a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node, and to set parameters of the node in executable graph code generated from the graph code as taught by CUDA C++ Programming Guide. Doing so enables optimizations to be performed on work submitted in CUDA, reduces costs for setting up and launching the work, and enables re-submission of updated work without the overhead of having to redefine and instantiate the work (CUDA C++ Programming Guide: Section 3.2.6.6. and Section 3.2.6.6.4.).
Regarding claim 26, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the method of claim 21, wherein the semaphore is […] to be allocated by the second API based, at least in part, on code not included in the graph code ([0063] – “parallel computing platform and application programming interface model interops supports externally allocated semaphore, such as Vulkan semaphores.”; [0089] – “Vulkan ( or other Graphics API like DX)”; [0128] – “externally allocated semaphore objects”; [0111] – “parallel computing platform and application programming interface model external semaphores supports importing Vulkan & D3D12 semaphores. In at least one embodiment, to differentiate between already supported types and SciSync, _EXTERNAL_SEMAPHORE_HANDLE_TYPE_SciSync can be implemented as a new type to externalSemaphoreHandleType. In at least one embodiment, this is set by application before importing SciSync via ImportExternalSemaphore( ).” Since an already created, externally allocated, semaphore may be imported, the code used to allocate or create the semaphore is interpreted to be not be included in the graph code which updates the semaphore as taught by Ashwathnarayan in view of CUDA C++ Programming Guide.), and the method further includes adding, based at least in part on the first API, a semaphore signal [operation] to the […] code that is to change a value of the […] semaphore when the semaphore signal [operation] is performed ([0128] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “GPU executes a previously enqueues signal operation”; [0132]-[0133] – “In at least one embodiment, semantics of signaling a semaphore depend on type of object. In at least one embodiment, […] signaling semaphore sets it to a signaled state. In at least one embodiment, […] semaphore is set to value specified in EXTERNAL_SEMAPHORE_PARAMS::params::fence::value.”).
CUDA C++ Programming Guide further teaches an API which adds a node to graph code that is to perform an operation when the node is performed (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them. […] An executable graph may be launched into a stream, similar to any other CUDA work.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each call to API cudaGraphAddKernelNode adds a node to the graph.).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead add a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Mancisidor further teaches the semaphore is a counting semaphore (Table 1 and Col. 6, lines 3-43 – “Table 1 illustrates the mapping between semaphores of several different operating systems to the emulated semaphore of the present invention”. Table 1 shows the mapping of a Dijkstra counting semaphore to the emulated generic semaphore, where the semaphore can be incremented and decremented (i.e., the generic semaphore is treated as a counting semaphore).).
It would have been obvious to one of ordinary skill in the art to have modified the teachings of Ashwathnarayan such that the semaphore may be a counting semaphore as taught by Mancisidor. Incorporating the methods of Mancisidor would provide a generic semaphore operation that is able to support semaphores with multiple different operating system personalities, including the traditional Dijkstra counting semaphore (Mancisidor; Col. 2, lines 7-16), and emulate semaphore APIs from multiple operating system personalities, which further allows a single resource to be used by applications using different operating system personalities (Mancisidor: Col. 2, lines 52-67). Further, Ashwathnarayan suggests setting a max count for a semaphore, implying the semaphore could be a counting semaphore as is known in the art (see Ashwathnarayan: [0580]).
Regarding claim 27, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the processor of claim 1. Ashwathnarayan further teaches wherein the first API is to add an external semaphore signal [operation to a stream] based, at least in part, on an API call ([0128] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “API enqueues a signal operation in a parallel computing platform and application programming interface model stream.”; [0549] – “an application generates instructions (e.g., in form of API calls) that cause driver kernel to generate one or more tasks for execution by PPU 4000 and driver kernel outputs tasks to one or more streams being processed by PPU 4000.” SignalExternalSemaphoresAsync() is a particular API call of the first API (the API of the “parallel computing platform and application programming interface”).).
CUDA C++ Programming Guide further teaches an API is to add a node to a graph based, at least in part, on an API call (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each call to API cudaGraphAddKernelNode adds a node to the graph.).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead add a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Claims 4, 19, and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor as applied to claims 1 and 15 above, and further in view of CUDA Runtime API (NPL Document U – previously provided with the Office Action dated 4/23/2024).
Regarding claim 4, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the one or more processors of claim 1. Ashwathnarayan further teaches wherein the […] code is to be performed, at least in part, by one or more graphics processing units (GPUs) ([0129] – “GPU executes a previously enqueued signal operation”; [0230] – “GPU(s) 1808 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's parallel computing platform and application programming interface model).”) and the first API is to add a semaphore signal [operation to the] code based, at least in part, on a parameter that specifies a [stream] to which to add the semaphore signal [operation] ([0129] – “SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0135] – “In at least one embodiment, SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. […] a parameter stream refers to stream to enqueue signal operations in.”).
CUDA C++ Programming Guide further teaches the CUDA graph code is performed at least in part by one or more GPUs (Section 3.2.6.6. CUDA Graphs – “CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. […] execution of the kernel on the GPU”), and an API adds a node to the graph code (Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream to be executed at least in part a GPU as taught by Ashwathnarayan, to instead add a node in graph code executed at least in part by a GPU as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
The combination of Ashwathnarayan in view of CUDA C++ Programming Guide fails to expressly teach a parameter that specifies a graph to which to add the node.
However, CUDA Runtime API teaches a parameter that specifies a graph to which to add the node (Section 5.29 Graph Management: cudaGraphAddKernelNode() API has parameter "graph" specifying which graph the node is to be added to).
The CUDA Runtime API is considered analogous art to the claimed invention because it is reasonably pertinent to the problem faced by the inventor in ordering the execution of dependent operations within a graph. The CUDA Runtime API discloses a list of APIs to add different types of nodes to a CUDA graph, many of which require a parameter corresponding to the graph to which a node should be added (Section 5.29). Therefore, it would have been obvious to one of ordinary skill prior to the filing date of the claimed invention to have replaced the parameter specifying a stream to which to add the semaphore signal operation as disclosed by Ashwathnarayan to instead specify a graph to which to add the node corresponding to the operation. Properly defining nodes and their edges representing dependencies between nodes in a graph ensures proper order of execution of the operations (CUDA C++ Programming Guide: Section 3.2.6.6.1). Further, using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work when compared to work submitted through streams (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 19, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the non-transitory machine-readable medium of claim 15. Ashwathnarayan further teaches wherein the first API is to add a semaphore signal [operation] to the […] code based, at least in part, on a first parameter that specifies a [stream] to which to add the semaphore signal [operation] ([0129] – “SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0135] – “In at least one embodiment, SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. […] a parameter stream refers to stream to enqueue signal operations in.”), a second parameter that specifies one or more parameters of the semaphore signal [operation] ([0135] – “In at least one embodiment, SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. In at least one embodiment a parameter extSemArray refers to external semaphores to be signaled. In at least one embodiment, a parameter paramsArray refers to array of semaphore parameters. In at least one embodiment, a parameter numExtSems refers to a number of semaphores to signal.”), […].
CUDA C++ Programming Guide further teaches an API to add a node to the graph code (Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead add a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
The combination of Ashwathnarayan in view of CUDA C++ Programming Guide fails to expressly teach a first parameter that specifies a graph to which to add the node, a second parameter that specifies parameters of the node, and a third parameter that specifies one or more dependencies of the semaphore signal node.
However, CUDA Runtime API teaches a first parameter that specifies a graph to which to add the node (Section 5.29 Graph Management: cudaGraphAddKernelNode() API has parameter "graph" specifying which graph the node is to be added to), a second parameter that specifies parameters of the node (Section 5.29 Graph Management: cudaGraphAddKernelNode() API has parameter "pNodeParams" specifying the parameters for the node being added), and a third parameter that specifies one or more dependencies of the semaphore signal node (Section 5.29 Graph Management: cudaGraphAddKernelNode() API has parameter "pDependencies" specifying the dependencies for the node).
The CUDA Runtime API is considered analogous art to the claimed invention because it is reasonably pertinent to the problem faced by the inventor in ordering the execution of dependent operations within a graph. The CUDA Runtime API discloses a list of APIs to add different types of nodes to a CUDA graph, many of which require a parameter corresponding to the graph to which a node should be added, parameters for the node itself, and parameters for the dependencies of the node (Section 5.29). Therefore, it would have been obvious to one of ordinary skill prior to the filing date of the claimed invention to have added the node corresponding to the semaphore signal operation as disclosed by Ashwathnarayan in view of the CUDA C++ Programming Guide based on the parameters specifying the graph to add the node to, specifying parameters of the node, and specifying dependencies of the node as taught by CUDA Runtime API. Properly defining nodes and their edges representing dependencies between nodes in a graph ensures proper order of execution of the operations (CUDA C++ Programming Guide: Section 3.2.6.6.1). Further, using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work when compared to work submitted through streams (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Regarding claim 28, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the processor of claim 1. Ashwathnarayan further teaches wherein the first API is to add an external semaphore signal [operation] to a [stream] based, at least in part, on an API call that includes a [stream] identifier parameter ([0129] – “SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0135] – “In at least one embodiment, SignalExternalSemaphoresAsync() accepts one or more parameters as input parameters. […] a parameter stream refers to stream to enqueue signal operations in.”), […].
CUDA C++ Programming Guide further teaches an API to add a node to the graph code based, at least in part, on an API call (Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each API cudaGraphAddKernelNode (e.g., “cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);”) adds a node to the graph).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead add a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs instead of streams enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
The combination of Ashwathnarayan in view of CUDA C++ Programming Guide fails to expressly teach the API call includes a graph identifier parameter, a dependencies parameter, and a node parameters parameter.
However, the CUDA Runtime API teaches the API call includes a graph identifier parameter (Section 5.29 Graph Management: cudaGraphAddKernelNode() API has parameter "graph" specifying which graph the node is to be added to), a dependencies parameter (Section 5.29 Graph Management: cudaGraphAddKernelNode() API has parameter "pDependencies" specifying the dependencies for the node), and a node parameters parameter (Section 5.29 Graph Management: cudaGraphAddKernelNode() API has parameter "pNodeParams" specifying the parameters for the node being added).
The CUDA Runtime API is considered analogous art to the claimed invention because it is reasonably pertinent to the problem faced by the inventor in ordering the execution of dependent operations within a graph. The CUDA Runtime API discloses a list of APIs to add different types of nodes to a CUDA graph, many of which require a parameter corresponding to the graph to which a node should be added, parameters for the node itself, and parameters for the dependencies of the node (CUDA Runtime API: Section 5.29). Therefore, it would have been obvious to one of ordinary skill prior to the filing date of the claimed invention to have added the node corresponding to the semaphore signal operation as disclosed by Ashwathnarayan in view of the CUDA C++ Programming Guide based on parameters specifying the graph to add the node to, dependencies of the node, and parameters of the node as taught by CUDA Runtime API. Properly defining nodes and their edges representing dependencies between nodes in a graph ensures proper order of execution of the operations (CUDA C++ Programming Guide: Section 3.2.6.6.1). Further, using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work when compared to work submitted through streams (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor as applied to claim 21 above, and further in view of Jane et al. (U.S. Pub. No. 2020/0104968), hereinafter Jane.
Regarding claim 22, the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor teaches the method of claim 21. Ashwathnarayan further teaches wherein the first API is to add the semaphore signal [operation] to the […] code ([0128] – “SignalExternalSemaphoresAsync() is a supported API. In at least one embodiment, SignalExternalSemaphoresAsync() enqueues a signal operation on a set of externally allocated semaphore objects in a specified stream.”; [0129] – “API enqueues a signal operation in a parallel computing platform and application programming interface model stream. […] GPU executes a previously enqueued signal operation.”) based, in part, on a GPU driver […], wherein the GPU driver is software, firmware, or a combination thereof performed by a GPU ([0085] – “frame level API library (e.g., NVMedia) and parallel computing platform and application programming interface model are APIs at different levels, […] parallel computing platform and application programming interface model, on other hand, being a general purpose driver need not recognize these higher level constructs and accepts datatypes recognized directly by HW or GPU.”; [0071] – “a parallel computing platform and API model refers to an API model that can be used by software developers and software engineers to write code that uses a graphics processing unit (GPU) for general purpose processing. In at least one embodiment, a parallel computing platform and API model is a software layer that gives a programmer or developer direct access to a GPU's virtual instruction set and/or parallel computational elements. […] In at least one embodiment, parallel computing platform and API model (e.g., CUDA) is a software platform”).
CUDA C++ Programming Guide further teaches an API to add a node to graph code (Section 3.2.6.6. CUDA Graphs – “During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them.”; Section 3.2.6.6.1 Graph Structure: "an operation forms a node in the graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations. An operation may be scheduled at any time once the nodes on which it depends are complete."; Section 3.2.6.6.2 Creating a Graph Using Graph APIs: in the example code, each call to API cudaGraphAddKernelNode adds a node to the graph.).
It would have been obvious to one of ordinary skill in the art to have modified the first API which enqueues a semaphore signal operation in a stream as taught by Ashwathnarayan, to instead add a node in graph code as taught by CUDA C++ Programming Guide, where the semaphore signal operation forms the node. Using graphs enables optimizations to be performed on work submitted in CUDA and reduces costs for setting up and launching the work (CUDA C++ Programming Guide: Section 3.2.6.6.). Ashwathnarayan also suggests implementing synchronization objects which are waited on and signaled (e.g., external semaphores – [0061]) into graph based execution frameworks ([0064]).
The combination of Ashwathnarayan in view of CUDA C++ Programming Guide fails to expressly teach the GPU driver receiving a call to the first API.
However, Jane teaches the GPU driver receiving a call to the first API ([0003] – “User space applications typically utilize a graphics application program interface (API) to access (e.g., indirect or near-direct access) a GPU for the purposes of improving graphics and compute operations. To access the GPU, a user space application institutes API calls that generate a series of commands for a GPU to execute.”; [0026] – “a graphics driver translates API calls into commands a graphics processor is able to execute”; [0031] – “The graphics processor system 112 includes one or more graphics processors (e.g., GPUs)”; [0033] – “The user space driver 102 receives graphics API calls from application 101 and maps the graphics API calls to operations understood and executable by the graphics processor system 112. For example, the user space driver 102 can translate the API calls into commands encoded within command buffers before being transferred to kernel driver 103.”; [0034] – “the graphics processor firmware 104 obtains commands that processor system 110 submits for execution”; [0035] – “After scheduling the commands, in FIG. 1, the graphics processor firmware 104 sends command streams to the graphics processor hardware 105. The graphics processor hardware 105 then executes the commands within the command streams according to the order the graphics processor hardware 105 receives the commands.”; [0036] – “Rather than having the kernel driver 103 and/or graphics processor firmware 104 submit commands to the graphics processor hardware 105 according to the same order kernel driver 103 receives commands, the kernel driver 103 and/or graphics processor firmware 104 is able to perform out-of-order command scheduling that submits commands to the graphics processor hardware 105 according to command dependency.”; [0085] – “The presence of multiple instances of a graphics driver (user space graphics drivers 905A and 905B, kernel graphics drivers 910A and 910B, and graphics driver firmware 925 in the microcontroller firmware 920) indicates the various options for implementing the graphics driver. As a matter of technical possibility any of the three shown drivers might independently operate as a sole graphics driver.”).
Jane is considered to be analogous art to the claimed invention because it is in the same field as the claimed invention of executing programs written for a parallel computing platform and application interface. Therefore, it would have been obvious to one of ordinary skill in the art to have modified the first API which adds a semaphore signal node to a graph based, in part, on a GPU driver as taught by Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor such that it is based on the GPU driver receiving a call to the first API as taught by Jane. Submitting the API calls to the GPU driver of Jane allows the GPU driver to translate received API calls into commands executable by the GPU and to reorder the commands based on dependencies between the commands before submission to the GPU hardware to reduce processing latency and utilize the GPU’s parallel architecture (see [0003], [0026]).
Response to Arguments
Applicant’s arguments with respect to the rejection of the claims under 35 U.S.C. 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Specifically, Applicant argues the combination of Ashwathnarayan in view of the CUDA C++ Programming Guide fails to teach the limitation "at least one of the one or more parameters of the first API specify an operation to be performed on the semaphore" as recited in claims 1, 9, 15, and 21.
However, the Examiner has relied upon new reference Mancisidor (U.S. Patent No. 6,519,623) to teach the argued limitation. See the rejections of claims 1, 9, 15, and 21 in the section titled Claim Rejections - 35 USC § 103 for additional details regarding the combination of Ashwathnarayan in view of CUDA C++ Programming Guide and Mancisidor.
No additional arguments were presented for any of claims 2-8, 10-14, 16-20, or 22-28.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Jones et al. (U.S. Pub. No. 2021/0248115) teaches a graph API may comprise functions for adding nodes to a graph which define the node, associates the node with an operation/function, and specifies how each node is related to other nodes of the graph (see [0079]).
Joffe et al. (U.S. Patent No. 7,437,535) teaches function parameters of a functions for interfacing with a co-processor including an operation type which informs a co-processor whether to perform a get-semaphore or a release-semaphore operation (see Col. 4, lines 51-59 and Col. 8, lines 59-64).
Aydin et al. (U.S. Patent No. 6,212,572) teaches a function with a parameter provided to it that permits a selection between an operation P to decrement the value of a semaphore and an operation V to increment the value of a semaphore (see Col. 5, lines 6-23).
Wilt et al. (U.S. Patent No. 8,402,229) teaches a method for sharing graphics objects between a CUDA API and a graphics API using a semaphore to synchronize accesses (see Abstract).
Dixon et al. (U.S. Pub. No. 2009/0217294) teaches it is known in the art how to combine multiple API calls into a single API call to perform the same operations as the multiple API call, as multiple API calls may be inefficient (see [0006]).
McCloghrie et al. (U.S. Patent No. 6,286,052) teaches one of ordinary skill in the art would recognize two or more API calls may be combined into a single call, or that any one call may be broken down into multiple calls (see Col. 19, lines 58-63).
Nickolls et al. (U.S. Patent No. 7,861,060) teaches it is known in the art that communication between a CPU and GPU can be managed by a driver program on the CPU that supports an API which defines function calls supported by the GPU, and a programmer invokes those functions using calls from the API in an application program, and particulars of the API such as names and parameters of particular function calls are a matter of design choice, such that a person of ordinary skill in the art is capable of creating suitable APIs (see Col. 25 line 65-Col. 26, line 61).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JENNIFER MARIE GUTMAN whose telephone number is (703)756-1572. The examiner can normally be reached M-F: 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kevin Young can be reached at 571-270-3180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JENNIFER MARIE GUTMAN/Examiner, Art Unit 2194 /KEVIN L YOUNG/Supervisory Patent Examiner, Art Unit 2194