DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
(Submitted on 10/27/2025)
- On Page 7, the applicant traverses the rejections and argues that the references GUO and Shafiq do not teach the amended limitation of claims 1, 8 and 15 that recites “during runtime execution of the first function, execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library, prior to the completion of the first function and then resume the execution of the first function to write the second data to memory”.
- On Page 8, the applicant argues that the examiner has applied improper application of BRO on reference GUO teaching of compile-time as a runtime call and treat the absence of an operator in a library as the “ second function outside the library”.
Examiner’s Response:
Applicant’s arguments with respect to claims 1, 8 and 15 on Page 7 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The examiner recognizes that the amended claim limitations of claims 1, 8 and 15 are programming concept described, where a function in a library executes another function implemented outside of the library at a specific, defined point during its runtime. This normally known as a callback or a hook mechanism and is taught by the reference “Kerr”. No new references are added.
The examiner considered the applicant’s argument on Page 8. As the runtime execution and the teaching of the second function outside the library is no longer taught by the reference “GUO”, the examiner submits that applicant’s argument are MOOT.
In Conclusion, the examiner rejects claim 1, 8 and 15 and all dependent claims 2-7, 9-14 and 16-20 as a “FINAL REJECTION” under § 103.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 8-9, and 15-16 are rejected under 35 U.S.C. 103 as unpatentable over
in view of Zhibin GUO et.al (hereinafter GUO) US 2020/0210821 A1,
in view of Kerr et.al (hereinafter Kerr) US 2021/0103433 A1.
In regard to claim 1. (Currently Amended)
GUO discloses:
generate first data responsive to execution of first function in the library comprising a plurality of functions;
In [0067]:
When the data processing method of the current example is applied to an isomorphic electronic device, a processor of the isomorphic electronic device may try to classify a plurality of network layers supported by the processor into one subnet.
In [0015]:
obtaining a return value of a preset fusion attribute function of each network layer;
In [0106]:
By way of illustration, it may be added to each network layer that a function mfus_supported( ) returns true or false to indicate whether a fusion operation is supported. mfus_supported( ) is a predefined fusion attribute function which may determine whether logic for each operator of the network layers and an interface for calling the logic exist in the preset function library.
in [0101] :
Specifically, the fusion attribute of each network layer includes a first fusion attribute and a second fusion attribute. The electronic device may include a first processor and a second processor. For instance, the electronic device may predetermine whether each network layer can be supported by the first processor to perform a fusion operation. For each network layer, the electronic device may look up the logic of each operator of the network layer, logic for the fusion operator, and the interface for calling the logic in a preset function library associated with the first processor. If there are logic for all the operators of the network layer, the logic for the fusion operator, and the interface for calling the logic in the preset function library, the electronic device determines that the first processor supports the network layer performing the fusion operation, and determines that the function attribute of the network layer is the first fusion attribute.
in [0105] :
if the return value of the network layer is a first return value, determining that the fusion attribute of the network layer is the first fusion attribute; and if the return value of the network layer is a second return value, determining that the fusion attribute of the network layer is the second fusion attribute.
In [0106] :
By way of illustration, it may be added to each network layer that a function mfus_supported( ) returns true or false to indicate whether a fusion operation is supported. mfus_supported( ) is a predefined fusion attribute function which may determine whether logic for each operator of the network layers and an interface for calling the logic exist in the preset function library. If the logic for each operator of the network layers and the interface for calling the logic exist in the preset function library, the function returns true, which refers to the first function attribute. If the logic for each operator of the network layers and the interface for calling the logic does not exist in the preset function library, the function returns false, which refers to the second function attribute.
GUO does not explicitly disclose:
- An apparatus comprising: a processor comprising circuitry configured to:
- during runtime execution of the first function, execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library, prior to the completion of the first function
- execute the second function to perform one or more operations on the first data to generate second data;
However, Kerr discloses:
- An apparatus comprising: a processor comprising circuitry configured to:
In [0058] :
- during runtime execution of the first function, execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library, prior to the completion of the first function
In [0043]:
In at least one embodiment, at application runtime an active component can expose an application programming interface (API) in which a client application can supply a function object for some or all callbacks. In at least one embodiment, this can include one or more CUDA device function objects. In at least one embodiment, these source representations can be compiled by an active component with a real time compiler yielding an intermediate representation
In [0043]:
In at least one embodiment, an active component can perform a compiler in-lining pass to insert these functions at appropriate call sites within a containing kernel. In at least one embodiment, once in-lined, inter-procedural optimization can be performed, followed by just-in-time code generation procedures. In at least one embodiment, this optimization and code generation can yield a complete GPU binary. In at least one embodiment, in-lining and optimization can occur prior to register allocation and final instruction scheduling, such that kernels fused by an active component have zero additional overhead
In [0047]:
In at least one embodiment, an abstraction layer can be used between frameworks for runtime kernel generation.
In [0047]:
In at least one embodiment, for kernel fusion elementwise operations can be fused with GEMMS, convolutions, and reductions, and successive convolutional operations may also be fused. In at least one embodiment, there can be a rich interface provided between framework and code generator.
In [0049]:
In at least one embodiment, a deep learning template library can be used that can provide reusable abstractions as C++ templates. In at least one embodiment, template arguments can include tile sizes, input types, accumulate types, and math operations. In at least one embodiment, orthogonal components can be decoupled where possible. In at least one embodiment, optimization directives and special functions can be represented as intrinsics and code generator passes. In at least one embodiment, deployment for binary can occur through a static library, with an API into compiled template instantiations, or for source code as a header-only library.
in [0052]:
In at least one embodiment, a user might try to combine an operation with a matrix multiply. In at least one embodiment, a matrix multiply has a very regular structured data access pattern. In at least one embodiment, a hardware vendor could leave hooks in the code that are opportunities for custom functionality to be injected and compiled. In at least one embodiment, a compiler tool chain can compile this workload into an intermediate representation, such as an LVM-IR. In at least one embodiment, these hooks, or fusion sites, can remain as device function calls in this intermediate representation. In at least one embodiment, an application programmer can define a function, based at least in part on semantics of what is possible at each of these hooks, and this compiler can take their function compiled to the intermediate representation, in-line it at the call site, within the compute-limited workload, and apply a specialized compiler variant to produce final lowered machine code. In at least one embodiment, types of operations to be included can depend at least in part upon a respective application and use case.
In [0046]:
a user or application will not receive a high level form of this matrix multiply, or direct access to a specialized compiler. In at least one embodiment, a user will still be able to fuse their operation with a vendor-supplied implementation and achieve the best performance.
(BRI: overall, the concept is known as a callback or a hook mechanism. An API that allows a client application to supply a function object for callbacks is designed to be used by programming languages can support first-class function as the API that accepts function objects for callbacks inherently uses first-class functions (or function objects [See CUDA device object in [0043]) as a core part of its design). The users or applications interact with a library that handles low-level, high-performance operations (like matrix multiplication) without exposing the intricate details or requiring direct access to specialized compilers are second function that is considered outside of the library where the approach balances the need for high performance with user flexibility, allowing users to define their broader computational or application logic while relying on the library for the most intensive computational kernels)
- execute the second function to perform one or more operations on the first data to generate second data;
In [0055]:
In at least one embodiment, a compilation process 500 can be used as illustrated in FIG. 5. In at least one embodiment, one or more function objects are received 502, at application runtime, through an API. In at least one embodiment, these function objects are provided by, or for, this application. In at least one embodiment, each function object is compiled 504 to obtain an intermediate representation. In at least one embodiment, an intermediate, or partially-compiled, representation of a compute-limited operation, such as a convolution kernel, is obtained 506. In at least one embodiment, an in-lining pass is performed 508 in order to insert compiled function(s) at one or more call locations corresponding to hooks in the partially compiled kernel. In at least one embodiment, one or more optimizations are performed. In at least one embodiment, a final compilation is performed 512 in order to generate output, which may take the form of final executable code.
(BRI: a final compilation stage (specifically the linking phase) is performed to generate the output as a single, final executable file. Function fusion is an optimization technique that occurs during the main compilation phase, resulting in modified intermediate or assembly code which is then processed through the remaining standard compilation stages and result in executing the second function)
- and resume execution of the first function to write the second data to a memory device.
In [0045]:
In at least one embodiment, kernel fusion is performed in part to obtain improved performance for applications that target various accelerator processor architectures
In [0045]:
improve performance by taking advantage of locality. In at least one embodiment, if a matrix multiply computes a result then corresponding data is resident on a chip, and another operation can use this data without having to fetch this data from device memory.
(BRI: represents a key concept often described using terms like register reuse, data locality, or kernel fusion in high-performance computing. Computing a result and ensuring the data is resident on a chip is a result of writing to the memory)
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, and Kerr.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches a processor circuitry and executing a function call at a defined fusing point
within the first function to invoke second function implemented outside of the library.
One of ordinary skill would have motivation to combine GUO, and Kerr to provide an improved performance with efficient utilization of the target hardware (Kerr [0045])
In regard to claim 2. (Previously Presented)
GUO discloses:
- wherein the second function is configured to modify data generated as a result of execution of at least one operation of the first function to generate the second data
In [0086]:
Alternatively, the fusion attribute of the subnet includes a first fusion attribute and a second fusion attribute.
In [0086]:
the fusion attribute of each subnet may indicate whether the subnet may run in the first processor. For instance, when the first processor supports a fusion operation of the subnet, the fusion attribute of the subnet may be the first fusion attribute. When the first processor does not support the fusion operation of the subnet, the fusion attribute of the subnet may be the second fusion attribute.
In regard to claim 8. (Currently Amended)
- first data responsive to execution of a first function in[[ the] a library comprising a plurality of functions;
In [0067]:
When the data processing method of the current example is applied to an isomorphic electronic device, a processor of the isomorphic electronic device may try to classify a plurality of network layers supported by the processor into one subnet.
In [0015]:
obtaining a return value of a preset fusion attribute function of each network layer;
In [0106]:
By way of illustration, it may be added to each network layer that a function mfus_supported( ) returns true or false to indicate whether a fusion operation is supported. mfus_supported( ) is a predefined fusion attribute function which may determine whether logic for each operator of the network layers and an interface for calling the logic exist in the preset function library.
in [0101] :
Specifically, the fusion attribute of each network layer includes a first fusion attribute and a second fusion attribute. The electronic device may include a first processor and a second processor. For instance, the electronic device may predetermine whether each network layer can be supported by the first processor to perform a fusion operation. For each network layer, the electronic device may look up the logic of each operator of the network layer, logic for the fusion operator, and the interface for calling the logic in a preset function library associated with the first processor. If there are logic for all the operators of the network layer, the logic for the fusion operator, and the interface for calling the logic in the preset function library, the electronic device determines that the first processor supports the network layer performing the fusion operation, and determines that the function attribute of the network layer is the first fusion attribute.
- execute the second function to perform one or more operations on the first data to generate second data;
GUO does not explicitly disclose:
- method comprising:
generating, by circuitry of a processor,
- during runtime execution of the first function, execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library, prior to the completion of the first function
- execute the second function to perform one or more operations on the first data to generate second data;
- and returning, by the circuitry of the processor, execution to the first function to write the second data to a memory device.
However, Kerr discloses:
- method comprising:
generating, by circuitry of a processor,
in [0058]
- during runtime execution of the first function, execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library, prior to the completion of the first function
In [0043]:
In at least one embodiment, at application runtime an active component can expose an application programming interface (API) in which a client application can supply a function object for some or all callbacks. In at least one embodiment, this can include one or more CUDA device function objects. In at least one embodiment, these source representations can be compiled by an active component with a real time compiler yielding an intermediate representation
In [0043]:
In at least one embodiment, an active component can perform a compiler in-lining pass to insert these functions at appropriate call sites within a containing kernel. In at least one embodiment, once in-lined, inter-procedural optimization can be performed, followed by just-in-time code generation procedures. In at least one embodiment, this optimization and code generation can yield a complete GPU binary. In at least one embodiment, in-lining and optimization can occur prior to register allocation and final instruction scheduling, such that kernels fused by an active component have zero additional overhead
In [0047]:
In at least one embodiment, an abstraction layer can be used between frameworks for runtime kernel generation.
In [0047]:
In at least one embodiment, for kernel fusion elementwise operations can be fused with GEMMS, convolutions, and reductions, and successive convolutional operations may also be fused. In at least one embodiment, there can be a rich interface provided between framework and code generator.
In [0049]:
In at least one embodiment, a deep learning template library can be used that can provide reusable abstractions as C++ templates. In at least one embodiment, template arguments can include tile sizes, input types, accumulate types, and math operations. In at least one embodiment, orthogonal components can be decoupled where possible. In at least one embodiment, optimization directives and special functions can be represented as intrinsics and code generator passes. In at least one embodiment, deployment for binary can occur through a static library, with an API into compiled template instantiations, or for source code as a header-only library.
in [0052]:
In at least one embodiment, a user might try to combine an operation with a matrix multiply. In at least one embodiment, a matrix multiply has a very regular structured data access pattern. In at least one embodiment, a hardware vendor could leave hooks in the code that are opportunities for custom functionality to be injected and compiled. In at least one embodiment, a compiler tool chain can compile this workload into an intermediate representation, such as an LVM-IR. In at least one embodiment, these hooks, or fusion sites, can remain as device function calls in this intermediate representation. In at least one embodiment, an application programmer can define a function, based at least in part on semantics of what is possible at each of these hooks, and this compiler can take their function compiled to the intermediate representation, in-line it at the call site, within the compute-limited workload, and apply a specialized compiler variant to produce final lowered machine code. In at least one embodiment, types of operations to be included can depend at least in part upon a respective application and use case.
In [0046]:
a user or application will not receive a high level form of this matrix multiply, or direct access to a specialized compiler. In at least one embodiment, a user will still be able to fuse their operation with a vendor-supplied implementation and achieve the best performance.
- execute the second function to perform one or more operations on the first data to generate second data;
In [0055]:
In at least one embodiment, a compilation process 500 can be used as illustrated in FIG. 5. In at least one embodiment, one or more function objects are received 502, at application runtime, through an API. In at least one embodiment, these function objects are provided by, or for, this application. In at least one embodiment, each function object is compiled 504 to obtain an intermediate representation. In at least one embodiment, an intermediate, or partially-compiled, representation of a compute-limited operation, such as a convolution kernel, is obtained 506. In at least one embodiment, an in-lining pass is performed 508 in order to insert compiled function(s) at one or more call locations corresponding to hooks in the partially compiled kernel. In at least one embodiment, one or more optimizations are performed. In at least one embodiment, a final compilation is performed 512 in order to generate output, which may take the form of final executable code.
(BRI: a final compilation stage (specifically the linking phase) is performed to generate the output as a single, final executable file. Function fusion is an optimization technique that occurs during the main compilation phase, resulting in modified intermediate or assembly code which is then processed through the remaining standard compilation stages and result in executing the second function)
- and returning, by the circuitry of the processor, execution to the first function to write the second data to a memory device.
In [0045]:
In at least one embodiment, kernel fusion is performed in part to obtain improved performance for applications that target various accelerator processor architectures
In [0045]:
improve performance by taking advantage of locality. In at least one embodiment, if a matrix multiply computes a result then corresponding data is resident on a chip, and another operation can use this data without having to fetch this data from device memory.
(BRI: represents a key concept often described using terms like register reuse, data locality, or kernel fusion in high-performance computing. Computing a result and ensuring the data is resident on a chip is a result of writing to the memory)
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, and Kerr.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches a processor circuitry and executing a function call at a defined fusing point
within the first function to invoke second function implemented outside of the library.
One of ordinary skill would have motivation to combine GUO, and Kerr to provide an improved performance with efficient utilization of the target hardware (Kerr [0045])
In regard to claim 9. (Currently Amended)
GUO discloses:
- wherein the second function is configured to modify data generated as a result of execution of at least one operation of the first function to generate the second data
In [0086]:
Alternatively, the fusion attribute of the subnet includes a first fusion attribute and a second fusion attribute.
In [0086]:
the fusion attribute of each subnet may indicate whether the subnet may run in the first processor. For instance, when the first processor supports a fusion operation of the subnet, the fusion attribute of the subnet may be the first fusion attribute. When the first processor does not support the fusion operation of the subnet, the fusion attribute of the subnet may be the second fusion attribute.
In regard to claim 15 (Currently Amended)
- generate first data responsive to execution of first function
In [0067]:
When the data processing method of the current example is applied to an isomorphic electronic device, a processor of the isomorphic electronic device may try to classify a plurality of network layers supported by the processor into one subnet.
In [0015]:
obtaining a return value of a preset fusion attribute function of each network layer;
In [0106]:
By way of illustration, it may be added to each network layer that a function mfus_supported( ) returns true or false to indicate whether a fusion operation is supported. mfus_supported( ) is a predefined fusion attribute function which may determine whether logic for each operator of the network layers and an interface for calling the logic exist in the preset function library.
in [0101] :
Specifically, the fusion attribute of each network layer includes a first fusion attribute and a second fusion attribute. The electronic device may include a first processor and a second processor. For instance, the electronic device may predetermine whether each network layer can be supported by the first processor to perform a fusion operation. For each network layer, the electronic device may look up the logic of each operator of the network layer, logic for the fusion operator, and the interface for calling the logic in a preset function library associated with the first processor. If there are logic for all the operators of the network layer, the logic for the fusion operator, and the interface for calling the logic in the preset function library, the electronic device determines that the first processor supports the network layer performing the fusion operation, and determines that the function attribute of the network layer is the first fusion attribute
GUO does not explicitly disclose:
- A system comprising: a memory ; a first processor comprising circuitry; and a second processor comprising circuitry configured to:
- During runtime execution of the first function, execute a function call at a defined fusing point within the first function to invoke a second function implemented outside of the library, prior to completion of the first
- execute the second function to perform one or more operations on the first data to generate second data;
- and
However, Kerr discloses:
- A system comprising: a memory ; a first processor comprising circuitry; and a second processor comprising circuitry configured to:
In [0058]:
- During runtime execution of the first function, execute a function call at a defined fusing point within the first function to invoke a second function implemented outside of the library, prior to completion of the first function
In [0043]:
In at least one embodiment, at application runtime an active component can expose an application programming interface (API) in which a client application can supply a function object for some or all callbacks. In at least one embodiment, this can include one or more CUDA device function objects. In at least one embodiment, these source representations can be compiled by an active component with a real time compiler yielding an intermediate representation
In [0043]:
In at least one embodiment, an active component can perform a compiler in-lining pass to insert these functions at appropriate call sites within a containing kernel. In at least one embodiment, once in-lined, inter-procedural optimization can be performed, followed by just-in-time code generation procedures. In at least one embodiment, this optimization and code generation can yield a complete GPU binary. In at least one embodiment, in-lining and optimization can occur prior to register allocation and final instruction scheduling, such that kernels fused by an active component have zero additional overhead
In [0047]:
In at least one embodiment, an abstraction layer can be used between frameworks for runtime kernel generation.
In [0047]:
In at least one embodiment, for kernel fusion elementwise operations can be fused with GEMMS, convolutions, and reductions, and successive convolutional operations may also be fused. In at least one embodiment, there can be a rich interface provided between framework and code generator.
In [0049]:
In at least one embodiment, a deep learning template library can be used that can provide reusable abstractions as C++ templates. In at least one embodiment, template arguments can include tile sizes, input types, accumulate types, and math operations. In at least one embodiment, orthogonal components can be decoupled where possible. In at least one embodiment, optimization directives and special functions can be represented as intrinsics and code generator passes. In at least one embodiment, deployment for binary can occur through a static library, with an API into compiled template instantiations, or for source code as a header-only library.
in [0052]:
In at least one embodiment, a user might try to combine an operation with a matrix multiply. In at least one embodiment, a matrix multiply has a very regular structured data access pattern. In at least one embodiment, a hardware vendor could leave hooks in the code that are opportunities for custom functionality to be injected and compiled. In at least one embodiment, a compiler tool chain can compile this workload into an intermediate representation, such as an LVM-IR. In at least one embodiment, these hooks, or fusion sites, can remain as device function calls in this intermediate representation. In at least one embodiment, an application programmer can define a function, based at least in part on semantics of what is possible at each of these hooks, and this compiler can take their function compiled to the intermediate representation, in-line it at the call site, within the compute-limited workload, and apply a specialized compiler variant to produce final lowered machine code. In at least one embodiment, types of operations to be included can depend at least in part upon a respective application and use case.
In [0046]:
a user or application will not receive a high level form of this matrix multiply, or direct access to a specialized compiler. In at least one embodiment, a user will still be able to fuse their operation with a vendor-supplied implementation and achieve the best performance.
- execute the second function to perform one or more operations on the first data to generate second data;
In [0055]:
In at least one embodiment, a compilation process 500 can be used as illustrated in FIG. 5. In at least one embodiment, one or more function objects are received 502, at application runtime, through an API. In at least one embodiment, these function objects are provided by, or for, this application. In at least one embodiment, each function object is compiled 504 to obtain an intermediate representation. In at least one embodiment, an intermediate, or partially-compiled, representation of a compute-limited operation, such as a convolution kernel, is obtained 506. In at least one embodiment, an in-lining pass is performed 508 in order to insert compiled function(s) at one or more call locations corresponding to hooks in the partially compiled kernel. In at least one embodiment, one or more optimizations are performed. In at least one embodiment, a final compilation is performed 512 in order to generate output, which may take the form of final executable code.
(BRI: a final compilation stage (specifically the linking phase) is performed to generate the output as a single, final executable file. Function fusion is an optimization technique that occurs during the main compilation phase, resulting in modified intermediate or assembly code which is then processed through the remaining standard compilation stages and result in executing the second function)
- resume execution of the first function to write the second data to a memory device.
In [0045]:
In at least one embodiment, kernel fusion is performed in part to obtain improved performance for applications that target various accelerator processor architectures
In [0045]:
improve performance by taking advantage of locality. In at least one embodiment, if a matrix multiply computes a result then corresponding data is resident on a chip, and another operation can use this data without having to fetch this data from device memory.
(BRI: represents a key concept often described using terms like register reuse, data locality, or kernel fusion in high-performance computing. Computing a result and ensuring the data is resident on a chip is a result of writing to the memory)
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, and Kerr.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches a processor circuitry and executing a function call at a defined fusing point
within the first function to invoke second function implemented outside of the library.
One of ordinary skill would have motivation to combine GUO, and Kerr to provide an improved performance with efficient utilization of the target hardware (Kerr [0045])
In regard to claim 16. (Previously Presented)
GUO discloses:
- wherein the second function is configured to modify data generated as a result of execution of at least one operation of the first function to generate the second data
In [0086]:
Alternatively, the fusion attribute of the subnet includes a first fusion attribute and a second fusion attribute.
In [0086]:
the fusion attribute of each subnet may indicate whether the subnet may run in the first processor. For instance, when the first processor supports a fusion operation of the subnet, the fusion attribute of the subnet may be the first fusion attribute. When the first processor does not support the fusion operation of the subnet, the fusion attribute of the subnet may be the second fusion attribute.
Claims 3-4, 10-11 and 17-18 are rejected under 35 U.S.C. 103 as unpatentable over
in view of Zhibin GUO et.al (hereinafter GUO) US 2020/0210821 A1,
in view of Kerr et.al (hereinafter Kerr) US 2021/0103433 A1,
further in view of Corkum et.al (hereinafter Corkum) US 2018/0126553 A1.
In regard to claim 3. (Previously Presented)
GUO does not explicitly disclose:
- the one or more operations at least comprise a tear-down operation, a data load operation, and a setup operation.
However, Kerr discloses:
- a data load operation
in [0198]:
graphics multiprocessor 1734 includes an internal cache memory to perform load and store operations,
in [0227]:
if a data load misses in data cache, there may be dependent operations in flight in pipeline that have left scheduler with temporarily incorrect data.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, and Kerr.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches a processor circuitry and executing a function call at a defined fusing point
within the first function to invoke second function implemented outside of the library.
One of ordinary skill would have motivation to combine GUO, and Kerr to provide an improved performance with efficient utilization of the target hardware (Kerr [0045])
GUO, and Kerr do not explicitly disclose:
- a tear-down operation
- a setup operation
However, Corkum discloses:
- a tear-down operation
in [0010]:
a system 100 includes: a base 110, an robotic arm 120, an end effector 140, a camera 150, and a controller 160,
in [0041]:
the system 100 can be loaded with an object feature classifier in the form of a neural network ,
in [0120]:
the system 100 transitions into an unlocked state in which actuators in the robotic system support the weight of the robotic system and the end effector 140,
in [0133]:
the system 100 can: fuse a known offset between the end effector 140 and the camera 150 and the interaction image to determine a point, line, or area in real space at which the end effector 140 engages a template object; map this point, line, or area onto the initial image to identify the template object or a feature of the interaction image in a pre-interaction state; and map this point, line, or area onto the final image to identify the template object or feature in a post-interaction state.
- a setup operation
in [0121]:
In one implementation, the operator places a template within an operating field of the robotic arm, such as on a fixture or a surface—near the robotic arm—on which units of similar objects will be arranged during future autonomous operating periods of the system 100. The operator then initializes a setup period at the system 100 (e.g., through a user interface executing on a computing device connected to the system 100),
in [0076]:
controller 160 “unlocks” joints in the arm and records position sensor and/or optical data during manual manipulation of the arm during a setup routine
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, Kerr and Corkum.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches a processor circuitry and executing a function call at a defined fusing point
within the first function to invoke second function implemented outside of the library.
Corkum teaches tear-down and setup operations.
One of ordinary skill would have motivation to combine GUO, Kerr, and Corkum can provide increased locational accuracy (Corkum [0017])
In regard to claim 4. (Previously Presented)
GUO, and Kerr do not explicitly disclose:
- the tear-down operation is performed to export a fusing point state of a given fusing point from a first location to a second location.
However, Corkum discloses:
in [0133]:
the system 100 can: fuse a known offset between the end effector 140 and the camera 150 and the interaction image to determine a point, line, or area in real space at which the end effector 140 engages a template object; map this point, line, or area onto the initial image to identify the template object or a feature of the interaction image in a pre-interaction state; and map this point, line, or area onto the final image to identify the template object or feature in a post-interaction state.
In regard to claim 10. (Previously Presented)
GUO does not explicitly disclose:
- the one or more operations at least comprise a tear-down operation, a data load operation, and a setup operation.
However, Kerr discloses:
- a data load operation
in [0198]:
graphics multiprocessor 1734 includes an internal cache memory to perform load and store operations,
in [0227]:
if a data load misses in data cache, there may be dependent operations in flight in pipeline that have left scheduler with temporarily incorrect data.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO , and Kerr.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
One of ordinary skill would have motivation to combine GUO, and Kerr to provide an improved performance with efficient utilization of the target hardware (Kerr [0045])
GUO, and Kerr do not explicitly disclose:
- a tear-down operation
- a setup operation
However, Corkum discloses:
- a tear-down operation
in [0010] :
a system 100 includes: a base 110, an robotic arm 120, an end effector 140, a camera 150, and a controller 160,
in [0041]:
the system 100 can be loaded with an object feature classifier in the form of a neural network ,
in [0120]:
the system 100 transitions into an unlocked state in which actuators in the robotic system support the weight of the robotic system and the end effector 140,
in [0133]:
the system 100 can: fuse a known offset between the end effector 140 and the camera 150 and the interaction image to determine a point, line, or area in real space at which the end effector 140 engages a template object; map this point, line, or area onto the initial image to identify the template object or a feature of the interaction image in a pre-interaction state; and map this point, line, or area onto the final image to identify the template object or feature in a post-interaction state.
- a setup operation
in [0121]:
In one implementation, the operator places a template within an operating field of the robotic arm, such as on a fixture or a surface—near the robotic arm—on which units of similar objects will be arranged during future autonomous operating periods of the system 100. The operator then initializes a setup period at the system 100 (e.g., through a user interface executing on a computing device connected to the system 100),
in [0076]:
controller 160 “unlocks” joints in the arm and records position sensor and/or optical data during manual manipulation of the arm during a setup routine
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO , Kerr, and Corkum.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
Corkum teaches tear-down and setup operations.
One of ordinary skill would have motivation to combine GUO, Kerr, and Corkum can provide increased locational accuracy (Corkum [0017])
In regard to claim 11. (Previously Presented)
GUO does not explicitly disclose:
- the one or more operations at least comprise a tear-down operation, a data load operation, and a setup operation.
However, Kerr discloses:
- a data load operation
in [0198]:
graphics multiprocessor 1734 includes an internal cache memory to perform load and store operations,
In [0227]:
if a data load misses in data cache, there may be dependent operations in flight in pipeline that have left scheduler with temporarily incorrect data.
GUO, and Kerr do not explicitly disclose:
- a tear-down operation
- a setup operation
However, Corkum discloses:
- a tear-down operation
in [0010]:
a system 100 includes: a base 110, an robotic arm 120, an end effector 140, a camera 150, and a controller 160,
in [0041]:
the system 100 can be loaded with an object feature classifier in the form of a neural network ,
In [0120]:
the system 100 transitions into an unlocked state in which actuators in the robotic system support the weight of the robotic system and the end effector 140,
in [0133]:
the system 100 can: fuse a known offset between the end effector 140 and the camera 150 and the interaction image to determine a point, line, or area in real space at which the end effector 140 engages a template object; map this point, line, or area onto the initial image to identify the template object or a feature of the interaction image in a pre-interaction state; and map this point, line, or area onto the final image to identify the template object or feature in a post-interaction state.
- a setup operation
in [0121]:
In one implementation, the operator places a template within an operating field of the robotic arm, such as on a fixture or a surface—near the robotic arm—on which units of similar objects will be arranged during future autonomous operating periods of the system 100. The operator then initializes a setup period at the system 100 (e.g., through a user interface executing on a computing device connected to the system 100),
In [0076]:
controller 160 “unlocks” joints in the arm and records position sensor and/or optical data during manual manipulation of the arm during a setup routine
In regard to claim 17. (Previously Presented)
GUO does not explicitly disclose:
- the one or more operations at least comprise a tear-down operation, a data load operation, and a setup operation.
However, Kerr discloses:
- a data load operation
in [0198]:
graphics multiprocessor 1734 includes an internal cache memory to perform load and store operations,
in [0227]:
if a data load misses in data cache, there may be dependent operations in flight in pipeline that have left scheduler with temporarily incorrect data.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO and Kerr.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
One of ordinary skill would have motivation to combine GUO, and Kerr to provide an improved performance with efficient utilization of the target hardware (Kerr [0045])
GUO, and Kerr do not explicitly disclose:
- a tear-down operation
- a setup operation
However, Corkum discloses:
- a tear-down operation
in [0010]:
a system 100 includes: a base 110, an robotic arm 120, an end effector 140, a camera 150, and a controller 160,
in [0041]:
the system 100 can be loaded with an object feature classifier in the form of a neural network ,
in [0120]:
the system 100 transitions into an unlocked state in which actuators in the robotic system support the weight of the robotic system and the end effector 140,
in [0133]:
the system 100 can: fuse a known offset between the end effector 140 and the camera 150 and the interaction image to determine a point, line, or area in real space at which the end effector 140 engages a template object; map this point, line, or area onto the initial image to identify the template object or a feature of the interaction image in a pre-interaction state; and map this point, line, or area onto the final image to identify the template object or feature in a post-interaction state.
- a setup operation
in [0121]:
In one implementation, the operator places a template within an operating field of the robotic arm, such as on a fixture or a surface—near the robotic arm—on which units of similar objects will be arranged during future autonomous operating periods of the system 100. The operator then initializes a setup period at the system 100 (e.g., through a user interface executing on a computing device connected to the system 100),
in [0076]:
controller 160 “unlocks” joints in the arm and records position sensor and/or optical data during manual manipulation of the arm during a setup routine
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO , Kerr, and Corkum.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
Corkum teaches tear-down and setup operations.
One of ordinary skill would have motivation to combine GUO, Kerr and Corkum can provide increased locational accuracy (Corkum [0017])
In regard to claim 18. (Previously Presented)
GUO, and Kerr do not explicitly disclose:
- the tear-down operation is performed to export a fusing point state of a given fusing point from a first location to a second location.
However, Corkum discloses:
in [0133]:
the system 100 can: fuse a known offset between the end effector 140 and the camera 150 and the interaction image to determine a point, line, or area in real space at which the end effector 140 engages a template object; map this point, line, or area onto the initial image to identify the template object or a feature of the interaction image in a pre-interaction state; and map this point, line, or area onto the final image to identify the template object or feature in a post-interaction state.
Claims 5-6, 12-13, and 19 are rejected under 35 U.S.C. 103 as unpatentable over
in view of Zhibin GUO et.al (hereinafter GUO) US 2020/0210821 A1,
in view of Kerr et.al (hereinafter Kerr) US 2021/0103433 A1.
further in view of Shafiq et.al (hereinafter Shafiq) US 2021/0182036 A1.
In regard to claim 5. (Currently Amended)
GUO does not explicitly disclose:
- the circuitry of the processor is configured to:
However, Kerr discloses:
- the circuitry of the processor is configured to:
In [0058] :
GUO and Kerr do not explicitly disclose:
- receive a first representation of a machine learning model;
- generate a second representation of the machine learning model
- on linking one or more layers of the first representation of the machine learning model to one or more fusing points of a plurality of fusing points of the library;
- and cause the second representation of the machine learning model to be executed on a target computing device.
However, Shafiq discloses:
- receive a first representation of a machine learning model;
In [0024] :
As opposed to current machine learning frameworks, where fusion passes as well as the fusion pattern are part of the compiler, the present invention described herein provides an approach that decouples the fusion patterns from the fusion passes and the compiler. This enables the compiler to be independent of any updates or changes to the target platform's supported operator fusions. The compiler source code does not need to be rebuilt with every change in the target platform. As used herein, the terms “target platform,” “hardware platform” and “hardware execution device” are used interchangeably. A platform can refer to an item of electronic hardware which can execute compiled code. The platform can additionally include associated software, firmware, or a combination thereof.
- generate a second representation of the machine learning model
in [0027]:
The application may be a software application using a neural network model. The neural network model may be a model which is adapted via machine learning during a training phase and then deployed during an inference phase.
in [0028]:
It should be understood that fusions may be performed according to an optimization pass performed by the compiler. This pass is performed in order to process a neural network model as described in the machine learning framework,
in [0034]:
A simple illustrative mathematical example of an operator fusion is as follows. A first operator receives two inputs (in context of a CNN, a and b are typically tensors) a and b and produces an output corresponding to a first mathematical function f(a,b),
in [0034]:
A second operator receives two inputs c and d (where either c or d is output of the first function) and produces an output corresponding to a second mathematical function g(c,d).
in [0034]:
the first and second operators can be fused into a fused operator which receives a, b and d, applies the first operator to a and b, and applies the second operator to the output of the first operator and the input d. Thus, the output of the fused operator is g(f(a,b),d), which, in the given example equals Batchnorm(Conv(a,b),d).
(BRI: the first and second operators corresponds to first and second representation)
- on linking one or more layers of the first representation of the machine learning model to one or more fusing points of a plurality of fusing points of the library;
in [0009]:
In yet another aspect, the neural network includes a convolution layer and the above-mentioned condition specifies a constraint on at least one of: a shape of a kernel of the convolution layer; a size of the kernel of convolution layer; and a data type of an execution kernel associated with the fused operator,
in [0034]:
A simple illustrative mathematical example of an operator fusion is as follows. A first operator receives two inputs (in context of a CNN, a and b are typically tensors) a and b and produces an output corresponding to a first mathematical function f(a,b). For example, the function can be f(a,b)=conv(a,b).
in [0034:
A second operator receives two inputs c and d (where either c or d is output of the first function) and produces an output corresponding to a second mathematical function g(c,d). For example, the function can be g(c,d)=Batchnorm(c,d). Then, the first and second operators can be fused into a fused operator which receives a, b and d, applies the first operator to a and b, and applies the second operator to the output of the first operator and the input d. Thus, the output of the fused operator is g(f(a,b),d), which, in the given example equals Batchnorm(Conv(a,b),d).
In [0024]:
As opposed to current machine learning frameworks, where fusion passes as well as the fusion pattern are part of the compiler, the present invention described herein provides an approach that decouples the fusion patterns from the fusion passes and the compiler. This enables the compiler to be independent of any updates or changes to the target platform's supported operator fusions. The compiler source code does not need to be rebuilt with every change in the target platform. As used herein, the terms “target platform,” “hardware platform” and “hardware execution device” are used interchangeably. A platform can refer to an item of electronic hardware which can execute compiled code. The platform can additionally include associated software, firmware, or a combination thereof.
- and cause the second representation of the machine learning model to be executed on a target computing device.
in [0036]:
The pattern file 204 indicates a list of fusion patterns associated with a target hardware execution device, also referred to as a target hardware platform. The fusion patterns may represent sets of operators that can be performed together in a particular way by the hardware execution device as a unitary operation.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, Kerr and Shafiq.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
Shafiq teaches a processor circuitry.
One of ordinary skill would have motivation to combine GUO , Kerr and Shafiq that can provide performance improvements resulting from plurality of fused operations (Shafiq [0055]).
In regard to claim 6. (Previously Presented)
GUO and Kerr do not explicitly disclose:
to link the one or more layers of the first representation of the machine learning model to one or more fusing points of the plurality of fusing points of the library, the circuitry of the processor is further configured to:
link the first function within the library from a first layer of the first representation of the machine learning model;
and link a first fusing point within the first function to a second layer of the first representation of the machine learning model.
However, Shafiq discloses:
- to link the one or more layers of the first representation of the machine learning model to one or more fusing points of the plurality of fusing points of the library, the circuitry of the processor is further configured to:
In [0024] :
As opposed to current machine learning frameworks, where fusion passes as well as the fusion pattern are part of the compiler, the present invention described herein provides an approach that decouples the fusion patterns from the fusion passes and the compiler. This enables the compiler to be independent of any updates or changes to the target platform's supported operator fusions. The compiler source code does not need to be rebuilt with every change in the target platform. As used herein, the terms “target platform,” “hardware platform” and “hardware execution device” are used interchangeably. A platform can refer to an item of electronic hardware which can execute compiled code. The platform can additionally include associated software, firmware, or a combination thereof.
- link the first function within the library from a first layer of the first representation of the machine learning model;
in [0008]:
each fusion pattern in the list of fusion patterns is associated with a condition for generating a fused operator.
in [0014]:
each fused operator comprising at least two operators of the plurality of operators which can be fused,
in [0008]:
the condition relates to at least one of: a memory allocation requirement associated with the fused operator; a size of a feature map input to a layer of the neural network; and a size of a filter of a layer of the neural network,
in [0044]:
Also illustrated in FIG. 2 is a fused kernel library 206. The fused kernel library 206 (also referred to as a fused operator kernel library) contains the supported underlying fused kernels corresponding to each fused operator for the target hardware platform. Also illustrated in FIG. 2 is a kernel library 208, which contains the kernels corresponding to each individual operator,
in [0032]:
The operations are performed by respective operators. The computation graph generally represents a neural network, that is, a plurality of computation operations to be performed by the neural network. The computation graph 202 includes various nodes, labeled op1, op2, op3, op4, op5, op6. Each of these nodes is associated with an operator of the neural network,
in [0067]:
the dataflow of computations can include the dataflow of computations performed by those nodes which the fused operator represents.
- and link a first fusing point within the first function to a second layer of the first representation of the machine learning model.
in [0008]:
each fusion pattern in the list of fusion patterns is associated with a condition for generating a fused operator.
in [0014]:
each fused operator comprising at least two operators of the plurality of operators which can be fused,
in [0008:
the condition relates to at least one of: a memory allocation requirement associated with the fused operator; a size of a feature map input to a layer of the neural network; and a size of a filter of a layer of the neural network,
in [0044]:
Also illustrated in FIG. 2 is a fused kernel library 206. The fused kernel library 206 (also referred to as a fused operator kernel library) contains the supported underlying fused kernels corresponding to each fused operator for the target hardware platform. Also illustrated in FIG. 2 is a kernel library 208, which contains the kernels corresponding to each individual operator,
in [0032]:
The operations are performed by respective operators. The computation graph generally represents a neural network, that is, a plurality of computation operations to be performed by the neural network. The computation graph 202 includes various nodes, labeled op1, op2, op3, op4, op5, op6. Each of these nodes is associated with an operator of the neural network,
in [0067]:
the dataflow of computations can include the dataflow of computations performed by those nodes which the fused operator represents.
In regard to claim 12 (Currently Amended)
GUO does not explicitly disclose:
- receiving, by the circuitry of the processor,
However, Kerr discloses:
- receiving, by the circuitry of the processor,
In [0058]
GUO and Kerr do not explicitly disclose:
- receive a first representation of a machine learning model;
- generate a second representation of the machine learning model
- on linking one or more layers of the first representation of the machine learning model to one or more fusing points of a plurality of fusing points of the library;
- and cause the second representation of the machine learning model to be executed on a target computing device.
However, Shafiq discloses:
- receive a first representation of a machine learning model;
In [0024] :
As opposed to current machine learning frameworks, where fusion passes as well as the fusion pattern are part of the compiler, the present invention described herein provides an approach that decouples the fusion patterns from the fusion passes and the compiler. This enables the compiler to be independent of any updates or changes to the target platform's supported operator fusions. The compiler source code does not need to be rebuilt with every change in the target platform. As used herein, the terms “target platform,” “hardware platform” and “hardware execution device” are used interchangeably. A platform can refer to an item of electronic hardware which can execute compiled code. The platform can additionally include associated software, firmware, or a combination thereof.
- generate a second representation of the machine learning model
in [0027]:
The application may be a software application using a neural network model. The neural network model may be a model which is adapted via machine learning during a training phase and then deployed during an inference phase.
in [0028]:
It should be understood that fusions may be performed according to an optimization pass performed by the compiler. This pass is performed in order to process a neural network model as described in the machine learning framework,
in [0034]:
A simple illustrative mathematical example of an operator fusion is as follows. A first operator receives two inputs (in context of a CNN, a and b are typically tensors) a and b and produces an output corresponding to a first mathematical function f(a,b),
in [0034]:
A second operator receives two inputs c and d (where either c or d is output of the first function) and produces an output corresponding to a second mathematical function g(c,d).
in [0034]:
the first and second operators can be fused into a fused operator which receives a, b and d, applies the first operator to a and b, and applies the second operator to the output of the first operator and the input d. Thus, the output of the fused operator is g(f(a,b),d), which, in the given example equals Batchnorm(Conv(a,b),d).
(BRI: the first and second operators corresponds to first and second representation)
- on linking one or more layers of the first representation of the machine learning model to one or more fusing points of a plurality of fusing points of the library;
in [0009]:
In yet another aspect, the neural network includes a convolution layer and the above-mentioned condition specifies a constraint on at least one of: a shape of a kernel of the convolution layer; a size of the kernel of convolution layer; and a data type of an execution kernel associated with the fused operator,
in [0034]:
A simple illustrative mathematical example of an operator fusion is as follows. A first operator receives two inputs (in context of a CNN, a and b are typically tensors) a and b and produces an output corresponding to a first mathematical function f(a,b). For example, the function can be f(a,b)=conv(a,b).
in [0034:
A second operator receives two inputs c and d (where either c or d is output of the first function) and produces an output corresponding to a second mathematical function g(c,d). For example, the function can be g(c,d)=Batchnorm(c,d). Then, the first and second operators can be fused into a fused operator which receives a, b and d, applies the first operator to a and b, and applies the second operator to the output of the first operator and the input d. Thus, the output of the fused operator is g(f(a,b),d), which, in the given example equals Batchnorm(Conv(a,b),d).
In [0024]:
As opposed to current machine learning frameworks, where fusion passes as well as the fusion pattern are part of the compiler, the present invention described herein provides an approach that decouples the fusion patterns from the fusion passes and the compiler. This enables the compiler to be independent of any updates or changes to the target platform's supported operator fusions. The compiler source code does not need to be rebuilt with every change in the target platform. As used herein, the terms “target platform,” “hardware platform” and “hardware execution device” are used interchangeably. A platform can refer to an item of electronic hardware which can execute compiled code. The platform can additionally include associated software, firmware, or a combination thereof.
- and cause the second representation of the machine learning model to be executed on a target computing device.
in [0036]:
The pattern file 204 indicates a list of fusion patterns associated with a target hardware execution device, also referred to as a target hardware platform. The fusion patterns may represent sets of operators that can be performed together in a particular way by the hardware execution device as a unitary operation.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, Kerr and Shafiq.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
Shafiq teaches a processor circuitry.
One of ordinary skill would have motivation to combine GUO , Kerr and Shafiq that can provide performance improvements resulting from plurality of fused operations (Shafiq [0055]).
In regard to claim 13 (Previously Presented)
GUO and Kerr do not explicitly disclose:
to link the one or more layers of the first representation of the machine learning model to one or more fusing points of the plurality of fusing points of the library, the circuitry of the processor is further configured to:
link the first function within the library from a first layer of the first representation of the machine learning model;
and link a first fusing point within the first function to a second layer of the first representation of the machine learning model.
However, Shafiq discloses:
- to link the one or more layers of the first representation of the machine learning model to one or more fusing points of the plurality of fusing points of the library, the circuitry of the processor is further configured to:
In [0024] :
As opposed to current machine learning frameworks, where fusion passes as well as the fusion pattern are part of the compiler, the present invention described herein provides an approach that decouples the fusion patterns from the fusion passes and the compiler. This enables the compiler to be independent of any updates or changes to the target platform's supported operator fusions. The compiler source code does not need to be rebuilt with every change in the target platform. As used herein, the terms “target platform,” “hardware platform” and “hardware execution device” are used interchangeably. A platform can refer to an item of electronic hardware which can execute compiled code. The platform can additionally include associated software, firmware, or a combination thereof.
- link the first function within the library from a first layer of the first representation of the machine learning model;
in [0008]:
each fusion pattern in the list of fusion patterns is associated with a condition for generating a fused operator.
in [0014]:
each fused operator comprising at least two operators of the plurality of operators which can be fused,
in [0008]:
the condition relates to at least one of: a memory allocation requirement associated with the fused operator; a size of a feature map input to a layer of the neural network; and a size of a filter of a layer of the neural network,
in [0044]:
Also illustrated in FIG. 2 is a fused kernel library 206. The fused kernel library 206 (also referred to as a fused operator kernel library) contains the supported underlying fused kernels corresponding to each fused operator for the target hardware platform. Also illustrated in FIG. 2 is a kernel library 208, which contains the kernels corresponding to each individual operator,
in [0032]:
The operations are performed by respective operators. The computation graph generally represents a neural network, that is, a plurality of computation operations to be performed by the neural network. The computation graph 202 includes various nodes, labeled op1, op2, op3, op4, op5, op6. Each of these nodes is associated with an operator of the neural network,
in [0067]:
the dataflow of computations can include the dataflow of computations performed by those nodes which the fused operator represents.
- and link a first fusing point within the first function to a second layer of the first representation of the machine learning model.
in [0008]:
each fusion pattern in the list of fusion patterns is associated with a condition for generating a fused operator.
in [0014]:
each fused operator comprising at least two operators of the plurality of operators which can be fused,
in [0008:
the condition relates to at least one of: a memory allocation requirement associated with the fused operator; a size of a feature map input to a layer of the neural network; and a size of a filter of a layer of the neural network,
in [0044]:
Also illustrated in FIG. 2 is a fused kernel library 206. The fused kernel library 206 (also referred to as a fused operator kernel library) contains the supported underlying fused kernels corresponding to each fused operator for the target hardware platform. Also illustrated in FIG. 2 is a kernel library 208, which contains the kernels corresponding to each individual operator,
in [0032]:
The operations are performed by respective operators. The computation graph generally represents a neural network, that is, a plurality of computation operations to be performed by the neural network. The computation graph 202 includes various nodes, labeled op1, op2, op3, op4, op5, op6. Each of these nodes is associated with an operator of the neural network,
in [0067]:
the dataflow of computations can include the dataflow of computations performed by those nodes which the fused operator represents.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, Kerr and Shafiq.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
Shafiq teaches a processor circuitry.
One of ordinary skill would have motivation to combine GUO, Kerr and Shafiq that can provide performance improvements resulting from plurality of fused operations (Shafiq [0055]).
In regard to claim 19 (Currently Amended)
GUO does not explicitly disclose:
- the circuitry of the second processor is configured to:
However, Kerr discloses:
- the circuitry of the second processor is configured to:
In [0058]
GUO and Kerr do not explicitly disclose:
- receive a first representation of a machine learning model;
- generate a second representation of the machine learning model
- on linking one or more layers of the first representation of the machine learning model to one or more fusing points of a plurality of fusing points of the library;
- and cause the second representation of the machine learning model to be executed on a target computing device.
However, Shafiq discloses:
- receive a first representation of a machine learning model;
In [0024] :
As opposed to current machine learning frameworks, where fusion passes as well as the fusion pattern are part of the compiler, the present invention described herein provides an approach that decouples the fusion patterns from the fusion passes and the compiler. This enables the compiler to be independent of any updates or changes to the target platform's supported operator fusions. The compiler source code does not need to be rebuilt with every change in the target platform. As used herein, the terms “target platform,” “hardware platform” and “hardware execution device” are used interchangeably. A platform can refer to an item of electronic hardware which can execute compiled code. The platform can additionally include associated software, firmware, or a combination thereof.
- generate a second representation of the machine learning model
in [0027]:
The application may be a software application using a neural network model. The neural network model may be a model which is adapted via machine learning during a training phase and then deployed during an inference phase.
in [0028]:
It should be understood that fusions may be performed according to an optimization pass performed by the compiler. This pass is performed in order to process a neural network model as described in the machine learning framework,
in [0034]:
A simple illustrative mathematical example of an operator fusion is as follows. A first operator receives two inputs (in context of a CNN, a and b are typically tensors) a and b and produces an output corresponding to a first mathematical function f(a,b),
in [0034]:
A second operator receives two inputs c and d (where either c or d is output of the first function) and produces an output corresponding to a second mathematical function g(c,d).
in [0034]:
the first and second operators can be fused into a fused operator which receives a, b and d, applies the first operator to a and b, and applies the second operator to the output of the first operator and the input d. Thus, the output of the fused operator is g(f(a,b),d), which, in the given example equals Batchnorm(Conv(a,b),d).
(BRI: the first and second operators corresponds to first and second representation)
- on linking one or more layers of the first representation of the machine learning model to one or more fusing points of a plurality of fusing points of the library;
in [0009]:
In yet another aspect, the neural network includes a convolution layer and the above-mentioned condition specifies a constraint on at least one of: a shape of a kernel of the convolution layer; a size of the kernel of convolution layer; and a data type of an execution kernel associated with the fused operator,
in [0034]:
A simple illustrative mathematical example of an operator fusion is as follows. A first operator receives two inputs (in context of a CNN, a and b are typically tensors) a and b and produces an output corresponding to a first mathematical function f(a,b). For example, the function can be f(a,b)=conv(a,b).
in [0034:
A second operator receives two inputs c and d (where either c or d is output of the first function) and produces an output corresponding to a second mathematical function g(c,d). For example, the function can be g(c,d)=Batchnorm(c,d). Then, the first and second operators can be fused into a fused operator which receives a, b and d, applies the first operator to a and b, and applies the second operator to the output of the first operator and the input d. Thus, the output of the fused operator is g(f(a,b),d), which, in the given example equals Batchnorm(Conv(a,b),d).
In [0024]:
As opposed to current machine learning frameworks, where fusion passes as well as the fusion pattern are part of the compiler, the present invention described herein provides an approach that decouples the fusion patterns from the fusion passes and the compiler. This enables the compiler to be independent of any updates or changes to the target platform's supported operator fusions. The compiler source code does not need to be rebuilt with every change in the target platform. As used herein, the terms “target platform,” “hardware platform” and “hardware execution device” are used interchangeably. A platform can refer to an item of electronic hardware which can execute compiled code. The platform can additionally include associated software, firmware, or a combination thereof.
- and cause the second representation of the machine learning model to be executed on a target computing device.
in [0036]:
The pattern file 204 indicates a list of fusion patterns associated with a target hardware execution device, also referred to as a target hardware platform. The fusion patterns may represent sets of operators that can be performed together in a particular way by the hardware execution device as a unitary operation.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO, Kerr and Shafiq.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches execute a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
Shafiq teaches a processor circuitry.
One of ordinary skill would have motivation to combine GUO , Kerr and Shafiq that can provide performance improvements resulting from plurality of fused operations (Shafiq [0055]).
Claims 7, and 14 are rejected under 35 U.S.C. 103 as unpatentable over
in view of Zhibin GUO et.al (hereinafter GUO) US 2020/0210821 A1,
in view of Kerr et.al (hereinafter Kerr) US 2021/0103433 A1.
further in view Song et.al (hereinafter Song) US 2019/0108444 A1.
In regard to claim 7. (Previously Presented)
GUP and Kerr do not explicitly disclose:
- first function is a matrix multiplication operation, and wherein the second layer is an activation layer
However, Song discloses:
- first function is a matrix multiplication operation, and wherein the second layer is an activation layer
in [0055]:
the feature domain X ⊂
R
d
, the matrix of n samples can be defined as X=[
x
1
T
, . . .,
x
n
T
]. A function k: X×X
→
R
d
defines a valid kernel if it gives rise to a positive definite kernel matrix K satisfying Mercer's condition. In this case, k also defines an implicit mapping φ to the RKHS
H
k
and an inner product
…
in
H
k
, such that
k (
x
i
,
x
j
)=
φ
x
i
,
φ
x
j
H
k
φ
x
i
,
φ
x
j
,
in [0148] :
In a number of embodiments, method 900 additionally can include a block 950 of performing classification on the combined representation. In many embodiments, block 950 of performing classification on the combined representation can include using a softmax activation function on the combined representation. In several embodiments, the softmax activation function can be similar or identical to softmax layer 150 and/or softmax layer 340.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO , Kerr, and Song.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches a processor circuitry and executing a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
Song teaches matrix multiplication and activation layer.
One of ordinary skill would have motivation to combine GUO, Kerr, and Song to provide an improved performance (Song [0124])
In regard to claim 14.(Previously Presented)
GUO, and Kerr do not explicitly disclose:
- first function is a matrix multiplication operation, and wherein the second layer is an activation layer
However, Song discloses:
- first function is a matrix multiplication operation, and wherein the second layer is an activation layer
in [0055]:
the feature domain X ⊂
R
d
, the matrix of n samples can be defined as X=[
x
1
T
, . . .,
x
n
T
]. A function k: X×X
→
R
d
defines a valid kernel if it gives rise to a positive definite kernel matrix K satisfying Mercer's condition. In this case, k also defines an implicit mapping φ to the RKHS
H
k
and an inner product
…
in
H
k
, such that
k (
x
i
,
x
j
)=
φ
x
i
,
φ
x
j
H
k
φ
x
i
,
φ
x
j
,
in [0148]:
In a number of embodiments, method 900 additionally can include a block 950 of performing classification on the combined representation. In many embodiments, block 950 of performing classification on the combined representation can include using a softmax activation function on the combined representation. In several embodiments, the softmax activation function can be similar or identical to softmax layer 150 and/or softmax layer 340.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO , Kerr, and Song.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches a processor circuitry and executing a function call at a defined fusing point within the first function to invoke second function implemented outside of the library.
Song teaches matrix multiplication and activation layer.
One of ordinary skill would have motivation to combine GUO, Kerr, , and Song to provide an improved performance (Song [0124])
Claim 20 is rejected under 35 U.S.C. 103 as unpatentable over
in view of Zhibin GUO et.al (hereinafter GUO) US 2020/0210821 A1,
in view of Kerr et.al (hereinafter Kerr) US 2021/0103433 A1.
further in view of Shafiq et.al (hereinafter Shafiq) US 2021/0182036 A1.
further in view Song et.al (hereinafter Song) US 2019/0108444 A1.
In regard to claim 20. (Previously Presented)
GUO, Kerr and Shafiq do not explicitly disclose:
- the first layer is a convolution layer, and wherein the second layer is an activation layer
However, Song discloses:
- the first layer is a convolution layer, and wherein the second layer is an activation layer
in [0148]:
In a number of embodiments, method 900 additionally can include a block 950 of performing classification on the combined representation. In many embodiments, block 950 of performing classification on the combined representation can include using a softmax activation function on the combined representation. In several embodiments, the softmax activation function can be similar or identical to softmax layer 150 and/or softmax layer 340.
It would have obvious to one of ordinary skill in the art before the effective filing
date of the present application to combine GUO , Kerr, Shafiq and Song.
GUO fusing points in the library with library comprising of multiple functions.
Kerr teaches a processor circuitry and executing a function call at a defined fusing point
within the first function to invoke second function implemented outside of the library.
Shafiq teaches a processor circuitry.
Song teaches matrix multiplication and activation layer.
One of ordinary skill would have motivation to combine GUO, Kerr, Shafiq and Song to provide an improved performance (Song [0124])
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be
obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TIRUMALE K RAMESH/Examiner, Art Unit 2121
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121