DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 12, 14-21, 23-30, 32-34 are presented for the examination. Claims 1-11, 13, 22, 31 are canceled.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 12, 21, 30, 32, 34 are rejected under 35 U.S.C. 103 as being unpatentable over Khorasani( US 20180275991 A1) in view of Li( US 20210149673 A1) and further in view of Damron(US 20110138372 A1) and further in view of Lueh( US 20200394041 A1).
As to claim 12, Khorasani teaches inserting register acquire instructions and register release instructions for an inter-block register pool into a first section of application code( compiler 510 inserts acquire and release instructions into executable code 515 at appropriate locations so as to acquire and release extended register sets in processor 520 for the various executing threads, para[0040], ln 1-6/ When a given thread needs more than the base set of registers to execute a given phase of program code, the given thread executes an acquire instruction to acquire an extended set of registers from a shared resource pool. The extended set of registers is then available for the exclusive use of the given thread for the given phase of program code. When the given thread no longer needs additional registers, the given thread executes a release instruction to release the extended set of registers back into the shared register pool for other threads to use, para[0022], ln 9-21).
Li teaches launching execution of the first section of application code with a register allocation based on register utilization of a second section of the application code (performing rematerialization operation(s)[section] on program source code to reduce a register pressure , para[0003], ln 1-3/performing rematerialization operations to balance cross-block register pressure prior to instruction scheduling. Such cross-block rematerialization may optimize the basic blocks of the source code prior to instruction scheduling such that instruction scheduling may be performed more efficiently, para[0010], ln 3-10/ perform rematerialization operations to balance cross-block register pressure prior to instruction scheduling. Such cross-block rematerialization may optimize the basic blocks of the source code prior to instruction scheduling such that instruction scheduling may be performed more efficiently (i.e., to produce fewer spills to memory) that increases shader performance. Rematerialization operations can decrease register pressure by increasing the amount of computations performed while reduce a number of registers being used in a particular basic block of code., par[008]/ The compiler 106 may be configured to perform the rematerialization operations prior to instruction scheduling in order to provide basic blocks having reduced register pressure that allow for scheduling results to have reduced memory latency (e.g., due to fewer spills to memory). In this way, the compiled code may be processed by the GPUs 102 more efficiently. Rematerialization may be tightly integrated with register allocation that is performed after instruction scheduling, where rematerialization is used as an alternative to spilling registers to memory. In scenarios where rematerialization is only performed at register allocation, instruction scheduling that is performed prior to register allocation may have inaccurate cross-block register pressures as well as inaccurate intra-block register pressures. By performing rematerialization before instruction scheduling (and also optionally after instruction scheduling), the input code provided for instruction scheduling may have balanced cross-block register pressure that helps the instruction scheduling process generate scheduling results with low register pressure. Moreover, performing cross-block rematerialization prior to instruction scheduling to help lower register pressure may have a lower processing cost as compared to performing global instruction scheduling, para[0019] to para[0020]/ The register allocation module 206 may be configured to assign target variables of the code onto the registers 103 of the GPUs 102. For example, the register allocation module 206 may be configured to perform register allocation over a basic block (i.e., local register allocation), over a whole function/procedure (i.e., global register allocation), or across function boundaries. The compiler 106 may be configured to generate target-dependent assembly code, performing register allocation in the process via the register allocation module 206. Further, the compiler 106 may be configured to compile the assembly code into machine code and output the machine code 110 to the plurality of GPUs 102 for processing, para[0031]).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Khorasani with Li to incorporate the above feature because this performs a rematerialization operation on at least one of the one or more candidate instructions to reduce the register pressure at the boundary to be less than the target register pressure.
Damron teaches a second section of the software application that employs register spill and refill instructions for launching execution of the application code with a register allocation based on register utilization of a second section(. At block 502, the compiler 304 performs instruction selection on the code of a program[application code] and the flow proceeds to block 503……….. t block 513, the compiler 304 inserts a spill instruction for each of the selected candidate virtual registers immediately after each definition or redefinition of the respective candidate virtual register in the code ……… At block 516, the compiler 304 performs a register allocation phase that assigns which virtual registers will be stored in which specific physical register of the target machine for each point in the code. The register allocation phase may also spill and reload one or more virtual registers if register pressure remains excessive for any point in the code. The flow then proceeds to block 517., para[0058], ln 5-8/ para[0065], n 1-12/ para[0066]/ FIG. 8B illustrates the allocation of the virtual registers to the physical registers 704 to 709 during the execution of the code that may be set by the processing unit performing register allocation on the code 600J. In the first part of basic block A, virtual integer register I1 is stored in Integer Register 1 704, virtual integer register I2 is stored in Integer Register 2 705, and virtual integer register I3 is stored in Integer Register 3 706. During basic block A, when virtual integer register I3 is no longer live it is no longer stored in Integer Register 3 706, para[0082], ln 1-10/ At block 507, the compiler 304 discards all virtual registers that are referenced in the basic block where the register pressure is excessive. Such virtual registers would require insertion of spill or reload instructions in high pressure areas of the program, so the compiler 304 avoids this issue by not select them as candidates, para[0061], ln 1-10).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Khorasani and Li with Damron to incorporate the above feature because this reduces the effectiveness of early instruction schedulers to focus on minimizing stalls in execution.
Lueh teaches executing the second section of the application code that employs register spill and refill instructions on condition that the register acquire instructions fail( However, a common thread is not guaranteed acquisition of a shared register and may stall due to shared register starvation[fail], para[0209], ln 5-8/ Thus, LT represents the average number of shared registers that common threads can acquire before resource contention can occur. In one embodiment, the LT value is computed as LT=floor(Rem_Regs/(8−c)), where (8−c) is #common threads when all critical slots are full, para[0211], ln 3-10/ However, Scenario (2) requires cooperation with compiler 1401. When register pressure at barrier exceeds LT, it is possible that all critical threads reach the barrier and stop execution. However only some, or possibly none, of the common threads will be able to climb register pressure level at the barrier due to shared register starvation[fail]. According to one embodiment, compiler 1401 may insert spills to move data from shared registers to memory and release those shared registers in order to mitigate this problem. In such an embodiment, compiler 1401 spills (or releases) a sufficient number of shared registers that register pressure drops to under LT. After the barrier instruction, compiler 1401 fills data back from memory to the shared registers that trigger an acquire operation. Accordingly, a barrier instruction is brought under LT register pressure level, making scenario (2) equivalent to trivial scenario (1), para[0226]/ when acquisition of a shared register may stall due to shared register starvation, acquire register fail as described above ).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Lueh to incorporate the above feature because this enables the compute resources to access the pooled resources as if they were local.
As to claims 21, 30, they are rejected for the same reason as to claim 12 above.
As to claim 32, It is rejected for the same reason as to claim 12 above. In additional, Lueh teaches configuring the application code with a back-off loop that re-attempts the register acquire instructions on condition that the inter-block register pool is empty or fails to satisfy a configured threshold level( At decision block 1735, a determination is made as to whether the number of registers that have already been acquired by the thread (T) (e.g., Reg_cnt[T]) exceeds LT. If not, the thread is a common thread and the Com_Regs counter is checked to determine whether there are shared registers available in the common pool (e.g., Com_Regs>0), decision block 1740. If there are shared registers available in the pool, the request is fulfilled and the count of the Com_Regs counter is decremented, processing block 1745.At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented. At processing block 1790, the acquired register is returned. This marks a successful acquire operation and hardware continues with execution of instruction. However upon a determination at decision block 1740 that Com_Regs=0, the common pool of shared registers is empty. At this point, the current thread is stalled and control is transferred to processing block 1748 to wait to repeat the request. If at decision block 1735, a determination is made Reg_Cnt[T] exceeds LT, a determination is made as to whether the thread is a critical thread in decision block 1755. If the thread is already a critical thread, the thread is allowed to acquire the registers at processing block 1760. At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented and the acquired register addresses are stored in mapping table 1508. At processing block 1790, the acquired register is returned, para[0218] to para[0220]/ determination at decision block 1765 that a free critical slot is available (e.g., the CT_Cnt counter indicates that less than the maximum allowed critical threads has been reached …… upon a determination that there is a critical slot available, the CT_Cnt counter is incremented to indicate a critical thread slot has been filled, processing block 1770. ….. After bookkeeping is performed at processing block 1770, the process continues with processing block 1750 as discussed above, para[0221], ln 3-9/ para[0222], ln 19-23/ Fig. 17).
It would have been obvious to one of the ordinary skill in the art before the effectile filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Lueh to incorporate the above feature because this enables the compute resources to access the pooled resources as if they were local.
As to claim 34, it is rejected for the same reason as to claim 1 above. In additional, Lueh teaches configuring some of the register acquire instructions and register release instructions in the application code to borrow registers from and return registers to an intra-block register pool exclusively for threads in a same thread block; and configuring some of the register acquire instructions and register release instructions in the application code to borrow from and return registers to the inter-block register pool exclusively for thread blocks belonging to the inter-block register pool( dedicated registers 1512 are permanently available to a hardware thread, while shared registers 1514 are acquired and released on demand. In a further embodiment, a kernel that requires more registers than available in its dedicated quota acquires registers from shared registers 1514 (or register pool), which is shared among all threads in execution unit 1413. In yet a further embodiment, an acquired register in the register pool is explicitly released when the register is no longer needed. In such an embodiment, the register is marked for release by software (e.g., a kernel) via a specific instruction in order to free, and return, the register to the register pool, para[0197], ln 1- 12/According to one embodiment, a thread may be either in a common state or a critical state. In such an embodiment, a critical thread is guaranteed acquisition of one or more shared registers and never stalls on shared register starvation (e.g., acquire operation always succeeds for a critical thread). However, a common thread is not guaranteed acquisition of a shared register and may stall due to shared register starvation, para[0209], ln 1-9/ If at decision block 1735, a determination is made Reg_Cnt[T] exceeds LT, a determination is made as to whether the thread is a critical thread in decision block 1755. If the thread is already a critical thread, the thread is allowed to acquire the registers at processing block 1760. At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented and the acquired register addresses are stored in mapping table 1508. At processing block 1790, the acquired register is returned. In one embodiment, a thread may be upgraded to a critical state, if it isn't already a critical thread. A thread may be upgraded upon a determination at decision block 1765 that a free critical slot is available (e.g., the CT_Cnt counter indicates that less than the maximum allowed critical threads has been reached). If not, control is forwarded to processing block 1740 where the process proceeds as discussed above for common threads. However upon a determination that there is a critical slot available, the CT_Cnt counter is incremented to indicate a critical thread slot has been filled, processing block 1770, para[0220] to para[0221]).
It would have been obvious to one of the ordinary skill in the art before the effectile filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Lueh to incorporate the above feature because this enables the compute resources to access the pooled resources as if they were local.
Claim(s) 14, 15, 23, 24, 25 are rejected under 35 U.S.C. 103 as being unpatentable over Khorasani( US 20180275991 A1) in view of Li( US 20210149673 A1) in view of Damron(US 20110138372 A1) and further in view of Craik( US 20210303373 A1).
As to claim 14, Craik teaches further comprising: identifying one or more high register utilization regions and one or more low register utilization regions of the application code( define a monent-monexit loop. The monent-monexit loop includes a loop header, a loop body, and a loop exit operatively and sequentially coupled , para[0005], ln 17-24/ When the loop body contains conditional control flow, it may be beneficial to hold the lock reservation on all paths through the loop or, equally, it may only be beneficial to hold the lock reservation on a subset of paths through the loop. Consider a loop with an error check, where it would normally be assumed that the error check would not fail at runtime, para[0056], ln 1-10).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Craik to incorporate the above feature because this employs multiple processing devices to perform processing tasks through facilitating the execution of multiple processing threads concurrently to more rapidly execute the instructions of a program.
As to claim 15, Craik teaches inserting the register acquire instructions and the register release instructions into at least one of the high register utilization regions but not into the low register utilization regions(para[0048], ln 9-15) for the same reason as to claim 14 above.
As to claim 24, Craik teaches a run-time profiler configured to identify high register utilization and low register utilization regions of the software application( (para[0048], ln 9-15) for the same reason as to claim 14 above.
As to claims 23, 25, they are rejected for the same reasons as to claims 14, 15 above.
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Khorasani( US 20180275991 A1) in view of Li( US 20210149673 A1) in view of Damron(US 20110138372 A1) and further in view of Sehr( US 20110029820 A1).
As to claim 16, Sehr teaches the register allocation is determined by runtime profiling of the application code( the base address of the valid memory region may be obtained as a global variable (e.g., an absolute memory location) and/or using a system call to secure runtime environment 114. Moreover, the base address may be stored in one or more free registers (e.g., R8-R15) using register allocation techniques applied during compilation of native code module 118. In particular, the base address may be loaded into a register at each entry to a function and used during stores within the function. Native code module 118 and/or validator 112 may further specify the register to be used as the base address register. For example, calling conventions for different functions within native code module 118 may specify different reserved registers for storing the base address of the valid memory region, para[0065]).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Sehr to incorporate the above feature because this maintains control flow integrity for the native code module and constrain store instructions in the native code module by bounding a valid memory region of the native code module with one or more guard regions.
Claim(s) 17, 26 are rejected under 35 U.S.C. 103 as being unpatentable over Khorasani( US 20180275991 A1) in view of Li( US 20210149673 A1) in view of Damron(US 20110138372 A1) and further in view of JACKSON( GB 2518022 A).
As to claim 17, Jackson teaches configuring a minimum register target for the application code below which the application code is configured to not release registers(One example method would be to pass some sideband with the stack-growing instruction so that the logic later that releases/invalidates physical registers does not release / invalidate the physical register holding the stack pointer which is referenced in the prediction stack. In another example, method, the logic which maintains the prediction stack (e.g. the stack pointer value prediction module 820 shown in FIG. 6) signals which registers are in use so that the releasing/invalidating logic does not release I invalidate them. Once the entries containing the particular register ID are flushed from the prediction stack, the physical registers can be invalidated / reused, etc. As physical registers that are referenced in the prediction stack are not invalidated, additional physical registers may be required, with the minimum number of physical registers corresponding to one more than the sum of the number of architectural registers, the maximum number of physical registers that can be referenced in the prediction stack (which is equal to N). Typically. however, a processor may have many more physical registers than this minimum, Sec: n various examples, low level control of the physical, ln 5-27).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Jackson to incorporate the above feature because this provides improved computational performance by executing instructions in a sequence that is different from the order in the program.
As to claim 26, it is rejected for the same reason as to claim 17 above.
Claim(s) 18, 20, 27, 29 are rejected under 35 U.S.C. 103 as being unpatentable over Khorasani( US 20180275991 A1) in view of Li( US 20210149673 A1) in view of Damron(US 20110138372 A1) and further in view of Lueh(US 20200394041 A1).
As to claim 18, Lueh teaches configuring the application code with a back-off loop that re-attempts the register acquire instructions on condition that the inter-block register pool is empty or fails to satisfy a configured threshold level( At decision block 1735, a determination is made as to whether the number of registers that have already been acquired by the thread (T) (e.g., Reg_cnt[T]) exceeds LT. If not, the thread is a common thread and the Com_Regs counter is checked to determine whether there are shared registers available in the common pool (e.g., Com_Regs>0), decision block 1740. If there are shared registers available in the pool, the request is fulfilled and the count of the Com_Regs counter is decremented, processing block 1745.At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented. At processing block 1790, the acquired register is returned. This marks a successful acquire operation and hardware continues with execution of instruction. However upon a determination at decision block 1740 that Com_Regs=0, the common pool of shared registers is empty. At this point, the current thread is stalled and control is transferred to processing block 1748 to wait to repeat the request. If at decision block 1735, a determination is made Reg_Cnt[T] exceeds LT, a determination is made as to whether the thread is a critical thread in decision block 1755. If the thread is already a critical thread, the thread is allowed to acquire the registers at processing block 1760. At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented and the acquired register addresses are stored in mapping table 1508. At processing block 1790, the acquired register is returned, para[0218] to para[0220]/ Fig. 17/ If at decision block 1735, a determination is made Reg_Cnt[T] exceeds LT, a determination is made as to whether the thread is a critical thread in decision block 1755. If the thread is already a critical thread, the thread is allowed to acquire the registers at processing block 1760. At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented and the acquired register addresses are stored in mapping table 1508. At processing block 1790, the acquired register is returned. In one embodiment, a thread may be upgraded to a critical state, if it isn't already a critical thread. A thread may be upgraded upon a determination at decision block 1765 that a free critical slot is available (e.g., the CT_Cnt counter indicates that less than the maximum allowed critical threads has been reached). If not, control is forwarded to processing block 1740 where the process proceeds as discussed above for common threads. However upon a determination that there is a critical slot available, the CT_Cnt counter is incremented to indicate a critical thread slot has been filled, processing block 1770, para[0220] to para[0221]).
It would have been obvious to one of the ordinary skill in the art before the effectile filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Lueh to incorporate the above feature because this enables the compute resources to access the pooled resources as if they were local.
As to claim 20, Lueh teaches configuring some of the register acquire instructions and register release instructions in the application code to borrow registers from and return registers to an intra-block register pool exclusively for threads in a same thread block; and configuring some of the register acquire instructions and register release instructions in the application code to borrow from and return registers to the inter-block register pool exclusively for thread blocks belonging to the inter-block register pool( 0197], ln 1-15/ para[0224], ln 1-15) for the same reason as to claim 18 above.
As to claim 27, 29, they are rejected for the same reasons as to claims 18, 20 above.
Claim(s) 19, 28, 33 are rejected under 35 U.S.C. 103 as being unpatentable over Khorasani( US 20180275991 A1) in view of Li( US 20210149673 A1) in view of Damron(US 20110138372 A1) and further in view of Hari(US 20190102180 A1).
As to claim 19, Hari teaches configuring the application code with a plurality of slow execution versions that implement among them a spectrum of the register spill instructions and the register refill instructions( the second-type integrity verifier 400 can result in significant execution slowdown for workloads in which the register file is a critical resource. This may either reduce the number of threads that run in parallel or increase the number of register spill/fill instructions that save/restore register content to/from local memory to limit the use of physical registers, para[0065], ln 3-14).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Hari to incorporate the above feature because this produces high overhead due to cross-block communication and synchronization overhead.
As to claim 28, it is rejected for the same reason as to claims 19 above.
As to claim 33, it is rejected for the same reason as to claim 1 above. In additional, Hari teaches configuring the application code with a plurality of slow execution versions that implement among them a spectrum of the register spill instructions and the register refill instructions( the second-type integrity verifier 400 can result in significant execution slowdown for workloads in which the register file is a critical resource. This may either reduce the number of threads that run in parallel or increase the number of register spill/fill instructions that save/restore register content to/from local memory to limit the use of physical registers, para[0065], ln 3-14).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching of Khorasani, Li and Damron with Hari to incorporate the above feature because this produces high overhead due to cross-block communication and synchronization overhead.
Response to the argument:
A. Applicant amendment filed on 07/09/2025 has been considered but they are not persuasive:
Applicant argued in substance that :
(1) “ regarding Claims 12, 21, and 30, the combination of Khorasani, Li, and Damron does not provide an obvious teaching to launch execution of a first section of the application code that includes register acquire instructions and register release instructions using a register allocation based on register utilization of a second section of the application code that employs register spill and refill instructions, and only executing the second section of the application code that employs register spill and refill instructions on condition that the register acquire instructions fail.”
(2) “ Regarding Claims 18, 27, and 32, Lueh does not describe configuring the application code with a back-off loop that re-attempts the register acquire instructions on condition that the inter- block register pool is empty or fails to satisfy a configured threshold leve”.
(3) “ None of Khorasani, Li, or Damron, or Lueh describe different thread pools for intra-block and inter-block allocation and release, nor do any of these references suggest exclusive configuration of the acquire and release instructions in the code to specific ones of these different register pools. The Examiner may therefore appreciate that the combination of Khorasani, Li, Damron and Lueh does not provide an obvious teaching of Claims 20, 29, and 34.”
B. Examiner respectfully disagreed with Applicant's remarks:
As to the point (1), Khorasani teaches compiler 510 inserts acquire and release instructions into executable code 515 at appropriate locations so as to acquire and release extended register sets in processor 520 for the various executing threads, para[0040], ln 1-6/ When a given thread needs more than the base set of registers to execute a given phase of program code, the given thread executes an acquire instruction to acquire an extended set of registers from a shared resource pool. The extended set of registers is then available for the exclusive use of the given thread for the given phase of program code. When the given thread no longer needs additional registers, the given thread executes a release instruction to release the extended set of registers back into the shared register pool for other threads to use, para[0022], ln 9-21).
Li teaches performing rematerialization operation(s)[section] on program source code to reduce a register pressure , para[0003], ln 1-3/performing rematerialization operations to balance cross-block register pressure prior to instruction scheduling. Such cross-block rematerialization may optimize the basic blocks of the source code prior to instruction scheduling such that instruction scheduling may be performed more efficiently, para[0010], ln 3-10/ perform rematerialization operations to balance cross-block register pressure prior to instruction scheduling. Such cross-block rematerialization may optimize the basic blocks of the source code prior to instruction scheduling such that instruction scheduling may be performed more efficiently (i.e., to produce fewer spills to memory) that increases shader performance. Rematerialization operations can decrease register pressure by increasing the amount of computations performed while reduce a number of registers being used in a particular basic block of code., par[008]/ The compiler 106 may be configured to perform the rematerialization operations prior to instruction scheduling in order to provide basic blocks having reduced register pressure that allow for scheduling results to have reduced memory latency (e.g., due to fewer spills to memory). In this way, the compiled code may be processed by the GPUs 102 more efficiently. Rematerialization may be tightly integrated with register allocation that is performed after instruction scheduling, where rematerialization is used as an alternative to spilling registers to memory. In scenarios where rematerialization is only performed at register allocation, instruction scheduling that is performed prior to register allocation may have inaccurate cross-block register pressures as well as inaccurate intra-block register pressures. By performing rematerialization before instruction scheduling (and also optionally after instruction scheduling), the input code provided for instruction scheduling may have balanced cross-block register pressure that helps the instruction scheduling process generate scheduling results with low register pressure. Moreover, performing cross-block rematerialization prior to instruction scheduling to help lower register pressure may have a lower processing cost as compared to performing global instruction scheduling, para[0019] to para[0020]/ The register allocation module 206 may be configured to assign target variables of the code onto the registers 103 of the GPUs 102. For example, the register allocation module 206 may be configured to perform register allocation over a basic block (i.e., local register allocation), over a whole function/procedure (i.e., global register allocation), or across function boundaries. The compiler 106 may be configured to generate target-dependent assembly code, performing register allocation in the process via the register allocation module 206. Further, the compiler 106 may be configured to compile the assembly code into machine code and output the machine code 110 to the plurality of GPUs 102 for processing, para[0031]).
Damron teaches At block 502, the compiler 304 performs instruction selection on the code of a program[application code] and the flow proceeds to block 503……….. t block 513, the compiler 304 inserts a spill instruction for each of the selected candidate virtual registers immediately after each definition or redefinition of the respective candidate virtual register in the code ……… At block 516, the compiler 304 performs a register allocation phase that assigns which virtual registers will be stored in which specific physical register of the target machine for each point in the code. The register allocation phase may also spill and reload one or more virtual registers if register pressure remains excessive for any point in the code. The flow then proceeds to block 517., para[0058], ln 5-8/ para[0065], n 1-12/ para[0066]/ FIG. 8B illustrates the allocation of the virtual registers to the physical registers 704 to 709 during the execution of the code that may be set by the processing unit performing register allocation on the code 600J. In the first part of basic block A, virtual integer register I1 is stored in Integer Register 1 704, virtual integer register I2 is stored in Integer Register 2 705, and virtual integer register I3 is stored in Integer Register 3 706. During basic block A, when virtual integer register I3 is no longer live it is no longer stored in Integer Register 3 706, para[0082], ln 1-10/ At block 507, the compiler 304 discards all virtual registers that are referenced in the basic block where the register pressure is excessive. Such virtual registers would require insertion of spill or reload instructions in high pressure areas of the program, so the compiler 304 avoids this issue by not select them as candidates, para[0061], ln 1-10).
Lueh teaches However, a common thread is not guaranteed acquisition of a shared register and may stall due to shared register starvation[fail], para[0209], ln 5-8/ Thus, LT represents the average number of shared registers that common threads can acquire before resource contention can occur. In one embodiment, the LT value is computed as LT=floor(Rem_Regs/(8−c)), where (8−c) is #common threads when all critical slots are full, para[0211], ln 3-10/ However, Scenario (2) requires cooperation with compiler 1401. When register pressure at barrier exceeds LT, it is possible that all critical threads reach the barrier and stop execution. However only some, or possibly none, of the common threads will be able to climb register pressure level at the barrier due to shared register starvation[fail]. According to one embodiment, compiler 1401 may insert spills to move data from shared registers to memory and release those shared registers in order to mitigate this problem. In such an embodiment, compiler 1401 spills (or releases) a sufficient number of shared registers that register pressure drops to under LT. After the barrier instruction, compiler 1401 fills data back from memory to the shared registers that trigger an acquire operation. Accordingly, a barrier instruction is brought under LT register pressure level, making scenario (2) equivalent to trivial scenario (1), para[0226]/ when acquisition of a shared register may stall due to shared register starvation, acquire register fail as described above ).
As to the point(2), Lueh teaches At decision block 1735, a determination is made as to whether the number of registers that have already been acquired by the thread (T) (e.g., Reg_cnt[T]) exceeds LT. If not, the thread is a common thread and the Com_Regs counter is checked to determine whether there are shared registers available in the common pool (e.g., Com_Regs>0), decision block 1740. If there are shared registers available in the pool, the request is fulfilled and the count of the Com_Regs counter is decremented, processing block 1745.At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented. At processing block 1790, the acquired register is returned. This marks a successful acquire operation and hardware continues with execution of instruction. However upon a determination at decision block 1740 that Com_Regs=0, the common pool of shared registers is empty. At this point, the current thread is stalled and control is transferred to processing block 1748 to wait to repeat the request. If at decision block 1735, a determination is made Reg_Cnt[T] exceeds LT, a determination is made as to whether the thread is a critical thread in decision block 1755. If the thread is already a critical thread, the thread is allowed to acquire the registers at processing block 1760. At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented and the acquired register addresses are stored in mapping table 1508. At processing block 1790, the acquired register is returned, para[0218] to para[0220]/ Fig. 17/ If at decision block 1735, a determination is made Reg_Cnt[T] exceeds LT, a determination is made as to whether the thread is a critical thread in decision block 1755. If the thread is already a critical thread, the thread is allowed to acquire the registers at processing block 1760. At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented and the acquired register addresses are stored in mapping table 1508. At processing block 1790, the acquired register is returned. In one embodiment, a thread may be upgraded to a critical state, if it isn't already a critical thread. A thread may be upgraded upon a determination at decision block 1765 that a free critical slot is available (e.g., the CT_Cnt counter indicates that less than the maximum allowed critical threads has been reached). If not, control is forwarded to processing block 1740 where the process proceeds as discussed above for common threads. However upon a determination that there is a critical slot available, the CT_Cnt counter is incremented to indicate a critical thread slot has been filled, processing block 1770, para[0220] to para[0221]/ Fig. 17).
As to the point(3), Lueh teaches dedicated registers 1512 are permanently available to a hardware thread, while shared registers 1514 are acquired and released on demand. In a further embodiment, a kernel that requires more registers than available in its dedicated quota acquires registers from shared registers 1514 (or register pool), which is shared among all threads in execution unit 1413. In yet a further embodiment, an acquired register in the register pool is explicitly released when the register is no longer needed. In such an embodiment, the register is marked for release by software (e.g., a kernel) via a specific instruction in order to free, and return, the register to the register pool, para[0197], ln 1- 12/According to one embodiment, a thread may be either in a common state or a critical state. In such an embodiment, a critical thread is guaranteed acquisition of one or more shared registers and never stalls on shared register starvation (e.g., acquire operation always succeeds for a critical thread). However, a common thread is not guaranteed acquisition of a shared register and may stall due to shared register starvation, para[0209], ln 1-9/ If at decision block 1735, a determination is made Reg_Cnt[T] exceeds LT, a determination is made as to whether the thread is a critical thread in decision block 1755. If the thread is already a critical thread, the thread is allowed to acquire the registers at processing block 1760. At processing block 1750, the count of total registers owned by thread Reg_Cnt[T] is incremented and the acquired register addresses are stored in mapping table 1508. At processing block 1790, the acquired register is returned. In one embodiment, a thread may be upgraded to a critical state, if it isn't already a critical thread. A thread may be upgraded upon a determination at decision block 1765 that a free critical slot is available (e.g., the CT_Cnt counter indicates that less than the maximum allowed critical threads has been reached). If not, control is forwarded to processing block 1740 where the process proceeds as discussed above for common threads. However upon a determination that there is a critical slot available, the CT_Cnt counter is incremented to indicate a critical thread slot has been filled, processing block 1770, para[0220] to para[0221]).
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Conclusion
US 6243668 B1teaches for forcing a reload of the user-visible register values from the backing store to the register stack and returns to the low level domain to perform a lookup operation in a translation lookaside buffer. It also calls the medium level domain from the low level domain to perform a lookup operation in an address map if the lookup operation in the translation lookaside buffer fails. It further calls the medium level domain if a translated block performs a system call. when it switches from the medium level domain to the high level domain only if a translated block for the current instruction to execute does not exist or a system call is performed which needs to access a base instruction set register state. It also returns from the high level domain to the low level domain if the translated block exists in memory for the current instruction to execute.
US 20110138372 A1 teaches if the compiler 304 inserted the spill and/or load instructions in a frequently executed basic block in the code but the spill and/or load instructions can be moved to a less frequently executed basic block without causing errors, the compiler 304 may perform the one or more scheduling phases on the code to move the spill and/or load instructions to the less frequently executed basic block. By way of another example, if the compiler 304 inserted the spill and/or load instructions at a location within a basic block in the code that will cause delay (such as where a reload instruction for a virtual register is inserted directly preceding a reference to that virtual register) but the spill and/or load instructions can be moved to another location within the basic block without causing errors (such moving a reload instruction for a virtual register from directly preceding a reference to that virtual register to a
US 20110138372 teaches A1he compiler 304 may perform the one or more instruction scheduling phases on the code that moves the spill and/or load instructions in order to optimize the placement of the spill and/or load instructions in the code. For example, if the compiler 304 inserted the spill and/or load instructions in a frequently executed basic block in the code but the spill and/or load instructions can ing errors, the compiler 304 may perform the one or more scheduling phases on the code to move the spill and/or load
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LECHI TRUONG whose telephone number is (571)272-3767. The examiner can normallUSPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor Young Kevin can be reached on (571)270-3180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LECHI TRUONG/ Primary Examiner, Art Unit 2194
v