Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Detailed Action
Claims 1-15 and 21-25 are currently pending.
Response to Amendment
This action is in response to the Amendment filled on 9/26/2025. The amendment has been entered. Claims 1-5, 7-11 and 13-15 have been amended, claims 16-20 are cancelled and claims 21-25 have been newly added. Claims 1-15 and 21-25 are pending, with claims 1, 9 and 21 being independent in the instant application.
Response to Arguments
Applicant's Arguments/Remarks filed on 9/26/2025 page 9-12 regarding 35 U.S.C. 103 rejections have been fully considered and are found persuasive in view
of the amended claims and presented Arguments/Remarks by the Applicant. However, a new ground of rejections is necessitated by Applicant's claim amendments. Therefore, the rejections regarding 35 U.S.C.103 are being amended in this current office action. (See analysis below Claim Rejections-35 U.S.C. 103).
Examiner Notes
Examiner cites particular columns, paragraphs, figures and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. The entire reference is considered to provide disclosure relating to the claimed invention. The claims & only the claims form the metes & bounds of the invention. Office personnel are to give the claims their broadest reasonable interpretation in light of the supporting disclosure. Unclaimed limitations appearing in the specification are not read into the claim. Prior art was referenced using terminology familiar to one of ordinary skill in the art. Such an approach is broad in concept and can be either explicit or implicit in meaning. Examiner's Notes are provided with the cited references to assist the applicant to better understand how the examiner interprets the applied prior art. Such comments are entirely consistent with the intent & spirit of compact prosecution.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 15 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim 15 does not fall within at least one of the four categories of patent eligible subject matter because the claim includes “computer-readable storage medium having program instructions stored thereon”. A review of Applicant’s Specification, para. [0071] provides “In the example of FIG. 7, data processing system 700 includes memory 704.Memory 704 can include computer-readable media in the form of volatile memory, such as random-access memory (RAM) ...”. As the list is non-limiting and the language does not explicitly claim a “non-transitory computer-readable storage medium”, under broadest reasonable interpretation in view of the Specification, the “computer-readable storage medium” can include a signal which is storing the instructions and accessible by a processor. As such, MPEP 2106.03(II) states: “A claim whose BRI covers both statutory and non-statutory embodiments embraces subject matter that is not eligible for patent protection and therefore is directed to non-statutory subject matter.” “For example, the BRI of computer-readable storage medium can encompass non-statutory transitory forms of signal transmission, such as a propagating electrical or electromagnetic signal per se. See In re Nuijten, 500 F.3d 1346, 84 USPQ2d 1495 (Fed. Cir. 2007). When the BRI encompasses transitory forms of signal transmission, a rejection under 35 U.S.C. 101 as failing to claim statutory subject matter would be appropriate. Thus, a claim to a “computer-readable storage medium” that can be a compact disc or a carrier wave covers a non-statutory embodiment and therefore should be rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter. See, e.g., Mentor Graphics v. EVE-USA, Inc., 851 F.3d at 1294-95, 112 USPQ2d at 1134 (claims to a "machine-readable medium" were non-statutory, because their scope encompassed both statutory random-access memory and non-statutory carrier waves).”
Examiner respectfully suggests that amending the claims to more explicitly disclaim transitory medium, such as using “non-transitory computer-readable storage medium” would overcome this rejection. Examiner notes that in view of the 101 rejections for signal per se, the “computer-readable storage medium” is not construed as structure.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham, v. John Deere Co., 383 U.S.1.148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
7. Claims 1,2,4-6,8-12,14, and 21-24 are rejected under 35 U.S.C. 103 as being unpatentable over Desai et al. (Pub. No. US2003/0167460A1), in view of Fleming et al. (Pub. No. US20190004945A1) and further in view of Yonezawa et al. (Patent No. US6513146B1) and Dissertation “High Level Power Estimation and Reduction Techniques for Power Aware Hardware Design” by Sumit Ahuja (hereinafter Ahuja, Dissertation published on May 12, 2010).
Regarding claim 1, Desai teaches a method, comprising: generating a plurality of snapshots for a pipeline of a processor core, (Desai disclosed in page 1 para [0010]: “A fourth form of the invention is a method comprising a determination of a plurality of relative power estimates of a design of a microprocessor,”. In page 2-3 para [0047]: “Vector or Single Instruction/Multiple Data (“SIMD”) processors perform several operations/computations per instruction cycle. … In the preferred embodiment of the invention, all instructions are executed in a single clock cycle, thereby increasing overall processing throughput. Note that other embodiments of the invention may employ pipelining of instruction cycles in order to increase clock rates, … These computations occur in parallel (e.g., in the same instruction or clock cycle) on data vectors that consist of several data elements each. In SIMD processors, the same operation is typically performed on each of the data elements per instruction cycle.”
Since the computations occur in parallel i.e., in the same instruction or clock cycle, in above disclosure, therefore the same operation performed on each of the data elements per instruction cycle, reads the claim limitation “generating a plurality of snapshots for a pipeline of a processor core” (because each given clock cycle is reasonably interpreted as corresponding with a “snapshot” of the pipeline, therefore, multiple/plurality of snapshots are generated after the whole instruction is executed in several clock cycles)).
wherein Desai teaches the pipeline includes three or more stages through which instructions of a computer program flow, (Desai disclosed in page 4 para [0056]: “Every instruction can be broken into micro-operations that make up the overall operation. Such micro-operations typically include an instruction memory fetch (access), instruction decode and dispatch (control), data operand fetch (memory or register file access), a sequence of RISC-like operations (that can be implemented in a single instruction cycle), and data result write-back (memory or register file access).”).
wherein Desai teaches the pipeline executes different parts of the instructions in parallel in different ones of the stages, (Desai disclosed in page 4 para [0054]: “Some examples of such compound vector or SIMD instructions include vector add-subtract instruction, which simultaneously computes the addition and subtraction of two data vectors on a per-element basis, as shown in FIG. 5. … All of these compound vector or SIMD instructions are made up of two or more RISC-like vector operations, and increase the useful computation done per instruction cycle, thereby increasing the processing throughput. Further, compound SIMD instructions may be made up of other compound SIMD operations, such as for example, the vector add-subtract instruction includes a vector add-subtract operation. These compound vector or SIMD instructions also simultaneously lower the energy required to implement those computations, because they incur less of the traditional overhead (e.g., instruction fetching, decoding, register file reading and write-back) of vector processor designs, as further described below.”).
Desai teaches for each snapshot, a contribution of each instruction included in the pipeline to the total power consumption based on a stage in which the instruction is located (Desai disclosed in page 8 para [0091]: “In one embodiment, the relative power consumption estimates are obtained by breaking down typical microprocessor operations to the micro-operation level (e.g., memory/register file reads/writes, add/subtract operations, multiply operations, logical MUX operations, etc.,) and associating a relative energy value (i.e., energy consumption value) to each micro-operation. The class of each micro-operation as well as a precision of each micro-operation (especially for parallel processors) determines its associated power consumption, since the operational complexity of the micro-operation is proportional to the number of logical transitions associated with the micro-operation, …”. This disclosure reads the limitation “each estimate specifies, for each instruction included in the pipeline, a contribution of the instruction to the power consumption.”.
It has been disclosed in page 5 para [0061]: “In this embodiment, the compound SIMD instruction can minimize the energy consumption of the addition and subtraction operations by reducing the number of micro-operations, such as register file reads. For example, a vector add instruction and a vector subtraction instruction would require a total of four register file reads while the compound SIMD instruction requires two register file reads.” This disclosure reads the limitation “a contribution of the instruction to the power consumption based on a stage (e.g., register file reading) in which the instruction is located.” Further, in page 5-6 para [0064]: “FIG. 8 illustrates an operational diagram of a Vector Average compound instruction of the present invention. … The vector average compound instruction is a compound SIMD instruction that can be viewed as combining the RISC-type operations of two vector additions, and vector arithmetic shifting. Further, this compound SIMD instruction increases the processing throughput because several sequential RISC steps are combined into one instruction cycle.” This disclosure reads the limitation “a stage in which each other instruction in the pipeline is located”, since vector additions, and vector arithmetic shifting operations are performed within one instruction cycle).
However, Desai does not explicitly teach the limitation “each snapshot specifies an instruction included in each stage of the pipeline and a sequence of instructions for the pipeline for a clock cycle; and a stage in which each other instruction in the pipeline is located, and wherein different snapshots specify different sequences;
wherein Fleming teaches each snapshot specifies an instruction included in each stage of the pipeline and a sequence of instructions for the pipeline for a clock cycle; (Fleming disclosed in page 9 para [0106]: “FIG. 9 shows a detailed block diagram of one such PE: the integer PE. … Each cycle, the scheduler may select an instruction for execution based on the availability of the input and output buffers …”. In para [0117]: “Although the integer PE is implemented as a single-cycle pipeline, other pipelining choices may be utilized.” In page 7 para [0091]: “Certain embodiments include a fabric with new processing elements to support sequential concepts like program ordered memory accesses … Certain embodiments herein include a network implementation that supports single-cycle latency communications, e.g., utilizing (e.g., small) PEs ...”
In page 13 para [0143]: “FIGS. 11C to 11J illustrate support for backup and replay using epochs in the cache/memory subsystem … One backup approach which is feasible is to take snapshots periodically that represent a moment in the execution of the graph and that can be backed up to. … Embodiments of the invention include mechanisms performed in the cache hierarchy to support execution with snapshots and the ability to support backup to the most recent snapshot.” In page 15 para [0162]: “FIG. 14 illustrates a snapshot 1400 of an in-flight, pipelined extraction according to embodiments of the disclosure … In some use cases of extraction, such as checkpointing, latency may not be a concern … In these cases, extraction may be orchestrated in a pipelined fashion.”
The disclosure above “taking snapshots periodically that represent a moment in the execution; snapshot used pipelined extraction” these correspond to the claim limitation “each snapshot specifies an instruction included in each stage of the pipeline”. Further, the disclosure “a single-cycle pipeline is implemented by the integer PE (processing element) and processing elements support sequential concepts like program ordered memory accesses” reads the claim limitation “sequence of instructions for the pipeline is implemented in a clock cycle).
and Fleming teaches a stage in which each other instruction in the pipeline is located, and wherein different snapshots specify different sequences; (Fleming disclosed in page 5 para [0084-0085]: “Certain embodiments herein explicitly decouple the operand input and result output such that memory operators are naturally pipelined and have the potential to produce many simultaneous outstanding requests, e.g., making them exceptionally well suited to the latency and band width characteristics of a memory subsystem. Embodiments of a CSA provide basic memory operations such as load, which takes an address channel and populates a response channel with the values corresponding to the addresses, and a store. Embodiments of a CSA may also provide more advanced operations such as in-memory atomics and consistency operators … FIG. 5 illustrates a program source (e.g., C code) 500 according to embodiments of the disclosure. According to the memory semantics of the C programming language, memory copy (memcpy) should be serialized … Since compilers are to generate statically correct code, they are usually forced to serialize memory accesses. Typically, compilers targeting sequential von Neu mann architectures use instruction ordering as a natural means of enforcing program order.”
Therefore, Desai and Fleming are analogous art because they are related in estimating power consumption in design architecture. Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Desai and Fleming, before him or her, to modify estimating power consumption contributed by the instruction in a pipeline stage of Desai to include the estimates of power consumption by executing computer program for the clock cycles, based on the snapshots of Fleming. The suggestion/motivation for doing so would have been obvious by Fleming because “Parallelism is explicit in dataflow graphs and embodiments of the CSA architecture spend no or minimal energy to extract it. In embodiments where the memory is multi-ported and distributed, a CSA may sustain many more outstanding memory requests and utilize more bandwidth than a core. For example, in the case of an integer multiply, a CSA may consume no more than 25% more energy than the underlying multiplication circuit. Relative to one embodiment of a core, an integer operation in that CSA fabric consumes less than 1/30th of the energy per integer operation” (Fleming disclosed in page 7-8 para [0099]). Therefore, it would have been obvious to combine Fleming with Desai to obtain the invention as specified in the instant claim(s).
Neither Desai nor Fleming do not explicitly teach the limitations “performing a gate-level simulation of a circuit design for the processor core executing the computer program resulting in pipeline power data specifying, on a per clock cycle basis, estimates of total power consumption of the processor core each including an estimate of power consumption for the pipeline of the processor core summed with an estimate of power consumption for non-pipeline circuitry of the processor core including a direct memory access circuit of the processor core; generating an instruction-based power model executable by computer hardware by correlating the plurality of snapshots with the estimates of total power consumption, and generating non-pipeline state data indicating a state of the non-pipeline circuitry of the processor core for each clock cycle, wherein the states of the non-pipeline circuitry of the processor core are correlated with the snapshots and have a defined contribution to the estimates of total power consumption as specified by the instruction-based power model.
Yonezawa teaches estimates of total power consumption of the processor core each including an estimate of power consumption for the pipeline of the processor core summed with an estimate of power consumption for non-pipeline circuitry of the processor core including a direct memory access circuit of the processor core; (Yonezawa disclosed in col. 23 lines 20-66: “AS is shown in FIG. 22(a), the power analysis System functioning as the instruction Set Simulator includes a test pattern generator 51 for generating a test pattern for power analysis, a power consumption estimator 52 … The power of a memory is estimated with respect to each of a write operation and a read operation. Thus, a power value of data transition in each register, a power value of each instruction and a power value of memory transfer are obtained. FIGS. 23(a) and 23(b) are diagrams for illustrating a method of analyzing power consumption of a specific instruction of a description in a given program. … As a characteristic of a CMOS device, power consumption of a register is caused by transition of a data (1-to-0 transition or 0-to-1 transition). … Then, the power consumption estimator 52 calculates power P in accordance with a formula below ...”).
Yonezawa teaches generating an instruction-based power model executable by computer hardware by correlating the plurality of snapshots with the estimates of total power consumption, (Yonezawa disclosed in col. 14 lines 28-56: “In this embodiment, a method (an apparatus) employed for S/W and H/W partitioning by using, as an index, power estimation based on an operation description of each module used in design of an LSI … When there is an operation description (such as the C language), power consumption is generally estimated by executing simulation. In contrast, in this embodiment, the operation description is subjected to a syntax analysis without conducting simulation, so as to estimate power consumption of each operation or function by calculating power consumption of modules fragmented by a given processing unit or by obtaining the power consumption of modules from a database. Thus, the automatic S/W and H/W partitioning is aided for attaining power consumption meeting a design index. … In general, power consumption P is calculated by the following formula: P = C.f.V2α. In this formula, … f indicates an operation frequency, and as the operation frequency is larger, the power consumption P increases.”
In col. 15 lines 1-38: “Power consumption is conventionally estimated through simulation, … Specifically, the number of repeating a process may be sometimes varied depending upon a numerical value determined by an operation conducted in executing a program. … First, with respect to lowering of the operation frequency, power consumption can be reduced, … In order to determine whether or not a parallel operation can be employed in a function, dependence between processes in the function is checked. When the processes are independent of each other, the parallel operation can be employed.”
It has been discussed above that lowering of the operation frequency (in formula power consumption P above), power consumption can be reduced. Power consumption is estimated through simulation by an operation conducted in executing a program. This corresponds to “generating an instruction-based power model executable by computer hardware”. Further, the parallel operation can be employed a circuit is implemented by H/W and the operation frequency of the circuit can be lowered through H/W implementation. That means the power consumption can be estimated by correlating the plurality of snapshots or given clock cycle (which is reasonably interpreted as corresponding with a “snapshot” of the pipeline)).
and Yonezawa teaches generating non-pipeline state data indicating a state of the non-pipeline circuitry of the processor core for each clock cycle, (Yonezawa disclosed in col. 32 lines 6-17: “FIG. 44 is a diagram for illustrating a method of automatically generating the interface peripheral circuit operation description by using a database and the H/W local memory region information. The database stores memory read/write control circuit information regarding data transfer of the memory and sequence information regarding data transfer control between the processor and the local memory. The interface peripheral circuit operation description corresponding to the interface part H/W is automatically generated by using the database storing the information necessary for generating the interface peripheral circuit operation description and using the extracted memory region information.”).
wherein Yonezawa teaches the states of the non-pipeline circuitry of the processor core are correlated with the snapshots and have a defined contribution to the estimates of total power consumption as specified by the instruction-based power model. (Yonezawa disclosed in col. 30 lines 28-45: “FIG. 37 is a block diagram for showing the structures of a processor and a H/W part generated in this example. The processor 40 includes an instruction memory 41 and a data memory 42, and the instruction memory 41 includes a S/W part 43. … On the other hand, the H/W part 44 including an input data storing memory 45, a H/W controlling register 46 and an operation result storing memory 47 is generated by an operation synthesis tool or the like on the basis of the H/W implemented operation description. In this example, an interface between S/W and H/W required in the S/W and H/W partitioning conducted in designing a system can be automatically generated. Accordingly, with the processing quantity (number of clock cycles) and power consumption reduced by the S/W and H/W partitioning,”
Ahuja teaches performing a gate-level simulation of a circuit design for the processor core executing the computer program resulting in pipeline power data specifying, on a per clock cycle basis, (Ahuja disclosed in page 31-32 section 3.2: “PowerTheater is an RTL/gate-level power estimation tool, which provides good accuracy for RTL power estimation with respect to the corresponding gate-level and silicon implementation. … PowerTheater (PT) accepts design description in Verilog, VHDL or mixed verilog and vhdl. Other input required for average power analysis is value change dump in vcd or fsdb format dumped from RTL simulation. … Once the activity of input-output ports and internal signal (if available) is provided, PT performs the probabilistic power estimation by propagating the activity of the missing signals from the activity provided for inputs and outputs.” Further, in page 46 section 4.1: “a particular state of the FSMD model corresponds to the specific part of the implementation model as shown in Fig. 4.1 … Let’s consider an FSMD M with state set S for a test run for T clock cycles, αs is the total number of times design visits state s during the test. Let τsi be number of signal toggles during the ith time the design visits state s, then total number of signal toggles in state s can be represented as … While the design is in state s during the test, energy spent due to toggles in state s, Es = kΓsCsV2, … The total dynamic energy spent in design is assumed to be the sum of energy spent in each state …”.
In above disclosure “PowerTheater is an RTL/gate-level power estimation tool” is used to estimate power consumption from gate-level simulation of a circuit design (since PowerTheater (PT) accepts design description in Verilog, VHDL or mixed Verilog). It has been discussed in page 7 section 1.3 (heading ‘Power Estimation using Statistical Power Models’) that to invent a suitable power model for designs, which are modeled as finite state machines with datapath (FSMD). The disclosure above “FSMD M with state set S for a test run for T clock cycles; Let τsi be number of signal toggles during the ith time the design visits state s, then total number of signal toggles in state s represented in equation (4.1) and energy spent due to toggles in state s is Es …”, teaches the limitation “processor core executing the computer program resulting in pipeline power data specifying, on a per clock cycle basis”).
Therefore, Desai, Fleming, Yonezawa and Ahuja are analogous art because they are related in estimating power consumption in design architecture. Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Desai, Fleming, Yonezawa and Ahuja before him or her, to modify the estimating of power consumption in per clock cycle of Fleming, to include estimating of power consumption by performing gate-level simulation of circuit using trained computer program of Ahuja. The suggestion/motivation for doing so would have been obvious by Ahuja because “In a power-aware methodology, the power estimation must be accurate, and at the highest possible level of abstraction for faster convergence of the selection process. In this chapter, we show how to endow the HLS itself with the ability to generate RTL with the power-saving features that are normally inserted during gate-level synthesis. This allows realistic power exploration at the RTL, resulting in faster convergence of the micro-architecture selection, without compromising the quality of the generated hardware co-processor.” (Ahuja disclosed in page 82 at 2nd para). Therefore, it would have been obvious to combine Ahuja with Desai, Fleming, and Yonezawa to obtain the invention as specified in the instant claim(s).
Regarding claim 2, Desai, Fleming, Yonezawa and Ahuja teach the method of claim 1, wherein Fleming teaches a selected snapshot corresponding to the selected clock cycle. (Fleming disclosed in page 19 para [0181]: “The CSA is a novel computer architecture with the potential to provide enormous performance and energy advantages relative to roadmap processors … In address computation, and especially strided address computation, … less than two bits of input toggle per computation in average for a stride calculation, reducing energy by 50% over a random toggle distribution … the CSA achieves approximately 3x energy efficiency over a core while delivering an 8x performance gain. The parallelism gains achieved by embodiments of a CSA may result in reduced program run times, yielding a proportionate, substantial reduction in leakage energy. … Since embodiments of a CSA are capable of exercising every floating point PE in the fabric at every cycle, it serves as a reasonable upper bound for energy and power consumption, …”.
The disclosure “a CSA are capable of exercising every floating point PE in the fabric at every cycle, it serves as a reasonable upper bound for energy and power consumption” corresponds to the limitation “power consumption based on the snapshot for a selected clock cycle” (since given clock cycle in above disclosure “floating point PE exercised at each clock cycle corresponds to “snapshot”)).
Yonezawa teaches each estimate of total power consumption is specific to a selected clock cycle (Yonezawa disclosed in col. 23 lines 20-66: “AS is shown in FIG. 22(a), the power analysis System functioning as the instruction Set Simulator includes a test pattern generator 51 for generating a test pattern for power analysis, a power consumption estimator 52 … The power of a memory is estimated with respect to each of a write operation and a read operation. Thus, a power value of data transition in each register, a power value of each instruction and a power value of memory transfer are obtained. FIGS. 23(a) and 23(b) are diagrams for illustrating a method of analyzing power consumption of a specific instruction of a description in a given program. … As a characteristic of a CMOS device, power consumption of a register is caused by transition of a data (1-to-0 transition or 0-to-1 transition). … Then, the power consumption estimator 52 calculates power P in accordance with a formula below ...”).
Regarding claim 4, Desai, Fleming, Yonezawa and Ahuja teach the method of claim 1, wherein Fleming teaches the plurality of snapshots are generated by compiling the computer program (Fleming disclosed in page 8 para [0102]: “Certain embodiments herein provide for a CSA (e.g., spatial fabric) that is easily programmed (e.g., by a compiler), power efficient, and highly parallel … From a programmability perspective, certain embodiments of the network provide flow controlled channels, e.g., which correspond to the control-dataflow graph (CDFG) model of execution used in compilers. … Certain network embodiments offer both high bandwidth and low latency. Certain network embodiments (e.g., static, circuit switching) provides a latency of 0 to 1 cycle (e.g., depending on the transmission distance)”).
Regarding claim 5, Desai, Fleming, Yonezawa and Ahuja teach the method of claim 1, wherein Yonezawa teaches the estimates of total power consumption include a power estimation component corresponding to states of the non-pipeline circuitry of the processor core. (Yonezawa disclosed in col. 23 lines 20-66: “As is shown in FIG. 22(a), the power analysis system functioning as the instruction Set Simulator includes a test pattern generator 51 for generating a test pattern for power analysis, a power consumption estimator 52 … The power of a memory is estimated with respect to each of a write operation and a read operation. Thus, a power value of data transition in each register, a power value of each instruction and a power value of memory transfer are obtained. FIGS. 23(a) and 23(b) are diagrams for illustrating a method of analyzing power consumption of a specific instruction of a description in a given program. … As a characteristic of a CMOS device, power consumption of a register is caused by transition of a data (1-to-0 transition or 0-to-1 transition). … Then, the power consumption estimator 52 calculates power P in accordance with a formula below ...”).
Regarding claim 6, Desai, Fleming, Yonezawa and Ahuja teach the method of claim 1, wherein Yonezawa teaches the non-pipeline state data indicates occurrences of reads from a memory and writes to the memory. (Yonezawa disclosed in col. 23 lines 20-35: “As is shown in FIG. 22(a), the power analysis system functioning as the instruction set simulator includes a test pattern generator 51 for generating a test pattern for power analysis, a power consumption estimator 52 … In the test pattern generator 51, a sufficiently large number of programs are generated so as not to cause an error. With respect to a data line of a register, a test pattern set where respective bits are successively inverted is used for estimation. The power of a memory is estimated with respect to each of a write operation and a read operation.”).
Regarding claim 8, Desai, Fleming, Yonezawa and Ahuja teach the method of claim 7, wherein Ahuja teaches the one or more estimates of stall power consumption are determined from the gate-level simulation. (Ahuja disclosed in page 83-84 section 7.2: “Timing, power or area can be controlled at architecture level by inserting directives in the C code or by applying intelligent scheduling and resource allocation schemes … Also, clock-gating based power reduction sometimes leads to very high fidelity, it is very important to generate clock-gated RTL for these co-processors. One of the case studies presented shows that up to 70% of power reduction in design can be achieved with clock-gating alone. … We introduce pragmas that can be inserted in behavioral code to enable clock-gating of registers during HLS. One should note that clock-gating can be enabled at different granularity from very coarse grain to fine grain such as register bank of 32bit to a single bit register, etc. … In the high-level C description of the design, an architect can control the register power consumption by performing the gating for global/static variables. Similarly a case might arise to do the clock-gating of a register of a particular function … Design automation solutions, where HLS can be guided for register power reduction, will greatly help in providing early power-aware realistic design trade-offs.”
The disclosure above “up to 70% of power reduction in design can be achieved with clock-gating” corresponds to the claim element “estimates of stall power consumption are determined from a gate-level power simulation”).
Regarding claim 9, the same ground of rejection is made as discussed in claim 1 for substantially similar rationale, therefore claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Desai, Fleming, Yonezawa and Ahuja as discussed above for substantially similar rationale. In addition, claim 9 recites following limitations:
Desai teaches a system, comprising: a processor configured to initiate operations generating a plurality of snapshots for a pipeline of a processor core, (Desai disclosed in page 1 para [0010]: “A fourth form of the invention is a method comprising a determination of a plurality of relative power estimates of a design of a microprocessor,”. In page 2-3 para [0047]: “Vector or Single Instruction/Multiple Data (“SIMD”) processors perform several operations/computations per instruction cycle. … In the preferred embodiment of the invention, all instructions are executed in a single clock cycle, thereby increasing overall processing throughput. Note that other embodiments of the invention may employ pipelining of instruction cycles in order to increase clock rates, … These computations occur in parallel (e.g., in the same instruction or clock cycle) on data vectors that consist of several data elements each. In SIMD processors, the same operation is typically performed on each of the data elements per instruction cycle.” It has been discussed in page 1 para [0001] that the present invention relates to vector and Single Instruction/Multiple Data ("SIMD”) processor instruction sets dedicated to facilitate a required throughput of communication algorithms. This disclosure relates to the claim limitation “a system, comprising a processor configured to initiate operations”.
Since the computations occur in parallel i.e., in the same instruction or clock cycle, in above disclosure, therefore the same operation performed on each of the data elements per instruction cycle, reads the claim limitation “generating a plurality of snapshots for a pipeline of a processor core” (because each given clock cycle is reasonably interpreted as corresponding with a “snapshot” of the pipeline, therefore, multiple/plurality of snapshots are generated after the whole instruction is executed in several clock cycles)).
Regarding claims 10-12 and 14, Desai, Fleming, Yonezawa and Ahuja teach the system of claim 9, are incorporating the rejections of claims 2,4,6 and 8 respectively, therefore claims 10-12 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Desai, Fleming, Yonezawa and Ahuja as discussed above for substantially similar rationale.
Regarding claim 21, the same ground of rejection is made as discussed in claims 1 and 9 for substantially similar rationale, therefore claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over Desai, Fleming, Yonezawa and Ahuja as discussed above for substantially similar rationale. In addition, claim 21 recites following limitations:
Fleming teaches a computer program product comprising: one or more computer-readable storage mediums having program instructions stored thereon, wherein the program instructions are executable by computer hardware to cause the computer hardware to execute operations (Fleming disclosed in page 30 para [0291]: “a non-transitory machine readable medium that stores code that when executed by a machine causes the machine to perform a method comprising any method disclosed herein.”).
Regarding claims 22-24, Desai, Fleming, Yonezawa and Ahuja teach the computer program product of claim 21, are incorporating the rejections of claims 2,4, and 6 respectively, therefore claims 22-24 are rejected under 35 U.S.C. 103 as being unpatentable over Desai, Fleming, Yonezawa and Ahuja as discussed above for substantially similar rationale.
Claims 7, 13 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Desai, Fleming, Yonezawa and Ahuja and further in view of Roy et al. (Pub. No. US2007/0136720A1).
Regarding claim 7, Desai, Fleming, Yonezawa and Ahuja teach the method of claim 1, however these prior arts do not teach the claim limitation “the generating an instruction-based power model includes using one or more estimates of stall power consumption of the processor core for one or more selected cycles for which a stall condition is detected.
wherein Roy teaches the generating an instruction-based power model includes using one or more estimates of stall power consumption of the processor core for one or more selected cycles for which a stall condition is detected. (Roy disclosed in page 4 para [0035]: “The steps 310 and 312 pertain to determining the estimated energy usage of the program code P. At step 310, a stall energy information is collected from the VLIW DSP core V. The stall energy refers to the energy consumed due to stalls of the VLIW DSP core. The stall energy consumption occurs due to, for example, when the VLIW DSP core waits for a response from a memory subsystem of the VLIW DSP. This can occur, during different stall types, for example, cache misses and contention of the memory sub-system. On collecting the stall energy information, the stall energy “Estall” is determined. For determining the stall energy, let the VLIW DSP core V have q different types of stalls due to the memory sub-system. Let the energy per cycle of the stall type j be Esj. Let the number of cycles due to the stall type j while executing the program code P be cj. Then the stall energy is given by … In an embodiment of the present invention, the energy E of the program code is used to estimate the power ‘P’ of the program code.”
The disclosure above “stall energy consumption occurs due to, when the VLIW DSP core waits for a response from a memory subsystem of the VLIW DSP. This corresponds to the claim element “stall power consumption of the processor core”. Further, the disclosure “determining the estimated energy usage of the program code P” corresponds to claim limitation “generating an instruction-based power model”. The disclosure “energy of the stall type j is generated for per cycle; the stall energy is determined in Eq. (9) using the “the energy per cycle of the stall type j as Esj” this corresponds to stall power consumption (Esj) for stall type j is generated for each cycle or as selected cycles (cj), when a stall condition is detected).
Therefore, Desai, Fleming, Yonezawa, Ahuja and Roy are analogous art because they are related in estimating power consumption in design architecture. Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Desai, Fleming, Yonezawa, Ahuja and Roy, before him or her, to modify estimating of power consumption in per clock cycle of Fleming and Yonezawa, to include stall power consumption estimation for the processor core are of Roy. The suggestion/motivation for doing so would have been obvious by Roy because “In an embodiment of the present invention, the estimated energy usage of the program code is computed by Software that interacts with an instruction set simulator (ISS) of the VLIW-DSP core. The Software estimates the energy and power consumed by the program code, at an instruction level. Once the energy and power consumption of the program code are estimated, the program code can be modified to reduce the energy and power consumption.” (Roy disclosed in page 2 para [0013]). Therefore, it would have been obvious to combine Roy with Desai, Fleming, Yonezawa and Ahuja to obtain the invention as specified in the instant claim(s).
Regarding claims 13 and 25, Desai, Fleming, Yonezawa and Ahuja teach the system of claim 9, and the computer program product of claim 21 respectively, are incorporating the rejections of claim 7 because claims 13 and 25 have substantially similar claim language as claim 7, therefore claim 13 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Desai, Fleming, Yonezawa, Ahuja and Roy as discussed above for substantially similar rationale.
Claims 3 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Desai, Fleming, Yonezawa and Ahuja and further in view of Xilinx reference guide (hereinafter Xilinx_Ref, November 15, 2012).
Regarding claim 3, Desai, Fleming, Yonezawa and Ahuja teach the method of claim 1, wherein Xilinx_Ref teaches the non-pipeline circuitry of the processor core includes a switch of the processor core. (Examiner would construe the claim element “switch” as “a memory-mapped switch or a stream switch” (in light of Specification of current application para [0034].
Xilinx_Ref disclosed in page 35-36 heading ‘AXI4-Stream Interconnect Core Features’: “The AXI4-Stream Interconnect IP contains the following features: … Core switch: … Full slave-side arbitrated crossbar switch … The AXI4-Stream Interconnect core consists of the SI, the MI, and the functional units that include the AXI channel pathways between them. … At the center is the switch that arbitrates and routes traffic between the various devices connected to the SI and MI. The AXI4-Stream Interconnect core also includes other functional units located between the switch and each of the SI and MI interfaces that optionally perform various conversion and storage functions.” It has been discussed in page 7 heading ‘Combining AXI4-Stream and Memory Mapped Protocols’ that a system can be built by combining AXI4-Stream and AXI memory mapped IP together. Often a DMA engine can be used to move streams in and out of memory.”).
Therefore, Desai, Fleming, Yonezawa, Ahuja and Xilinx_Ref are analogous art because they are related to system architecture that impact the area and performance of the system. Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Desai, Fleming, Yonezawa, Ahuja and Xilinx_Ref before him or her, to modify the generating non-pipeline state data indicated to the state of the non-pipeline circuitry of Yonezawa, to include non-pipeline circuitry of the processor core comprises switches of Xilinx_Ref. The suggestion/motivation for doing so would have been obvious by Xilinx_Ref because “The AXI DMA engine provides high performance direct memory access between system memory and AXI4-Stream type target peripherals. The AXI DMA provides Scatter Gather (SG) capabilities, allowing the CPU to offload transfer control and execution to hardware automation. The AXI DMA as well as the SG engines are built around the AXI DataMover helper core (shared sub-block) that is the fundamental bridging element between AXI4-Stream and AXI4 memory mapped buses.” (Xilinx_Ref disclosed in page 43 heading ‘AXI4 DMA Summary’). Therefore, it would have been obvious to combine Xilinx_Ref with Desai, Fleming, Yonezawa and Ahuja to obtain the invention as specified in the instant claim(s).
Regarding claims 15, Desai, Fleming, Yonezawa and Ahuja teach the system of claim 9, is incorporating the rejections of claim 3 because claims 15 has substantially similar claim language as claim 3, therefore claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Desai, Fleming, Yonezawa, Ahuja and Xilinx_Ref as discussed above for substantially similar rationale.
Conclusion
8. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. An NPL research article “A Precise High-Level Power Consumption Model for Embedded Systems Software” by Mostafa E. A. Ibrahim et al. disclosed an approach for estimating the power consumption of a VLIW DSP while running a software application is presented. The contribution of this work aims to precisely estimate the power consumption of the core processor while running a software algorithm at an early stage in the design process. The commercial off-the-shelf VLIW DSP C6416T from Texas Instruments is utilized as the targeted platform. L1 Program Cache Power Consumption sub-model. With the aid of the profiler of the C6416T device accurate cycle simulator, different scenarios are prepared that arbitrarily vary the program cache miss rate δ. Figure 15 shows the effect of varying the program cache miss rate on the current drawn by the core processor. The best fit for the measured values in Figure 15 is obtained with an R2 value of 0.9889. The inter-instructions as well as the pipeline stall effects have been investigated in this article’s proposed model. The validation and precision of the model have been proven by estimating the power consumption of many typical algorithms applied in signal and image processing. The proposed model allows us to figure out the processor functional units that are dominantly contributing to the power consumption. Two specific architectural features of the C6416T, namely, Software Pipelined Loop (SPLoop) and the Single Instruction Multi Data (SIMD) from the perspective of energy and power consumption is evaluated.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NUPUR DEBNATH whose telephone number is (571)272-8161. The examiner can normally be reached M-F 8:00 am -4:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Renee D Chavez can be reached on (571)270-1104. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NUPUR DEBNATH/Examiner, Art Unit 2186
/RENEE D CHAVEZ/Supervisory Patent Examiner, Art Unit 2186