Last updated: April 19, 2026
Application No. 17/148,676
METHOD AND APPARATUS WITH ACCELERATOR PROCESSING

Non-Final OA §103§112
Filed
Jan 14, 2021
Examiner
YI, HYUNGJUN B
Art Unit
2146
Tech Center
2100 — Computer Architecture & Software
Assignee
Seoul National University R&Db Foundation
OA Round
5 (Non-Final)
Interview Optional

— +31.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 17 resolved cases, 2023–2026
Examiner Intelligence

YI, HYUNGJUN B View full profile →
Grants only 18% of cases
Career Allow Rate
3 granted / 17 resolved
-37.4% vs TC avg
Strong +32% interview lift
Without
With
+31.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
39 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
26.3%
-13.7% vs TC avg
§103
53.9%
+13.9% vs TC avg
§102
12.9%
-27.1% vs TC avg
§112
4.7%
-35.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 17 resolved cases
Office Action

§103 §112
DETAILED ACTION
	This action is responsive to the claims filed on 09/17/2025. Claims 1-6, 8-15 and 17-23 are pending for examination.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 09/17/2025 has been entered.
 
Response to Amendments
Applicant’s remarks accompanying the amendments to the claims primarily restate the amended claim language and identify supporting passages in the specification. Aside from conclusory statements that the cited art “fails to teach” the amended features, applicant does not present a substantive traversal that identifies any specific error in the prior art mappings or in the Examiner’s reasoning. As set forth above, it is interpreted that the previous prior art combination continue to teach or render obvious the limitations of amended claims. Accordingly, applicant’s remarks have been fully considered but are not persuasive, and the rejections under 35 U.S.C. § 103 are maintained.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 1 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as failing to set forth the subject matter which the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the applicant regards as the invention. 
Claim 1 recites the limitation "and a third access cost level of the L2 memory". There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-6, 9, 19, 20-21 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Joydeep et al. (US 20220066931), hereafter referred to as Joydeep, in view of Sinclair et al. (Komuravelli, R., Sinclair, M. D., Alsop, J., Huzaifa, M., Kotsifakou, M., Srivastava, P., ... & Adve, V. S. (2015). Stash: Have your scratchpad and cache it too. ACM SIGARCH Computer Architecture News, 43(3S), 707-719.), hereafter referred to as Sinclair, and in further view of Kannan et al. (US 2014/0149632 A1), hereafter referred to as Kannan.

Regarding claim 1, Joydeep teaches the following limitations:
An accelerator comprising clusters of processing elements, the processing elements configured to perform an operation associated with an instruction received from a host processor, each cluster comprising a respective L1 memory shared exclusively by the processing elements thereof, wherein each processing element comprises a respective LO memory used exclusively by the corresponding processing element; (Joydeep, paragraph 48, “A graphics processing unit (GPU) is communicatively coupled to host/processor cores to accelerate, for example, graphics operations, machine-learning operations, pattern analysis operations, and/or various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to the host processor/cores over a bus or another interconnect”, it is shown that a processing element (GPU) is used to perform operations received from a host processor
Joydeep, paragraph 122, “The caches 462A-462D may comprise level 1 (L1) and level 2 (L2) caches. In addition, one or more shared caches 456 may be included in the caching hierarchy and shared by sets of the cores 460A-460D. For example, one embodiment of the processor 407 includes 24 cores, each with its own L1 cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, one of the L2 and L3 caches are shared by two adjacent cores.”, each core (processing element) has a private L1 cache and that higher level caches (e.g., L2, L3) are shared by subsets of cores, which corresponds to clusters of processing elements with shared L1/L2/L3 memories. A lowest-level per-core memory (registers/local memory) is interpreted as the claimed L0 memory used exclusively by the corresponding processing element.)); 
and wherein according to the access cost levels, a first of the sub-cores prefetches a first portion of the data into the L2 memory from an external memory, (Joydeep, paragraph 79, “Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed.”, shows that data used by the GPU flows from system memory (external memory) into the on-chip cache hierarchy including the L2 cache and then to the L1 caches of the graphics multiprocessors.
Joydeep, paragraph 291, “Storing the thread state within the execution unit 1900 enables the rapid pre-emption of threads when those threads become blocked or idle. The instruction fetch/prefetch unit 1903 can fetch instructions from an instruction cache of higher-level execution logic (e.g., instruction cache 1806 as in FIG. 18A). The instruction fetch/prefetch unit 1903 can also issue prefetch requests for instructions to be loaded into the instruction cache based on an analysis of currently executing threads.”, explicitly discloses a dedicated fetch/prefetch unit that issues prefetch requests to move information into a cache ahead of demand. Taken together, paragraph 79 provides the external-memory–to–L2–to–L1 data path, and paragraph 291 provides the prefetch mechanism operating along a cache hierarchy. It would have been understood by a person of ordinary skill in the art that the same fetch/prefetch unit (or analogous sub-core associated with the L2 level) prefetches a first portion of the information from system memory into the L2 cache before it is needed, so that subsequent accesses hit in L2. Accordingly, Joydeep is interpreted as teaching that, according to the hierarchy of access costs, a first sub-core (fetch/prefetch unit) prefetches a first portion of the data from an external memory into the L2 memory.)
a second of sub-cores prefetches a second portion of the data from the L2 memory into one of the L1 memories, (Joydeep, paragraph 79, “Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed.”,  shows that the same texture data may reside in L2 and is then supplied to the L1 cache of the graphics multiprocessor—i.e., movement of data from L2 into an L1 cache associated with a particular processing cluster. 
Joydeep, paragraph 291, “Storing the thread state within the execution unit 1900 enables the rapid pre-emption of threads when those threads become blocked or idle. The instruction fetch/prefetch unit 1903 can fetch instructions from an instruction cache of higher-level execution logic (e.g., instruction cache 1806 as in FIG. 18A). The instruction fetch/prefetch unit 1903 can also issue prefetch requests for instructions to be loaded into the instruction cache based on an analysis of currently executing threads.”, This disclosure makes clear that Joydeep’s architecture includes a hardware prefetch unit that proactively issues prefetch requests to move information from a higher-level cache into a nearer cache before the core demands it. A person of ordinary skill in the art would understand that the same prefetch mechanism is used to move information between adjacent cache levels in the hierarchy, including from L2 into the L1 caches of the graphics multiprocessors shown in paragraph 79. Thus, Joydeep is reasonably interpreted as teaching that a second sub-core, implemented as prefetch logic associated with an L1 cache, prefetches a second portion of the data from the shared L2 cache into one of the L1 memories ahead of demand.) )
Sinclair, in the same field of hierarchical memory prefetching, teaches the following limitation which Joydeep fails to teach:
the processing elements comprising respective sub-cores respectively corresponding to LO memories, (Sinclair, Page 8, section 5, paragraph 1, “The system is composed of multiple CPU and GPU cores, which are connected via an interconnection network. Each GPU Compute Unit (CU), which is analogous to an NVIDIA SM, has a separate node on the network. All CPU and GPU cores have an attached block of SRAM. For CPU cores, this is an L1 cache, while for GPU cores, it is divided into an L1 cache and a scratchpad. Each node also has a bank of the L2 cache, which is shared by all CPU and GPU cores. The stash is located at the same level as the GPU L1 caches and both the cache and stash write their data to the backing L2 cache bank.”, sub-cores (compute units) are stored to corresponding hierarchical memories (stash). Each GPU compute unit (CU) has a local memory organization including L1 and scratchpad on a per-CU basis, and that each node also has a bank of L2. The GPU CUs, together with their directly attached SRAM partitions (including scratchpads), correspond to “sub-cores” respectively associated with local private memories. It is thus reasonably interpreted that these sub-cores correspond to L0-type per-processing-element memories in the claim.)
all of the sub-cores configured to prefetch data for the operation, different from instruction, associated with the operation (Sinclair, page 4, col. 1, paragraph 2, “The first time a load occurs to a newly mapped stash address, it implicitly copies the data from the mapped global space to the stash (analogous to a cache miss). Subsequent loads for that address immediately return the data from the stash (analogous to a cache hit, but with the energy benefits of direct addressing).”, loads data, not instructions, from global memory.
Page 4, col. 1, paragraph 2, “Similarly, no explicit stores are needed to write back the stash data to its mapped global location. Thus, the stash enjoys all the benefits of direct addressing of a scratchpad on hits (which occur on all but the first access), without the overhead incurred by the additional loads and stores that scratchpads require for explicit data movement.”, it is shown that, upon loads, the stash hardware implicitly moves data (not instructions) between global memory and the per-CU local stash. Since stash operations are performed per GPU CU, and the stash is tightly associated with each CU’s local memory organization, it is reasonably interpreted that the sub-cores are configured to prefetch data (distinct from instructions) used by the operation executed on those sub-cores.)
and a third of the sub-cores prefetches a third portion of the data from the one of the L1 memories into the LO memory of its processing element. (Sinclair, page 2, col. 2, Figure 1(a), “First, the hardware must explicitly copy fieldX into the scratchpad. To achieve this, the application issues an explicit load of the corresponding global address to the L1 cache (event 1)… Next, the hardware sends the data value from the L1 to the core’s register (event 4). Finally, the application issues an explicit store instruction to write the value from the register into the corresponding scratchpad address (event 5). At this point, the scratchpad has a copy of fieldX and the application can finally access the data (events 6 and 7).”, The scratchpad in Sinclair is a private, per-core local memory and thus corresponds to the claimed L0 memory used exclusively by its processing element, while the cache supplying the data corresponds to the claimed L1 memory. In Sinclair’s sequence, events 1–5 are performed before the core’s later accesses in events 6 and 7 specifically so that those later accesses hit in the scratchpad instead of going back to L1. This explicit copying of data from L1 into the private scratchpad in advance of subsequent use is a software-controlled prefetch of that data into the L0 memory. Accordingly, Sinclair is reasonably interpreted as teaching that a third sub-core associated with the processing element prefetches a third portion of data from one of the L1 memories into the L0 memory of its processing element, as claimed.)
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings disclosed by Joydeep with the teachings disclosed Sinclair (i.e., sub-cores corresponding to hierarchical memories for prefetching operational data using stash memory). A motivation for the combination is to provide an architecture that eliminates conflict misses and a fixed access latency (no overhead of searching or comparing tags), allowing for guaranteed hit rate. (Sinclair, page 2, section 1.2, paragraph 1, “Direct addressing also eliminates the pathologies of conflict misses and has a fixed access latency (100% hit rate). Scratchpads also provide compact storage since the software only brings useful data into the scratchpad.”).
Kannan, in the same field of hierarchical memory processing, teaches the following limitation which the above fails to teach:
prefetch data… based on a first access costs levels of the LO memory memories, a second access cost level of L1 memories, and a third access cost level of the L2 memory, wherein the third access cost level is higher than the second access cost level, and the second access cost level is higher than the first access cost level (Kannan, paragraph 7, “Many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement. These prefetch units are typically tuned to be more aggressive as their proximity to the core decreases, such that the lower-level cache prefetch units run significantly”
paragraph 13, “In some embodiments, the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases. As such, the lower level prefetch units may be further ahead of the upper level prefetch units in the prefetch stream.”, prefetching is explicitly performed to be more aggressive at levels with higher access costs (further away from the core), thus prefetching is based by the different access-cost latencies of a hierarchical memory. Thus, it is shown that prefetch units exist at each cache level and that their prefetch aggressiveness is tuned according to the cache level and its distance from the core. With higher-level caches having higher latency (higher access “cost”) and lower-level caches having lower latency, prefetch behavior is explicitly based on differing access characteristics of levels of the hierarchy. This corresponds to the claimed notion of different “access cost levels” for L0, L1, and L2, and that the cost increases with hierarchy level.)
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings disclosed by Joydeep and Sinclair with the teachings disclosed Kannan (i.e., prefetching based on differing access costs). A motivation for the combination is to provide an architecture that reduces the delay of lower level prefetching hits. (Kannan, paragraph 14, “Therefore, before the lower level prefetch units reach a page boundary, the upper level prefetch unit may preemptively generate a prefetch request to retrieve a translation for the next virtual page. Subsequently, after the physical address for the next virtual page has been obtained, the address translation is conveyed to the lower level prefetch unit(s). When the lower level prefetch unit reaches the page boundary, the prefetch units will have the physical address of the next page and may continue prefetching without delay.”).

Regarding claim 2, Joydeep, Sinclair, and Kannan teaches the limitations of claim 1. Joydeep further teaches:
wherein the sub-cores are further configured to: perform the prefetching based on a data access portion for the operation in the instruction (Joydeep, paragraph 294, “The graphics processor execution units as described herein may natively support instructions in a 128-bit instruction format 2010”, the graphics processor execution unit supports 128-bit instruction format 2010. It is noted that the instruction prefetch unit 1837/1903 resides on the execution unit as shown on figure 19.
Paragraph 298, “The 128-bit instruction format 2010 may also include an access/address mode field 2026, which specifies an address mode and/or an access mode for the instruction. The access mode may be used to define a data access alignment for the instruction… where the byte alignment of the access mode determines the access alignment of the instruction operands.”, the instruction prefetch unit 1837/1903, prefetches instructions in the 128-bit format which uses access mode field 2026 to specify a data access alignment for the instruction. Therefore, the instruction prefetch unit 1837/1903 prefetches based on a data access portion for the operation in the instruction.).  

Regarding claim 3, Joydeep, Sinclair, and Kannan teaches the limitations of claim 1. Kannan further teaches:
wherein the sub-cores are further configured to perform the prefetching independent of the processing elements (Kannan, paragraph 7, states that “many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement.” Paragraph 13 further explains that “the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases. As such, the lower level prefetch units may be further ahead of the upper level prefetch units in the prefetch stream.” These passages describe prefetch units as hardware structures associated with the caches that autonomously monitor access patterns and generate prefetch requests at each cache level. Because these prefetch units operate based on cache-level behavior and prefetch streams, rather than as part of the cores’ instruction execution, a person of ordinary skill in the art would understand that the prefetch units (sub-cores) perform prefetching operations independent of the processing elements’ execution pipelines. Thus, Kannan teaches the claimed sub-cores performing the prefetching independent of the processing elements.)

Regarding claim 4, Joydeep, Sinclair, and Kannan teaches the limitations of claim 1. Joydeep further teaches:
wherein the processing elements are further configured to: perform the operation associated with the instruction (Joydeep, paragraph 79, states that “in graphics and computing applications, a processing cluster 214 may be configured such that each graphics multiprocessor 234 is coupled to a texture unit 236 for performing texture mapping operations, e.g., determining texture sample positions, reading texture data, and filtering the texture data. Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed.” This teaches that the processing clusters (processing elements) perform their operations using data read from hierarchical memories such as L1 and L2 caches.)
Kannan further teaches:
using the data prefetched to the hierarchical memories by the sub-cores (Kannan, paragraph 7, explains that “many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement,” and paragraph 13 clarifies that “the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases,” with “the lower level prefetch units” being “further ahead of the upper level prefetch units in the prefetch stream.” These passages show that dedicated prefetch units at each cache level proactively prefetch data (cache lines) into the corresponding caches ahead of demand. Interpreting these prefetch units as the claimed sub-cores associated with the L2, L1, and L0 memories, Kannan teaches that the data used by the processing elements in Joydeep’s hierarchy is brought into the hierarchical memories by prefetch operations of the sub-cores, satisfying the requirement that the processing elements perform the operation using data prefetched to the hierarchical memories by the sub-cores.)

Regarding claim 5, Joydeep, Sinclair, and Kannan teaches the limitations of claim 1. Joydeep further teaches:
wherein the sub-cores are further configured to: cooperatively prefetch the data associated with the operation based on a structure of the hierarchical memories (Kannan, paragraph 7, states that “many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement.” Paragraph 13 further explains that “in some embodiments, the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases. As such, the lower level prefetch units may be further ahead of the upper level prefetch units in the prefetch stream.” These passages describe a cache hierarchy with multiple prefetch units distributed across different cache levels (e.g., L2, L1, and closer private caches). Each prefetch unit issues prefetch requests at its own level, and the prefetch streams are coordinated so that deeper prefetch units run further ahead and supply data to the upper-level prefetch units and the demand stream. Thus, the prefetch units (sub-cores) at the various levels cooperate to move data through the hierarchy and their behavior is explicitly determined by the structure of the hierarchical memories (position of each cache level and its latency). Interpreting each prefetch unit as a claimed sub-core at the L2, L1, and L0 levels, Kannan teaches sub-cores that cooperatively prefetch the data associated with an operation based on the hierarchical memory structure, as recited.) 

Regarding claim 6, Joydeep, Sinclair, and Kannan teaches the limitations of claim 1. Joydeep further teaches:
wherein the first portion of the data is shared by all of the processing elements, the second portion of the data is shared by only the processing elements in the cluster corresponding to the one of the memories, and the third portion is used by only the processing element of the LO memory storing the third portion (Joydeep, paragraph 122, “The caches 462A-462D may comprise level 1 (L1) and level 2 (L2) caches. In addition, one or more shared caches 456 may be included in the caching hierarchy and shared by sets of the cores 460A-460D. For example, one embodiment of the processor 407 includes 24 cores, each with its own L1 cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, one of the L2 and L3 caches are shared by two adjacent cores.”, it is shown that higher-level caches (such as L2/L3) are shared among multiple cores, while lower-level caches (L1) are private to individual cores. Interpreting the L2/L3 as the level at which a first portion of data is shared among multiple processing elements (cores), and each L1 as shared only within the cluster of processing elements associated with that cache, the remaining private per-core local memory (L0) naturally holds data used only by a single processing element. Accordingly, the sharing pattern of first/second/third portions of data across L2, cluster-shared L1, and private L0 is taught by the hierarchical sharing structure in Joydeep.)

Regarding claim 9, Joydeep, Sinclair, and Kannan teaches the limitations of claim 1. Joydeep further teaches:
The accelerator of claim 1, being comprised in a user terminal to which data to be recognized through a neural network corresponding to the instruction is input, or a server configured to receive the data to be recognized from the user terminal (Joydeep, paragraph 435, “User input devices, including alphanumeric and other keys, may be used to communicate information and command selections to graphics processor 3904. Another type of user input device is cursor control, such as a mouse, a trackball, a touchscreen, a touchpad, or cursor direction keys to communicate direction information and command selections to GPU and to control cursor movement on the display device.”, it is interpreted under broadest-reasonable interpretation (BRI) that a terminal is a device which a user enters data for a computer system
Paragraph 438, “Therefore, the configuration of the computing device 3900 may vary from implementation to implementation depending upon numerous factors, such as price constraints, performance requirements, technological improvements, or other circumstances. Examples include (without limitation) a mobile device, a personal digital assistant…, a server, a server array or server farm, a web server, a network server, an Internet server…”, the computing device 3900 which contains user input devices may be comprised on a server. Therefore, the system is comprised in a server configured to receive data to be recognized from user terminal devices).  

Regarding claim 19, Joydeep teaches the following limitations:
An accelerator system comprising: a host processor configured to transmit an instruction to an accelerator, the accelerator comprising: (Joydeep, paragraph 48, “A graphics processing unit (GPU) is communicatively coupled to host/processor cores to accelerate, for example, graphics operations, machine-learning operations, pattern analysis operations, and/or various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to the host processor/cores over a bus or another interconnect”, discloses that a GPU accelerator is coupled to host cores and accelerates operations delegated by the host via instructions/commands. The GPU’s processing elements are configured to perform operations associated with these instructions, while the host processor transmits those instructions to the accelerator.)
and wherein a first of the sub-cores is configured to prefetch a first portion of data, from a memory external to the accelerator into the L2 memory based on a first access cost level thereof, (Joydeep, paragraph 79, “Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed.”, shows that data used by the GPU flows from system memory (external memory) into the on-chip cache hierarchy including the L2 cache and then to the L1 caches of the graphics multiprocessors.
Joydeep, paragraph 291, “Storing the thread state within the execution unit 1900 enables the rapid pre-emption of threads when those threads become blocked or idle. The instruction fetch/prefetch unit 1903 can fetch instructions from an instruction cache of higher-level execution logic (e.g., instruction cache 1806 as in FIG. 18A). The instruction fetch/prefetch unit 1903 can also issue prefetch requests for instructions to be loaded into the instruction cache based on an analysis of currently executing threads.”, explicitly discloses a dedicated fetch/prefetch unit that issues prefetch requests to move information into a cache ahead of demand. Taken together, paragraph 79 provides the external-memory–to–L2–to–L1 data path, and paragraph 291 provides the prefetch mechanism operating along a cache hierarchy. It would have been understood by a person of ordinary skill in the art that the same fetch/prefetch unit (or analogous sub-core associated with the L2 level) prefetches a first portion of the information from system memory into the L2 cache before it is needed, so that subsequent accesses hit in L2. Accordingly, Joydeep is interpreted as teaching that, according to the hierarchy of access costs, a first sub-core (fetch/prefetch unit) prefetches a first portion of the data from an external memory into the L2 memory.)
a second of the sub-cores in a first of the clusters is configured to prefetch a second portion of the data from the L2 memory into the L1 memory of the first of the clusters, (Joydeep, paragraph 79, “Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed.”,  shows that the same texture data may reside in L2 and is then supplied to the L1 cache of the graphics multiprocessor—i.e., movement of data from L2 into an L1 cache associated with a particular processing cluster. 
Joydeep, paragraph 291, “Storing the thread state within the execution unit 1900 enables the rapid pre-emption of threads when those threads become blocked or idle. The instruction fetch/prefetch unit 1903 can fetch instructions from an instruction cache of higher-level execution logic (e.g., instruction cache 1806 as in FIG. 18A). The instruction fetch/prefetch unit 1903 can also issue prefetch requests for instructions to be loaded into the instruction cache based on an analysis of currently executing threads.”, This disclosure makes clear that Joydeep’s architecture includes a hardware prefetch unit that proactively issues prefetch requests to move information from a higher-level cache into a nearer cache before the core demands it. A person of ordinary skill in the art would understand that the same prefetch mechanism is used to move information between adjacent cache levels in the hierarchy, including from L2 into the L1 caches of the graphics multiprocessors shown in paragraph 79. Thus, Joydeep is reasonably interpreted as teaching that a second sub-core, implemented as prefetch logic associated with an L1 cache, prefetches a second portion of the data from the shared L2 cache into one of the L1 memories ahead of demand.) )
Sinclair, in the same field of hierarchical memory prefetching, teaches the following limitation which Joydeep fails to teach:
clusters of processing elements configured to perform an operation associated with the instruction, each cluster comprising a respective L1 memory shared by the processing elements of its cluster, each processing element comprising a respective LO memory used only by its processing element, the accelerator further comprising an L2 memory shared by the clusters, and the processing elements comprising respective sub-cores; (Sinclair, page 8, section 5.1, “The system is composed of multiple CPU and GPU cores, which are connected via an interconnection network. Each GPU Compute Unit (CU), which is analogous to an NVIDIA SM, has a separate node on the network. All CPU and GPU cores have an attached block of SRAM. For CPU cores, this is an L1 cache, while for GPU cores, it is divided into an L1 cache and a scratchpad. Each node also has a bank of the L2 cache, which is shared by all CPU and GPU cores. The stash is located at the same level as the GPU L1 caches and both the cache and stash write their data to the backing L2 cache bank.”, it is shown that the accelerator (GPU) is composed of multiple CUs (clusters) each associated with local memory (L1 and scratchpad) and that L2 banks are shared among the nodes. The scratchpad is a per-CU private memory analogous to an L0 memory used only by that CU’s processing elements. Thus Sinclair teaches clusters of processing elements with cluster-shared L1 memories, per-processing-element/private local memories (L0-like scratchpad), and a shared L2, with each processing element having associated sub-core logic responsible for its local memory behavior.)
and a third of the sub-cores is configured to prefetch a third portion of the data from the L1 memory of the first of the clusters into the LO memory of the processing element comprising the third sub-core (Sinclair, page 2, col. 2 Figure 1(a), “First, the hardware must explicitly copy fieldX into the scratchpad. To achieve this, the application issues an explicit load of the corresponding global address to the L1 cache (event 1)… Next, the hardware sends the data value from the L1 to the core’s register (event 4). Finally, the application issues an explicit store instruction to write the value from the register into the corresponding scratchpad address (event 5). At this point, the scratchpad has a copy of fieldX and the application can finally access the data (events 6 and 7).”, The scratchpad in Sinclair is a private, per-core local memory and thus corresponds to the claimed L0 memory used exclusively by its processing element, while the cache supplying the data corresponds to the claimed L1 memory. In Sinclair’s sequence, events 1–5 are performed before the core’s later accesses in events 6 and 7 specifically so that those later accesses hit in the scratchpad instead of going back to L1. This explicit copying of data from L1 into the private scratchpad in advance of subsequent use is a software-controlled prefetch of that data into the L0 memory. Accordingly, Sinclair is reasonably interpreted as teaching that a third sub-core associated with the processing element prefetches a third portion of data from one of the L1 memories into the L0 memory of its processing element, as claimed.)
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings disclosed by Joydeep with the teachings disclosed Sinclair (i.e., sub-cores corresponding to hierarchical memories for prefetching operational data using stash memory). A motivation for the combination is to provide an architecture that eliminates conflict misses and a fixed access latency (no overhead of searching or comparing tags), allowing for guaranteed hit rate. (Sinclair, page 2, section 1.2, paragraph 1, “Direct addressing also eliminates the pathologies of conflict misses and has a fixed access latency (100% hit rate). Scratchpads also provide compact storage since the software only brings useful data into the scratchpad.”).
Kannan, in the same field of hierarchical memory prefetching, teaches the following limitation which Joydeep and Sinclair fails to teach:
wherein the first access cost level is higher than the second access cost level, and the second access cost level is higher than the third access costs level (Kannan, paragraph 7, “Many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement. These prefetch units are typically tuned to be more aggressive as their proximity to the core decreases, such that the lower-level cache prefetch units run significantly”
paragraph 13, “In some embodiments, the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases. As such, the lower level prefetch units may be further ahead of the upper level prefetch units in the prefetch stream.”, it is shown that data residing in L1 is explicitly copied under hardware/software control into a per-core scratchpad, which is a private per-processing-element memory. Interpreting the per-core scratchpad as the L0 memory and the control logic moving the data as a “third sub-core,” a third portion of the data is prefetched from L1 of a cluster into the private L0 of a processing element.)
It would have been obvious to a person of ordinary skill in the art to have incorporated the teachings disclosed by Joydeep and Sinclair with the teachings disclosed Kannan (i.e., prefetching based on differing access costs). A motivation for the combination is to provide an architecture that reduces the delay of lower level prefetching hits. (Kannan, paragraph 14, “Therefore, before the lower level prefetch units reach a page boundary, the upper level prefetch unit may preemptively generate a prefetch request to retrieve a translation for the next virtual page. Subsequently, after the physical address for the next virtual page has been obtained, the address translation is conveyed to the lower level prefetch unit(s). When the lower level prefetch unit reaches the page boundary, the prefetch units will have the physical address of the next page and may continue prefetching without delay.”).

Regarding claim  20, Joydeep, Sinclair, and Kannan teaches the system of claim 19, Kannan further teaches:
The accelerator system of claim 19, wherein the first, second, and third sub-cores are further configured to perform their operation independent of the processing elements. (Kannan, paragraph 7, states that “many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement.” Paragraph 13 further explains that “the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases. As such, the lower level prefetch units may be further ahead of the upper level prefetch units in the prefetch stream.” These passages describe prefetch units as hardware structures associated with the caches that autonomously monitor access patterns and generate prefetch requests at each cache level. Because these prefetch units operate based on cache-level behavior and prefetch streams, rather than as part of the cores’ instruction execution, a person of ordinary skill in the art would understand that the prefetch units (sub-cores) perform prefetching operations independent of the processing elements’ execution pipelines. Mapping the prefetch units at the different cache levels to the first, second, and third sub-cores of claim 19, Kannan therefore teaches that the first, second, and third sub-cores are configured to perform their prefetching operation independent of the processing elements as recited.)


Regarding claim 21, Joydeep, Sinclair, and Kannan teaches the system of claim 19, Joydeep further teaches:
The accelerator of claim 1, wherein the L0 memories are provided in a number corresponding to the number of the processing elements included in the accelerator, and wherein the L1 memories are provided in a number corresponding to the number of the clusters of the processing elements. (Sinclair, page 8, section 5.1, “Each GPU Compute Unit (CU), which is analogous to an NVIDIA SM, has a separate node on the network. All CPU and GPU cores have an attached block of SRAM. For CPU cores, this is an L1 cache, while for GPU cores, it is divided into an L1 cache and a scratchpad. Each node also has a bank of the L2 cache, which is shared by all CPU and GPU cores.”, it is shown that each GPU CU (cluster of processing elements) has its own local L1 cache and scratchpad, and that each CU (node) has its own L2 bank. Because each GPU CU has exactly one L1 and one scratchpad block per local processing unit organization, it is reasonably interpreted that there is one L1 memory per cluster, and one private local memory (scratchpad/L0) per processing element. Thus the number of L0 memories corresponds to the number of processing elements, and the number of L1 memories corresponds to the number of clusters, as claimed.)

Regarding claim 23, claim 23 is directed to a method of operating an accelerator that performs the functions of the accelerator product of claim 21. Therefore, the rejection made to claim 21 is applied to claim 23.


Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Joydeep, in view of Sinclair and Kannan, as applied to claims 1-6, 9, 18-19, 20-21 and 23 above, and in further view of Aghaei et al. (Aghaei, Babak & Zamanzadeh, Negin. (2016). Evaluation of Cache Coherence Protocols in terms of Power and Latency in Multiprocessors.), hereinafter referred to as Aghaei.

Regarding claim 8, Joydeep, Sinclair, and Kannan teaches the limitations of claim 6. Aghaei, in the same field of hierarchical memory implementation, teaches the following limitation which Joydeep, Sinclair, and Kannan fails to teach: 
wherein the access cost levels correspond to the number of processing elements sharing a corresponding one of the hierarchical memories increases (Aghaei, abstract, “The shared memory multiprocessors suffer with significant problem of accessing shared resources in a shared memory it will result in longer latencies. Consequently, the performance of the system will get affected. With the object of solving the problem of increased access latency due to large number of processors with shared memory, Cache is being used.”, it is well known that the number of processing elements using a shared memory will increase the access cost for that memory.
Page 5, section 4.2, “This protocol models two-level cache hierarchy. The L1 cache is private to a core, while the L2 cache is shared among the cores.”, a cache coherency protocol is responsible for reducing access latency due to the increasing number of processing elements. The cache coherency protocol implementation disclosed in section 4.2 uses hierarchical memory.).  
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings disclosed by Joydeep, Sinclair, and Kannan with the teachings disclosed Aghaei (i.e., hierarchical memory access cost increases as the number of processing elements increases). A motivation for the combination is to provide a method for reducing access latency for shared memory resources by utilizing a cache coherence protocol (Aghaei, page 8, “Average latacy of network with different cache coherence protocols is depicted in figur 2. This comparative figure demonstare MOESI_CMP_token protocol has maximum average network lateny. Wherease, MESI_Two_Level and MOESI_CMP_directory protocols have minimum latency.”, Fig. 2 on page 8 shows the effect of different cache coherency protocols on access latency. Cache coherency protocols reduce the overall latency (access cost) and power consumption).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Joydeep in view of  Sinclair and Kannan, as applied to claims 1-6, 9, 18-19, 20-21 and 23 above, and in further view of Hu et al. (US11599798), hereinafter referred to as Hu.

Regarding claim 10, Joydeep, Sinclair, and Kannan teaches the limitations of claim 1. Hu, in the same field of neural network implementation, teaches the following limitation which Joydeep, Sinclair, and Kannan fails to teach:
wherein the prefetching performed by the sub-cores are performed by cooperation of the sub-cores, the cooperation based on usage information of hardware resources of the accelerator (Hu, col. 8, lines 7-16, “Next, moDNN computes a schedule for data offloading and prefetching together with convolution process selection. The scheduling goal is to minimize the finish time of the TDFG, maintaining that the memory usage never exceeds the memory budget. Instead of dynamically scheduling the training process, moDNN produces a static schedule for the given DNN and GPU platform. Since the TDFG structure and dependency relations do not change, a static schedule is sufficient to ensure efficient memory usage.”, a cooperation of prefetching by the sub-cores is given through a schedule. The schedule determines prefetching based on memory usage (usage information of hardware resources) so that memory usage never exceeds a predetermined budget).  
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings disclosed by Joydeep, Sinclair, and Kannan with the teachings disclosed by Hu (i.e., cooperative prefetching based on hardware usage). A motivation for the combination is to provide reduced memory usage when prefetching (Hu, paragraph 22, “However, in moDNN, we have designed new heuristics to judiciously schedule data transfers and select convolution operations such that both memory usage and performance are optimized. We also adopt the idea of batch partitioning to cooperate with data transfer scheduling to further reduce memory usage without affecting the accuracy. ”).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Joydeep in view of Sinclair, Kannan, and Hu as applied to claim 10 above, and in further view of Sivak et al. (US20200401456), hereinafter referred to as Sivak.

Regarding claim 11, Joydeep, Sinclair, Kannan and Hu teach the limitations of claim 10. Sivak, in the same field of hardware resource management, teaches the following limitation which Joydeep, Sinclair, Kannan and Hu fail to teach:
wherein the usage information of the hardware resources includes usage information of an operation resource based on the processing elements, and usage information of one of the memories (Sivak, paragraph 18, “Computing device 100 may monitor one or more of the memory usage, processor usage, input/output operation usage (e.g., number of writes and reads from storage media), storage bandwidth usage (amount of data written/read), network bandwidth usage, and usage of any other appropriate resource for each VM 113 and store this resource usage information in a time series database 125.”, the usage information monitored in Li comprises of memory usage, processor usage, and operation usage. 
Paragraph 40, “In some embodiments, the cache memory is shared among various components of the processor 102. In some embodiments, the processor 102 also uses an external cache (e.g., a Level-3 (L3) cache or Last Level Cache (LLC)) (not shown), which may be shared among processor cores 107 using known cache coherency techniques.”, Furthermore, memory in Sivak is arranged in a hierarchical format).  
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings disclosed by Joydeep, Sinclair, Kannan, and Hu with the teachings disclosed in Sivak (i.e., usage information based on processing elements and memory). A motivation for the combination is to provide a predictive resource model to accurately predict variations in the usage of a hardware resource (Sivak, paragraph 18, “Each predictive resource usage model may indicate and model predicted variations in the usage of a particular resource by VM 113 over a second time period.”, a predicted resource model is derived from usage information).

Claim 12-15, 18 and 22 is rejected under 35 U.S.C. 103 as being unpatentable over Joydeep in view of Gaur et al. (US 20190079877 A1), hereafter referred to as Gaur, and in further view of Kannan and Sinclair.

Regarding claim 12, Joydeep teaches the following limitations:
receiving an instruction for performing an operation from a host processor (Joydeep, paragraph 48, “A graphics processing unit (GPU) is communicatively coupled to host/processor cores to accelerate, for example, graphics operations, machine-learning operations, pattern analysis operations, and/or various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to the host processor/cores over a bus or another interconnect”, it is shown that a processing element (GPU) is used to perform operations received from a host processor); 
reading, by the accelerator, from an external memory external to the accelerator, data targeted for the operation associated with the instruction (Joydeep, paragraph 79, “Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed.”, data needed for graphics operations is fetched from system memory and other memories into the L2 and L1 caches of the GPU. System memory is external to the GPU accelerator, so this quote teaches the accelerator reading data from an external memory external to the accelerator for use in operations associated with instructions.); 
and performing, by the processing elements, the operation associated with the instruction based on the data, (Joydeep, paragraph 79, “In graphics and computing applications, a processing cluster 214 may be configured such that each graphics multiprocessor 234 is coupled to a texture unit 236 for performing texture mapping operations, e.g., determining texture sample positions, reading texture data, and filtering the texture data. Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed.”, processing clusters perform operations from data read from hierarchical memories (L1, L2 caches, etc) 
Gaur teaches the following limitation which Joydeep fails to teach:
wherein a first portion of the data is prefetched by a first of the sub-cores respectively from the external memory into the L2 memory based on a first access cost level thereof, a second portion of the data is prefetched by a second of the sub- cores in a first of the clusters from the L2 memory into the L1 memory of the first of the clusters, and a third portion of data is prefetched by a third of the sub-cores from the L1 memory of the first of the clusters into the LO memory of the processing element comprising the third sub-core (Gaur, paragraph 81, states that processor 1000 has “a plurality of core units 1010a-1010n” and that “each such core may be coupled to a shared cache memory 1015 … part of a cache memory hierarchy providing a prefetch-aware replacement policy,” showing cores, a shared last-level cache, and a prefetch-aware hierarchy fed from external/main memory. 
Paragraph 46 explains that metadata in MLC 210 includes “a prefetch field 224 … to indicate whether the data of the cache line was brought into MLC 210 in response to a prefetch request or a demand request,” i.e., explicit prefetching of data into a lower cache level. Paragraph 144 further teaches associating different priority indicators with cache lines “having the prefetch-based data” based on “access information” for different usage categories. Interpreting the shared last-level cache 1015 as an L2 memory, MLC 210 as a cluster-visible L1 memory, and per-core private caches as L0 memories, these passages show prefetch-aware controllers at each level (sub-cores) that prefetch portions of data along the path external memory → L2 → L1 → L0, with the prefetch behavior at each level governed by access information (i.e., the effective access-cost characteristics) of that level, as recited.)
It would have been obvious to a person of ordinary skill in the art to have incorporated the teachings disclosed by Joydeep with the teachings disclosed Gaur (i.e., sub-cores corresponding to hierarchical memories for prefetching operational data). A motivation for the combination is to provide a method for further reducing memory latency by introducing both a memory hierarchical structure and prefetching (Gaur, paragraph 2, “One way to mitigate high memory latency is to use a cache memory hierarchy within a processor. Another technique to hide memory latency is to prefetch cache lines into this cache hierarchy.”).
Kannan, in the same field of hierarchical memory prefetching, teaches the following limitation which Joydeep and Gaur fails to teach:
wherein the first access cost level is higher than the second access cost level, and the second access cost level is higher than the third access costs level (Kannan, paragraph 7, “Many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement. These prefetch units are typically tuned to be more aggressive as their proximity to the core decreases, such that the lower-level cache prefetch units run significantly”
paragraph 13, “In some embodiments, the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases. As such, the lower level prefetch units may be further ahead of the upper level prefetch units in the prefetch stream.”, it is shown that prefetch units at different cache levels are tuned differently according to their position in the hierarchy and thus the relative latency (access cost) of each level. Lower-level prefetch units (closer to core, lower cost) behave differently from higher-level units (farther from core, higher cost). This corresponds to defining first/second/third access cost levels and to the ordering where access costs increase with the level of the hierarchical memory.)
It would have been obvious to a person of ordinary skill in the art to have incorporated the teachings disclosed by Joydeep and Guar with the teachings disclosed Kannan (i.e., prefetching based on differing access costs). A motivation for the combination is to provide an architecture that reduces the delay of lower level prefetching hits. (Kannan, paragraph 14, “Therefore, before the lower level prefetch units reach a page boundary, the upper level prefetch unit may preemptively generate a prefetch request to retrieve a translation for the next virtual page. Subsequently, after the physical address for the next virtual page has been obtained, the address translation is conveyed to the lower level prefetch unit(s). When the lower level prefetch unit reaches the page boundary, the prefetch units will have the physical address of the next page and may continue prefetching without delay.”).
Sinclair, in the same field of hierarchical memory prefetching, teaches the following limitation which Joydeep, Guar, and Kannan fails to teach:
the receiving by an accelerator comprising clusters of processing elements, each cluster comprising a respective L1 memory shared by the processing elements of its cluster, each processing element respectively comprising a sub-core and an LO memory used only by its processing element, the accelerator further comprising an L2 memory shared by the clusters, and the processing elements comprising respective sub-cores; (Sinclair, page 7, section 5.1, “The system is composed of multiple CPU and GPU cores, which are connected via an interconnection network. Each GPU Compute Unit (CU), which is analogous to an NVIDIA SM, has a separate node on the network. All CPU and GPU cores have an attached block of SRAM. For CPU cores, this is an L1 cache, while for GPU cores, it is divided into an L1 cache and a scratchpad. Each node also has a bank of the L2 cache, which is shared by all CPU and GPU cores. The stash is located at the same level as the GPU L1 caches and both the cache and stash write their data to the backing L2 cache bank.”, the accelerator includes multiple GPU CUs (clusters), each with its own local SRAM that is partitioned into L1 and scratchpad (per-cluster shared L1 and per-CU scratchpad). The L2 cache bank is shared across nodes. The GPU CUs plus their local scratchpads act as sub-cores with respective private memories (L0), while L1 caches within the GPU CU act as cluster-shared memories, and L2 banks are shared across clusters. Thus Sinclair teaches an accelerator comprising clusters of processing elements, each cluster having a shared L1, each processing element having a private L0, and an L2 shared by multiple clusters, with respective sub-cores.)
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings disclosed by Joydeep, Guar, and Kannan with the teachings disclosed Sinclair (i.e., sub-cores corresponding to hierarchical memories for prefetching operational data using stash memory). A motivation for the combination is to provide an architecture that eliminates conflict misses and a fixed access latency (no overhead of searching or comparing tags), allowing for guaranteed hit rate. (Sinclair, page 2, section 1.2, paragraph 1, “Direct addressing also eliminates the pathologies of conflict misses and has a fixed access latency (100% hit rate). Scratchpads also provide compact storage since the software only brings useful data into the scratchpad.”).

Regarding claim 13, Joydeep, Sinclair, and Kannan teaches the limitations of claim 12. Joydeep further teaches:
wherein the first, second, and third sub-cores are further configured to perform their respective prefetching (Kannan, paragraph 7, states that “many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement.” Paragraph 13 further explains that “the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases. As such, the lower level prefetch units may be further ahead of the upper level prefetch units in the prefetch stream.” These passages describe prefetch units as hardware structures associated with the caches that autonomously monitor access patterns and generate prefetch requests at each cache level. Because these prefetch units operate based on cache-level behavior and prefetch streams, rather than as part of the cores’ instruction execution, a person of ordinary skill in the art would understand that the prefetch units (sub-cores) perform prefetching operations independent of the processing elements’ execution pipelines. Mapping the prefetch units at the different cache levels to the first, second, and third sub-cores of claim 19, Kannan therefore teaches that the first, second, and third sub-cores are configured to perform their prefetching operation independent of the processing elements as recited.)

Regarding claim 14, Joydeep, Guar, Sinclair, and Kannan teaches the limitations of claim 12. Joydeep further teaches:
wherein the first, second, and third sub-cores are further configured to cooperatively perform their respective prefetching (Kannan, paragraph 7, states that “many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement.” Paragraph 13 further explains that “in some embodiments, the prefetch units may be more aggressive in prefetching as their proximity to the processor core(s) decreases. As such, the lower level prefetch units may be further ahead of the upper level prefetch units in the prefetch stream.” These passages describe a cache hierarchy with multiple prefetch units distributed across different cache levels (e.g., L2, L1, and closer private caches). Each prefetch unit issues prefetch requests at its own level, and the prefetch streams are coordinated so that deeper prefetch units run further ahead and supply data to the upper-level prefetch units and the demand stream. Thus, the prefetch units (sub-cores) at the various levels cooperate to move data through the hierarchy and their behavior is explicitly determined by the structure of the hierarchical memories (position of each cache level and its latency). Interpreting each prefetch unit as a claimed sub-core at the L2, L1, and L0 levels, Kannan teaches sub-cores that cooperatively prefetch the data associated with an operation based on the hierarchical memory structure, as recited.) 

Regarding claim 15, Joydeep, Guar, Sinclair, and Kannan teaches the limitations of claim 12. Joydeep further teaches:
The method of claim 12, wherein first portion of the data is shared by all of the processing elements from the L2 memory and the second portion of the data is shared from the L1 memory of the first cluster by all of the processing elements in the first cluster (Joydeep, paragraph 122, “The caches 462A-462D may comprise level 1 (L1) and level 2 (L2) caches. In addition, one or more shared caches 456 may be included in the caching hierarchy and shared by sets of the cores 460A-460D. For example, one embodiment of the processor 407 includes 24 cores, each with its own L1 cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, one of the L2 and L3 caches are shared by two adjacent cores.”, it is shown that top-level caches (L2/L3) are shared across sets of cores, while L1 caches are private to the cores or subsets. Interpreting L2 as the level where a first portion of data is shared across all processing elements, and L1 of a given cluster as shared only among the processing elements in that cluster, the recited method step of sharing first and second portions via L2 and cluster-L1 follows directly from Joydeep’s hierarchical sharing behavior.)
Regarding claim 18, Joydeep, Sinclair, and Kannan teaches the limitations of claim 12. Joydeep further teaches:
A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, configure the one or more processors to perform the method of claim 12 (Joydeep, paragraph 373, “Systems and methods may be implemented to manage the above features or that may include aspects of the above features. Non-transitory machine readable media may store instructions that cause processors and/or microcontrollers to provide the above mentioned features.”, the features disclosed in Joydeep may be implemented as a non-transitory machine readable medium that stores instructions to be executed by processors).

Regarding claim 22, Joydeep, Guar, Sinclair, and Kannan teaches the system of claim 19, Joydeep further teaches:
The accelerator of claim 12, wherein the L0 memories are provided in a number corresponding to the number of the processing elements included in the accelerator, and wherein the L1 memories are provided in a number corresponding to the number of the clusters of the processing elements. (Sinclair, page 8, section 5.1, “Each GPU Compute Unit (CU), which is analogous to an NVIDIA SM, has a separate node on the network. All CPU and GPU cores have an attached block of SRAM. For CPU cores, this is an L1 cache, while for GPU cores, it is divided into an L1 cache and a scratchpad. Each node also has a bank of the L2 cache, which is shared by all CPU and GPU cores.”, it is shown that each GPU CU (cluster of processing elements) has its own local L1 cache and scratchpad, and that each CU (node) has its own L2 bank. Because each GPU CU has exactly one L1 and one scratchpad block per local processing unit organization, it is reasonably interpreted that there is one L1 memory per cluster, and one private local memory (scratchpad/L0) per processing element. Thus the number of L0 memories corresponds to the number of processing elements, and the number of L1 memories corresponds to the number of clusters, as claimed.)

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Joydeep in view of Gaur, Kannan, Sinclair, and Agahei.

Regarding claim 17, Joydeep, Gaur, Kannan, and Sinclair teaches the limitations of claim 15. Aghaei, in the same field of hierarchical memory implementation, teaches the following limitation which Joydeep, Gaur, Kannan, and Sinclair fails to teach: 
wherein the access cost levels correspond to the number of processing elements sharing a corresponding one of the hierarchical memories increases (Aghaei, abstract, “The shared memory multiprocessors suffer with significant problem of accessing shared resources in a shared memory it will result in longer latencies. Consequently, the performance of the system will get affected. With the object of solving the problem of increased access latency due to large number of processors with shared memory, Cache is being used.”, it is well known that the number of processing elements using a shared memory will increase the access cost for that memory.
Page 5, section 4.2, “This protocol models two-level cache hierarchy. The L1 cache is private to a core, while the L2 cache is shared among the cores.”, a cache coherency protocol is responsible for reducing access latency due to the increasing number of processing elements. The cache coherency protocol implementation disclosed in section 4.2 uses hierarchical memory.).  
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings disclosed by Joydeep, Gaur, Kannan, and Sinclair with the teachings disclosed Aghaei (i.e., hierarchical memory access cost increases as the number of processing elements increases). A motivation for the combination is to provide a method for reducing access latency for shared memory resources by utilizing a cache coherence protocol (Aghaei, page 8, “Average latacy of network with different cache coherence protocols is depicted in figur 2. This comparative figure demonstare MOESI_CMP_token protocol has maximum average network lateny. Wherease, MESI_Two_Level and MOESI_CMP_directory protocols have minimum latency.”, Fig. 2 on page 8 shows the effect of different cache coherency protocols on access latency. Cache coherency protocols reduce the overall latency (access cost) and power consumption).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
1. Lee et al. (Ebrahimi, E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2011). Prefetch-aware shared resource management for multi-core systems. Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA), 141–154. DOI: 10.1145/2024723.2000081) - Explores how integrating prefetching into shared resource management on multi-core systems can improve workload fairness and mitigate performance bottlenecks caused by memory interference.

Orosa et al. (Orosa, L., Azevedo, R., & Mutlu, O. (2018). AVPP: Address-first value-next predictor with value prefetching for improving the efficiency of load value prediction. ACM Transactions on Architecture and Code Optimization, 15(4), 49. Doi: 10.1145/3239567) - An approach to value prediction that emphasizes address prediction before value prediction, enabling processors to better speculate on memory usage with minimal hardware complexity.
Panda et al. (Panda, R., Eckert, Y., Jayasena, N., Kayiran, O., Boyer, M., & John, L. K. (2016). Prefetching techniques for near-memory throughput processors. Proceedings of the 30th International Conference on Supercomputing (ICS), 48–59. Doi: 10.1145/2925426.2926282) - This study examines memory-side prefetching strategies for GPU-based processing-in-memory systems, focusing on improving row buffer utilization and reducing cache misses through pattern-based data fetching.
Shakerinava et al. (Bakhshalipour, M., Shakerinava, M., Lotfi-Kamran, P., & Sarbazi-Azad, H. (2019). Bingo spatial data prefetcher. (2019). Bingo spatial data prefetcher. 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 399-211. Doi: 10.1109/HPCA.2019.00053) – Introduces a prefetching mechanism which uses both short and long-term memory access patterns to predict spatial data needs, achieving high prediction accuracy and reduced cache misses in large-scale applications.
Shrinivas et al. (US 20110145502 A1) - Meta-data based data prefetching
Touma et al. (Touma, R., Queralt, A., & Cortes, T. (2020). CAPre: Code-Analysis based Prefetching for Persistent Object Stores. arXiv preprint arXiv:2005.11259.)
Kougkas, A., Devarajan, H., & Sun, X. H. (2020). I/O acceleration via multi-tiered data buffering and prefetching. Journal of computer science and technology, 35(1), 92-120.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HYUNGJUN B YI whose telephone number is (703)756-4799. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/H.B.Y./Examiner, Art Unit 2124                                                                                                                                                                                                        
/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146
Read full office action
Prosecution Timeline

Jan 14, 2021
Application Filed
Feb 29, 2024
Non-Final Rejection — §103, §112
May 30, 2024
Examiner Interview Summary
May 30, 2024
Applicant Interview (Telephonic)
Jun 05, 2024
Response Filed
Aug 13, 2024
Final Rejection — §103, §112
Nov 07, 2024
Request for Continued Examination
Nov 13, 2024
Response after Non-Final Action
Dec 12, 2024
Non-Final Rejection — §103, §112
Mar 14, 2025
Applicant Interview (Telephonic)
Mar 17, 2025
Examiner Interview Summary
Apr 03, 2025
Response Filed
Jun 11, 2025
Final Rejection — §103, §112
Aug 18, 2025
Applicant Interview (Telephonic)
Aug 18, 2025
Examiner Interview Summary
Sep 17, 2025
Request for Continued Examination
Oct 01, 2025
Response after Non-Final Action
Dec 19, 2025
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/337,998
Patent 12536429
INTELLIGENTLY MODIFYING DIGITAL CALENDARS UTILIZING A GRAPH NEURAL NETWORK AND REINFORCEMENT LEARNING
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
18%
Grant Probability
49%
With Interview (+31.7%)
4y 7m
Median Time to Grant
High
PTA Risk
Based on 17 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD AND APPARATUS WITH ACCELERATOR PROCESSING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email