Last updated: April 19, 2026
Application No. 17/969,397
Optimization of Scratchpad Memory Allocation for Heterogeneous Devices Using A Cooperative Compiler Framework

Final Rejection §103§112
Filed
Oct 19, 2022
Examiner
LIN, HSING CHUN
Art Unit
2195
Tech Center
2100 — Computer Architecture & Software
Assignee
MediaTek Inc.
OA Round
2 (Final)
This examiner grants 59% of cases after interview

— +79.8% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 108 resolved cases, 2023–2026
Examiner Intelligence

LIN, HSING CHUN View full profile →
Grants 59% of resolved cases
Career Allow Rate
64 granted / 108 resolved
+4.3% vs TC avg
Strong +80% interview lift
Without
With
+79.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 4m
Avg Prosecution
37 currently pending
Career history
145
Total Applications
across all art units
Statute-Specific Performance

§101
17.1%
-22.9% vs TC avg
§103
35.8%
-4.2% vs TC avg
§102
6.5%
-33.5% vs TC avg
§112
34.0%
-6.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 108 resolved cases
Office Action

§103 §112
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending in this application.

Response to Arguments
Applicant’s arguments regarding the rejections of claims 1-20 under 35 U.S.C. 112b have been fully considered and are persuasive. The rejections have been withdrawn. However, new 35 U.S.C. 112b rejections are applied to claims 1-20 based on the amendments.

Applicant's arguments regarding the 35 U.S.C. 101 rejections of claims 1-20 have been fully considered and are persuasive.

Applicant's arguments regarding the 35 U.S.C. 103 rejections of claims 1-20 have been fully considered but they are moot in light of the references being applied in the current rejection. 

Drawings
The drawings are objected to because the drawings fail to comply with 37 CFR 1.84(q) which is reproduced here: (q) Lead lines. Lead lines are those lines between the reference characters and the details referred to. Such lines may be straight or curved and should be as short as possible. They must originate in the immediate proximity of the reference character and extend to the feature indicated. Lead lines must not cross each other. Lead lines are required for each reference character except for those which indicate the surface or cross section on which they are placed. Such a reference character must be underlined to make it clear that a lead line has not been left out by mistake. Lead lines must be executed in the same way as lines in the drawing. See paragraph (l) of this section.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
As per claims 1 and 11 (line numbers refer to claim 1):
	Lines 5-6 recite “receiving, by a global optimization manager during compilation of the subgraphs, an indication from each compiler that a corresponding compilation state is read-ready” but this is not supported by the specification. The specification recites in [0032] “after a compiler generates an I/O map, the compiler suspends the compilation process and reports to global optimization manager 600 that the compilation state is ready for data format consistency check”. Therefore, the compiler does not indicate to a global optimization manager that a corresponding compilation state is read-ready, but indicates that the compilation state is ready for a data format consistency check. 
Lines 13-14 recite “allocating, by the global optimization manager, the SPM to the subgraphs when compilation states of all of the compilers are read-ready” but this is not supported by the specification. Paragraph [0033] recites “When a compiler has both the tensor record and the access record ready, it suspends the compilation process and reports to global optimization manager 600 that the compilation state is ready for SPM allocation. After global optimization manager 600 reads the compilation states from all compilers that have their tensor records and access records ready, it computes the SPM allocation”. Therefore, the specification supports that among the compilers, the compilers that have their tensor and access records ready, are ready for SPM allocation. The specification does not support that SPM allocation is contingent upon all compilers being in a read-ready state. 

Claims 2-10 and 12-20 are dependent claims of claims 1 and 11, and fail to resolve the deficiencies of claims 1 and 11, so they are rejected for the same reasons.

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
As per claims 1 and 11 (line numbers refer to claim 1):
	Line 7 recites “the compiler” but it is unclear which compiler this refers to since there are a plurality of compilers.  
Line 10 recites “the compilation by each compiler” and it is unclear if this refers to “compilation of the subgraphs”. 
	Line 10 recites “the corresponding compilation state” but it is unclear what this refers to since each compiler has a corresponding compilation state and there are a plurality of compilers. 
	Lines 14 and 18 recite “the compilers” but it is unclear if this refers to the plurality of compilers. 
	Line 15 recites “the records that are unified across different subgraphs” and line 7 recites “records of tensors in a subgraph”. Therefore, it is unclear what “the records” refer to since “records” in line 7 only refer to records in a single subgraph. 

	Claims 2-10 and 12-20 are dependent claims of claims 1 and 11, and fail to resolve the deficiencies of claims 1 and 11, so they are rejected for the same reasons. 

	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-9 and 11-19 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture hereinafter Zhang), in view of Zheng et al. (US 11809849 B1 hereinafter Zheng), in view of Brady et al. (US 20190391796 A1 hereinafter Brady), and further in view of Pal et al. (OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators hereinafter Pal).
Zhang and Pal were cited in a prior office action. 
As per claim 1, Zhang teaches the invention substantially as claimed including a method comprising: dispatching subgraphs of a neural network model to a plurality of compilers, each compiler to compile for a different one of the heterogeneous devices (B. DNN Compilation paragraph 1 There has been recent work on optimizing DNN performance through DL compilers, which emit optimized code that runs the model efficiently on a target hardware; Abstract we present a DNN inference engine, called DUET, that explores potential concurrent execution opportunities on heterogeneous CPU-GPU architecture for DNN inference; Section IV paragraph 1 divides the DNN computation graph into multiple subgraphs that still allow DL compiler to apply graph-level optimizations; Section IV.A. paragraph 4 A schedule S on a coupled CPU-GPU architecture includes a mapping for each subgraph to either CPU or GPU; Section IV.D. paragraph 1 Once the scheduling decision has been made, DUET instantiates an executor to run the decided schedule, as shown in Figure 9. The executor spawns two child processes to run compiled subgraphs concurrently on CPU and GPU); 
wherein the corresponding compilation state includes records of tensors in a subgraph compiled by the compiler (Section IV.B. paragraph 2 For a given subgraph, the profiler builds a micro-benchmark by treating that subgraph as a standalone DNN model and going through the DL compilation pipeline, including generating the target-dependent code through the back-end. The profiler then runs each micro-benchmark on both CPU and GPU for several runs and records the information; Section II.B. paragraph 2 data flow graphs, in which each node represents a tensor operator, and each edge denotes the data dependency between operators; B. DNN Compilation paragraphs 1-2 Fig. 1 shows a typical processing flow of these DL compilers, which consists of five layers: 1) front-end, 2) intermediate representation (IR), 3) graph-level optimization, 4) low-level optimization, and 5) back-end. The front-end transforms high-level DSL of DNNs into compiler-specific IRs. These IRs are usually in the form of data flow graphs, in which each node represents a tensor operator, and each edge denotes the data dependency between operators; Section IV paragraph 1 divides the DNN computation graph into multiple subgraphs that still allow DL compiler to apply graph-level optimizations).

Zhang fails to teach a method for allocating scratchpad memory (SPM) to heterogeneous devices for neural network computing: receiving, by a global optimization manager during compilation of the subgraphs, an indication from each compiler that a corresponding compilation state is read-ready; suspending the compilation by each compiler when the corresponding compilation state is read-ready; allocating, by the global optimization manager, the SPM to the subgraphs when compilation states of all of the compilers are read-ready, wherein allocation of the SPM is according to the records that are unified across different subgraphs; and resuming the compilation of the subgraphs with the allocated SPM incorporated into the compilation states of the compilers.

However, Zheng teaches receiving, by a global optimization manager during compilation of the subgraphs, an indication from each compiler that a corresponding compilation state is read-ready; suspending the compilation by each compiler when the corresponding compilation state is read-ready (Col. 23 lines 24-34 The compiler can perform the array contraction operation as part of the compilation operation to generate executable instructions for the neural network hardware accelerator. For example, the compiler can perform the array contraction operation after generating program 544 of FIG. 5B. As part of the array contraction operation, the compiler can parse the loop representations in the program and identify a tensor indexed by the induction variables of the loops that has no loop-carried dependency, and then change the indexing of the tensor in the loops such that the elements of the tensor are mapped to a single memory address; Col. 31 lines 48-52 The compiler can determine the modulo operators for the indexing of tensors TO and T1 based on the degree of parallelism supported by the neural network accelerator as well as available memory space; Col. 17 lines 47-49 load into the acceleration engine 312 or cause the acceleration engine 312 to load the compiled code 344; Col. 7 lines 12-28 the compiler can reduce some or all of the initial modulo operators based on whether the total memory footprint by the tensors exceeds the available memory space. Specifically, the compiler can determine the live interval of each tensor for which an initial modulo operator is assigned, as well as the size of memory used by the tensor during the live interval. Tensors having overlapping live intervals can indicate that the memory needs to store the tensors simultaneously, whereas tensors that do not have overlapping live intervals need not be stored simultaneously. The compiler can determine the total memory footprint by the tensors based on identifying tensors having overlapping live intervals, as well as their memory footprints. If the total memory footprint of the tensors with the initial modulo operators is below the available memory space, the compiler can stop the global modulo allocation operation.).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Zhang with the teachings of Zheng to improve resource utilization (see Zheng Col. 4 lines 1-21 If a compiler does not take into account the computation resources and memory space assigned to the execution of the neural network operator when scheduling the parallel execution of the different iterations of a neural network operator, the compiler may generate instructions that may either underutilize or overutilize the computation and memory resources, which may lead to inefficient execution of the neural network operator or may affect other operations being performed by the neural network hardware accelerator. Examples described herein provide methods, systems, and other techniques to improve the scheduling of repetitive operations of a neural network operator. The compiler can determine a number of iterations of the operations to be included in a batch, where operations within a batch can be executed in parallel and can access different memory addresses, while different batches are executed sequentially. Moreover, the compiler can determine an address mapping scheme in which the different batches of operations reuse the same set of memory addresses, to reduce the total memory footprint by the neural network operator.).

Zhang and Zheng fail to teach a method for allocating scratchpad memory (SPM) to heterogeneous devices for neural network computing: allocating, by the global optimization manager, the SPM to the subgraphs when compilation states of all of the compilers are read-ready, wherein allocation of the SPM is according to the records that are unified across different subgraphs; and resuming the compilation of the subgraphs with the allocated SPM incorporated into the compilation states of the compilers.
	
However, Brady teaches allocating, by the global optimization manager, the SPM to the subgraphs when compilation states of all of the compilers are read-ready ([0083] For instance, target descriptors may be accepted and consumed by example compilers and the compiler may use the information within the target descriptor to flexibly tune the compilation process to the specific hardware architecture of potentially any one of multiple different devices. For instance, the target descriptor may specify which computations resources of a device are comparable performing which types of neural network operations (e.g., specifying that a convolution can be executed on either a SHAVE processor or a hardware accelerator). Example target descriptors may further specify the parameters of the operation (e.g., kernel size) that the particular computation resource can support (e.g., specifying that a particular hardware accelerator is limited to kernel sizes of 11×11). These resources are described in a Target Descriptor JSON file which is an input to the compilation; [0080]  An example operator model may also define fields for populating attributes determined (through one or more compilation passes) for each of the tensors. For instance, such tensor attribute fields may include fields to store attribute information such as the name of a corresponding memory allocator used to allocate memory for storage of the tensor on the target, the data type of the tensor, flows of the tensor, shape of the tensor, ordering for storage of the tensor, etc. This information may be utilized in other compilation passes (e.g., memory allocation passes) to reserve an appropriate amount of memory to store the tensor, among other example information. For instance, early compilation passes may be utilized to determine attributes of the operations and tensors (using the operator model of the intermediate representation). With this information, additional compilation passes may be performing (using the operator model and/or control model of the IR) to determine which operations are to be performed by which compute resources and in what order. With the assignment of compute resources and operation order set, together with the collection of tensor attribute information through preceding compilation passes, memory allocation passes may be performed (using a data model of the IR) to determine how best to allocate memory to enable fast and efficient use of the tensors to thereby optimize performance of the operations of the neural network by the particular target hardware; [0082] The particular example of FIG. 14 illustrates allocation of memory within the scratchpad memory for a particular buffer (e.g., Buffer 2). Attributes of a particular one of the tensors 1415 (e.g., as described in the operator and/or data models of the intermediate representation) may be consulted to determine, first, which of the available memory resources would be most appropriate for use in storing the tensor. In this example, a particular tensor may be determined (e.g., through one or more compilation passes) to be used in a convolution operation by a subsequent operation performed by the same or nearby compute resource, and may thus be assigned to be stored in scratchpad memory (if available). One or more compilation passes may further utilize models of the intermediate representation to determine attributes of the tensor (e.g., its block size, padding used in the tensor, stride applied in the operation, whether the tensor (e.g., its constituent component matrices 1415a-c) should be stored in contiguous memory to optimize performance, among other example information. Determining this information can allow a size (e.g., 1420) of a buffer to be determined, which would be sufficient to store the tensor. Compilation passes may determine similar information for each of the tensors in the data model, and memory allocator objects (e.g., 1405, 1410) may extract this information and define buffers to identify the amount of memory to “reserve” or allocate for storage of each of the tensors during execution of the neural network. Memory allocation compilation passes may further act to affirmatively define address ranges in the target's memory where each buffer is to be implemented, and this information may be defined within the binary executable passed to and used by the target machine learning device.).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Zhang and Zheng with the teachings of Brady to promote efficiency (see Brady [0080] With the assignment of compute resources and operation order set, together with the collection of tensor attribute information through preceding compilation passes, memory allocation passes may be performed (using a data model of the IR) to determine how best to allocate memory to enable fast and efficient use of the tensors to thereby optimize performance of the operations of the neural network by the particular target hardware.). 

Zhang, Zheng, and Brady fail to teach a method for allocating scratchpad memory (SPM) to heterogeneous devices for neural network computing: wherein allocation of the SPM is according to the records that are unified across different subgraphs; and resuming the compilation of the subgraphs with the allocated SPM incorporated into the compilation states of the compilers.
	
However, Pal teaches a method for allocating scratchpad memory (SPM) to heterogeneous devices for neural network computing: wherein allocation of the SPM is according to the records that are unified across different subgraphs (Figs. 5, 6; Abstract paragraph 2 propose OnSRAM, a novel SPM management framework integrated with the compiler runtime of a DL accelerator…We integrate OnSRAM with TensorFlow and analyze it on multiple accelerator configurations; Section 3.1 paragraph 1 Figure 2 shows the parameterized architectural template that we use to describe AI accelerators in OnSRAM…. The architectural template has broad coverage and can describe a range of AI accelerators such as [10, 19, 63, 94, 97], among others; Section 1 paragraph 1 One optimization that has received relatively little attention is the effective management of the on-chip SPM across the nodes or layers of a DNN; Section 5.1.1 paragraph 6 A new entry corresponding to dsout is added to the data structure table and linked from the history buffer. If the history buffer yielded a matching data structure (dssim), then the predicted liveness, FoM and isPinCand of dssim are copied to dsout; Section 5.1.1 paragraph 3 a binary field (isPinCand) denoting if we should attempt to pin a data structure with similar characteristics in SPM; Section 5.1 paragraph 1 As the criticalities of data structures are unknown during offload, we propose to predict the data structure's characteristics based on the history of patterns observed. This is motivated by the observation that although operations arrive onebyone, most DNNs have recurring patterns of operations. e.g., in ResNet [38], the basic recurring pattern is a residual block which comprises of multiple convolutions, batch normalizations, ReLUs, and additions. Thus, we propose to learn the reuse, liveness, and criticality of data structures for the first few times the pattern repeats, and then use it for efficient SPM management during the remainder of execution; Section 4.1 paragraph 1 activation-bound operations, or nodes whose execution time would improve if either the input/output activation is pinned in SPM (e.g., most convolutions and activation functions); Section 3.2 paragraph 1 Pinning Activations. In OnSRAM, we explore the optimization where the activations produced by a node, if capacity permits, can be pinned in the on-chip SPM as opposed to a write-back to external memory. Clearly, this reduces the bandwidth requirement and potentially boosts performance for this node, as well as future nodes that use this activation as their input);
resuming the compilation of the subgraphs with the allocated SPM incorporated into the compilation states of the compilers (pg. 86:9 paragraph 3 Overall, OnSRAM-Static produces an SPM management plan for the accelerator’s runtime, and a preferred node execution sequence for the DL framework’s graph scheduler. For a given DNN, OnSRAM-Static is invoked only during graph compilation; pg. 86:7 4 SPM Management with Static DNN Graphs We propose OnSRAM-Static to manage the on-chip SPM of AI accelerators under the static execution model, wherein the DNN is represented as a graph. OnSRAM-Static receives an optimized graph of the DNN from the DL framework (e.g., TensorFlow’s ProtoBuf graph), and the key hardware parameters and constraints of the target AI accelerator (SPM capacity, external memory bandwidth, etc.) as its inputs. It derives (i) a graph execution plan that maximizes data reuse in the SPM, and (ii) an SPM management plan for each data structure as a tuple comprising of: — Pinned Location (PinLoc): Whether the data structure is pinned in the SPM or in external memory. — Start timestep (StartTS): A logical execution timestep at which the data structure gets allocated in the SPM. — End timestep (EndTS): A logical execution timestep at which the data structure is discarded from the SPM. Since OnSRAM-Static lies in the critical path of the compiler runtime, we use the application insights to design a heuristic-based approach that is low-cost and yet achieves most of the benefits from on-chip SPM management.).   

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Zhang, Zheng, and Brady with the teachings of Pal to reduce energy consumption (see Pal Section 7.1.4 The main source of energy reduction is the ∼50× difference in per-bit access energy between the off-chip DRAM and on-chip SRAM. This is leveraged by OnSRAM through reduced off-chip accesses using intelligent SPM management.).
	

As per claim 2, Zhang, Zheng, Brady, and Pal teach the method of claim 1. Pal teaches wherein allocating the SPM further comprises: performing global optimization of SPM allocation based on the compilation states of the plurality of compilers (Figs. 5 and 6; Section 3.2 paragraph 2  The focus of OnSRAM is primarily to optimize the on-chip reuse of activations; Section 5.1.1 paragraph 6 A new entry corresponding to dsout is added to the data structure table and linked from the history buffer. If the history buffer yielded a matching data structure (dssim), then the predicted liveness, FoM and isPinCand of dssim are copied to dsout; Section 5.1.1 paragraph 3 a binary field (isPinCand) denoting if we should attempt to pin a data structure with similar characteristics in SPM; Section 5.1 paragraph 1 As the criticalities of data structures are unknown during offload, we propose to predict the data structure's characteristics based on the history of patterns observed. This is motivated by the observation that although operations arrive onebyone, most DNNs have recurring patterns of operations. e.g., in ResNet [38], the basic recurring pattern is a residual block which comprises of multiple convolutions, batch normalizations, ReLUs, and additions. Thus, we propose to learn the reuse, liveness, and criticality of data structures for the first few times the pattern repeats, and then use it for efficient SPM management during the remainder of execution; Section 3.2 paragraph 1 Pinning Activations. In OnSRAM, we explore the optimization where the activations produced by a node, if capacity permits, can be pinned in the on-chip SPM as opposed to a write-back to external memory. Clearly, this reduces the bandwidth requirement and potentially boosts performance for this node, as well as future nodes that use this activation as their input; Section 8 paragraph 5 a memory reuse algorithm implemented as a static compiler pass, similar to the register allocation problem in traditional compilers’ backend passes; abstract paragraphs 1-2 inter-node on-chip scratchpad memory (SPM) management in Deep Learning (DL) accelerators, whose significance is bolstered by the recent trends in complex network topologies and the emergence of eager execution in DL frameworks.We characterize and show that there exists up to a 5.2× performance gap in DL inference to be bridged using SPM management and propose OnSRAM, a novel SPM management framework integrated with the compiler runtime of a DL accelerator.).

As per claim 3, Zhang, Zheng, Brady, and Pal teach the method of claim 1. Zhang teaches wherein each compiler is target device specific and is operative to compile a corresponding subgraph of the neural network model into a subcommand to run on a heterogeneous device (B. DNN Compilation paragraph 1 There has been recent work on optimizing DNN performance through DL compilers, which emit optimized code that runs the model efficiently on a target hardware; Abstract we present a DNN inference engine, called DUET, that explores potential concurrent execution opportunities on heterogeneous CPU-GPU architecture for DNN inference; Section IV paragraph 1 divides the DNN computation graph into multiple subgraphs that still allow DL compiler to apply graph-level optimizations; Section IV.A. paragraph 4 A schedule S on a coupled CPU-GPU architecture includes a mapping for each subgraph to either CPU or GPU; Section IV.D. paragraph 1 Once the scheduling decision has been made, DUET instantiates an executor to run the decided schedule, as shown in Figure 9. The executor spawns two child processes to run compiled subgraphs concurrently on CPU and GPU).

As per claim 4, Zhang, Zheng, Brady, and Pal teach the method of claim 1. Zhang teaches wherein each compilation state includes a tensor record that indicates attributes of tensors in a corresponding subgraph (B. DNN Compilation paragraphs 1-2 Fig. 1 shows a typical processing flow of these DL compilers, which consists of five layers: 1) front-end, 2) intermediate representation (IR), 3) graph-level optimization, 4) low-level optimization, and 5) back-end. The front-end transforms high-level DSL of DNNs into compiler-specific IRs. These IRs are usually in the form of data flow graphs, in which each node represents a tensor operator, and each edge denotes the data dependency between operators; Section IV paragraph 1 divides the DNN computation graph into multiple subgraphs that still allow DL compiler to apply graph-level optimizations).

As per claim 5, Zhang, Zheng, Brady, and Pal teach the method of claim 1. Zhang teaches wherein each compilation state includes an access record that identifies an input tensor and an output tensor of a neural network operation in a corresponding subgraph (Fig. 7; Section IV.A paragraph 2 A phased schedule executes a DAG in a sequence of phases S1,S2,S3,…St,…, where each phase St represents a non-overlapping subset of nodes; Section IV.A paragraph 1 DNN inference computations is often transformed into compiler-specific IRs, in the form of directed acyclic graphs (DAG). For a given DAG G, each node vi∈G is an operator (e.g., matmul, softmax) in the DNN, and each edge (vi,vj)∈G establishes a dependency between the output of operator vi and the input of operator vj. A valid execution schedule of the DAG determines an execution order of its nodes that satisfies all the dependencies; B. DNN Compilation paragraphs 1-2 Fig. 1 shows a typical processing flow of these DL compilers, which consists of five layers: 1) front-end, 2) intermediate representation (IR), 3) graph-level optimization, 4) low-level optimization, and 5) back-end. The front-end transforms high-level DSL of DNNs into compiler-specific IRs. These IRs are usually in the form of data flow graphs, in which each node represents a tensor operator, and each edge denotes the data dependency between operators; Section IV paragraph 1 divides the DNN computation graph into multiple subgraphs that still allow DL compiler to apply graph-level optimizations; Section IV.A paragraph 5 When doing partitioning, we note that there are cases where multiple nodes consume the same input, i.e., a shared node in the DAG. We handle this situation by creating replicated placeholders in different branches but let them all point to the same input stream.).

As per claim 6, Zhang, Zheng, Brady, and Pal teach the method of claim 1. Pal teaches wherein unifying the records comprises: unifying tensor IDs that identify a same object into a unified tensor ID; unifying tensor records into a unified tensor record based on the unified tensor ID; and unifying access records into a unified access record based on the unified tensor ID (Figs. 5, 6; Section 5.1.1 paragraph 6 A new entry corresponding to dsout is added to the data structure table and linked from the history buffer. If the history buffer yielded a matching data structure (dssim), then the predicted liveness, FoM and isPinCand of dssim are copied to dsout; Section 5.1.1 paragraph 3 a binary field (isPinCand) denoting if we should attempt to pin a data structure with similar characteristics in SPM; A data structure is a tensor.).


As per claim 7, Zhang, Zheng, Brady, and Pal teach the method of claim 6. Pal teaches wherein the unified access record indicates lifetime information of each tensor in the unified tensor record, and allocating the SPM is based on, at least in part, the lifetime information (Fig. 5; Section 4.2 paragraph 1 We define liveness as the number of timesteps between the creation and last use of a data structure; Section 5.2 paragraph 1 Thus, we propose to learn the reuse, liveness, and criticality of data structures for the first few times the pattern repeats, and then use it for efficient SPM management during the remainder of execution; Section 1 paragraph 9 Given the on-chip SPM capacity and external memory bandwidth, it analyses the liveness, reuse and computational significance of each data structure and determines which ones can lucratively be held on-chip to maximize the overall performance; Section 4.2 paragraph 7 it is determined whether the SPM has enough capacity to hold the data structure for all the timesteps that it is alive).

As per claim 8, Zhang, Zheng, Brady, and Pal teach the method of claim 1. Pal teaches further comprising: writing back results of SPM allocation to the compilation states of the plurality of compilers for the compilers to resume compiling (Fig. 5; Section 5.1.1 paragraph 6 A new entry corresponding to dsout is added to the data structure table and linked from the history buffer. If the history buffer yielded a matching data structure (dssim), then the predicted liveness, FoM and isPinCand of dssim are copied to dsout; Section 5.1.1 paragraph 3 a binary field (isPinCand) denoting if we should attempt to pin a data structure with similar characteristics in SPM; Section 1 paragraph 9 we propose OnSRAM, a compiler extension that integrates within DL frameworks to manage the on-chip memory of AI accelerators).

As per claim 9, Zhang, Zheng, Brady, and Pal teach the method of claim 1. Zhang teaches wherein the compilation states include respective I/O maps that identify input and output tensors (Fig. 7; Section IV.A paragraph 2 A phased schedule executes a DAG in a sequence of phases S1,S2,S3,…St,…, where each phase St represents a non-overlapping subset of nodes; Section IV.A paragraph 1 DNN inference computations is often transformed into compiler-specific IRs, in the form of directed acyclic graphs (DAG). For a given DAG G, each node vi∈G is an operator (e.g., matmul, softmax) in the DNN, and each edge (vi,vj)∈G establishes a dependency between the output of operator vi and the input of operator vj. A valid execution schedule of the DAG determines an execution order of its nodes that satisfies all the dependencies; B. DNN Compilation paragraphs 1-2 Fig. 1 shows a typical processing flow of these DL compilers, which consists of five layers: 1) front-end, 2) intermediate representation (IR), 3) graph-level optimization, 4) low-level optimization, and 5) back-end. The front-end transforms high-level DSL of DNNs into compiler-specific IRs. These IRs are usually in the form of data flow graphs, in which each node represents a tensor operator, and each edge denotes the data dependency between operators; Section IV.A paragraph 5 When doing partitioning, we note that there are cases where multiple nodes consume the same input, i.e., a shared node in the DAG. We handle this situation by creating replicated placeholders in different branches but let them all point to the same input stream.).
Additionally, Zheng teaches input and output data formats (Col. 16 lines 3-10 Compilers, in general, are software programs that translate program code written in a human-readable language into a format (e.g., machine instructions) that can be read and processed by an integrated circuit device. In the example of FIG. 3, the acceleration engine 312 is a neural network accelerator and the compiler 330 is for compiling a neural network description into instructions to be executed on the acceleration engine 312).

As per claim 11, it is a system claim of claim 1, so it is rejected for similar reasons. Additionally, Zhang teaches processing hardware including the heterogeneous devices; and memory to store instructions, when executed by the processing hardware, cause the processing hardware to perform operations of a plurality of compilers and a global optimization manager, the processing hardware operative to (Section VI.A paragraph 1 Our evaluation is conducted on a server with a 2.10 GHz Intel(R) Xeon(R) Gold 6152 CPU processor and an NVidia TITAN V GPU, connected through PCIe V3.0 interconnect. The server has 128GB RAM, running 64-bit Linux Ubuntu 16.04; Section II.B paragraph 1 There has been recent work on optimizing DNN performance through DL compilers; Section VI.B paragraph 1 DUET offers much higher performance improvements than DL frameworks, because it combines heterogeneous execution with DL compiler optimizations; Section I. paragraph 6 We make the case for heterogeneity- and compiler-aware DNN inference and present DUET, an engine design for concurrent execution of DNN computation on heterogeneous hardware).

As per claims 12-19, they are system claims of claims 2-9, so they are rejected for similar reasons. 


Claims 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang, Zheng, Brady, and Pal, as applied to claims 1 and 11 above, in view of Kovvuri et al. (US 20190286973 A1 hereinafter Kovvuri).
Kovvuri was cited in a prior office action.

As per claim 10, Zhang, Zheng, Brady, and Pal teach the method of claim 1. Pal teaches receiving the compilation states from the compilers for SPM allocation, wherein the compilation states include a new compilation state for the new subgraph (Figs. 5, 6; Section 5.1.1 paragraphs 5-6 Consider the node scheduled at timestep 7 with operation opA and output dsout. When OnSRAM-Eager receives the node, the predictor searches the history buffer for opA, and constructs a liveness sequence for data structures in the history buffer from the table (Step 1). We define a window size W (W < B) and grab the W most-recent entries from the sequence. We search the remaining (B-W) entries to see if any sub-sequence matches the W most-recent entries (Step 2). If so, we take the data structure next in the sub-sequence and deem it computationally similar (dssim) to the node's output dsout (Step 3). This approach works best for sequences composed of recurring operator blocks. Since the typical use case is to define the block once and uses it repeatedly, the liveness pattern is likely to repeat across recurring blocks. A new entry corresponding to dsout is added to the data structure table and linked from the history buffer. If the history buffer yielded a matching data structure (dssim), then the predicted liveness, FoM and isPinCand of dssim are copied to dsout; Section 5.1.1 paragraph 3 a binary field (isPinCand) denoting if we should attempt to pin a data structure with similar characteristics in SPM; Section 1 paragraph 9 we propose OnSRAM, a compiler extension that integrates within DL frameworks to manage the on-chip memory of AI accelerators).

Zhang, Zheng, Brady, and Pal fail to teach further comprising: detecting different data formats between an input and an output of two adjacent subgraphs in the neural network model; inserting a new subgraph between the two adjacent subgraphs to perform data format conversion.

	However, Kovvuri teaches further comprising: detecting different data formats between an input and an output of two adjacent subgraphs in the neural network model; inserting a new subgraph between the two adjacent subgraphs to perform data format conversion (Fig. 5A; [0112] A sixth code portion 1010 includes an example interface that can be used to define or disable quantization specifications for converting numeric formats between the neural network model 310 and subgraph 320; [0090] For example, during initial training, both the neural network model 310 and the subgraph 320 can be executed on a general-purpose CPU that performs computations with relatively high precision (e.g., using the single or double formats of IEEE 754-1985 with 24 or 53 bits for the significand, respectively). During a fine-tuning of the training and during inferencing of the neural network model 310 and the subgraph 320, the subgraph 320 can be accelerated on a neural network accelerator using lower precision computations (e.g., using a format with eight or fewer bits for the significand); [0146] The processor can be configured to perform computations at a higher precision than the neural network accelerator. The neural network model can be specified using source code of a machine learning native framework. For example, the marker node is located in implementing code of an application programming interface defining the identified subgraph. The marker node can pass values unchanged between the identified subgraph and the neural network model during an initial training mode implemented on the machine learning native framework executing on the processor. The marker node can reduce a precision of values passed as inputs to the identified subgraph during a fine-tune training mode implemented on the machine learning native framework executing on the processor. The marker node can reduce a precision of values output from the identified subgraph during a fine-tune training mode implemented on the machine learning native framework executing on the processor. The marker node can include metadata specifying a format for communicating values between the accelerated version of the subgraph and the neural network model executing on the processor in communication with the neural network accelerator.).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Zhang, Zheng, Brady, and Pal with the teachings of Kovvuri to reduce latency (see Kovvuri [0086] The neural network accelerator 450 is used to accelerate evaluation and/or training of neural network subgraphs, typically with increased speed and reduced latency).
	
As per claim 20, it is a system claim of claim 10, so it is rejected for similar reasons. 


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HSING CHUN LIN whose telephone number is (571)272-8522. The examiner can normally be reached Mon - Fri 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached at (571) 272-4169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/H.L./Examiner, Art Unit 2195                                                                                                                                                                                                        
/Aimee Li/Supervisory Patent Examiner, Art Unit 2195
Read full office action
Prosecution Timeline

Oct 19, 2022
Application Filed
Jun 14, 2025
Non-Final Rejection — §103, §112
Sep 18, 2025
Response Filed
Jan 06, 2026
Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/837,306
Patent 12554523
REDUCING DEPLOYMENT TIME FOR CONTAINER CLONES IN COMPUTING ENVIRONMENTS
2y 5m to grant Granted Feb 17, 2026
17/355,265
Patent 12547458
PLATFORM FRAMEWORK ORCHESTRATION AND DISCOVERY
2y 5m to grant Granted Feb 10, 2026
18/074,254
Patent 12468573
ADAPTIVE RESOURCE PROVISIONING FOR A MULTI-TENANT DISTRIBUTED EVENT DATA STORE
2y 5m to grant Granted Nov 11, 2025
17/806,614
Patent 12461785
GRAPHIC-BLOCKCHAIN-ORIENTATED SHARDING STORAGE APPARATUS AND METHOD THEREOF
2y 5m to grant Granted Nov 04, 2025
17/535,922
Patent 12443425
ISOLATED ACCELERATOR MANAGEMENT INTERMEDIARIES FOR VIRTUALIZATION HOSTS
2y 5m to grant Granted Oct 14, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
59%
Grant Probability
99%
With Interview (+79.8%)
3y 4m
Median Time to Grant
Moderate
PTA Risk
Based on 108 resolved cases by this examiner. Grant probability derived from career allow rate.