DETAILED ACTION
This Office Action is in response to claims filed on
Claims 1-30 are pending.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 11/14/2025 have been fully considered but they are not persuasive.
Claims 23-26 are rejected under 35 U.S.C. 112(a) as allegedly failing to comply with the written description requirement. That is, it is alleged that claims 23-26 contain subject matter which “was not described in the specification…” Additionally, claims 23-26 and claims 11-30 are also rejected under 35 U.S.C. 112(b) as allegedly being indefinite for failing to particularly point out and distinctly claim the subject matter invoking an interpretation 35 USC 112(f) which the inventor or a joint inventor regards as the invention. Applicant respectfully disagrees with the 112 rejections for the reasons stated herein.
Regarding Claims 23-26, the following Table matches each of the means plus element of the claims 23-26 to structures that may perform the means plus functions recited in the claim elements.
With regard to point (a), Examiner respectfully disagrees. As set forth in MPEP 2181(II)(A), “the proper test for meeting the definiteness requirement is that the corresponding structure (or materials or acts) of a means- (or step-) plus-function limitation must be disclosed in the specification itself in a way that one skilled in the art will understand what structure (or material or acts) will perform the recited function.” Further, MPEP 2181(II)(B) asserts “in cases involving a special purpose computer-implemented means-plus-function limitation, the Federal Circuit has consistently required that the structure be more than simply a general purpose computer or microprocessor and that the specification must disclose an algorithm for performing the claimed function. See, e.g., Noah Systems Inc. v. Intuit Inc., 675 F.3d 1302, 1312, 102 USPQ2d 1410, 1417 (Fed. Cir. 2012); Aristocrat, 521 F.3d at 1333, 86 USPQ2d at 1239.” (emphasis added). Accordingly, support for a limitation invoking 35 U.S.C. 112(f) requires a positively disclosed corresponding structure or a general-purpose computer or processor programmed with a disclosed algorithm set forth in the specification.
First, Applicant identifies “a data producer” as the corresponding structure for the claimed function of “means for executing a write request with local coordination in a first batch of a plurality of batches in each of a plurality of slices” recited in claim 23. Applicant cites paragraphs [0057] and [0064] as the purported support for the corresponding structure performing the means-plus-function limitation, respectively describing “In one example, local coordination executed by a data producer may execute a write request sequence as follows” and “In block 440, a write request is executed with local coordination in the first batch of the plurality of batches in the each slice of the plurality of slices. In one example, the write request is executed by a data producer of a plurality of data producers.” However, these disclosures do not sufficiently define the structure of “a data producer” in a manner that would enable one of ordinary skill in the art to identify the particular structure performing the recited function. It is disclosed in paragraph [0027] of the instant specification that “the data producer is a processing unit with software algorithm configured to generate data. In one example, the data producer executes a write request (i.e., a request to write data into a memory unit)”. This disclosure is consistent with an interpretation under a computer-implemented means-plus-function limitation necessitating a “computer” + “algorithm”. However, the Examiner finds that the specification does not positively recite any algorithm for performing the function attributed to the data producer. While the specification broadly describes a processing unit executing software to generate data or perform a write request, it fails to disclose a specific algorithm that would inform one of ordinary skill in the art how the particular steps of “executing a write request with local coordination” is carried out by the data producer. The cited paragraphs of [0057] and [0064] merely reiterate that a data producer performs the claimed function and discloses a sequence of events, without further describing any processing logic or control flow governing execution of the claimed function. As such, reciting that “a data producer” performs the claimed means-plus-function is an inadequate disclosure of structure.
Similarly, Applicant identifies “a data consumer” as the corresponding structure for the claimed function of “means for executing a read request with the local coordination in the first batch of the plurality of batches in the each slice of the plurality of slices after monitoring the write request is completed” recited in claim 23. Applicant cites paragraph [0058] and [0067] as the purported support for the corresponding structure performing the means-plus-function limitation. However, for the same reasons discussed above, merely reciting that “a data consumer” performs the claimed means-plus-function is inadequate to disclose corresponding structure, as the term fails to disclose a specific algorithm that would inform one of ordinary skill in the art how the recited function is carried out by the data consumer.
Further, Applicant identifies “each slice of a plurality of slices” as the corresponding structure for the claimed function of “means for monitoring the write request in the first batch of the plurality of batches in the each slice of the plurality of slices” and “means for monitoring read request in the first batch of the plurality of batches in the each slice of the plurality of slices” recited in claim 23. Applicant cites paragraph [0059-0060] and [0066] as the purported support for the corresponding structure performing the means-plus-function limitations. However, for the same reasons discussed above, merely reciting that “each slice” performs the claimed means-plus-function is inadequate to disclose corresponding structure, as the term fails to disclose a meaningful component or a specific algorithm that would inform one of ordinary skill in the art how the recited function is carried out by the slice. In particular, paragraph [0066] of the instant specification merely discusses when monitoring is performed, however, fails to disclose how monitoring is performed by the slice. Moreover, paragraph [0059]-[0060] of the instant specification fail to disclose the performance of monitoring entirely. As such, reciting that “each slice of the plurality of slices” performs the claimed means-plus-function is an inadequate disclosure of structure.
Moreover, Applicant identifies “each slice of a plurality of slices” as the corresponding structure for claimed function of “means for setting a local event tag” in claim 23. Applicant cites paragraph [0052-0056] and [0065] as the purported support for the corresponding structure performing the means-plus-function limitation. However, for the same reasons discussed above, merely reciting that “each slice” performs the claimed means-plus-function is inadequate to disclose corresponding structure, as the term fails to disclose a meaningful component or a specific algorithm that would inform one of ordinary skill in the art how the recited function is carried out by the slice. In particular, paragraph [0065] of the instant specification merely recites that the setting of a local event tag is performed. Additionally, paragraphs [0052-0056] fail to set forth the particular structure or algorithm that describe how the “setting [of] a local event tag” is performed. As such, reciting that “each slice of the plurality of slices” performs the claimed means-plus-function is an inadequate disclosure of structure.
Accordingly, the cited paragraphs relied upon by applicant are each rpesented in an exemplary format and do not disclose a particular structure or a corresponding algorithm that performs the claimed functions. As such for the reasons outlined above, the specification fails to provide the structure required under 35 U.S.C. 112(f), and the rejections under 35 U.S.C. 112(a) and 35 U.S.C. 112(b) are maintained.
Argument has not been found persuasive.
Regarding the 35 U.S.C. 112(b) rejection of claims 11-30, Applicant respectfully submits the following.
“Executing a write request with local coordination…” and “executing a read request with local coordination…” are disclosed in present application as follows:
“In one example, each transition of each local status for each local output sync signal occurs asynchronously. That is each slice may operate with local coordination. That is each slice may complete execution of a batch within its slice and operate with local coordination, rather than global coordination…” Present application, paragraphs [0052].
“In one example, local coordination, rather than global coordination, is feasible when processing applications (e.g., graphical processing) perform mostly local (e.g., in pixel) processing with local memory access, rather than global memory access. For example, the data producers and data consumers in a given slice may be independent of data producers and data consumers in other slices. That is, batch execution for other slices which reduces timing performance overhead.” Present application, paragraphs [0054]
“In one example, the data producers and data consumers in each slice require local coordination and data coherency only within that slice…” Present application, paragraphs [0055].
“In one example, the data producers and data consumers in each slice require local coordination and data coherency only within that slice…” Present application, paragraphs [0055]. “In one example, local coordination may insert a local event tag in the sequence of operations to ensure that all data producers have written data into a slice memory, buffer memory or main memory prior to any data consumer commence a read operation of the sequence of operations.” Present application, paragraphs [0056].
That is, local coordination refers to data coherency within one slice independent of the status of other slices. Local coordination may be used when an application is confined to processing and data storage within a local slice, without need for any coordination with other slices. That is, each slice operates asynchronously with respect to other slices, but with local synchronicity within the slice. Local synchronicity (i.e., local coordination) may use local event tag to ensure data coherency within the slice to coordinate write operations with read operations.
“Setting a local event tag in the first batch…” is disclosed in the present application as follows:
“In one example, local coordination may insert a local event tag in the sequence of operations to ensure that all data producers have written data into a slice memory, buffer memory or main memory prior to any data consumer commence a read operation on the sequence of operations.
In one example, local coordination executed by a data producer may execute a write request sequence as follows:
[write request in batch0 -> local event tag -> read request in batch1]
In one example, local coordination executed by a data consumer may execute a read request sequence as follows:
[read request in batch0 -> local event tag -> read request in batch1]
In one example, when a local coordination node (e.g., a slice memory) receives the local event tag, it ma push back all subsequent requests from a same path until the following conditions are met:
All previous requests in the same path are completed
The local event tag is received in other paths within the same slice and requests in those other paths are completed.” Present application paragraphs [0056-0059] Emphasis added.
That is, the local event tag is a data object inserted in a sequence of operations for local coordination for each slice in a plurality of slices. One usage of the local event tag is data coherency, for example, for ensuring that write operations are completed before read operations are started. That is, the local event tag may be used as a local synchronization or coordination signal to maintain data consistency.
With regard to point (b) and (c), Examiner respectfully disagrees. The rejection of the limitations of “executing a write request with local coordination in a first batch…”, “executing a read request with local coordination in a first batch…”, and “setting a local event tag in the first batch…” under 35 U.S.C. 112(b) is maintained because Applicant’s arguments rely on exemplary embodiments described in the specification which does not define or limit the claim language or provide boundaries for its scope.
Argument has not been found persuasive.
Applicant’s arguments with respect to claims 1, 11, 23, and 27 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
The application includes one or more claim limitations that use the word “means for,” such that invokes interpretation under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitations are: “means for executing a write request with local coordination in a first batch of a plurality of batches in each slice of a plurality of slices,” “means for setting a local event tag in the first batch of the plurality of batches in the each slice of the plurality of slices,” “means for monitoring the write request in the first batch of the plurality of batches in the each slice of the plurality of slices,” “means for executing a read request with the local coordination in the first batch of the plurality of batches in the each slice of the plurality of slices after monitoring the write request is completed,” and “means for monitoring the read request in the first batch of the plurality of batches in the each slice of the plurality of slices” in claim 23; “wherein the means for executing the write request is configured to execute asynchronously with respect to other slices of the plurality of slices and wherein the means for executing the read request is configured to execute asynchronously with respect to other slices of the plurality of slices” in claim 24; “wherein the means for monitoring the read request is configured to perform monitoring until one or more of the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed” in claim 25; and “wherein the means for monitoring the read request is configured to perform monitoring until one or more for the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed” in claim 26. Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
A review of the specification finds no explicit support for the structures of the limitations. Examiner notes that for computer-implemented technologies, structural support may be derived from a “computer” + “algorithm,” see MPEP 2181. Examiner does not find support for specific structures nor general computers that are specially programmed by algorithms in the specification corresponding to the limitations above which invoke 35 U.S.C. 112(f). Further, the dependent claims of 24, 25, and 26 are understood as merely providing additional functional language without introducing distinct structural limitations. For the purposes of compact prosecution and applying art, Examiner will interpret the limitations as a generic computer/processors for performing the instructions for carrying out the claimed functionality.
If applicant does not intend to have this limitation interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it being interpreted under 35 U.S.C. 112(f) or pre-AIA U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation recites sufficient structure to perform the claimed function so as to avoid it being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 23-26 rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contain subject matter invoking an interpretation under 35 U.S.C. 112(f) which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claim limitation of “means for executing a write request with local coordination in a first batch of a plurality of batches in each slice of a plurality of slices,” “means for setting a local event tag in the first batch of the plurality of batches in the each slice of the plurality of slices,” “means for monitoring the write request in the first batch of the plurality of batches in the each slice of the plurality of slices,” “means for executing a read request with the local coordination in the first batch of the plurality of batches in the each slice of the plurality of slices after monitoring the write request is completed,” and “means for monitoring the read request in the first batch of the plurality of batches in the each slice of the plurality of slices” in claim 23; “wherein the means for executing the write request is configured to execute asynchronously with respect to other slices of the plurality of slices and wherein the means for executing the read request is configured to execute asynchronously with respect to other slices of the plurality of slices” in claim 24; “wherein the means for monitoring the read request is configured to perform monitoring until one or more of the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed” in claim 25 and; “wherein the means for monitoring the read request is configured to perform monitoring until one or more for the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed” in claim 26 invoke 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts, to the function. A review of the specification finds no explicit support for the structure of the limitation. Examiner does not find support for a specific structure nor a general computer that is specially programmed by an algorithm in the specification corresponding to the limitations above which invoke 35 U.S.C. 112(f) See MPEP § 2181(II)(B). Further, the dependent claims of 24, 25, and 26 recite additional functional language, but do not further clarify or identify the structure of the limitations above which invoke 35 U.S.C. 112(f). For the purpose of compact prosecution and applying art, Examiner will interpret the limitations as a generic computer/processor for performing instructions for carrying out the claimed functionality. See claim interpretation under 35 U.S.C 112(f) above and rejection under 35 U.S.C. 112(b) below. Therefore, claims 23-26 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement.
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim 23-26 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter invoking an interpretation under 35 U.S.C. 112(f) which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim limitations of “means for executing a write request with local coordination in a first batch of a plurality of batches in each of a plurality of slices,” “means for setting a local event tag in the first batch of plurality of batches in the each slice of the plurality of slices,” “means for monitoring the write request in the first batch of the plurality of batches in the each slice of the plurality of slices,” “means for executing a read request with the local coordination in the first batch of the plurality of batches in the each slice of plurality of slices after monitoring the write request is completed”, “means for monitoring the read request in the first batch of the plurality of batches in the each slice of the plurality of slices” in claim 23; “wherein the means for executing the write request is configured to execute asynchronously with respect to other slices of the plurality of slices and wherein the means for executing the read request is configured to execute asynchronously with respect to other slices of the plurality of slices” in claim 24; “wherein the means for monitoring the read request is configured to perform monitoring until one or more of the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed” in claim 25 and; “wherein the means for monitoring the read request is configured to perform monitoring until one or more for the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed” in claim 26 invoke 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. A review of the specification finds no explicit support for the structure of these limitations. Examiner notes that for computer-implemented technologies, structural support may be derived from a "computer" + "algorithm," see MPEP § 2181. Examiner does not find support for a specific structure nor a general computer that is specially programmed by an algorithm in the specification corresponding to the limitations above which invoke 35 U.S.C. 112(f). Further, the dependent claims of 24, 25, and 26 recite additional functional language, but do not further clarify or identify the structure of the limitations above which invoke 35 U.S.C. 112(f). Therefore, the claims are indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA 35 U.S.C. 112, second paragraph.
Applicant may:
(a) Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph;
(b) Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or
(c) Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either:
(a) Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or
(b) Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.
Claims 11-30 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
The following claim language is unclear:
With regard to claim 11, 23, and 27 recite “executing a write request with local coordination in a first batch of a plurality of batches in each slice of a plurality of slices” and “executing a read request with local coordination in the first batch of the plurality of batches in the each slice of the plurality of slices.” It is unclear from the context of the claim to ascertain the implementation of “local coordination.” More particularly, it is unclear what specifically is being “coordinated.” Does coordination occur as scheduling the execution of the different operation requests on a single slice? Does coordination occur as communication between two or more slices? As such, one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. Further, it is unclear how “local” is to be interpreted. Does local coordination occur within each slice? Does local coordination occur among the slices? Is local meant to be understood as a spatial proximity (e.g., coordination occurs only among a cluster of adjacent slices) or as a workload-defined locality (e.g., coordination occurs among a set of slices best suited to a scheduling principle)? For purposes of examination, the Examiner will reasonably interpret the limitation as a slice engaging in read and write operations where data in a write buffer is read into a processing element, further which is written into a read buffer. Coordination arises from the control mechanisms indicating when read and write operations occur within each slice.
Further, with regard to claim 11, 23, and 27 recite “setting a local event tag in the first batch of the plurality of batches in the each slice of the plurality of slices.” It is unclear from the context of the claim to ascertain the specific structure or implementation of a “local event tag.” Examiner will reasonably interpret the limitation as a communication medium (e.g., a packet, interrupt, message) merely indicating a state of execution of a slice associated with a data batch.
Claims 12-22 are dependent on claim 11 and fail to cure the deficiencies set forth above for claim 11. Therefore, they are additionally rejected under the same rationale discussed above.
Claims 24-26 are dependent on claim 23 and fail to cure the deficiencies set forth above for claim 11. Therefore, they are additionally rejected under the same rationale discussed above.
Claims 28-30 are dependent on claim 27 and fail to cure the deficiencies set forth above for claim 11. Therefore, they are additionally rejected under the same rationale discussed above.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-5 and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. Patent No. US 11,093,276 B2 (hereinafter Zhou) in view of Chen et al. Patent No. US 11,709,664 B2 (hereinafter Chen) in view of Guo et al. Pub. No. US 2023/0289242 A1 (hereinafter Guo).
With regard to claim 1, Zhou teaches an apparatus comprising:
a plurality of slices, wherein each slice of the plurality of slices is configured for distributed information processing (Col. 2, Chip communication system 102 can include a global manager 1022 and a plurality of cores 1024. Global manager 1022 can include at least one task manager to coordinate with one or more cores 1024); and
However, Zhou does not explicitly teach a plurality of dedicated databuses coupled to slices configured for coordination of distributed information processing.
Chen teaches a plurality of dedicated databuses, wherein each slice of the plurality of slices is coupled to one of the plurality of dedicated databuses (FIG. 13a, Processor array network coupled to databuses in a grid; Col. 23, lines 4-6, In this example, the array of configurable units 1330 includes a plurality of types of configurable units, which are configured with the anti-congestion logic 232; Col. 23, lines 42-51, The array level network includes links interconnecting configurable units in the array. The links in the array level network include one or more and, in this case, three kinds of physical buses: a chunk-level vector bus (e.g., 128 bits of data), a word-level scalar bus (e.g., 32 bits of data), and a multiple bit-level control bus. For instance, interconnect 1321 between switch units 1311 and 1312 includes a vector interconnect with a vector bus width of 128 bits, a scalar bus interconnect with scalar bus width of 32 bits, and a control interconnect) and each slice of the plurality of slices is configured to coordinate locally the distributed information processing (Col. 9, Coarse-grained reconfigurable architectures (CGRAs) comprise distributed compute and memory components in a programmable interconnect fabric. Applications 102 are executed on CGRAs in a distributed fashion by programming the individual compute and memory components to asynchronously receive, process, and send data and control information) ….
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Chen with the teachings of Zhou in order to provide an apparatus that teaches a plurality of databases coupled to computing slices configured for local coordination of distributed processing. The motivation for applying Chen teaching with Zhou teaching is to provide an apparatus that allows for efficient propagation of control events throughout an array of computing slices through the plurality of associated databuses, such that the system can manage the rate of execution of pipeline stages and prevent buffer overflows and processing bottlenecks (Chen, Col. 3). Zhou and Chen are analogous art directed towards task management and arrangements of distributed systems. Therefore, it would have been obvious for one of ordinary skill in the art to combine Chen with Zhou to teach the claimed invention in order to provide a system which enables efficient control signaling and communication among an array of computing slices through associated databuses.
However, Zhou and Chen do not explicitly teach coordinate distributed information processing locally using an asynchronous local output sync signal.
Guo teaches coordinate locally the distributed information processing ([0158], Threads from CTAs on remote SMs and from CTAs on the local can all arrive and wait at the same transaction barrier in the local SM’s shared memory. In some embodiments, waiting may be limited to local threads) using an asynchronous local output sync signal ([0112], For example, in some embodiments, the synchronization unit 700 may provide hardware acceleration to arrive-wait barriers by, for example, caching the arrive-wait barriers in the barrier cache, providing the datapath circuit for processing arrive() and wait() on the arrive-wait barriers, and by providing for the threads waiting on the arrive-wait barrier to utilize the try-wait barrier. In an example embodiment, the producer will perform a sequence wait(barrier0)=>data store=>fence()=>arrive(barrier1), while the consumer will perform the sequence arrive(barrier0)=>other operations=>wait(barrier1)=>consume stored data, in which at least the wait() may use the try-wait buffer thereby reducing the polling of the barrier1 is reduced)
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Guo with the teachings of Zhou and Chen in order to provide an apparatus that teaches local coordination of processing devices using a local asynchronous output synchronization signal. The motivation for applying Guo teaching with Zhou and Chen teaching is to provide an apparatus that allows for asynchronous compute, wherein device utilization is increased by scheduling task out of order while maintaining data coherency through the implementation of barriers for operations with dependencies, improving system throughput across computing nodes (Guo, [0021]). Zhou, Chen, and Guo are analogous art directed towards synchronization techniques. Therefore, it would have been obvious for one of ordinary skill in the art to combine Guo with Zhou and Chen to teach the claimed invention in order to provide local asynchronous synchronization techniques within distributed, grid computing arrangements.
With regard to claim 2, Zhou teaches the apparatus of claim 1, wherein the each slice of the plurality of slices includes a memory unit (Col. 4, lines 22-25, CGRA 200 can includes cores 210-217 and buffers 220-227. Each of cores 210-217 can be connected to its own memory buffer (e.g., buffers 220-227).
With regard to claim 3, Zhou teaches the apparatus of claim 2, wherein the each slice of plurality of slices is a processing unit (Col. 2, lines 57-62, Cores 1024 can include one or more processing elements that each include single instruction, multiple data (SIMD) architecture including one or more processing units configured to perform one or more operations).
With regard to claim 4, Zhou teaches the apparatus of claim 2, further comprising a plurality of current workload batches (Col. 5, lines 4-9, Buffer controller 402 can generate instructions for performing a plurality of buffer transactions on at some of buffers 410-417. For example, on-chip system 100 can receive a computation task (e.g., deep learning), and buffer controller 402 can generate instructions based on the received task).
With regard to claim 5, Zhou teaches the apparatus of claim 4, further comprising a plurality of current workload batches is stored within the memory unit the each slice of the plurality of slices (Col. 5, lines 9-11, These instructions can be assigned to data managers corresponding to the at least some of buffers 410-417 for execution; Col. 5, lines 37-40, As shown in FIG. 4, system 400 includes buffers 410-417 and each buffer can further include a plurality of data units indicated by blocks in the buffer).
With regard to claim 9, Zhou reasonably teaches the plurality of cores executing buffer transactions that include read and write operations (Col. 5). However, Zhou does not explicitly teach the apparatus comprising data producers configured to execute write requests.
Chen teaches the apparatus of claim 1, further comprising a plurality of data producers (Col. 12, lines 45-48, The buffer classification logic 212 is configured to classify the stage buffers as producers and consumers on a stage-by-stage basis by classifying those stage buffers that provide input data to a particular stage as the producers), wherein each of the plurality of data producers is configured to execute a write request (Col. 12, lines 51-57, The control connections creation logic 222 is configured to create control connection between the stage buffers by extending the control connections from the consumers in the particular stage. The control connections extend from a particular consumer to one or more corresponding producers that write data into the particular consumer) in each of the plurality of slices (Col. 14, lines 16-20, One skilled in the art will appreciate that the data processing pipeline can comprise a plurality of producers, a plurality of compute nodes, and a plurality of consumers such that that a compute node can receive input from multiple producers and can provide output to multiple consumers).
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Chen with the teachings of Zhou and Guo in order to provide an apparatus that teaches a plurality of data producers configured to perform write requests in each of the plurality of slices. The motivation for applying Chen teaching with Zhou and Guo teaching is to provide an apparatus that allows for classification of stage buffers such that the appropriate control connections between stage buffers can be applied for the particular stage. For example, write buffer designations control when output data can be written by associated data producers to maintain consumer data saturation (Chen, Col. 7). Zhou, Guo, and Chen are analogous art directed towards task management and arrangements of distributed systems. Therefore, it would have been obvious for one of ordinary skill in the art to combine Chen with Zhou and Guo to teach the claimed invention in order to provide the appropriate stage buffer classification which controls data transmission between compute slices.
With regard to claim 10, Zhou reasonably teaches the plurality of cores executing buffer transactions that include read and write operations (Col. 5). However, Zhou does not explicitly teach the apparatus comprising data consumers configured to execute read requests.
Chen teaches the apparatus of claim 1, further comprising a plurality of data consumers (Col. 12, lines 45-50, The buffer classification logic 212 is configured to classify the stage buffers as producers and consumers on a stage-by-stage basis by … classifying those stage buffers that store output data from the particular stage as the consumers), wherein each of the plurality of data consumers is configured to execute a read request (Col. 12, lines 61-67, The anti-congestion logic 232 is configured to configure each of the producers with a ready-to-read credit counter, such that the ready-to-read credit counter of a particular producer is initialized with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer) in each of the plurality of slices (Col. 14, lines 16-20, One skilled in the art will appreciate that the data processing pipeline can comprise a plurality of producers, a plurality of compute nodes, and a plurality of consumers such that that a compute node can receive input from multiple producers and can provide output to multiple consumers).
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Chen with the teachings of Zhou and Guo in order to provide an apparatus that teaches a plurality of data consumers configured to perform read requests in each of the plurality of slices. The motivation for applying Chen teaching with Zhou and Guo teaching is to provide an apparatus that, analogous to data producers, allows for classification of stage buffers such that the appropriate control connections between stage buffers can be applied for the particular stage. For example, read buffer designations control the amount of input data needed to fully saturate an associated data consumer and signals buffer availability (Chen, Col. 7). Zhou, Guo, and Chen are analogous art directed towards task management and arrangements of distributed systems. Therefore, it would have been obvious for one of ordinary skill in the art to combine Chen with Zhou and Guo to teach the claimed invention in order to provide appropriate stage buffer classification which controls data transmission between compute slices.
Claims 6, 7, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou in view of Chen in view of Guo as applied to claim 2 and 5 above, and further in view of Vembu et al. Pub. No. US 2023/0039853 A1 (hereinafter Vembu).
With regard to claim 6, Vembu teaches the apparatus of claim 5, wherein each of plurality of current workload batches is different from another of the plurality of current workload batches ([0159], Instead, each tile executes a subset of the submitted workloads. In one embodiment a given tile can execute a specific subset of the workload (Examiner notes: A workload is divided and distributed into different batches) based on an identifier provided to a hardware context associated with the tile; [0153], As shown in FIG. 16A, the graphics processing system 1600 includes an application and/or graphics driver (app/driver 1601) that can send workloads 1604A-1604D to one or more engine block tiles 1605A-1605D, which can be similar to or variants of the engine block tiles 1524A-1524N of FIG. 15. The workloads 1604A-1604D can be part of the same workload and/or separate workloads).
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Vembu with the teachings of Zhou, Chen, and Guo in order to provide an apparatus that teaches a plurality of workload batches different from each other. The motivation for applying Vembu teaching with Zhou, Chen, and Guo teaching is to provide an apparatus that allows for the use of known techniques of multi-tasking in multi-core systems such that different workloads can be executed simultaneously, leading to improved efficiency of the system. Zhou, Chen, Guo, and Vembu are analogous art directed towards task management and arrangements of distributed systems. Therefore, it would have been obvious for one of ordinary skill in the art to combine Vembu with Zhou, Chen, and Guo to teach the claimed invention in order to provide efficient resource allocation by dividing different workload batches to different processors.
With regard to claim 7, Vembu teaches the apparatus of claim 5, wherein at least two of the plurality of current workload batches are the same ([0159], In one embodiment, to enable cross-tile workloads, the same batch buffer containing the superset of work items to be performed is submitted to each tile that is to be included within tile work-group. All commands are submitted to all tiles (Examiner notes: A single workload is distributed as identical batches) that are to execute the commands, even if a given tile is not intended to execute all submitted workloads; [0153], As shown in FIG. 16A, the graphics processing system 1600 includes an application and/or graphics driver (app/driver 1601) that can send workloads 1604A-1604D to one or more engine block tiles 1605A-1605D, which can be similar to or variants of the engine block tiles 1524A-1524N of FIG. 15. The workloads 1604A-1604D can be part of the same workload and/or separate workloads).
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Vembu with the teachings of Zhou, Chen, and Guo in order to provide an apparatus that teaches a plurality of workload batches can be associated with the same workload. The motivation for applying Vembu teaching with Zhou, Chen, and Guo teaching is to provide an apparatus that allows for the use of known techniques of parallel execution in order to speed up the execution of a workload by increasing the processing throughput of the plurality of identical workloads. Zhou, Chen, Guo, and Vembu are analogous art directed towards task management and arrangements of distributed systems. Therefore, it would have been obvious for one of ordinary skill in the art to combine Vembu with Zhou, Chen, and Guo to teach the claimed invention in order to provide parallel execution of identical workloads.
With regard to claim 8, Vembu teaches the apparatus of claim 2, further comprising a plurality of external memory units configured to store a plurality of future workload batches (FIG. 16B, Workload 1605 held in Batch Buffer Submitter 1618; [0152], FIG. 16B shows an example of a system graphics interface 1602; [0157], The doorbell 1603 is one of the multiple doorbell interface through which a workload 1605 can be submitted, where the workload 1604 can be anyone one of workloads 1604A-1604D of FIG. 16A … In one embodiment the work request is provided in the form of a buffer of batched commands (e.g., a batch buffer). The batch buffer can be processed via the batch buffer submitter 1618 (Examiner notes: Batch Buffer Submitter is an intermediary memory device to store workloads for scheduling). In one embodiment, the system/device address translator 1616 to translate system addresses to device local address for the engine block tile. The commands of the batch buffer can then be submitted to the associated engine block tile), wherein each of the plurality of slices is associated with one of the plurality of future workload batches (FIG. 16A, Plurality of Engine Block Tiles 1605A-D coupled to System Graphics Interface 1602A-D holding Workloads 1604A-D; [0152], FIG. 16A shows an overview of the graphics processing system 1600, according to an embodiment; [0153], As shown in FIG. 16A, the graphics processing system 1600 includes an application and/or graphics driver (app/driver 1601) that can send workloads 1604A-1604D to one or more engine block tiles 1605A-1605D).
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Vembu with the teachings of Zhou, Chen, and Guo in order to provide an apparatus that teaches memory units configured to queue workloads for execution. The motivation for applying Vembu teaching with Zhou, Chen, and Guo teaching is to provide an apparatus that allows for pre-processing of future workload batches during the execution of a current batch such that enables efficient scheduling and execution upon batch dispatching (Vembu, [0157]). Zhou, Chen, Guo and Vembu are analogous art directed towards task management and arrangements of distributed systems. Therefore, it would have been obvious for one of ordinary skill in the art to combine Vembu with Zhou, Chen, and Guo to teach the claimed invention in order to provide workload queuing for associated computing slices.
Claims 11-30 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. Patent No. US 11,709,664 B2 (hereinafter Chen) in view of Chandra et al. Patent No. US 5,978,936 (hereinafter Chandra) in view of Guo et al. Pub. No. US 2023/0289242 A1 (hereinafter Guo).
With regard to claim 11, Chen teaches a method for implementing slice (Col. 14, lines 6-10, A data processing pipeline/operation comprises at least a producer, a compute node, and a consumer; Col. 14, lines 30-35, In the context of this application, a producer can be referred to as an upstream buffer or upstream memory node/unit, a compute node can be referred to as an intermediary compute node/unit or an intermediary processing node/unit, and a consumer can be referred to as a downstream buffer or downstream memory node/unit) coordination, the method comprising (Col. 7, lines 1-6 and lines 19-22, A computer-implemented method is described that includes accessing a dataflow graph with compute nodes that asynchronously transmit data along data connections. The data flow graph includes a loop nest in which loops are arranged in a hierarchy of levels, such that a loop at a second level is within a loop at a first level. The method includes … controlling data transmission between the compute nodes along the data connections by using control connections to control writing of the data by the producers into the consumers):
executing a write request with local coordination (Col. 15, lines 10-12 and lines 22-28, The write credit counter is configured to decrement when the producer begins writing the buffer data unit into the consumer … The write credit counter is configured to increment when the producer receives from the consumer a write done token. The write done token indicates to the producer that the writing of the buffer data unit into the producer that the writing of the buffer data unit into the consumer has completed. The producer resumes writing data into the consumer when the producer receives the write done token from the consumer. The producer stops writing data into the consumer when the write credit counter has zero write credits (Examiner notes: Where write request coordination is managed by the credit counters of the associated slice) in a first batch of a plurality of batches in each slice of a plurality of slices (Col. 15, lines 48-49, At a first timestep, the producer begins writing a first buffer unit data into the consumer…;
setting a local event tracker (Col. 3, lines 56-58, The compiler is further configured to configure each of the producers (Examiner notes: each of the slice) with a ready-to-read credit counter; Col. 4, lines 16-18, In some embodiments, the compiler is further configured to configure each of the producers with a write credit counter that is initialized with one or more write credits; Col. 15, lines 36-38, As discussed above, the ready-to-read credit counter is initialized with three read credits and the write credit counter is initialized with two write credits) in the first batch of the plurality of batches in the each slice of the plurality of slices (Col. 15, lines 49-52, In response, the ready-to-read counter is decremented by one (read credit = 2) and the write credit counter is also decremented by one (write credit = 1);
monitoring the write request in the first batch of the plurality of batches in the each slice of the plurality of slices (Col. 15, lines 53-64, At a second timestep, the producer begins writing a second buffer unit data into the consumer. In response, the ready-to-read credit counter is further decremented by one (read credit=1) and the write credit counter is further decremented by one (write credit = 0). The write credit counter expires and therefore the producer stops writing data into the consumer (writing stopped). At a third timestep, the writing of the first buffer unit data into the consumer is complete. In response, the consumer sends a first write done token to the produce. In response, the write credit counter is incremented by one and therefore reactivated (write credit = 1) (Examiner notes: Where slice request execution is monitored by the credits available);
executing a read request with the local coordination (Col. 15, lines 7-11 and lines 12-22, The ready-to-read counter is configured to decrement when the producer begins writing a buffer data unit into the consumer. The size of the buffer data unit is s (e.g., 16 bytes, 64 bytes, 512 bytes … The ready-to-read credit counter is configured to increment when the producer receives from the consumer a read ready token. The read ready token indicates to the producer that the consumer has freed a buffer data unit and is ready to receive an additional buffer data unit. The producer stops writing data into the consumer when the ready-to-read credit counter has zero read credits. The producer resumes writing data into the consumer when the producer receives read ready token from the consumer) in the first batch of the plurality of batches in the each slice of the plurality of slices after monitoring the write request is completed (Col. 16, lines 31-40, At the fifth timestep, the producer begins writing a third buffer unit data into the consumer. In response, the ready-to-read credit counter is further decremented by one (read credit = 0) and the write credit counter is decremented by one (write credit = 1). The ready-to-read credit counter expires and therefore the producer stops writing data into the consumer (writing stopped). At a sixth timestep, the writing of the third buffer unit consumer is complete. In response, the consumer sends a third write done token to the producer); and
monitoring the read request in the first batch of the plurality of batches in each slice of the plurality of slices (Col. 16, lines 43-51, At a seventh timestep, a downstream consumer reads a buffer unit data from the consumer, i.e., the consumer is an upstream producer from the perspective of the downstream consumer and write the buffer unit data into the downstream consumer. This frees up space in the consumer equaling to a buffer unit data and therefore the consumer is ready to receive an additional buffer unit data from the producer. In response, the consumer sends a read ready token to the producer).
Chen reasonably teaches executing requests with local coordination in a slice (Col. 15). However, Chen does not explicitly teach a plurality of batches in each slice of the plurality of slices. Further, Chen reasonably teaches monitoring of requests through counters (Col. 15), however, does not explicitly teach local event tags set in the batches of the plurality of batches.
Chandra teaches coordination in a first batch of a plurality of batches (Col. 1, lines 46-50, According to the present invention, the foregoing and other objects are attained by providing corresponding sets of test instructions (Examiner notes: a batch of instructions) for a number of nodes in a computer network. The test instructions are partitioned into test modules; Col. 1, lines 64-67, The test modules have an ordered sequence. That is, there is a first module, a second module, a third module, and so on (Examiner notes: a workload partitioned into a plurality of ordered batches), in each set of modules on each one of the nodes) in each slice of a plurality of slice a plurality of batches in each slice of a plurality of slices (Col. 1, lines 51-56, When a node completes processing of one of its test modules, the node stores a test result corresponding to the test module. Meanwhile, another node processing it set of the test modules, stores test results for each one of the modules when it completes. This the two nodes process the test modules asynchronously (Examiner notes: the slice apparatus of Chen); Col. 2, lines 1-2, The nodes process the corresponding test modules in the same ordered sequence (Examiner notes: each node receives the same copy of the first batch)
setting a local event tag in the first batch of the plurality of batches in each of the plurality of slices (Col. 4, lines 54-67, If the comparison at 160 of node Nx and node Ny test module processing status indicates that node Nx has processed at least as many test modules as node Ny, then, at 178, node Nx sends the result of the highest order test module processed by node Nx to node Ny. Then, at 180 node Nx waits for node Ny to process additional test modules until node Ny catches up with node Nx. Once node Ny has processed a test module of the same order completed by node Nx, the at 185, node Nx gets the result of that test module from node Ny. Then, at 190, the results of the highest order test modules processed by both node Nx and node Ny are compared)
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Chandra with the teachings of Chen in order to provide a method that teaches execution of request with local coordination and setting local event tags in a distributed manner by means of batch processing of a workload to a computing slices. The motivation for applying Chandra teaching with Chen teaching is to provide a method that allows for the known benefits of concurrent execution coupled with proper execution management of operations distributed across a multi-processing hardware, such that coordinating execution though events can enable processing remedies, which is advantageous for ensuring proper functioning of a node to be confirmed with certainty prior to an operation (Chandra, Col. 2). Chen and Chandra are analogous art directed towards workload management in distributed environments. Therefore, it would have been obvious for one of ordinary skill in the art to combine Chandra with Chen to teach the claimed invention in order to provide a method that utilizes distributed computing with proper operation management across the distributed system.
However, Chen and Chandra do not explicitly teach coordinate distributed information processing locally using an asynchronous local output sync signal.
Guo teaches using an asynchronous local output sync signal ([0112], For example, in some embodiments, the synchronization unit 700 may provide hardware acceleration to arrive-wait barriers by, for example, caching the arrive-wait barriers in the barrier cache, providing the datapath circuit for processing arrive() and wait() on the arrive-wait barriers, and by providing for the threads waiting on the arrive-wait barrier to utilize the try-wait barrier. In an example embodiment, the producer will perform a sequence wait(barrier0)=>data store=>fence()=>arrive(barrier1), while the consumer will perform the sequence arrive(barrier0)=>other operations=>wait(barrier1)=>consume stored data, in which at least the wait() may use the try-wait buffer thereby reducing the polling of the barrier1 is reduced) which is substantially similar to claim 1 and therefore rejected with similar rationale.
Examiner notes: It would be obvious for one of ordinary skill in the art to recognize that the apparatus of claim 1 is being substantially recited again for the method of claim 11.
With regard to claim 12, Chen teaches the method of claim 11, wherein the write request in the each slice of the plurality of slices is executed asynchronously (Col. 13, lines 20-23, The anti-congestion logic 232 is configured to configure each of the producers with a write credit counter that is initialized with one or more write credits; Col. 14, lines 35-40, Additionally, the producers, the compute nodes, and the consumers operate asynchronously and therefore use the anti-congestion flow control described herein to handle backpressure and avoid processing bottlenecks and buffer overflows between the producers and the consumers)
However, Chen does not explicitly teach that the asynchronous execution is with respect to other slices of the plurality of slices
Chandra teaches with respect to other slices of the plurality of slices (Col. 3, lines 51-56, However, by partitioning signatures analysis programs into test modules producing a signature for each of the modules, as provided in the present invention, and coordinating the processing of the test modules concurrently but in an asynchronous mode among a number of nodes)
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to apply the teachings of Chandra with the teachings of Chen and Guo in order to provide a method that teaches asynchronous execution of an operation for a slice independent of the executions of the other slices in the system. The motivation for applying Chandra teaching with Chen and Guo teaching is to provide a method that allows for the known benefits of continuous processing and throughput of data in a distributed system, such that reduces the idleness of computing nodes and mitigates system stall due to failure of any single computing node (Chandra, Col. 3 - Col. 4). Chen, Guo, and Chandra are analogous art directed towards workload management in distributed environments. Therefore, it would have been obvious for one of ordinary skill in the art to combine Chandra with Chen and Guo to teach the claimed invention in order to provide asynchronous execution of operations of all computing slice in a system.
With regard to claim 13, Chen teaches the method of claim 12, further comprising the each slice of the plurality of slices suppling a local output coordination signal to indicate completion of the execution of a batch in the each slice of the plurality of slices (Col. 2, lines 54-63, In CGRAs and other processing systems that comprise a plurality of processing units that participate in a data processing operation, part of the data processing operation to be executed in one processing unit may need to be synchronized with other parts being executed in processing units distributed across the system. For example, several parts of the data processing operation may need to complete before a next part can safely begin. Thus, techniques for distributed control signals among elements of the processing system are required; Col. 15, lines 36-38, The read ready tokens and the write done tokens are pulse signals that emanate from the consumer and terminate at the producer)
With regard to claim 14, Chen teaches the method of claim 13, wherein monitoring the write request is performed until all previous write requests in a path are completed (FIG. 4, Dataflow graph 400 illustrating the path for General Matrix Multiplication (GeMM) operations where GeMM 0, 1, and 2 (402, 412, 422) perform write request to GeMM 4 (407) until completion; Col. 17, lines 27-29, FIG. 4 is an example of a dataflow graph 400 with compute nodes that asynchronously transmit data along data connections; Col. 17, lines 41-59, In the outer loop 410 each of the first three matrix multiplication nodes 402, 412, 422 receives a respective input (e.g., a respective tensor), executes a general matrix multiply (GeMM) operation on the respective input using a respective set of weights, and produces a respective output. The outputs from the first three matrix multiplication nodes 402, 412, 422 are piecewise processed by the innermost loop 409 over multiple iterations).
With regard to claim 15, Chen teaches the method of claim 13, wherein monitoring the write request is performed until a path has a verified receipt of the local event tag (Col. 13, lines 28-31, The write done token indicates to the particular producer that the writing of the buffer data unit into the corresponding consumer has completed; Col. 16, lines 6-10, The write done token ensures that at most K samples are processed through the compute node(s) between the producer and the consumer without receiving an acknowledgement from the consumer, where K is the initialization value of the write credit counter).
With regard to claim 16, Chen teaches the method of claim 13, wherein monitoring the write request is performed until the write request in the first batch of the plurality of batches is completed (Col. 13, lines 20-25 and lines 31-33, The anti-congestion logic 232 is configured to configure each of the producers with a write credit counter that is initialized with one or more write credits. The write credit counter is configured to decrement when the particular producer begins writing the buffer data unit into the corresponding consumer along the data connection … The particular producer stops writing data into the corresponding consumer when the write credit counter has zero write credits; Col. 16, lines 14-20, A zero write credit means that the consumer is still collecting the results from a previous sample and processing of the previous sample is not yet finished. The producer sends the next sample to the consumer only when the consumer finishes writing the result of the previous sample to a downstream consumer and is ready to start processing the next sample).
With regard to claim 17, Chen teaches the method of claim 13, wherein monitoring the write request is performed until two or more of the following occur: a) all previous write requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the write request in the first batch of the plurality of batch is completed (Col. 20, lines 34-45, At action 1062, the method includes (i) configuring each of the producers with a write credit counter, (ii) initializing the write credit counter with one or more write credits, (iii) decrementing the write credit counter when the particular producer begins writing the buffer data unit into the corresponding consumer along the data connection (Examiner notes: Condition C, monitoring disabled after K batches), and (iv) incrementing the write credit counter when the particular producer receives from the corresponding consumer a write done token along the control connection. The write done token indicates to the particular producer that the writing of the buffer data unit into the corresponding consumer has completed (Examiner notes: Condition B, receipt of local event completion tag)
With regard to claim 18, Chen teaches the method of claim 11, wherein the read request in the each slice of the plurality of slices is executed asynchronously (Col. 12, lines 61-67, The anti-congestion logic 232 is configured to configure each of the producer with a ready-to-read credit counter, such that the ready-to-read credit counter of a particular producer is initialized with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer; Col. 14, lines 35-40, Additionally, the producers, the compute nodes, and the consumers operate asynchronously and therefore use the anti-congestion flow control described herein to handle backpressure and avoid processing bottlenecks and buffer overflows between the producers and the consumers)
However, Chen does not explicitly teach that the asynchronous execution is with respect to other slices of the plurality of slices
Chandra teaches with respect to other slices of the plurality of slices (Col. 3, lines 51-56, However, by partitioning signatures analysis programs into test modules producing a signature for each of the modules, as provided in the present invention, and coordinating the processing of the test modules concurrently but in an asynchronous mode among a number of nodes) which is substantially similar to claim 12 and therefore reject with similar rationale.
Examiner notes: It would be obvious for one of ordinary skill in the art to recognize that the limitation of claim 12 is being substantially recited again as limitations for claim 17.
With regard to claim 19, Chen teaches the method of claim 18, wherein monitoring the read request is performed until all previous read requests in a path are completed (FIG. 4, Dataflow graph 400 illustrating the path for General Matrix Multiplication (GeMM) operations where GeMM (407) perform read request to aggregate results for the write request to GeMM 5 (408) until completion; Col. 17, lines 27-29, FIG. 4 is an example of a dataflow graph 400 with compute nodes that asynchronously transmit data along data connections; Col. 17, lines 54-59, The outputs from the multiple iterations are combined (e.g., concatenated) to generate an input for the matrix multiplication node 408).
With regard to claim 20, Chen teaches the method of claim 18, wherein monitoring the read request is performed until a path has a verified receipt of the local event tag (Col. 13, lines 6-9, The read ready token indicates to the particular producer that the corresponding consumer has freed a buffer data unit and is ready to receive an additional buffer data unit; Col. 16, lines 54-55, The consumer sends the read ready token to the producer when the consumer is ready to receive a new sample).
With regard to claim 21, Chen teaches the method of claim 18, wherein monitoring the read request is performed until the read request in the first batch of the plurality of batches is completed (Col. 12-Col. 13, lines 61-67 and lines 1-3 and lines 9-11, The anti-congestion logic 232 is configured to configure each of the producers with a ready-to-read counter, such that the ready-to-read credit counter of a particular producer is initialized with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer. The ready-to-read credit counter is configured to decrement when the particular producer begins writing buffer data unit into the corresponding consumer along a data connection … The particular producer stops writing data into the corresponding consumer when the ready-to-read credit counter has zero read credits).
With regard to claim 22, Chen teaches the method of claim 18, wherein monitoring the read request is performed until two or more of the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed (Col. 20, lines 19-32, At action 1052, the method further includes (i) configuring each of the producers with a read-to-read credit counter, (ii) initializing the ready-to-read credit counter of a particular producer with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer, (iii) decrementing the ready-to-read credit counter when the particular producer begins writing a buffer data unit into the corresponding consumer along a data connection, and (iv) incrementing the ready-to-read credit counter when the particular producer receives from the corresponding consumer a read ready token along a control connection. The read ready token indicates to the particular producer that the corresponding consumer has freed a buffer data unit and is ready to receive an additional buffer data unit (Examiner notes: Condition B, receipt of local event completion tag and Condition C, awaiting saturation of read buffer after processing first data batch)
With regard to claim 23, Chen teaches an apparatus for implementing slice (Col. 14, lines 6-10, A data processing pipeline/operation comprises at least a producer, a compute node, and a consumer; Col. 14, lines 30-35, In the context of this application, a producer can be referred to as an upstream buffer or upstream memory node/unit, a compute node can be referred to as an intermediary compute node/unit or an intermediary processing node/unit, and a consumer can be referred to as a downstream buffer or downstream memory node/unit) coordination, the apparatus comprising (Col. 3, lines 21-24, A technology is described which enables efficient control signaling among processing units processing units of a data processing system, including among reconfigurable processing units of a coarse-grained reconfigurable array processor):
means for executing a write request with local coordination (Col. 15, lines 10-12 and lines 22-28, The write credit counter is configured to decrement when the producer begins writing the buffer data unit into the consumer … The write credit counter is configured to increment when the producer receives from the consumer a write done token. The write done token indicates to the producer that the writing of the buffer data unit into the producer that the writing of the buffer data unit into the consumer has completed. The producer resumes writing data into the consumer when the producer receives the write done token from the consumer. The producer stops writing data into the consumer when the write credit counter has zero write credits (Examiner notes: Where write request coordination is managed by the credit counters of the associated slice) in a first batch of a plurality of batches in each slice of a plurality of slices (Col. 15, lines 48-49, At a first timestep, the producer begins writing a first buffer unit data into the consumer) …;
means for setting a local event tracker (Col. 3, lines 56-58, The compiler is further configured to configure each of the producers (Examiner notes: each of the slice) with a ready-to-read credit counter; Col. 4, lines 16-18, In some embodiments, the compiler is further configured to configure each of the producers with a write credit counter that is initialized with one or more write credits; Col. 15, lines 36-38, As discussed above, the ready-to-read credit counter is initialized with three read credits and the write credit counter is initialized with two write credits) in the first batch of the plurality of batches in the each slice of the plurality of slices (Col. 15, lines 49-52, In response, the ready-to-read counter is decremented by one (read credit = 2) and the write credit counter is also decremented by one (write credit = 1);
means for monitoring the write request in the first batch of the plurality of batches in the each slice of the plurality of slices (Col. 4, lines 16-18, In some embodiments, the compiler is further configured to configure each of the producers with a write credit counter that is initialized with one or more credits; Col. 15, lines 53-64, At a second timestep, the producer begins writing a second buffer unit data into the consumer. In response, the ready-to-read credit counter is further decremented by one (read credit=1) and the write credit counter is further decremented by one (write credit = 0). The write credit counter expires and therefore the producer stops writing data into the consumer (writing stopped). At a third timestep, the writing of the first buffer unit data into the consumer is complete. In response, the consumer sends a first write done token to the produce. In response, the write credit counter is incremented by one and therefore reactivated (write credit = 1) (Examiner notes: Where slice request execution is monitored by the credits available);
means for executing a read request with the local coordination (Col. 3, lines 56-61, The compiler is further reconfigured to configure each of the producers with a ready-to-read credit counter, such that the ready-to-read credit counter of a particular producer is initialized with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer; Col. 15, lines 7-11 and lines 12-22, The ready-to-read counter is configured to decrement when the producer begins writing a buffer data unit into the consumer. The size of the buffer data unit is s (e.g., 16 bytes, 64 bytes, 512 bytes … The ready-to-read credit counter is configured to increment when the producer receives from the consumer a read ready token. The read ready token indicates to the producer that the consumer has freed a buffer data unit and is ready to receive an additional buffer data unit. The producer stops writing data into the consumer when the ready-to-read credit counter has zero read credits. The producer resumes writing data into the consumer when the producer receives read ready token from the consumer) in the first batch of the plurality of batches in the each slice of the plurality of slices after monitoring the write request is completed (Col. 16, lines 31-40, At the fifth timestep, the producer begins writing a third buffer unit data into the consumer. In response, the ready-to-read credit counter is further decremented by one (read credit = 0) and the write credit counter is decremented by one (write credit = 1). The ready-to-read credit counter expires and therefore the producer stops writing data into the consumer (writing stopped). At a sixth timestep, the writing of the third buffer unit consumer is complete. In response, the consumer sends a third write done token to the producer); and
means for monitoring the read request in the first batch of the plurality of batches in the each slice of the plurality of slices (Col. 16, lines 43-51, At a seventh timestep, a downstream consumer reads a buffer unit data from the consumer, i.e., the consumer is an upstream producer from the perspective of the downstream consumer and write the buffer unit data into the downstream consumer. This frees up space in the consumer equaling to a buffer unit data and therefore the consumer is ready to receive an additional buffer unit data from the producer. In response, the consumer sends a read ready token to the producer).
Chen reasonably teaches executing requests with local coordination in a slice (Col. 15). However, Chen does not explicitly teach a plurality of batches in each slice of the plurality of slices. Further, Chen reasonably teaches monitoring of requests through counters (Col. 15), however, does not explicitly teach local event tags set in the batches of the plurality of batches.
Chandra teaches coordination in a first batch of a plurality of batches (Col. 1, lines 46-50, According to the present invention, the foregoing and other objects are attained by providing corresponding sets of test instructions (Examiner notes: a batch of instructions) for a number of nodes in a computer network. The test instructions are partitioned into test modules; Col. 1, lines 64-67, The test modules have an ordered sequence. That is, there is a first module, a second module, a third module, and so on (Examiner notes: a workload partitioned into a plurality of ordered batches), in each set of modules on each one of the nodes) in each slice of a plurality of slice a plurality of batches in each slice of a plurality of slices (Col. 1, lines 51-56, When a node completes processing of one of its test modules, the node stores a test result corresponding to the test module. Meanwhile, another node processing it set of the test modules, stores test results for each one of the modules when it completes. This the two nodes process the test modules asynchronously (Examiner notes: the slice apparatus of Chen); Col. 2, lines 1-2, The nodes process the corresponding test modules in the same ordered sequence (Examiner notes: each node receives the same copy of the first batch)
setting a local event tag in the first batch of the plurality of batches in each of the plurality of slices (Col. 4, lines 54-67, If the comparison at 160 of node Nx and node Ny test module processing status indicates that node Nx has processed at least as many test modules as node Ny, then, at 178, node Nx sends the result of the highest order test module processed by node Nx to node Ny. Then, at 180 node Nx waits for node Ny to process additional test modules until node Ny catches up with node Nx. Once node Ny has processed a test module of the same order completed by node Nx, the at 185, node Nx gets the result of that test module from node Ny. Then, at 190, the results of the highest order test modules processed by both node Nx and node Ny are compared) which is substantially similar to claim 11 and therefore rejected with similar rationale.
Examiner notes: It would be obvious for one of ordinary skill in the art to recognize that the method of claim 11 is being substantially recited again as limitations for the apparatus of claim 23.
However, Chen and Chandra do not explicitly teach coordinate distributed information processing locally using an asynchronous local output sync signal.
Guo teaches using an asynchronous local output sync signal ([0112], For example, in some embodiments, the synchronization unit 700 may provide hardware acceleration to arrive-wait barriers by, for example, caching the arrive-wait barriers in the barrier cache, providing the datapath circuit for processing arrive() and wait() on the arrive-wait barriers, and by providing for the threads waiting on the arrive-wait barrier to utilize the try-wait barrier. In an example embodiment, the producer will perform a sequence wait(barrier0)=>data store=>fence()=>arrive(barrier1), while the consumer will perform the sequence arrive(barrier0)=>other operations=>wait(barrier1)=>consume stored data, in which at least the wait() may use the try-wait buffer thereby reducing the polling of the barrier1 is reduced) which is substantially similar to claim 1 and therefore rejected with similar rationale.
Examiner notes: It would be obvious for one of ordinary skill in the art to recognize that the method of claim 11 is being substantially recited again for the apparatus of claim 23.
With regard to claim 24, Chen teaches the apparatus of claim 23, wherein the means for executing the write request is configured to execute asynchronously with respect to other slices of the plurality of slices (Col. 13, lines 20-23, The anti-congestion logic 232 is configured to configure each of the producers with a write credit counter that is initialized with one or more write credits; Col. 14, lines 35-40, Additionally, the producers, the compute nodes, and the consumers operate asynchronously and therefore use the anti-congestion flow control described herein to handle backpressure and avoid processing bottlenecks and buffer overflows between the producers and the consumers), and wherein the means for executing the read request is configured to execute asynchronously (Col. 12, lines 61-67 The anti-congestion logic 232 is configured to configure each of the producer with a ready-to-read credit counter, such that the ready-to-read credit counter of a particular producer is initialized with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer; Col. 14, lines 35-40, Additionally, the producers, the compute nodes, and the consumers operate asynchronously and therefore use the anti-congestion flow control described herein to handle backpressure and avoid processing bottlenecks and buffer overflows between the producers and the consumers)
However, Chen does not explicitly teach that the asynchronous execution is with respect to other slices of the plurality of slices.
Chandra teaches with respect to other slices of the plurality of slices (Col. 3, lines 51-56, However, by partitioning signatures analysis programs into test modules producing a signature for each of the modules, as provided in the present invention, and coordinating the processing of the test modules concurrently but in an asynchronous mode among a number of nodes) which is substantially similar to claim 12 and therefore reject with similar rationale.
Examiner notes: It would be obvious for one of ordinary skill in the art to recognize that the method of claim 12 is being substantially recited again as limitations for the apparatus of claim 24.
With regard to claim 25, Chen teaches the apparatus of claim 24, wherein the means for monitoring the write request is configured to perform monitoring until one or more of the following occur: a) all previous write requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the write request in the first batch of the plurality of batches is completed (Col. 20, lines 34-45, At action 1062, the method includes (i) configuring each of the producers with a write credit counter, (ii) initializing the write credit counter with one or more write credits, (iii) decrementing the write credit counter when the particular producer begins writing the buffer data unit into the corresponding consumer along the data connection, and (iv) incrementing the write credit counter when the particular producer receives from the corresponding consumer a write done token along the control connection. The write done token indicates to the particular producer that the writing of the buffer data unit into the corresponding consumer has completed (Examiner notes: Condition B, receipt of local event completion tag and Condition C completion of first data batch).
With regard to claim 26, Chen teaches the apparatus of claim 24, wherein the means for monitoring the read request is configured to perform monitoring until one or more of the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed (Col. 20, lines 19-32, At action 1052, the method further includes (i) configuring each of the producers with a read-to-read credit counter, (ii) initializing the ready-to-read credit counter of a particular producer with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer, (iii) decrementing the ready-to-read credit counter when the particular producer begins writing a buffer data unit into the corresponding consumer along a data connection, and (iv) incrementing the ready-to-read credit counter when the particular producer receives from the corresponding consumer a read ready token along a control connection. The read ready token indicates to the particular producer that the corresponding consumer has freed a buffer data unit and is ready to receive an additional buffer data unit (Examiner notes: Condition B, receipt of local event completion tag and Condition C, awaiting saturation of read buffer after processing first data batch).
With regard to claim 27, Chen teaches a non-transitory computer-readable medium storing computer executable code, operable on a device comprising at least one processor and at least one memory coupled to the at least one processor, wherein the at least one processor is configured to implement slice (Col. 14, lines 6-10, A data processing pipeline/operation comprises at least a producer, a compute node, and a consumer; Col. 14, lines 30-35, In the context of this application, a producer can be referred to as an upstream buffer or upstream memory node/unit, a compute node can be referred to as an intermediary compute node/unit or an intermediary processing node/unit, and a consumer can be referred to as a downstream buffer or downstream memory node/unit) coordination, the computer executable code comprising (Col. 20, lines 46-54, Other implementations of the method described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the methods described above. Yet another implementation of the method described in this section include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the methods described above):
instructions for causing a computer to execute a write request with local coordination (Col. 15, lines 10-12 and lines 22-28, The write credit counter is configured to decrement when the producer begins writing the buffer data unit into the consumer … The write credit counter is configured to increment when the producer receives from the consumer a write done token. The write done token indicates to the producer that the writing of the buffer data unit into the producer that the writing of the buffer data unit into the consumer has completed. The producer resumes writing data into the consumer when the producer receives the write done token from the consumer. The producer stops writing data into the consumer when the write credit counter has zero write credits) in a first batch of a plurality of batches in each slice of a plurality of slices (Col. 15, lines 48-49, At a first timestep, the producer begins writing a first buffer unit data into the consumer);
instructions for causing the computer to set a local event tracker (Col. 3, lines 56-58, The compiler is further configured to configure each of the producers (Examiner notes: each of the slice) with a ready-to-read credit counter; Col. 4, lines 16-18, In some embodiments, the compiler is further configured to configure each of the producers with a write credit counter that is initialized with one or more write credits; Col. 15, lines 36-38, As discussed above, the ready-to-read credit counter is initialized with three read credits and the write credit counter is initialized with two write credits) in the first batch of the plurality of batches in the each slice of the plurality of slices (Col. 15, lines 49-52, In response, the ready-to-read counter is decremented by one (read credit = 2) and the write credit counter is also decremented by one (write credit = 1) …;
instructions for causing the computer to monitor the write request in the first batch of the plurality of batches in the each slice of the plurality of slices (Col. 15, lines 53-64, At a second timestep, the producer begins writing a second buffer unit data into the consumer. In response, the ready-to-read credit counter is further decremented by one (read credit=1) and the write credit counter is further decremented by one (write credit = 0). The write credit counter expires and therefore the producer stops writing data into the consumer (writing stopped). At a third timestep, the writing of the first buffer unit data into the consumer is complete. In response, the consumer sends a first write done token to the produce. In response, the write credit counter is incremented by one and therefore reactivated (write credit = 1) (Examiner notes: Where slice request execution is monitored by the credits available);
instructions for causing the computer to execute a read request with the local coordination (Col. 15, lines 7-11 and lines 12-22, The ready-to-read counter is configured to decrement when the producer begins writing a buffer data unit into the consumer. The size of the buffer data unit is s (e.g., 16 bytes, 64 bytes, 512 bytes … The ready-to-read credit counter is configured to increment when the producer receives from the consumer a read ready token. The read ready token indicates to the producer that the consumer has freed a buffer data unit and is ready to receive an additional buffer data unit. The producer stops writing data into the consumer when the ready-to-read credit counter has zero read credits. The producer resumes writing data into the consumer when the producer receives read ready token from the consumer) in the first batch of the plurality of batches in the each slice of the plurality of slices after monitoring the write request is completed (Col. 16, lines 31-40, At the fifth timestep, the producer begins writing a third buffer unit data into the consumer. In response, the ready-to-read credit counter is further decremented by one (read credit = 0) and the write credit counter is decremented by one (write credit = 1). The ready-to-read credit counter expires and therefore the producer stops writing data into the consumer (writing stopped). At a sixth timestep, the writing of the third buffer unit consumer is complete. In response, the consumer sends a third write done token to the producer); and
instructions for causing the computer to monitor the read request in the first batch of the plurality of batches in the each slice of the plurality of slices (Col. 16, lines 43-51, At a seventh timestep, a downstream consumer reads a buffer unit data from the consumer, i.e., the consumer is an upstream producer from the perspective of the downstream consumer and write the buffer unit data into the downstream consumer. This frees up space in the consumer equaling to a buffer unit data and therefore the consumer is ready to receive an additional buffer unit data from the producer. In response, the consumer sends a read ready token to the producer).
Chen reasonably teaches executing requests with local coordination in a slice (Col. 15). However, Chen does not explicitly teach a plurality of batches in each slice of the plurality of slices. Further, Chen reasonably teaches monitoring of requests through counters (Col. 15), however, does not explicitly teach local event tags set in the batches of the plurality of batches.
Chandra teaches coordination in a first batch of a plurality of batches (Col. 1, lines 46-50, According to the present invention, the foregoing and other objects are attained by providing corresponding sets of test instructions (Examiner notes: a batch of instructions) for a number of nodes in a computer network. The test instructions are partitioned into test modules; Col. 1, lines 64-67, The test modules have an ordered sequence. That is, there is a first module, a second module, a third module, and so on (Examiner notes: a workload partitioned into a plurality of ordered batches), in each set of modules on each one of the nodes) in each slice of a plurality of slice a plurality of batches in each slice of a plurality of slices (Col. 1, lines 51-56, When a node completes processing of one of its test modules, the node stores a test result corresponding to the test module. Meanwhile, another node processing it set of the test modules, stores test results for each one of the modules when it completes. This the two nodes process the test modules asynchronously (Examiner notes: the slice apparatus of Chen); Col. 2, lines 1-2, The nodes process the corresponding test modules in the same ordered sequence (Examiner notes: each node receives the same copy of the first batch)
setting a local event tag in the first batch of the plurality of batches in each of the plurality of slices (Col. 4, lines 54-67, If the comparison at 160 of node Nx and node Ny test module processing status indicates that node Nx has processed at least as many test modules as node Ny, then, at 178, node Nx sends the result of the highest order test module processed by node Nx to node Ny. Then, at 180 node Nx waits for node Ny to process additional test modules until node Ny catches up with node Nx. Once node Ny has processed a test module of the same order completed by node Nx, the at 185, node Nx gets the result of that test module from node Ny. Then, at 190, the results of the highest order test modules processed by both node Nx and node Ny are compared) which is substantially similar to claim 11 and therefore rejected with similar rationale.
Examiner notes: It would be obvious for one of ordinary skill in the art to recognize that the method of claim 11 is being substantially recited again as limitations for the non-transitory computer-readable medium of claim 27.
However, Chen and Chandra do not explicitly teach coordinate distributed information processing locally using an asynchronous local output sync signal.
Guo teaches using an asynchronous local output sync signal ([0112], For example, in some embodiments, the synchronization unit 700 may provide hardware acceleration to arrive-wait barriers by, for example, caching the arrive-wait barriers in the barrier cache, providing the datapath circuit for processing arrive() and wait() on the arrive-wait barriers, and by providing for the threads waiting on the arrive-wait barrier to utilize the try-wait barrier. In an example embodiment, the producer will perform a sequence wait(barrier0)=>data store=>fence()=>arrive(barrier1), while the consumer will perform the sequence arrive(barrier0)=>other operations=>wait(barrier1)=>consume stored data, in which at least the wait() may use the try-wait buffer thereby reducing the polling of the barrier1 is reduced) which is substantially similar to claim 1 and therefore rejected with similar rationale.
Examiner notes: It would be obvious for one of ordinary skill in the art to recognize that the method of claim 11 is being substantially recited again for the non-transitory computer-readable medium of claim 27.
With regard to claim 28, Chen teaches the non-transitory computer-readable medium of claim 27, further comprising instructions for causing the computer to execute the write request asynchronously with respect to other slices of the plurality of slices (Col. 13, lines 20-23, The anti-congestion logic 232 is configured to configure each of the producers with a write credit counter that is initialized with one or more write credits; Col. 14, lines 35-40, Additionally, the producers, the compute nodes, and the consumers operate asynchronously and therefore use the anti-congestion flow control described herein to handle backpressure and avoid processing bottlenecks and buffer overflows between the producers and the consumers) and to execute the read request asynchronously (Col. 12, lines 61-67, The anti-congestion logic 232 is configured to configure each of the producer with a ready-to-read credit counter, such that the ready-to-read credit counter of a particular producer is initialized with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer; Col. 14, lines 35-40, Additionally, the producers, the compute nodes, and the consumers operate asynchronously and therefore use the anti-congestion flow control described herein to handle backpressure and avoid processing bottlenecks and buffer overflows between the producers and the consumers)
However, Chen does not explicitly teach that the asynchronous execution is with respect to other slices of the plurality of slices
Chandra teaches with respect to other slices of the plurality of slices (Col. 3, lines 51-56, However, by partitioning signatures analysis programs into test modules producing a signature for each of the modules, as provided in the present invention, and coordinating the processing of the test modules concurrently but in an asynchronous mode among a number of nodes) which is substantially similar to claim 12 and therefore reject with similar rationale.
Examiner notes: It would be obvious for one of ordinary skill in the art to recognize that the method of claim 12 is being substantially recited again as limitations for the non-transitory computer-readable medium of claim 28.
With regard to claim 29, Chen teaches the non-transitory computer-readable medium of claim 28, further comprising instructions for causing the computer to monitor the write request until one or more of the following occur: a) all previous write requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the write request in the first batch of the plurality of batches is completed (Col. 20, lines 34-45, At action 1062, the method includes (i) configuring each of the producers with a write credit counter, (ii) initializing the write credit counter with one or more write credits, (iii) decrementing the write credit counter when the particular producer begins writing the buffer data unit into the corresponding consumer along the data connection, and (iv) incrementing the write credit counter when the particular producer receives from the corresponding consumer a write done token along the control connection. The write done token indicates to the particular producer that the writing of the buffer data unit into the corresponding consumer has completed (Examiner notes: Condition B, receipt of local event completion tag and Condition C completion of first data batch).
With regard to claim 30, Chen teaches the non-transitory computer-readable medium of claim 28, further comprising instructions for causing the computer to monitor the read request until one or more of the following occur: a) all previous read requests in a path are completed; b) the path has a verified receipt of the local event tag; c) the read request in the first batch of the plurality of batches is completed (Col. 20, lines 19-32, At action 1052, the method further includes (i) configuring each of the producers with a read-to-read credit counter, (ii) initializing the ready-to-read credit counter of a particular producer with as many read credits as a buffer depth of a corresponding consumer that reads data from the particular producer, (iii) decrementing the ready-to-read credit counter when the particular producer begins writing a buffer data unit into the corresponding consumer along a data connection, and (iv) incrementing the ready-to-read credit counter when the particular producer receives from the corresponding consumer a read ready token along a control connection. The read ready token indicates to the particular producer that the corresponding consumer has freed a buffer data unit and is ready to receive an additional buffer data unit (Examiner notes: Condition B, receipt of local event completion tag and Condition C, awaiting saturation of read buffer after processing first data batch).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IVAN A CASTANEDA whose telephone number is (571)272-0465. The examiner can normally be reached Monday-Friday 9:30AM-5:30PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached at (571) 272-4169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/I.A.C./Examiner, Art Unit 2195
/Aimee Li/Supervisory Patent Examiner, Art Unit 2195