DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Style
In this action unitalicized bold is used for claim language, while italicized bold is used for emphasis.
Applicant Reply
“The claims may be amended by canceling particular claims, by presenting new claims, or by rewriting particular claims as indicated in 37 CFR 1.121(c). The requirements of 37 CFR 1.111(b) must be complied with by pointing out the specific distinctions believed to render the claims patentable over the references in presenting arguments in support of new claims and amendments. . . . The prompt development of a clear issue requires that the replies of the applicant meet the objections to and rejections of the claims. Applicant should also specifically point out the support for any amendments made to the disclosure. See MPEP § 2163.06. . . . An amendment which does not comply with the provisions of 37 CFR 1.121(b), (c), (d), and (h) may be held not fully responsive. See MPEP § 714.” MPEP § 714.02. Generic statements or listing of numerous paragraphs do not “specifically point out the support for” claim amendments. “With respect to newly added or amended claims, applicant should show support in the original disclosure for the new or amended claims. See, e.g., Hyatt v. Dudas, 492 F.3d 1365, 1370, n.4, 83 USPQ2d 1373, 1376, n.4 (Fed. Cir. 2007) (citing MPEP § 2163.04 which provides that a ‘simple statement such as ‘applicant has not pointed out where the new (or amended) claim is supported, nor does there appear to be a written description of the claim limitation ‘___’ in the application as filed’ may be sufficient where the claim is a new or amended claim, the support for the limitation is not apparent, and applicant has not pointed out where the limitation is supported.’)” MPEP § 2163(II)(A).
Election/Restrictions
The claims are not restricted because there was no undue examination burden. However, it is noted that the Specification goes into significant detail with respect to DMA and various machine learning schemes. The claims are directed to specific ways of carrying out transpose operations made possible by using a specific type of transpose buffer. This invention has been constructively elected. See MPEP § 818.02(a). Claims 5 and 6, reciting DMA and an accelerator, were not restricted because they did not result in an undue examination burden. Details of accelerators machine learning algorithms and DMA or other technologies that are not directly related to transpose buffers, may be independent or distinct from this invention. Amendments designed to include aspects which are independent or distinct from the elected invention will be evaluated for shift. See MPEP § 819.
Claim Rejections - 35 USC § 101
The claims read on various mathematical operations. But as a whole, the claims are directed to technique that saves power when carrying out the transpose operations, of the type commonly used in machine learning. This is a technical solution to a technical problem.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA the applicant regards as the invention.
Generally: separately listed claim elements are construed as distinct components, that all claim terms must be given weight, there is presumed to be a difference in meaning and scope when different words or phrases are used in separate claims, and repeated and consistent descriptions in the specification indicate the proper scope of a claimed term. “[C]laims must ‘conform to the invention as set forth in the remainder of the specification and the terms and phrases used in the claims must find clear support or antecedent basis in the description so that the meaning of the terms in the claims may be ascertainable by reference to the description.’ 37 C.F.R. § 1.75(d)(1).” Phillips v. AWH Corp., 415 F.3d 1303, 1316 (Fed. Cir. 2005) (as cited in MPEP § 2111). Therefore, use of two different terms in the claims that both rely on the description of a single structure in the Specification may render at least one term indefinite because there is no way to determine which term should be construed in view of the description of the single structure.
Claim 1 recites “A computing system, comprising one or more source memories, one or more destination memories, a transpose buffer, and a hardware component that is configured to:” carry out various operations. It is not clear whether the “computing system” or the “hardware component” are “configured to” carry out the operations in the body of the claim.
Claim 7 recites “A One or more computer-readable non-transitory storage media embodying software that is operable when executed to, by a computing system comprising one or more source memories, one or more destination memories, and a transpose buffer:” for carrying out the operations recited in the body of the claim. It is not clear whether the claim recite “A” or “One or more” storage media. One of these terms should be deleted to the number of storage media is clear.
Claims 5, 11 and 17 substantially recite “wherein the hardware component is a direct memory access.” It is not clear what, if any, operations must be carried out using DMA. As an example, claim 5 recites further limits the “hardware component” of claim 1 to being “a direct memory access.” First, this language purports to limit a “hardware component” to a set of operations or a feature (and not a DMA controller or hardware module.) This inconsistency leaves it unclear whether the “hardware component” must be a DMA controller, or if the hardware component can be generic hardware in a system otherwise implementing DMA. If the claims are interpreted as reciting the operation of “direct memory access,” it is not clear if this operation must relate to the data being transferred via the transpose buffer in the independent claims. Since it is not clear whether “hardware component [that is] a direct memory access” requires a DMA controller or if the language requires any specific DMA operations related to the memory operations in the body of the claim, the claim language is indefinite. Further, independent claims 7 and 13 do not recite a “hardware component.” It is not clear then, which hardware in the independent claims provides antecedent support for “the hardware component” of claims 11 and 17.
Claim 13 recites “A method comprising, by a computing system comprising one or more source memories, one or more destination memories, and a transpose buffer: . . .” It is not clear whether the “computer system” must carry out the operations in the body of the claim. The claims recite a method “comprising, by a computer system” without any language indicating that the method is carried out by the “computer system.
All dependent claims are rejected as containing the limitations of the claims from which they depend.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-4, 6-10, 12-16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Ruiz (Efficient low-power register array with transposed access mode) and AAPA in the Background of the Specification.
1. A computing system, comprising one or more source memories, one or more destination memories, a transpose buffer, and a hardware component that is configured to: (“Fig. 4 shows the proposed architecture of a low-powered transpose register array. Here, the size 4x4 is used as an example, but it can be easily extended to a generic MxM size. This circuit is made up of 16 registers each one of an N-bit word based on clock-gated DFFs, a control unit which selects one row (or column) and a selecting output circuit based on OR trees and multiplexers.” Ruiz P. 465. Ruiz teaches inputting data to the transpose buffer and outputting data from the transpose buffer. This implies a source for the input data and destination for the output data, but does not expressly teach either one. While one of ordinary skill in the art would understand data as being received from and sent to a memory based on context and a basic understanding of computer technology, this is not an express teaching.
The background section explains “Contemporary machine-learning (ML) models may require computing tensor transposes frequently. To compute a tensor transpose, a computing device that is not equipped with an enhanced tensor transpose solution may load a tensor from a source memory to an intermediate memory buffer in a row-major order (or column-major order) and write the tensor to a target memory in a column-major order (or row-major order).” Spec. ¶3. The Specification describes this aspect in reference to “[c]ontemporary machine-learning models” that are “not equipped with an enhanced tensor transpose solution.” Based on this characterization of loading a tensor from a source memory to a buffer used to transpose the tensor before storing it in a target memory, Examiner finds that this technique is applicant admitted prior art.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teaching of the AAPA because accessing data in memory is helpful to computing tensors and computing tensors are helpful to training and using machine learning models.) at each iteration i among N iterations of a first loop, wherein the transpose buffer has N rows and N columns: (“In real-time processing environments, the transpose operation is performed by a transpose memory which supports a continuous data flow where the processed data are stored in rows (columns) and read in columns (rows).” Ruiz P. 463. See Ruiz Figs. 1-2 showing a transpose memory (buffer) having rows and columns.) read first data corresponding to row i of a first tensor from a first source memory; (“When rc=1 and si=1, all registers {Ri,j, j=0–3} belonging to the ith row are selected. The four inputs INPj are written in parallel in those registers and, at the same time, their four outputs Qi,j are read out through the outputs OUTj.” Ruiz p. 465. Performing this operation on a tensor is AAPA, as explained above. The motivation to combine addresses this aspect of the combination.) read second data from column i of the transpose buffer; (If rc=0 and sj=1, the registers {Ri,j, i=0–3} belonging to the jth column are now selected for writing the input INPi and reading out the output Qi,j through the output OUTi. Ruiz p. 466.) write the first data to column i of the transpose buffer; and (“When rc=1 and si=1, all registers {Ri,j, j=0–3} belonging to the ith row are selected. The four inputs INPj are written in parallel in those registers and, at the same time, their four outputs Qi,j are read out through the outputs OUTj.” Ruiz p. 465) cause the second data to be written to row i of a second tensor at a first destination memory; and (“(If rc=0 and sj=1, the registers {Ri,j, i=0–3} belonging to the jth column are now selected for writing the input INPi and reading out the output Qi,j through the output OUTi.” Ruiz p. 466. “In a real-time transpose memory, simultaneous read and write operations are performed to carry out matrix transposition. To achieve this, input data are read out of the register column-wise if the previous intermediate data were written into the register row-wise, and vice versa.” Ruiz P. 464.)
at each iteration j among N iterations of a second loop: (“In a real-time transpose memory, simultaneous read and write operations are performed to carry out matrix transposition. To achieve this, input data are read out of the register column-wise if the previous intermediate data were written into the register row-wise, and vice versa.” Ruiz p. 464. Note here, that Ruiz is teaching a continuous process. See Ruiz P. 463 (“In real-time processing environments, the transpose operation is performed by a transpose memory which supports a continuous data flow where the processed data are stored in rows (columns) and read in columns (rows).”))
read third data corresponding to row j of a third tensor from a second source memory; read fourth data from row j of the transpose buffer; write the third data to row j of the transpose buffer; and cause the fourth data to be written to row j of a fourth tensor at a second destination memory, (This reads on reading a “third data” from a second source memory and writing that data to a row the transpose buffer, as well as reading a “fourth data” from the same row in the buffer and writing that data to a second memory. Ruiz teaches a continuous process of reading data vectors from each row (or column) in a transpose buffer while simultaneously writing another vector to the given row (or column). The technique then performs the same operations on each column (or row) of the transpose buffer thereby transposing a matrix (comprising the vectors) Ruiz teaches: “In a real-time transpose memory, simultaneous read and write operations are performed to carry out matrix transposition. To achieve this, input data are read out of the register column-wise if the previous intermediate data were written into the register row-wise, and vice versa.” Ruiz p. 464. “When rc=1 and si=1, all registers {Ri,j, j=0–3} belonging to the ith row are selected. The four inputs INPj are written in parallel in those registers and, at the same time, their four outputs Qi,j are read out through the outputs OUTj.” Ruiz p. 465. Note here, that Ruiz is teaching a continuous process. See Ruiz P. 463 (“In real-time processing environments, the transpose operation is performed by a transpose memory which supports a continuous data flow where the processed data are stored in rows (columns) and read in columns (rows).”) wherein the fourth tensor is a transposed tensor of the first tensor. (“In a real-time transpose memory, simultaneous read and write operations are performed to carry out matrix transposition. To achieve this, input data are read out of the register column-wise if the previous intermediate data were written into the register row-wise, and vice versa.” Ruiz p. 464.)
2. The computing system of Claim 1, wherein the transpose buffer is a Delay Flip-Flop (D Flip-Flop) memory. (“Each register . . . is made up of a DFF bank interconnected via 2:1 multiplexers forming 4 shift-registers of length 4 either in the horizontal direction (columns) when the selection signal r/c=0 or in the vertical direction (rows) when r/c=1. This signal changes every 4 clock cycles and controls the direction of shifting in the registers.” Ruiz P. 464.)
3. The computing system of Claim 1, wherein reading the second data from column i of the transpose buffer and writing the first data to column i of the transpose buffer occur simultaneously. (Ruiz teaches “Fig. 2(a) shows the schematic of the conventional implementation [7,13–15,18] of a 4x4 transpose register array, distributed in a matrix order. . . . Indeed, Fig. 2(b) and (c) graphically shows the horizontal (vertical) shifting equivalent to a reading/writing process in rows (columns) carried out over the sixteen previous data stored in columns (rows). This means that the transpose register has a parallel input/output structure and the data are transposed on the fly supporting a continuous data flow with the smallest possible size and minimal latency (4 clock cycles).” Ruiz p. 464. “This means that during each clock cycle only one row (column) is selected to perform a simultaneous read/write operation, while the rest of the registers are disabled” Ruiz P. 465. Note also, that the latency of 4 clock cycles for transposing the matrix using a 4 x 4 transpose register array implies 1 clock per array performing both a read and a write operation.)
4. The computing system of Claim 1, wherein reading the fourth data from row j of the transpose buffer and writing the third data to row j of the transpose buffer occur simultaneously. (See rejection of claim 3.)
6. The computing system of Claim 1, wherein the computing system is a machine-learning accelerator. (The AAPA explains “Recently, computer processing devices specifically designed to accelerate ML calculations have been introduced. Those computer processing devices may be referred to as ML accelerators. . . . [E] existing ML accelerators focus on using high compute parallelism along with an optimized data orchestration throughout the memory hierarchy to speed up the processing of convolutional layers or self-attention layers.” Spec. ¶2.
With respect to this limitation, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teaching of the AAPA because accelerators may improve the performance of machine learning models (i.e. may reduce the time to use and/or train the models.)
For rejections of claims 7-10 and 12, see rejections of claims 1-4 and 6.
For rejections of claims 13-16 and 18, see rejections of claims 1-4 and 6.
Claims 5, 11, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Ruiz, AAPA in the Background of the Specification, and Glines (Strided DMA for Multidimensional Array Copy and Transpose, 2022).
5. The computing system of Claim 1, wherein the hardware component is a direct memory access. (The previously cited art does not address DMA.
Glines teaches “We emulate a multidimensional array direct memory access (DMA) copy and transpose engine using a GPU kernel. This DMA can more effectively prefetch and write-combine non-contiguous multidimensional array data, reducing latency and improving bandwidth. We propose a reconfigurable DMA engine that supports multiple strides and discuss how it can offload multidimensional array copy and transpose. Further, this DMA engine can use the stride information to better inform policies of higher level memory hierarchies to maximize bandwidth.” Glines Abstract. “CPU strided, GPU transpose: CPU copies data into a contiguous buffer for DMA offload, but does not transpose data. Data is transposed on GPU. GPUDMA emulation: GPU kernel ‘pulls’ the data instead of the CPU pushing the data. This emulates a strided DMA engine because the CPU is free to orchestrate other actions.” Glines P. 381. Note that this teaches using DMA to access data from a memory for transpose on a GPU.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teaching of Glines because using DMA when carrying out transposes of multidimensional arrays (e.g. tensors) can reduce bandwidth requirements.)
For rejections of claims 11 and 17, see rejection of claim 5.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL M KNIGHT whose telephone number is (571) 272-8646. The examiner can normally be reached Monday - Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached on (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
PAUL M. KNIGHTExaminerArt Unit 2148
/PAUL M KNIGHT/Examiner, Art Unit 2148