DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-3, 6, 8-12, 15 and 17-18 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Yu et al., (US 2022/0261249 A1, hereinafter Yu).
Regarding claims 1 and 10, taking claim 10 as exemplary:
Yu teaches:
“A data processing circuit based on convolution computation, comprising: a plurality of memories, used to store a code; and a processor, coupled to the memories and configured to load and execute the code to:” (Paragraph [0052]: “After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed. Based on the above characteristic, a vector processing unit in a deep learning processor can be optimized according to the characteristic of tensor computing. When each operation is a predictable determinate operation, a bandwidth of the memory access interface can be used to the maximum extent, and all memory access operations are pipelined so that a set of vector data for computing reaches an operation unit in each clock cycle, to reduce memory losses...Therefore, design of an address generation unit (AGU) of a chip in a deep learning processor can be simplified according to the tensor memory access characteristics, thereby improving the memory access efficiency.” In paragraph [0283]: “The deep learning processor 320 controls the operation of the electronic device 30, and may also be referred to as a CPU. The memory 330 may include a read-only memory (ROM) and a random access memory (RAM), and provide an instruction and data to the deep learning processor 320. The memory 330 may include a read-only memory (ROM) and a random access memory (RAM), and provide an instruction and data to the deep learning processor 320. During specific application, components of the electronic device 30 are coupled together by using a bus system 340. In addition to a data bus, the bus system 340 may further include a power bus, a control bus, a status signal bus, and the like. However, for ease of clear description, all types of buses are marked as the bus system 340 in the figure.” And in paragraph [0284]: “The method disclosed in the foregoing embodiment of the present invention may be applied to the deep learning processor 320, or implemented by the deep learning processor 320. The deep learning processor 320 may be an integrated circuit chip, having a capability of processing a signal. In an implementation process, steps in the foregoing methods can be implemented by using a hardware integrated logical circuit in the deep learning processor 320, or by using instructions in a form of software” And in paragraph [0059]: “Sliding window-based operations (for example, depthwise convolution (DepthwiseCov), maximum pooling (MaxPool), average pooling (AvgPool), and upsampling) support repeated reading of continuous data. Based on depthwise convolution, it is supported that weight data is read sequentially and repeatedly.”)
“according to a size of a storage space of a first address of a first memory among the memories, store first partial data in input data into the first address of the first memory, wherein a size of the first partial data is not greater than the size of the storage space of the first address;” (Paragraph [0052]: “After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed. Based on the above characteristic, a vector processing unit in a deep learning processor can be optimized according to the characteristic of tensor computing. When each operation is a predictable determinate operation, a bandwidth of the memory access interface can be used to the maximum extent, and all memory access operations are pipelined so that a set of vector data for computing reaches an operation unit in each clock cycle, to reduce memory losses...Therefore, design of an address generation unit (AGU) of a chip in a deep learning processor can be simplified according to the tensor memory access characteristics, thereby improving the memory access efficiency.” In paragraph [0093]: “two tensors are read into a computing unit such as an ALU. After all multiplication and addition operations in the window are performed, an obtained operation result is written back by an AGU_W for writing back data.” In paragraph [0056]: “Specifically, in the address access of the AGU, a format of an access object is [N, H, W, C], H, [N, H, W, C] implements sequential access of a tensor with a size [1, H′, W′, C.sub.VEP]. C.sub.VEP represents a sub-tensor obtained after cutting an inputted tensor in a C dimension in a slice. A parallelism of C.sub.VEP in the C dimension is consistent with that of an ALU in a single instruction multi data (SIMD) processor. H′ and W′ are less than or equal to H and W of the inputted tensor respectively. Values of H′ and W′ depend on a capacity of a memory that can be directly accessed by the computing unit.”)
“and according to a size of a storage space of a second address of a second memory among the memories, store second partial data in the input data into the second address of the second memory, wherein a size of the second partial data is not greater than the size of the storage space of the second address,” (Paragraph [0052]: “After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed. Based on the above characteristic, a vector processing unit in a deep learning processor can be optimized according to the characteristic of tensor computing. When each operation is a predictable determinate operation, a bandwidth of the memory access interface can be used to the maximum extent, and all memory access operations are pipelined so that a set of vector data for computing reaches an operation unit in each clock cycle, to reduce memory losses...Therefore, design of an address generation unit (AGU) of a chip in a deep learning processor can be simplified according to the tensor memory access characteristics, thereby improving the memory access efficiency.” In paragraph [0093]: “two tensors are read into a computing unit such as an ALU. After all multiplication and addition operations in the window are performed, an obtained operation result is written back by an AGU_W for writing back data.” In paragraph [0056]: “Specifically, in the address access of the AGU, a format of an access object is [N, H, W, C], H, [N, H, W, C] implements sequential access of a tensor with a size [1, H′, W′, C.sub.VEP]. C.sub.VEP represents a sub-tensor obtained after cutting an inputted tensor in a C dimension in a slice. A parallelism of C.sub.VEP in the C dimension is consistent with that of an ALU in a single instruction multi data (SIMD) processor. H′ and W′ are less than or equal to H and W of the inputted tensor respectively. Values of H′ and W′ depend on a capacity of a memory that can be directly accessed by the computing unit.”)
“coordinates of the first partial data stored at the first address in two- dimensional coordinates of the input data of any channel are different from coordinates of the second partial data stored at the second address,” (Paragraph [0052]: “After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed. Based on the above characteristic, a vector processing unit in a deep learning processor can be optimized according to the characteristic of tensor computing. When each operation is a predictable determinate operation, a bandwidth of the memory access interface can be used to the maximum extent, and all memory access operations are pipelined so that a set of vector data for computing reaches an operation unit in each clock cycle, to reduce memory losses...Therefore, design of an address generation unit (AGU) of a chip in a deep learning processor can be simplified according to the tensor memory access characteristics, thereby improving the memory access efficiency.” In paragraph [0093]: “two tensors are read into a computing unit such as an ALU. After all multiplication and addition operations in the window are performed, an obtained operation result is written back by an AGU_W for writing back data.” In paragraph [0056]: “Specifically, in the address access of the AGU, a format of an access object is [N, H, W, C], H, [N, H, W, C] implements sequential access of a tensor with a size [1, H′, W′, C.sub.VEP]. C.sub.VEP represents a sub-tensor obtained after cutting an inputted tensor in a C dimension in a slice. A parallelism of C.sub.VEP in the C dimension is consistent with that of an ALU in a single instruction multi data (SIMD) processor. H′ and W′ are less than or equal to H and W of the inputted tensor respectively. Values of H′ and W′ depend on a capacity of a memory that can be directly accessed by the computing unit.”)
“and the first address stores elements of a plurality of channels with same coordinates in the input data.” (Paragraph [0052]: “After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed. Based on the above characteristic, a vector processing unit in a deep learning processor can be optimized according to the characteristic of tensor computing. When each operation is a predictable determinate operation, a bandwidth of the memory access interface can be used to the maximum extent, and all memory access operations are pipelined so that a set of vector data for computing reaches an operation unit in each clock cycle, to reduce memory losses...Therefore, design of an address generation unit (AGU) of a chip in a deep learning processor can be simplified according to the tensor memory access characteristics, thereby improving the memory access efficiency.” In paragraph [0093]: “two tensors are read into a computing unit such as an ALU. After all multiplication and addition operations in the window are performed, an obtained operation result is written back by an AGU_W for writing back data.” In paragraph [0056]: “Specifically, in the address access of the AGU, a format of an access object is [N, H, W, C], H, [N, H, W, C] implements sequential access of a tensor with a size [1, H′, W′, C.sub.VEP]. C.sub.VEP represents a sub-tensor obtained after cutting an inputted tensor in a C dimension in a slice. A parallelism of C.sub.VEP in the C dimension is consistent with that of an ALU in a single instruction multi data (SIMD) processor. H′ and W′ are less than or equal to H and W of the inputted tensor respectively. Values of H′ and W′ depend on a capacity of a memory that can be directly accessed by the computing unit.”)
Regarding claims 2 and 11, taking claim 11 as exemplary:
Yu shows the method and circuit of claims 1 and 10 as claimed and specified above.
And Yu shows “wherein the processor is further configured to: compare a channel number of the input data with the size of the storage space of the first address; and according to a comparison result between the channel number and the size of the storage space of the first address, determine an element number of at least one element of the input data comprised in the first partial data.” (Paragraph [0167]: “a read and write manner in which each piece of data uses a different fetch address is described. In vector read, each piece of data corresponds to one channel and uses one AGU. For example, a VEP is 128. In this case, 128 AGUs need to be used, and each AGU uses a separate set of configuration parameters. In conclusion, AGUs located on different channels can output different addresses and read or write corresponding data.” And in paragraph [0168]: “For ease of understanding, for example, a VEP is 128. FIG. 9 is a schematic diagram of an embodiment of using different fetch addresses for data in embodiments of this application. As shown in FIG. 9, E1 to E8 all denote channels. After the first target address of the first target data is obtained in an AGU_R0, the first target address can be sent to a data buffer through the channel E1, and the first target data corresponding to the first target address can be read from the data buffer through the channel E1 based on the first target address. Then, the first target data is sent to an ALU through the channel E2. Similarly, after the second target address of the second target data is obtained in an AGU_R0, the second target address can be sent to a data buffer through the channel E3, and the second target data corresponding to the second target address can be read from the data buffer through the channel E3 based on the second target address. Then, the second target data is sent to an ALU through the channel E4. After third target address of third target data is obtained in an AGU_R0, the third target address can be sent to a data buffer through the channel E5, and the third target data corresponding to the third target address can be read from the data buffer through the channel E5 based on the third target address. Then, the third target data is sent to an ALU through the channel E6. By analogy, 128 pieces of data can be read at the same time, and each data is read and written through different channels. It can be understood that the example in FIG. 9 is only used to understand this solution, and in practical application, a specific method for reading and writing data is flexibly determined according to an actual case.” And in paragraph [0052]: “After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed. Based on the above characteristic, a vector processing unit in a deep learning processor can be optimized according to the characteristic of tensor computing. When each operation is a predictable determinate operation, a bandwidth of the memory access interface can be used to the maximum extent, and all memory access operations are pipelined so that a set of vector data for computing reaches an operation unit in each clock cycle, to reduce memory losses...Therefore, design of an address generation unit (AGU) of a chip in a deep learning processor can be simplified according to the tensor memory access characteristics, thereby improving the memory access efficiency.”)
Regarding claims 3 and 12, taking claim 12 as exemplary:
Yu shows the method and circuit of claims 2 and 11 as claimed and specified above.
And Yu shows “wherein the processor is further configured to: determine that the comparison result is that the channel number is not greater than the size of the storage space of the first address, and further determine that a product of the channel number and the element number is not greater than the size of the storage space of the first address.” (Paragraph [0167]: “a read and write manner in which each piece of data uses a different fetch address is described. In vector read, each piece of data corresponds to one channel and uses one AGU. For example, a VEP is 128. In this case, 128 AGUs need to be used, and each AGU uses a separate set of configuration parameters. In conclusion, AGUs located on different channels can output different addresses and read or write corresponding data.” And in paragraph [0168]: “For ease of understanding, for example, a VEP is 128. FIG. 9 is a schematic diagram of an embodiment of using different fetch addresses for data in embodiments of this application. As shown in FIG. 9, E1 to E8 all denote channels. After the first target address of the first target data is obtained in an AGU_R0, the first target address can be sent to a data buffer through the channel E1, and the first target data corresponding to the first target address can be read from the data buffer through the channel E1 based on the first target address. Then, the first target data is sent to an ALU through the channel E2. Similarly, after the second target address of the second target data is obtained in an AGU_R0, the second target address can be sent to a data buffer through the channel E3, and the second target data corresponding to the second target address can be read from the data buffer through the channel E3 based on the second target address. Then, the second target data is sent to an ALU through the channel E4. After third target address of third target data is obtained in an AGU_R0, the third target address can be sent to a data buffer through the channel E5, and the third target data corresponding to the third target address can be read from the data buffer through the channel E5 based on the third target address. Then, the third target data is sent to an ALU through the channel E6. By analogy, 128 pieces of data can be read at the same time, and each data is read and written through different channels. It can be understood that the example in FIG. 9 is only used to understand this solution, and in practical application, a specific method for reading and writing data is flexibly determined according to an actual case.” And in paragraph [0052]: “After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed. Based on the above characteristic, a vector processing unit in a deep learning processor can be optimized according to the characteristic of tensor computing. When each operation is a predictable determinate operation, a bandwidth of the memory access interface can be used to the maximum extent, and all memory access operations are pipelined so that a set of vector data for computing reaches an operation unit in each clock cycle, to reduce memory losses...Therefore, design of an address generation unit (AGU) of a chip in a deep learning processor can be simplified according to the tensor memory access characteristics, thereby improving the memory access efficiency.”)
Regarding claims 6 and 15, taking claim 15 as exemplary:
Yu shows the method and circuit of claims 1 and 10 as claimed and specified above.
And Yu shows “wherein the processor is further configured to: read the input data from one of the memories according to location information, wherein the location information comprises a size of the input data and coordinates of at least one element in the input data.” (Paragraph [0052]: “After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed. Based on the above characteristic, a vector processing unit in a deep learning processor can be optimized according to the characteristic of tensor computing. When each operation is a predictable determinate operation, a bandwidth of the memory access interface can be used to the maximum extent, and all memory access operations are pipelined so that a set of vector data for computing reaches an operation unit in each clock cycle, to reduce memory losses...Therefore, design of an address generation unit (AGU) of a chip in a deep learning processor can be simplified according to the tensor memory access characteristics, thereby improving the memory access efficiency.”)
Regarding claims 8 and 17, taking claim 17 as exemplary:
Yu shows the method and circuit of claims 6 and 15 as claimed and specified above.
And Yu shows “wherein the processor is further configured to: read a first convolution kernel group among a plurality of convolution kernels according to a size of a sum register, wherein a number of the convolution kernels in the first convolution kernel group is the same as the size of the sum register; and temporarily store a first convolution computation result of the input data and the first convolution kernel group into the sum register through first input first output.” (Paragraph [0052]: “A tensor operation is defined as follows: for an operator (or a function), a data access address is usually regularized and can be computed. A pooling operation is used as an example. Pooling is an operation centered on a loop. After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed.” And in paragraph [0053]: “For ease of understanding, refer to FIG. 2. FIG. 2 is a schematic diagram of an embodiment of a tensor computing process according to embodiments of this application. As shown in FIG. 2, each slice in a tensor corresponds to multiple kernel windows, and a kernel window may further include at least one piece of data. B1 and B2 indicate different kernel windows. Data indicated by B11 to B12 belong to the kernel window B1, and data indicated by B21 to B22 belong to the kernel window B2. The quantity of data between B11 and B12 and the quantity of data between B21 and B22 in FIG. 2 are not limited in this application. The kernel window is slid at a unit of stride on the feature map of each depth. Each time the kernel window is slid, data in the kernel window is sequentially read from an on-chip memory, and data read in multiple kernel windows forms continuous data streams.” And in paragraph [0055]: “A deep learning processor may compute and supply data. Data supply is to transfer to-be-computed data to a computing unit during computing. Because a memory usually uses a multi-level architecture, data supply usually includes three levels of transfer: an off-chip-memory to an on-chip-memory, the on-chip-memory to an on-chip-near-alu-buffer or the on-chip-memory to an on-chip-near-alu-register file, and the on-chip-near-alu-buffer or the on-chip-near-alu-register file to an ALU. Transfer from an off-chip-memory to an on-chip-memory and from the on-chip-memory to an on-chip-near-alu-buffer or an on-chip-near-alu-register file is mainly performed in a data preparation stage. Transfer from the on-chip-near-alu-buffer or the on-chip-near-alu-register file to the ALU is a data read stage of computing. An AGU provided in this application is configured to solve the problem of the data read stage from the on-chip-near-alu-buffer or the on-chip-near-alu-register file to the ALU, or solve the problem of the data read stage from the on-chip-memory to the ALU. An address generation method provided in this application can be parameterized for tensor access in the AGU, so that one set of parameters can support multiple access modes, to improve versatility of tensor access. In addition, data is sequentially read on an inputted tensor, thereby improving data access efficiency.”)
Regarding claims 9 and 18, taking claim 19 as exemplary:
Yu shows the method and circuit of claims 8 and 17 as claimed and specified above.
And Yu shows “wherein the processor is further configured to: judge that a size of one of the convolution kernels is less than a computation amount of convolution computation; and repeatedly provide the input data for the convolution kernels to perform convolution computation.” (Paragraph [0052]: “A tensor operation is defined as follows: for an operator (or a function), a data access address is usually regularized and can be computed. A pooling operation is used as an example. Pooling is an operation centered on a loop. After being compiled by a central processing unit (CPU) or a graphics processing unit (GPU), an executed instruction stream includes many repeated instruction segments. A kernel window, a slice, and a batch are used as repetition cycles of a quantity of repeated instructions. An instruction execution sequence is determined. Then, a data arrangement format (for example, batch, height, width, and channels (NHWC)) is determined. Finally, a data access address is determined. That is, data in a kernel window is preferentially accessed, and then the kernel window is moved by a fixed stride to obtain a next kernel window. Therefore, a data access address is regularized and can be computed.” And in paragraph [0053]: “For ease of understanding, refer to FIG. 2. FIG. 2 is a schematic diagram of an embodiment of a tensor computing process according to embodiments of this application. As shown in FIG. 2, each slice in a tensor corresponds to multiple kernel windows, and a kernel window may further include at least one piece of data. B1 and B2 indicate different kernel windows. Data indicated by B11 to B12 belong to the kernel window B1, and data indicated by B21 to B22 belong to the kernel window B2. The quantity of data between B11 and B12 and the quantity of data between B21 and B22 in FIG. 2 are not limited in this application. The kernel window is slid at a unit of stride on the feature map of each depth. Each time the kernel window is slid, data in the kernel window is sequentially read from an on-chip memory, and data read in multiple kernel windows forms continuous data streams.” ” And in paragraph [0059]: “Sliding window-based operations (for example, depthwise convolution (DepthwiseCov), maximum pooling (MaxPool), average pooling (AvgPool), and upsampling) support repeated reading of continuous data. Based on depthwise convolution, it is supported that weight data is read sequentially and repeatedly.)
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 4-5, 7, 13-14, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Mills et al., (US 2021/0319290 A1, hereinafter Mills).
Regarding claims 4 and 13, taking claim 13 as exemplary:
Yu shows the circuit and method of claims 2 and 11 as claimed and specified above.
But Yu does not appear to explicitly recite “wherein the processor is further configured to: determine that the comparison result is that the channel number is greater than the size of the storage space of the first address, and further determine that the element number comprised in the first partial data is one.”
However, Mills shows “wherein the processor is further configured to: determine that the comparison result is that the channel number is greater than the size of the storage space of the first address, and further determine that the element number comprised in the first partial data is one.” (Paragraph [0115]: “An elementwise operation on data may be performed by planar engine 340 in more than one operating cycle due to the size of the input data exceeding a work unit. For example, in a first operating cycle, planar engine 340 only fetches first value 732 of width vector 730. Tensor 742 is used in an elementwise operation to combine with values in the first channel of input data 710 (stored as first plane 712) in the first operating cycle. In the second operating cycle, planar engine 340 only fetches second value 734. In planar engine 340, second value 734 is expanded in both width and height dimensions to generate another 5×5×1 tensor 744 for use in another elementwise operation that combines tensor 744 with the second channel of input data 710 (stored as the second plane 714) in the second operating cycle. The process is repeated for the third, fourth, and fifth operating cycles to complete the elementwise operation on the input data 710. To complete operations on a source dataset that have three or more dimensions (e.g., 5 dimensions), planar engine 340 may go through the work units (e.g., 5×5 work units) within a channel one work unit at a time, move to another channel until other channels are completed, and loop to another dimension (e.g., depth) until processing across other dimensions are also completed. Planar engine 340 processes a work unit of 5×5 width and height of a single channel in an operating cycle. To complete the elementwise operation of input data 710, which has 5 channels, 5 operating cycles are used.”)
Yu and Mills are analogous in the arts because both Yu and Mills describe the use of data processing for model data using channels and dimensional data.
Therefore, it would be obvious to one of ordinary skill in the art at the filing date of the instant application, having the teachings of Yu and Mills before him or her, to modify the teachings of Yu to include the teachings of Mills in order to increase flexibility of input data to allow for greater input data through expanded dimensions of tensor data (see Mills paragraph [0115).
Regarding claims 5 and 14, taking claim 14 as exemplary:
Yu and Mills teach the method and circuit of claims 4 and 14 as claimed and specified above.
And Yu shows “wherein the processor is further configured to: according to a size of a storage space of a third address of the first memory, store third partial data in the input data into the third address of the first memory, wherein a size of the third partial data is not greater than the size of the storage space of the third address.” (Paragraph [0167]: “a read and write manner in which each piece of data uses a different fetch address is described. In vector read, each piece of data corresponds to one channel and uses one AGU. For example, a VEP is 128. In this case, 128 AGUs need to be used, and each AGU uses a separate set of configuration parameters. In conclusion, AGUs located on different channels can output different addresses and read or write corresponding data.” And in paragraph [0168]: “For ease of understanding, for example, a VEP is 128. FIG. 9 is a schematic diagram of an embodiment of using different fetch addresses for data in embodiments of this application. As shown in FIG. 9, E1 to E8 all denote channels. After the first target address of the first target data is obtained in an AGU_R0, the first target address can be sent to a data buffer through the channel E1, and the first target data corresponding to the first target address can be read from the data buffer through the channel E1 based on the first target address. Then, the first target data is sent to an ALU through the channel E2. Similarly, after the second target address of the second target data is obtained in an AGU_R0, the second target address can be sent to a data buffer through the channel E3, and the second target data corresponding to the second target address can be read from the data buffer through the channel E3 based on the second target address. Then, the second target data is sent to an ALU through the channel E4. After third target address of third target data is obtained in an AGU_R0, the third target address can be sent to a data buffer through the channel E5, and the third target data corresponding to the third target address can be read from the data buffer through the channel E5 based on the third target address. Then, the third target data is sent to an ALU through the channel E6. By analogy, 128 pieces of data can be read at the same time, and each data is read and written through different channels. It can be understood that the example in FIG. 9 is only used to understand this solution, and in practical application, a specific method for reading and writing data is flexibly determined according to an actual case.”)
Regarding claims 7 and 16, taking claim 16 as exemplary:
Yu shows the circuit and method of claims 6 and 15 as claimed and specified above.
But Yu does not appear to explicitly recite “wherein the processor is further configured to: in response to a coordinate of one of the at least one element being located outside the size of the input data, determine that a value of the element is one of the input data according to a padding mode.”
However, Mills shows “wherein the processor is further configured to: in response to a coordinate of one of the at least one element being located outside the size of the input data, determine that a value of the element is one of the input data according to a padding mode.” (Paragraph [0115]: “Input data is typically split into smaller pieces of data for parallel processing at multiple neural engines 314 or neural engines 314 and planar engine 340. A set of data used for a convolution operation may be referred to as a convolution group, which can be split into multiple smaller units. The hierarchy of smaller units (segments) may be convolution groups, slices, tiles, work units, output channel groups, input channels (Cin), sub-Cins for input stride, etc.” And in paragraph [0109]: “Since strides in other dimensions are multiples of the access granularity, the width dimension, in one embodiment, is rounded up to the granularity size by paddings (e.g., adding zeros to fill unoccupied memory locations). Put differently, some dimensions of data stored in buffer 334 may be rounded up to the access granularity while other dimensions might have more flexibility. In one embodiment, the width dimension is padded to next granularity multiple. As such, a channel vector 720, which has width equal to one, is inefficient to be stored because a padding of 14 or 15 bytes is needed. Instead of storing the vector 720 as a channel vector, in one embodiment, buffer 334 may store the vector 720 as a width vector 730.”)
Yu and Mills are analogous in the arts because both Yu and Mills describe the use of data processing for model data using channels and dimensional data.
Therefore, it would be obvious to one of ordinary skill in the art at the filing date of the instant application, having the teachings of Yu and Mills before him or her, to modify the teachings of Yu to include the teachings of Mills in order to increase flexibility and processing of input data by padding data so as to process with different granularities through the addition of padding data (see Mills [0109]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Kong et al., (US 2022/0012587 A1), part of the prior art made of record, teaches the use of memory with coordinates, addresses, and storage space in paragraph [0087] through the use of convolution step sizes using a kernel and based on storage sections.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHANE D WOOLWINE whose telephone number is (571)272-4138. The examiner can normally be reached M-F 9:30-6:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MIRANDA HUANG can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
SHANE D. WOOLWINE
Primary Examiner
Art Unit 2124
/SHANE D WOOLWINE/Primary Examiner, Art Unit 2124