Last updated: April 19, 2026
Application No. 18/344,091
BINARY CONVOLUTION INSTRUCTIONS FOR BINARY NEURAL NETWORK COMPUTATIONS

Final Rejection §101§103§112
Filed
Jun 29, 2023
Examiner
VICARY, KEITH E
Art Unit
2183
Tech Center
2100 — Computer Architecture & Software
Assignee
Texas Instruments Incorporated
OA Round
4 (Final)
This examiner grants 58% of cases after interview

— +41.2% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 683 resolved cases, 2023–2026
Examiner Intelligence

VICARY, KEITH E View full profile →
Grants 58% of resolved cases
Career Allow Rate
393 granted / 683 resolved
+2.5% vs TC avg
Strong +41% interview lift
Without
With
+41.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
41 currently pending
Career history
724
Total Applications
across all art units
Statute-Specific Performance

§101
8.7%
-31.3% vs TC avg
§103
34.0%
-6.0% vs TC avg
§102
12.0%
-28.0% vs TC avg
§112
37.6%
-2.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 683 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
 
Claims 1-2, 4-6, 8-10, 12-14, 16-17, 23-25, 28-30, and 33-35 are pending in this office action and presented for examination. Claims 1, 5-6, 9, 13-14, 17, 23-25, 28-30, and 33-35 are newly amended, and claims 7 and 15 are newly cancelled by the response received January 28, 2026.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 25, 30, and 35 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claim 25 recites the limitation “The hardware processing device of claim 1, wherein the first XNOR circuit includes N/2 individual XNOR gates, wherein N is an even integer equal to a width of the first input register, and wherein the first counter circuit is configured to count a quantity of ones or a quantity of zeros in an output of the first XNOR circuit” in lines 1-4. However, the original disclosure does not appear to provide support for this limitation. For example, the original disclosure (e.g., paragraph [0064]) does not appear to provide support for the first XNOR circuit including both N/2 individual XNOR gates and a further element(s) beyond the N/2 individual XNOR gates, which is a scenario encompassed by the claim language in view of the open-ended “includes” language. 

Claim 30 recites the limitation “The hardware apparatus of claim 9, wherein the first XNOR circuit includes N/2 individual XNOR gates, wherein N is an even integer equal to a width of the first input register, and wherein the first counter circuit is configured to count a quantity of ones or a quantity of zeros in an output of the first XNOR circuit” in lines 1-4. However, the original disclosure does not appear to provide support for this limitation. For example, the original disclosure (e.g., paragraph [0064]) does not appear to provide support for the first XNOR circuit including both N/2 individual XNOR gates and a further element(s) beyond the N/2 individual XNOR gates, which is a scenario encompassed by the claim language in view of the open-ended “includes” language. 

Claim 35 recites the limitation “The hardware computing apparatus of claim 17, wherein the first XNOR circuit includes N/2 individual XNOR gates, wherein N is an even integer equal to a width of the first input register, and wherein the first counter circuit is configured to count a quantity of ones or a quantity of zeros in an output of the first XNOR circuit” in lines 1-5. However, the original disclosure does not appear to provide support for this limitation. For example, the original disclosure (e.g., paragraph [0064]) does not appear to provide support for the first XNOR circuit including both N/2 individual XNOR gates and a further element(s) beyond the N/2 individual XNOR gates, which is a scenario encompassed by the claim language in view of the open-ended “includes” language. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2, 4-6, 8-10, 12-14, 16-17, 23, 25, 28, 30, 33, and 35 is/are rejected under 35 U.S.C. 103 as being unpatentable over Aggarwal et al. (Aggarwal) (US 20220413853 A1) in view of Nealis et al. (Nealis) (US 20180307950) in view of Herrero Abellanas et al. (Herrero Abellanas) (US 20150178246 A1).
Consider claim 1, Aggarwal discloses a hardware processing device comprising: a set of input registers ([0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory); convolution circuitry (FIG. 6, execution circuit 610) including a first channel and a second channel both coupled to the set of input registers ([0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data)), and wherein the first channel includes a first accumulator circuit and wherein the second channel includes a second accumulator circuit ([0068], lines 2-4, adder/accumulator circuitry 618 are split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data)); a set of destination registers coupled to the first accumulator circuit and the second accumulator circuit (FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory)); a decoder ([0045], lines 10-13, the decoder circuit 206 (e.g., decoder circuit) decodes the instruction into a decoded instruction (e.g., one or more micro-instructions or micro-operations). The decoded instruction is then sent for execution) coupled to the convolution circuitry (FIG. 2, execution circuit 212; FIG. 6, execution circuit 610); and instruction fetch circuitry coupled to the decoder ([0045], lines 8-9, the instruction (e.g., macro-instruction) is fetched from storage 202 and sent to decoder circuit 206; [0205], lines 4-5, instruction fetch unit) and configured to fetch a convolution instruction ([0049], line 2, packed data convolution instruction) from memory ([0045], lines 2-4, storage 202 that includes one or more packed data convolution with shift control and/or width control instructions), wherein the convolution instruction specifies a set of input data, a set of weight data, and the set of destination registers ([0049], lines 8-10, single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination; [0029], lines 9-10, vector of values and a vector of filter weights; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory)); wherein the decoder is configured to cause the set of input data and the set of weight data to be provided to the convolution circuitry (FIG. 6, which shows the data in Source 1 602 and Source 2 604 being provided to execution circuit 610; [0049], line 28, retrieve data associated with the identified source operands); wherein the convolution circuitry is configured to: perform a convolution operation ([0049], lines 30-31, execute the decoded instruction according to the opcode; [0049], line 2, packed data convolution instruction), via the first channel and the second channel ([0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data)), on the set of input data and the set of weight data (FIG. 6, which shows the data in Source 1 602 and Source 2 604 being provided to execution circuit 610; [0029], lines 9-10, vector of values and a vector of filter weights); and wherein the set of input registers includes a first input register, coupled to the first channel, configured to store a first portion of the set of input data, further wherein the set of input registers includes a second input register, coupled to the second channel, configured to store a second portion of the set of input data ([0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Examiner notes that each vector register of the register file that may provide input data for the packed data convolution instruction is coupled to execution circuitry of the first channel and execution circuitry of the second channel, and further notes that each element-width register of the overall vector register is coupled to corresponding execution circuitry of a corresponding channel).
However, Aggarwal does not disclose that the convolution is a binary convolution (and therefore does not disclose that the aforementioned convolution circuitry is “binary” convolution circuitry; the aforementioned convolution instruction is a “binary” convolution instruction; and the aforementioned convolution operation is a “binary” convolution operation). Consequently, Aggarwal also does not disclose the first channel includes a first exclusive-nor (XNOR) circuit, a first counter circuit coupled to the first XNOR circuit, and the aforementioned first accumulator circuit coupled to the first counter circuit, and wherein the second channel includes a second XNOR circuit, a second counter circuit coupled to the second XNOR circuit, and the aforementioned second accumulator circuit coupled to the second counter circuit, the first input register coupled to the first XNOR circuit, the second input register coupled to the second XNOR circuit. Aggarwal also does not disclose the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row. 
On the other hand, Nealis discloses binary convolution ([0207], lines 2-3, fully binary neural networks can be implemented in which both weight and feature data is stored as binary values; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation), and a first channel ([0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width) includes a first exclusive-nor (XNOR) circuit, a first counter circuit coupled to the first XNOR circuit, and a first accumulator circuit coupled to the first counter circuit ([0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814), and wherein a second channel ([0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width) includes a second XNOR circuit, a second counter circuit coupled to the second XNOR circuit, and a second accumulator circuit coupled to the second counter circuit (Nealis, [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814), a first input register coupled to the first XNOR circuit, a second input register coupled to the second XNOR circuit (FIG. 2D, register file 258 coupled to GPGPU Cores 262; [0074], lines 1-6, The register file 258 provides a set of registers for the functional units of the graphics multiprocessor 234. The register file 258 provides temporary storage for operands connected to the data paths of the functional units (e.g., GPGPU cores 262, load/store units 266) of the graphics multiprocessor 234; [0075], lines 1-4, The GPGPU cores 262 can each include floating point units (FPUs) and/or integer arithmetic logic units (ALUs) that are used to execute instructions of the graphics multiprocessor 234; [0208], lines 16-21, in one embodiment the fused XNOR and population count unit 2104 is included within all processing elements with the GPGPU. In one embodiment, only a subset of the processing elements within the GPGPU include the fused XNOR and population count unit 2104; Examiner notes that each register of the register file that may provide input data for the vector binary multiply-accumulate instruction is coupled to the first XNOR circuit and the second XNOR circuit, and further notes that each element-width register of the overall vector register is coupled to a corresponding XNOR circuit of a corresponding channel).
Nealis’ teaching enables achieving inference accuracy similar to higher precision networks with significant reduction in memory storage and bandwidth requirements and computational complexity (Nealis, [0207], lines 6-9).
Therefore, it would have been obvious, for one of ordinary skill in the art before the effective filing date of the claimed invention, with the Aggarwal and Nealis references in front of them, to combine the aforementioned references to result in the claimed subject matter (including binary convolution circuitry, a binary convolution instruction, and a binary convolution operation), as such a combination enables achieving inference accuracy similar to higher precision networks with significant reduction in memory storage and bandwidth requirements and computational complexity. Alternatively, this modification merely entails combining prior art elements (for example, an instruction performing convolution as taught by Aggarwal, and convolution being binary convolution in particular as taught by Nealis) according to known methods (for example, Examiner submits that implementing processor operations via instructions specifying operands, an instruction decoder, and so forth, was well-known before the effective filing date) to yield predictable results (the claimed subject matter, which includes binary convolution circuitry, a binary convolution instruction, and a binary convolution operation).
However, the combination thus far does not entail the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row.
On the other hand, Herrero Abellanas discloses convolution, wherein a first portion of a set of input data is a first column or row, and a second portion of a set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row ([0043], lines 1-4, in the next iteration, the imaginary input window of convolution circuit 314 may shift to the next row of the input image, in order to apply to convolution filter to the pixels I.sub.5, . . . , I.sub.8).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Herrero Abellanas with the combination of Aggarwal and Nealis in order to apply the power of convolution to matrix data in particular, such as image data (Herrero Abellanas, [0002]). Alternatively, this modification merely entails combining prior art elements (the prior art elements of Aggarwal and Nealis as cited above, including the prior art element of convolution and registers storing input data, and the prior art element of Herrero Abellanas of convolution of rows of a matrix) according to known methods (Examiner submits that convolution on matrix data, and registers storing matrix data such as row data, for example, was known) to yield predictable results (the combination of Aggarwal and Nealis, wherein the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row), which is an example of a rationale that may support a conclusion of obviousness as per MPEP 2143. Examiner further submits that it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to store a row in a register, given that a register is higher speed relative to other forms of memory such as main memory.

Consider claim 2, the overall combination entails the hardware processing device of claim 1 (see above), wherein to cause the set of input data and the set of weight data to be provided to the binary convolution circuitry (Aggarwal, FIG. 6, which shows the data in Source 1 602 and Source 2 604 being provided to execution circuit 610; [0049], line 28, retrieve data associated with the identified source operands), the decoder is configured to cause register locations identified by the binary convolution instruction to be provided to the binary convolution circuitry, wherein the register locations comprise locations for an input data register of the set of input registers, a weight data register, and a destination register of the set of destination registers (Aggarwal, [0049], lines 8-10, single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination; [0029], lines 9-10, vector of values and a vector of filter weights; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory); [0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory).

Consider claim 4, the overall combination entails the hardware processing device of claim 1 (see above), wherein the first XNOR circuit is configured to calculate a bit-wise XNOR of a first portion of the set of input data and a first portion of the set of weight data, and wherein the second XNOR circuit is configured to calculate a bit-wise XNOR of a second portion of the set of input data and the first portion of the set of weight data, wherein the first counter circuit is configured to perform a counting operation on a result of the first XNOR circuit and the second counter circuit is configured to perform a counting operation on a result of the second XNOR circuit, and wherein the first accumulator circuit is configured to add an output of the first counter circuit to a first destination register of the set of destination registers and the second accumulator circuit is configured to add an output of the second counter circuit to a second destination register of the set of destination registers (Aggarwal, [0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory)); Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; Nealis, [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814).

Consider claim 5, the overall combination entails the hardware processing device of claim 1 (see above), wherein the first XNOR circuit is configured to output a first result, and wherein the first counter circuit is configured to perform a counting operation on the first result and to output a second result, and wherein the first accumulator circuit is configured to add the second result to a first destination register of the set of destination registers (Aggarwal, [0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory); Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; Nealis, [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814).

Consider claim 6, the overall combination entails the hardware processing device of claim 5 (see above), wherein the second XNOR circuit is configured to output a third result, wherein the second counter circuit is configured to perform a counting operation on the third result and to output a fourth result, and wherein the second accumulator circuit is configured to add the fourth result to a second destination register of the set of destination registers (Aggarwal, [0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory); Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; Nealis, [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814).

Consider claim 8, the combination thus far entails the hardware processing device of claim 1 (see above), and Nealis further explicitly discloses a set of input data comprises sensor data associated with a machine learning model ([0135], lines 1-10, a machine learning algorithm is an algorithm that can learn based on a set of data. Embodiments of machine learning algorithms can be designed to model high-level abstractions within a data set. For example, image recognition algorithms can be used to determine which of several categories to which a given input belong; regression algorithms can output a numerical value given an input; and pattern recognition algorithms can be used to generate translated text or perform text to speech and/or speech recognition), a set of weight data comprises weight values of the machine learning model ([0207], line 3, weight), and output values produced by a layer of the machine learning model ([0136], lines 1-21, an exemplary type of machine learning algorithm is a neural network. There are many types of neural networks; a simple type of neural network is a feedforward network. A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the aforementioned further teachings of Nealis with the previously explained combination of Aggarwal, Nealis, and Herrero Abellanas (which, as cited above, entails a set of input data, a set of weight data, and a set of destination registers), in order to, for example, implement machine learning, which is useful in fields such as image recognition, regression, and pattern recognition (Nealis, [0135], lines 1-10). 

Consider claim 9, Aggarwal discloses a hardware apparatus comprising: a set of input registers ([0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory); a memory device configured to store program instructions, wherein the program instructions include a convolution instruction ([0045], lines 2-4, storage 202 that includes one or more packed data convolution with shift control and/or width control instructions 204); convolution circuitry configured to perform a convolution (FIG. 6, execution circuit 610) and including a first channel and a second channel both coupled to the set of input registers ([0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data)), wherein the first channel includes a first accumulator circuit and wherein the second channel includes a second accumulator circuit ([0068], lines 2-4, adder/accumulator circuitry 618 are split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data)); a set of destination registers coupled to the first accumulator circuit and the second accumulator circuit (FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory)); a decoder ([0045], lines 10-13, the decoder circuit 206 (e.g., decoder circuit) decodes the instruction into a decoded instruction (e.g., one or more micro-instructions or micro-operations). The decoded instruction is then sent for execution) coupled to the convolution circuitry (FIG. 2, execution circuit 212; FIG. 6, execution circuit 610); and instruction fetch circuitry coupled to the memory device and the decoder ([0045], lines 8-9, the instruction (e.g., macro-instruction) is fetched from storage 202 and sent to decoder circuit 206; [0205], lines 4-5, instruction fetch unit) and configured to fetch the convolution instruction ([0049], line 2, packed data convolution instruction) from the memory device ([0045], lines 2-4, storage 202 that includes one or more packed data convolution with shift control and/or width control instructions), wherein the convolution instruction specifies a set of input data, a set of weight data, and the set of destination registers ([0049], lines 8-10, single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination; [0029], lines 9-10, vector of values and a vector of filter weights; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory)); wherein the decoder is configured to cause the set of input data and the set of weight data to be provided to the convolution circuitry (FIG. 6, which shows the data in Source 1 602 and Source 2 604 being provided to execution circuit 610; [0049], line 28, retrieve data associated with the identified source operands); wherein the convolution circuitry is configured to: perform a convolution operation ([0049], lines 30-31, execute the decoded instruction according to the opcode; [0049], line 2, packed data convolution instruction), via the first channel and the second channel ([0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data)); and wherein the set of input registers includes a first input register, coupled to the first channel, configured to store a first portion of the set of input data, further wherein the set of input registers includes a second input register, coupled to the second channel, configured to store a second portion of the set of input data ([0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Examiner notes that each vector register of the register file that may provide input data for the packed data convolution instruction is coupled to execution circuitry of the first channel and execution circuitry of the second channel, and further notes that each element-width register of the overall vector register is coupled to corresponding execution circuitry of a corresponding channel).
However, Aggarwal does not disclose that the convolution is a binary convolution (and therefore does not disclose that the aforementioned convolution circuitry is “binary” convolution circuitry; the aforementioned convolution instruction is a “binary” convolution instruction; and the aforementioned convolution operation is a “binary” convolution operation). Consequently, Aggarwal also does not disclose the first channel includes a first exclusive-nor (XNOR) circuit, a first counter circuit coupled to the first XNOR circuit, and the aforementioned first accumulator circuit coupled to the first counter circuit, and wherein the second channel includes a second XNOR circuit, a second counter circuit coupled to the second XNOR circuit, and the aforementioned second accumulator circuit coupled to the second counter circuit, the first input register coupled to the first XNOR circuit, the second input register coupled to the second XNOR circuit. Aggarwal also does not disclose the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row. 
On the other hand, Nealis discloses binary convolution ([0207], lines 2-3, fully binary neural networks can be implemented in which both weight and feature data is stored as binary values; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation), and a first channel ([0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width) includes a first exclusive-nor (XNOR) circuit, a first counter circuit coupled to the first XNOR circuit, and a first accumulator circuit coupled to the first counter circuit ([0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814), and wherein a second channel ([0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width) includes a second XNOR circuit, a second counter circuit coupled to the second XNOR circuit, and a second accumulator circuit coupled to the second counter circuit (Nealis, [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814), a first input register coupled to the first XNOR circuit, a second input register coupled to the second XNOR circuit (FIG. 2D, register file 258 coupled to GPGPU Cores 262; [0074], lines 1-6, The register file 258 provides a set of registers for the functional units of the graphics multiprocessor 234. The register file 258 provides temporary storage for operands connected to the data paths of the functional units (e.g., GPGPU cores 262, load/store units 266) of the graphics multiprocessor 234; [0075], lines 1-4, The GPGPU cores 262 can each include floating point units (FPUs) and/or integer arithmetic logic units (ALUs) that are used to execute instructions of the graphics multiprocessor 234; [0208], lines 16-21, in one embodiment the fused XNOR and population count unit 2104 is included within all processing elements with the GPGPU. In one embodiment, only a subset of the processing elements within the GPGPU include the fused XNOR and population count unit 2104; Examiner notes that each register of the register file that may provide input data for the vector binary multiply-accumulate instruction is coupled to the first XNOR circuit and the second XNOR circuit, and further notes that each element-width register of the overall vector register is coupled to a corresponding XNOR circuit of a corresponding channel).
Nealis’ teaching enables achieving inference accuracy similar to higher precision networks with significant reduction in memory storage and bandwidth requirements and computational complexity (Nealis, [0207], lines 6-9).
Therefore, it would have been obvious, for one of ordinary skill in the art before the effective filing date of the claimed invention, with the Aggarwal and Nealis references in front of them, to combine the aforementioned references to result in the claimed subject matter (including binary convolution circuitry, a binary convolution instruction, and a binary convolution operation), as such a combination enables achieving inference accuracy similar to higher precision networks with significant reduction in memory storage and bandwidth requirements and computational complexity. Alternatively, this modification merely entails combining prior art elements (for example, an instruction performing convolution as taught by Aggarwal, and convolution being binary convolution in particular as taught by Nealis) according to known methods (for example, Examiner submits that implementing processor operations via instructions specifying operands, an instruction decoder, and so forth, was well-known before the effective filing date) to yield predictable results (the claimed subject matter, which includes binary convolution circuitry, a binary convolution instruction, and a binary convolution operation).
However, the combination thus far does not entail the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row.
On the other hand, Herrero Abellanas discloses convolution, wherein a first portion of a set of input data is a first column or row, and a second portion of a set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row ([0043], lines 1-4, in the next iteration, the imaginary input window of convolution circuit 314 may shift to the next row of the input image, in order to apply to convolution filter to the pixels I.sub.5, . . . , I.sub.8).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Herrero Abellanas with the combination of Aggarwal and Nealis in order to apply the power of convolution to matrix data in particular, such as image data (Herrero Abellanas, [0002]). Alternatively, this modification merely entails combining prior art elements (the prior art elements of Aggarwal and Nealis as cited above, including the prior art element of convolution and registers storing input data, and the prior art element of Herrero Abellanas of convolution of rows of a matrix) according to known methods (Examiner submits that convolution on matrix data, and registers storing matrix data such as row data, for example, was known) to yield predictable results (the combination of Aggarwal and Nealis, wherein the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row), which is an example of a rationale that may support a conclusion of obviousness as per MPEP 2143. Examiner further submits that it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to store a row in a register, given that a register is higher speed relative to other forms of memory such as main memory.

Consider claim 10, the overall combination entails the hardware apparatus of claim 9 (see above), wherein to cause the set of input data and the set of weight data to be provided to the binary convolution circuitry (Aggarwal, FIG. 6, which shows the data in Source 1 602 and Source 2 604 being provided to execution circuit 610; [0049], line 28, retrieve data associated with the identified source operands), the decoder is configured to cause register locations identified by the binary convolution instruction to be provided to the binary convolution circuitry (Aggarwal, [0049], lines 8-10, single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination; [0029], lines 9-10, vector of values and a vector of filter weights; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory); [0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory).

Consider claim 12, the overall combination entails the hardware apparatus of claim 9 (see above), wherein the first XNOR circuit is configured to calculate a bit-wise XNOR of a first portion of the set of input data and a first portion of the set of weight data, and wherein the second XNOR circuit is configured to calculate a bit-wise XNOR of a second portion of the set of input data and the first portion of the set of weight data, wherein the first counter circuit is configured to perform a counting operation on a result of the first XNOR circuit and the second counter circuit is configured to perform a counting operation on a result of the second XNOR circuit, and wherein the first accumulator circuit is configured to add an output of the first counter circuit to a first destination register of the set of destination registers and the second accumulator circuit is configured to add an output of the second counter circuit to a second destination register of the set of destination registers (Aggarwal, [0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory)); Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; Nealis, [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814).

Consider claim 13, the overall combination entails the hardware apparatus of claim 9 (see above), wherein the first XNOR circuit is configured to output a first result, and wherein the first counter circuit is configured to perform a counting operation on the first result and to output a second result, and wherein the first accumulator circuit is configured to add the second result to a first destination register of the set of destination registers (Aggarwal, [0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory); Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; Nealis, [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814).

Consider claim 14, the overall combination entails the hardware apparatus of claim 13 (see above), wherein the second XNOR circuit is configured to output a third result, wherein the second counter circuit is configured to perform a counting operation on the third result and to output a fourth result, and wherein the second accumulator circuit is configured to add the fourth result to a second destination register of the set of destination registers (Aggarwal, [0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory); Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; Nealis, [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814).

Consider claim 16, the combination thus far entails the hardware apparatus of claim 9 (see above), and Nealis further explicitly discloses a set of input data comprises sensor data associated with a machine learning model ([0135], lines 1-10, a machine learning algorithm is an algorithm that can learn based on a set of data. Embodiments of machine learning algorithms can be designed to model high-level abstractions within a data set. For example, image recognition algorithms can be used to determine which of several categories to which a given input belong; regression algorithms can output a numerical value given an input; and pattern recognition algorithms can be used to generate translated text or perform text to speech and/or speech recognition), a set of weight data comprises weight values of the machine learning model ([0207], line 3, weight), and output values produced by a layer of the machine learning model ([0136], lines 1-21, an exemplary type of machine learning algorithm is a neural network. There are many types of neural networks; a simple type of neural network is a feedforward network. A feedforward network may be implemented as an acyclic graph in which the nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer that are separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are fully connected via edges to the nodes in adjacent layers, but there are no edges between nodes within each layer. Data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the aforementioned further teachings of Nealis with the previously explained combination of Aggarwal, Nealis, and Herrero Abellanas (which, as cited above, entails a set of input data, a set of weight data, and a set of destination registers), in order to, for example, implement machine learning, which is useful in fields such as image recognition, regression, and pattern recognition (Nealis, [0135], lines 1-10).

Consider claim 17, Aggarwal discloses a hardware computing apparatus comprising: a set of input registers ([0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory); convolution circuitry including a first channel and a second channel both coupled to the set of input registers ([0049], line 2, packed data convolution instruction; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data)), wherein the first channel includes a first accumulator circuit and wherein the second channel includes a second accumulator circuit ([0068], lines 2-4, adder/accumulator circuitry 618 are split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data)); a set of destination registers coupled to the first accumulator circuit and the second accumulator circuit (FIG. 6, destination 606; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory)); one or more computer readable storage media; program instructions stored on the one or more computer readable storage media; wherein the program instructions include a convolution instruction ([0045], lines 2-4, storage 202 that includes one or more packed data convolution with shift control and/or width control instructions 204) that specifies a set of input data, a set of weight data, and the set of destination registers ([0049], lines 8-10, single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination; [0029], lines 9-10, vector of values and a vector of filter weights; [0047], lines 2-3, write back results of an instruction to a destination (e.g., write them to a register(s) and/or memory)); a decoder ([0045], lines 10-13, the decoder circuit 206 (e.g., decoder circuit) decodes the instruction into a decoded instruction (e.g., one or more micro-instructions or micro-operations). The decoded instruction is then sent for execution) coupled to the convolution circuitry (FIG. 2, execution circuit 212; FIG. 6, execution circuit 610); and instruction fetch circuitry coupled to the decoder ([0045], lines 8-9, the instruction (e.g., macro-instruction) is fetched from storage 202 and sent to decoder circuit 206; [0205], lines 4-5, instruction fetch unit) and configured to fetch the convolution instruction ([0049], line 2, packed data convolution instruction); wherein the decoder is configured to cause the set of input data and the set of weight data to be provided to the convolution circuitry (FIG. 6, which shows the data in Source 1 602 and Source 2 604 being provided to execution circuit 610; [0049], line 28, retrieve data associated with the identified source operands); and wherein the set of input registers includes a first input register, coupled to the first channel, configured to store a first portion of the set of input data, further wherein the set of input registers includes a second input register, coupled to the second channel, configured to store a second portion of the set of input data ([0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Examiner notes that each vector register of the register file that may provide input data for the packed data convolution instruction is coupled to execution circuitry of the first channel and execution circuitry of the second channel, and further notes that each element-width register of the overall vector register is coupled to corresponding execution circuitry of a corresponding channel).
However, Aggarwal does not disclose that the convolution is a binary convolution (and therefore does not disclose that the aforementioned convolution circuitry is “binary” convolution circuitry, and the aforementioned convolution instruction is a “binary” convolution instruction). Consequently, Aggarwal also does not disclose the first channel includes a first exclusive-nor (XNOR) circuit, a first counter circuit coupled to the first XNOR circuit, and the aforementioned first accumulator circuit coupled to the first counter circuit, and wherein the second channel includes a second XNOR circuit, a second counter circuit coupled to the second XNOR circuit, and the aforementioned second accumulator circuit coupled to the second counter circuit, the first input register coupled to the first XNOR circuit, the second input register coupled to the second XNOR circuit. Aggarwal also does not disclose the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row. 
On the other hand, Nealis discloses binary convolution ([0207], lines 2-3, fully binary neural networks can be implemented in which both weight and feature data is stored as binary values; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation), and a first channel ([0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width) includes a first exclusive-nor (XNOR) circuit, a first counter circuit coupled to the first XNOR circuit, and a first accumulator circuit coupled to the first counter circuit ([0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814), and wherein a second channel ([0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width) includes a second XNOR circuit, a second counter circuit coupled to the second XNOR circuit, and a second accumulator circuit coupled to the second counter circuit (Nealis, [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; FIG. 21, XNOR Unit 2103, Population Count Unit 2105, M-bit Adder 1812, Accumulator Register 1814), a first input register coupled to the first XNOR circuit, a second input register coupled to the second XNOR circuit (FIG. 2D, register file 258 coupled to GPGPU Cores 262; [0074], lines 1-6, The register file 258 provides a set of registers for the functional units of the graphics multiprocessor 234. The register file 258 provides temporary storage for operands connected to the data paths of the functional units (e.g., GPGPU cores 262, load/store units 266) of the graphics multiprocessor 234; [0075], lines 1-4, The GPGPU cores 262 can each include floating point units (FPUs) and/or integer arithmetic logic units (ALUs) that are used to execute instructions of the graphics multiprocessor 234; [0208], lines 16-21, in one embodiment the fused XNOR and population count unit 2104 is included within all processing elements with the GPGPU. In one embodiment, only a subset of the processing elements within the GPGPU include the fused XNOR and population count unit 2104; Examiner notes that each register of the register file that may provide input data for the vector binary multiply-accumulate instruction is coupled to the first XNOR circuit and the second XNOR circuit, and further notes that each element-width register of the overall vector register is coupled to a corresponding XNOR circuit of a corresponding channel).
Nealis’ teaching enables achieving inference accuracy similar to higher precision networks with significant reduction in memory storage and bandwidth requirements and computational complexity (Nealis, [0207], lines 6-9).
Therefore, it would have been obvious, for one of ordinary skill in the art before the effective filing date of the claimed invention, with the Aggarwal and Nealis references in front of them, to combine the aforementioned references to result in the claimed subject matter (including binary convolution circuitry and a binary convolution instruction), as such a combination enables achieving inference accuracy similar to higher precision networks with significant reduction in memory storage and bandwidth requirements and computational complexity. Alternatively, this modification merely entails combining prior art elements (for example, an instruction performing convolution as taught by Aggarwal, and convolution being binary convolution in particular as taught by Nealis) according to known methods (for example, Examiner submits that implementing processor operations via instructions specifying operands, an instruction decoder, and so forth, was well-known before the effective filing date) to yield predictable results (the claimed subject matter, which includes binary convolution circuitry and a binary convolution instruction).
However, the combination thus far does not entail the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row.
On the other hand, Herrero Abellanas discloses convolution, wherein a first portion of a set of input data is a first column or row, and a second portion of a set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row ([0043], lines 1-4, in the next iteration, the imaginary input window of convolution circuit 314 may shift to the next row of the input image, in order to apply to convolution filter to the pixels I.sub.5, . . . , I.sub.8).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Herrero Abellanas with the combination of Aggarwal and Nealis in order to apply the power of convolution to matrix data in particular, such as image data (Herrero Abellanas, [0002]). Alternatively, this modification merely entails combining prior art elements (the prior art elements of Aggarwal and Nealis as cited above, including the prior art element of convolution and registers storing input data, and the prior art element of Herrero Abellanas of convolution of rows of a matrix) according to known methods (Examiner submits that convolution on matrix data, and registers storing matrix data such as row data, for example, was known) to yield predictable results (the combination of Aggarwal and Nealis, wherein the first portion of the set of input data is a first column or row, and the second portion of the set of input data is a second column or row, wherein the first column or row is adjacent to the second column or row), which is an example of a rationale that may support a conclusion of obviousness as per MPEP 2143. Examiner further submits that it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to store a row in a register, given that a register is higher speed relative to other forms of memory such as main memory.

Consider claim 23, the overall combination entails the hardware processing device of claim 1 (see above), further comprising a set of weight registers configured to store the set of weight data, wherein a first register of the set of weight registers is coupled to the first XNOR circuit and to the second XNOR circuit (Aggarwal, [0049], lines 8-10, single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination; [0029], lines 9-10, vector of values and a vector of filter weights; [0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Examiner notes that each vector register of the register file that may provide weight data for the packed data convolution instruction is coupled to execution circuitry of the first channel and execution circuitry of the second channel, and further notes that each element-width register of the overall vector register is coupled to corresponding execution circuitry of a corresponding channel; Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; Examiner notes that each register of the register file that may provide weight data for the vector binary multiply-accumulate instruction is coupled to the first XNOR circuit and the second XNOR circuit, and further notes that each element-width register of the overall vector register is coupled to a corresponding XNOR circuit of a corresponding channel).

Consider claim 25, the overall combination entails the hardware processing device of claim 1 (see above), wherein the first XNOR circuit includes N/2 individual XNOR gates, wherein N is an even integer equal to a width of the first input register, and wherein the first counter circuit is configured to count a quantity of ones or a quantity of zeros in an output of the first XNOR circuit (Aggarwal, [0029], lines 9-10, vector of values and a vector of filter weights; [0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Nealis, [0207], lines 2-3, fully binary neural networks can be implemented in which both weight and feature data is stored as binary values; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; in other words, registers each comprising a vector of binary values, wherein an XNOR is performed between a binary value of a first register and a corresponding binary value of a second register, entails N individual XNOR gates, wherein N is equal to a width of the registers).

Consider claim 28, the overall combination entails the hardware apparatus of claim 9 (see above), further comprising a set of weight registers configured to store the set of weight data, wherein a first register of the set of weight registers is coupled to the first XNOR circuit and to the second XNOR circuit (Aggarwal, [0049], lines 8-10, single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination; [0029], lines 9-10, vector of values and a vector of filter weights; [0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Examiner notes that each vector register of the register file that may provide weight data for the packed data convolution instruction is coupled to execution circuitry of the first channel and execution circuitry of the second channel, and further notes that each element-width register of the overall vector register is coupled to corresponding execution circuitry of a corresponding channel; Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; Examiner notes that each register of the register file that may provide weight data for the vector binary multiply-accumulate instruction is coupled to the first XNOR circuit and the second XNOR circuit, and further notes that each element-width register of the overall vector register is coupled to a corresponding XNOR circuit of a corresponding channel).

Consider claim 30, the overall combination entails the hardware apparatus of claim 9 (see above), wherein the first XNOR circuit includes N/2 individual XNOR gates, wherein N is an even integer equal to a width of the first input register, and wherein the first counter circuit is configured to count a quantity of ones or a quantity of zeros in an output of the first XNOR circuit (Aggarwal, [0029], lines 9-10, vector of values and a vector of filter weights; [0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Nealis, [0207], lines 2-3, fully binary neural networks can be implemented in which both weight and feature data is stored as binary values; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; in other words, registers each comprising a vector of binary values, wherein an XNOR is performed between a binary value of a first register and a corresponding binary value of a second register, entails N individual XNOR gates, wherein N is equal to a width of the registers).

Consider claim 33, the overall combination entails the hardware computing apparatus of claim 17 (see above), further comprising a set of weight registers configured to store the set of weight data, wherein a first register of the set of weight registers is coupled to the first XNOR circuit and to the second XNOR circuit (Aggarwal, [0049], lines 8-10, single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination; [0029], lines 9-10, vector of values and a vector of filter weights; [0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Examiner notes that each vector register of the register file that may provide weight data for the packed data convolution instruction is coupled to execution circuitry of the first channel and execution circuitry of the second channel, and further notes that each element-width register of the overall vector register is coupled to corresponding execution circuitry of a corresponding channel; Nealis, [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; Examiner notes that each register of the register file that may provide weight data for the vector binary multiply-accumulate instruction is coupled to the first XNOR circuit and the second XNOR circuit, and further notes that each element-width register of the overall vector register is coupled to a corresponding XNOR circuit of a corresponding channel).

Consider claim 35, the overall combination entails the hardware computing apparatus of claim 17 (see above), wherein the first XNOR circuit includes N/2 individual XNOR gates, wherein N is an even integer equal to a width of the first input register, and wherein the first counter circuit is configured to count a quantity of ones or a quantity of zeros in an output of the first XNOR circuit (Aggarwal, [0029], lines 9-10, vector of values and a vector of filter weights; [0140], lines 1-4, Register index field 744—its content, directly or through address generation, specifies the locations of the source and destination operands, be they in registers or in memory; [0050], lines 6-8, split into multiple lanes therein (e.g., to perform their operations in parallel on a subset of the data); Nealis, [0207], lines 2-3, fully binary neural networks can be implemented in which both weight and feature data is stored as binary values; [0207], lines 10-11, specifically, the dot product for binary neural nets can be performed via an XNOR and population count operation; [0196], lines 5-8, multiple parallel fused binary multiply-accumulate operations 1706A-1706N can be performed in parallel within the compute units of the general-purpose processor; [0196], lines 11-17, for an N-bit register, N binary weights may be stored. In one embodiment the compute units of the GPGPU provide support a vector binary multiply-accumulate instruction in which the N binary bits within the N-bit register can be multiplied by N number of N-bit features 1704A-1704N, which can be stored in a vector register of NxN width; in other words, registers each comprising a vector of binary values, wherein an XNOR is performed between a binary value of a first register and a corresponding binary value of a second register, entails N individual XNOR gates, wherein N is equal to a width of the registers).

Claim(s) 24, 29, and 34 is/are rejected under 35 U.S.C. 103 as being unpatentable over Aggarwal, Nealis, and Herrero Abellanas as applied to claims 1, 9, and 17 above, and further in view of Roy et al. (Roy) (US 20210150313 A1).
Consider claim 24, the combination thus far entails the hardware processing device of claim 1, but does not entail a first shift circuit, coupled between the first counter circuit and the first accumulator circuit, and a second shift circuit, coupled between the second counter circuit and the second accumulator circuit.
On the other hand, Roy discloses a first shift circuit, coupled between the first counter circuit and the first accumulator circuit, and a second shift circuit, coupled between the second counter circuit and the second accumulator circuit ([0057], lines 1-8, FIG. 3B is a flow diagram illustrating a method for performing the MAC operation on the binary data belongs to the BNN_B data type. At step B301, the method includes performing bitwise XNOR operation between the W vector and the IFM vector. At step B302, the method includes detecting the popcount in response to performing bitwise XNOR operation to each pair of bits. At step B303, the method includes performing a left shift to the popcount; also note, for example, paragraph [0062], lines 25-27, left shift the output from the step 605, equivalent to multiply by 2 operation).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Roy with the combination of Aggarwal, Nealis, and Herrero Abellanas, in order to support the performing of a MAC operation on the binary data belonging to the BNN_B data type. Alternatively, this modification merely entails combining prior art elements (the prior art elements of Aggarwal, Nealis, and Herrero Abellanas as cited above, including an output of a popcount, and the prior art element of Roy of performing a left shift) according to known methods (the teaching of Roy reflects that performing a left shift is known, as was performing a left shift if a multiplication by two is desired) to yield predictable results (the output of a popcount being left shifted, such that a multiplication by two is performed), which is an example of a rationale that may support a conclusion of obviousness as per MPEP 2143.

Consider claim 29, the combination thus far entails the hardware apparatus of claim 9, but does not entail a first shift circuit, coupled between the first counter circuit and the first accumulator circuit, and a second shift circuit, coupled between the second counter circuit and the second accumulator circuit.
On the other hand, Roy discloses a first shift circuit, coupled between the first counter circuit and the first accumulator circuit, and a second shift circuit, coupled between the second counter circuit and the second accumulator circuit ([0057], lines 1-8, FIG. 3B is a flow diagram illustrating a method for performing the MAC operation on the binary data belongs to the BNN_B data type. At step B301, the method includes performing bitwise XNOR operation between the W vector and the IFM vector. At step B302, the method includes detecting the popcount in response to performing bitwise XNOR operation to each pair of bits. At step B303, the method includes performing a left shift to the popcount; also note, for example, paragraph [0062], lines 25-27, left shift the output from the step 605, equivalent to multiply by 2 operation).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Roy with the combination of Aggarwal, Nealis, and Herrero Abellanas, in order to support the performing of a MAC operation on the binary data belonging to the BNN_B data type. Alternatively, this modification merely entails combining prior art elements (the prior art elements of Aggarwal, Nealis, and Herrero Abellanas as cited above, including an output of a popcount, and the prior art element of Roy of performing a left shift) according to known methods (the teaching of Roy reflects that performing a left shift is known, as was performing a left shift if a multiplication by two is desired) to yield predictable results (the output of a popcount being left shifted, such that a multiplication by two is performed), which is an example of a rationale that may support a conclusion of obviousness as per MPEP 2143.

Consider claim 34, the combination thus far entails the hardware computing apparatus of claim 17, but does not entail a first shift circuit, coupled between the first counter circuit and the first accumulator circuit, and a second shift circuit, coupled between the second counter circuit and the second accumulator circuit.
On the other hand, Roy discloses a first shift circuit, coupled between the first counter circuit and the first accumulator circuit, and a second shift circuit, coupled between the second counter circuit and the second accumulator circuit ([0057], lines 1-8, FIG. 3B is a flow diagram illustrating a method for performing the MAC operation on the binary data belongs to the BNN_B data type. At step B301, the method includes performing bitwise XNOR operation between the W vector and the IFM vector. At step B302, the method includes detecting the popcount in response to performing bitwise XNOR operation to each pair of bits. At step B303, the method includes performing a left shift to the popcount; also note, for example, paragraph [0062], lines 25-27, left shift the output from the step 605, equivalent to multiply by 2 operation).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Roy with the combination of Aggarwal, Nealis, and Herrero Abellanas, in order to support the performing of a MAC operation on the binary data belonging to the BNN_B data type. Alternatively, this modification merely entails combining prior art elements (the prior art elements of Aggarwal, Nealis, and Herrero Abellanas as cited above, including an output of a popcount, and the prior art element of Roy of performing a left shift) according to known methods (the teaching of Roy reflects that performing a left shift is known, as was performing a left shift if a multiplication by two is desired) to yield predictable results (the output of a popcount being left shifted, such that a multiplication by two is performed), which is an example of a rationale that may support a conclusion of obviousness as per MPEP 2143.

Response to Arguments
Applicant on page 10 argues: “Claims 9-10, 12-16, and 28-30 are objected to because of informalities. Applicant has amended claim 9, as suggested by the Office Action. Accordingly, Applicant respectfully requests that the objection be withdrawn.”
In view of the aforementioned amendment, the previously presented objections to the claims are withdrawn.

Applicant on page 10 argues: ‘Claims 25, 30, and 35 are rejected under 35 U.S.C. § 112(a) as failing to comply with the written description requirement. Applicant does not concede that claims 25, 30, and 35 failed to comply with the written description requirement. However, Applicant has amended those claims to change "comprises" to "includes". It is believed that this amendment renders the rejection moot. Accordingly, Applicant respectfully requests withdrawal of the 35 U.S.C. § 112(a) rejection of claims 25, 30, and 35.’
However, Examiner submits that “includes” is likewise open-ended language — see MPEP 2111.03.

Applicant across pages 10-11 argues: ‘Claims 5-7, 13-15, 17, 23-25, 28-30, and 33-35 are rejected under 35 U.S.C. § 112(b) as being indefinite. Applicant has made the following amendments:  Claims 5, 6, 13, 14, 17 have been amended to fix antecedent basis issues. Applicant has canceled claims 7, 15 without prejudice.  Applicant has amended claim 17 to omit the term, "from memory." Claims 23-25, 28-30, and 33-35 have been amended to depend from claims that are not canceled. Claims 25, 20, 35 have been amended to recite that N "is an even integer." Support may be found at least at 0064 of the Specification. Accordingly, Applicant respectfully requests withdrawal of the 35 U.S.C. § 112(b) rejection of claims 5-7, 13-15, 17, 23-25, 28-30, and 33-35.’
In view of the aforementioned amendments, the previously presented indefinite rejections are withdrawn.

Applicant across page 11 argues: ‘Claims 17 and 33-35 are rejected under 35 U.S.C. § 101 as being directed to non- statutory subject matter. Specifically, the Office Action points to language in 0093 of the Specification, alleging that the subject matter of claims 17 and 33-35 may be interpreted as software per se. Applicant has amended claims 17 and 33-35 to recite a "hardware computing apparatus," thereby excluding software per se. Accordingly, Applicant respectfully requests withdrawal of the Section 101 rejection.’
In view of the aforementioned amendments, the previously presented rejections under 35 U.S.C. § 101 are withdrawn.

Applicant across pages 11-12 argues: ‘The Office Action cites Nealis at Figure 21 (XNOR unit 2103) to teach an XNOR circuit. Office Action, 14.1 However, Nealis does not disclose how multiple XNOR units 2103 would have been coupled to multiple input registers, much less that "the set of input registers includes a first input register, coupled to the first XNOR circuit, configured to store a first column or row of the set of input data, further wherein the set of input registers includes a second input register, coupled to the second XNOR circuit, configured to store a second column or row of the set of input data, and wherein the first column or row is adjacent to the second column or row."
Examiner submits that, in accordance with the standard behavior of a vector register file comprising vector registers providing operands to vector execution units, each vector register of the vector register file of Nealis that may provide input data for the vector binary multiply-accumulate instruction is coupled to the first XNOR circuit and the second XNOR circuit, and each element-width register of the overall vector register is coupled to a corresponding XNOR circuit of a corresponding channel. While Applicant’s particular implementation across FIG. 5-7 may differ from the aforementioned standard behavior of Nealis (and Aggarwal, which likewise entails the standard behavior of a vector register file comprising vector registers providing operands to vector execution units), Examiner submits that the aforementioned standard behavior can nevertheless meet the relevant limitations of the claims under the broadest reasonable interpretation.
Examiner further note that while Herreo Abellanas has been cited to explicitly teach column and row data for the purposes of compact prosecution. Examiner submits that data (or a subset thereof) in a vector register (e.g., a vector register of Nealis or Aggarwal) can be reasonably considered a row of data, as well as multiple columns of data. 

Applicant on page 12 argues: “The Office Action further cites to Herrero Abellanas at Figures 6A-6D. Office Action, 15-16. Specifically, the Office Action cites the portion of Herrero Abellanas to show relationships between columns or rows of sets of input data. However, Herrero Abellanas does not render obvious the amended feature of claim 1 at least because Herrero Abellanas does not address XNOR circuits nor how XNOR circuits may be coupled to input registers. For at least this reason, no combination of Nealis and Herrero Abellanas renders amended claim 1 obvious.”
However, Examiner is relying upon Nealis to disclose XNOR circuits. In addition, as noted above, Examiner submits that the standard behavior of a vector register file comprising vector registers providing operands to vector execution units (as embodied by Nealis or Aggarwal), in the context of the overall prior art combination (which entails Nealis to teach XNOR circuits in a vector execution unit to perform a vector binary multiply-accumulate instruction), renders obvious the amended feature of claim 1.

Applicant on page 13 argues: ‘The other reference Aggarwal does not address XNOR circuits at all. Therefore, the combination of Aggarwal, Nealis, and Herrero Abellanas does not render amended claim 1 obvious.”
While Aggarwal does not address XNOR circuits, Examiner generally notes, as mentioned above, that Aggarwal embodies the standard behavior of a vector register file comprising vector registers providing operands to vector execution units, which remains relevant even after the packed data convolution of Aggarwal is modified, in view of Nealis, to be binary convolution. 

Applicant on page 13 argues: “The other independent claims 9 and 17 are amended similarly and are patentable over the combination of references for the same or similar reasons as those given above with respect to claim 1. The various dependent claims are patentable at least due to their dependence on respective base claims 1, 9, and 17. Accordingly, Applicant respectfully requests withdrawal of the obviousness rejections of all pending claims.”
Examiner’s responses to arguments above with respect to claim 1 are likewise applicable to the arguments directed to independent claims 9 and 17, as well as the dependent claims.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEITH E VICARY whose telephone number is (571)270-1314. The examiner can normally be reached Monday to Friday, 9:00 AM to 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached at (571)270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KEITH E VICARY/            Primary Examiner, Art Unit 2183
Read full office action
Prosecution Timeline

Jun 29, 2023
Application Filed
Sep 28, 2024
Non-Final Rejection — §101, §103, §112
Jan 16, 2025
Response Filed
Jun 14, 2025
Final Rejection — §101, §103, §112
Sep 05, 2025
Request for Continued Examination
Sep 19, 2025
Response after Non-Final Action
Nov 03, 2025
Non-Final Rejection — §101, §103, §112
Jan 28, 2026
Response Filed
Feb 23, 2026
Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/213,598
Patent 12602349
HANDLING DYNAMIC TENSOR LENGTHS IN A RECONFIGURABLE PROCESSOR THAT INCLUDES MULTIPLE MEMORY UNITS
2y 5m to grant Granted Apr 14, 2026
17/720,657
Patent 12572360
Cache Preload Operations Using Streaming Engine
2y 5m to grant Granted Mar 10, 2026
18/328,688
Patent 12554507
SYSTEMS AND METHODS FOR PROCESSING FORMATTED DATA IN COMPUTATIONAL STORAGE
2y 5m to grant Granted Feb 17, 2026
18/626,629
Patent 12554494
APPARATUSES, METHODS, AND SYSTEMS FOR INSTRUCTIONS TO REQUEST A HISTORY RESET OF A PROCESSOR CORE
2y 5m to grant Granted Feb 17, 2026
18/739,070
Patent 12547401
Load Instruction Fusion
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
58%
Grant Probability
99%
With Interview (+41.2%)
3y 8m
Median Time to Grant
High
PTA Risk
Based on 683 resolved cases by this examiner. Grant probability derived from career allow rate.