Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending. This Office Action is responsive to the amendment filed on 09/19/2025, which has been entered into the above application.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claims 1 and 11 recite the limitation “perform one or more model operations using the received portions before all portions of the respective block have been received, the operations requiring the entire block of data to complete”. In Paragraph [0055] of the specification, it states “the matrix operations can be performed using intermediate data without the reliance on other portions of the respective blocks of lookup data”. This seems to be consistent with the first half of the limitation in the independent claims that recites to “perform one or more model operations using the received portions before all portions of the respective block have been received”, yet seems inconsistent with the second half that recites “the operations requiring the entire block of data to complete”. As such, it is unclear whether the claim recites that “the operations” refer to all the “one or more model operations” that as a collective require the entire block of data to complete, or that “the operations” refers to “model operations” in general that each cannot be completed without the entire block of data. For the purposes of examination, this limitation will be interpreted as “perform one or more model operations using the received portions before all portions of the respective block have been received, the model operations as a whole requiring the entire block of data to complete”.
Claims 1 and 11 recite the limitation “continue to generate and communicate additional portions”. It is unclear if this continuation is in relation to the “one or more portions of a respective block of data” that is generated by the “one or more data producing nodes”, or if it is referring to completely unrelated portions of data unassociated with the respective block of data. For the purposes of examination, this limitation will be interpreted as “continue to generate and communicate additional portions of the respective block of data”.
Claims 2-10 and 12-20 are similarly rejected for their dependency on an indefinite independent claim.
Claims 2, 6, 10, 12, 16 and 20 recite the limitation “the model”. There is insufficient antecedent basis for this limitation. At best, the independent claims recite “performing model operations”, but never define a model.
Claims 5 and 15 recite the limitation “the respective blocks of data”. There is insufficient antecedent for this limitation. At best, the independent claims recite “a respective block of data”, but never suggest multiple blocks of data.
Claims 7 and 17 recite the limitation “combine lookup data received in the portions”. It is unclear if “the portions” is referring to the “one or more portions of a respective block of data” or the “additional portions” in the independent claims. For the purposes of examination, this limitation will be interpreted as “combine lookup data received in the one or more portions”.
Claims 8 and 18 recite the limitation “each portion of the block”. It is unclear if “the block” is referring to the “respective block of data” in the independent claims. For the purposes of examination, this limitation will be interpreted as “each portion of the respective block of data”.
Claims 9 and 19 recite the limitation “the blocks of data”. There is insufficient antecedent for this limitation. At best, the independent claims recite “a respective block of data”, but never suggest multiple blocks of data.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention
Independent Claims
Step 1 – Claim 1 is drawn to a device, and claim 11 is drawn to a method. Therefore, each of these claims fall under one of the four categories of statutory subject matter (process/method, machine/product/apparatus, manufacture or composition of matter).
Step 2A Prong 1 – Claims 1 and 11 are directed to a judicially recognized exception of an abstract idea without significantly more. Claims 1 and 11 recite:
perform one or more model operations using the received portions – This limitation recites merely applying (see MPEP § 2106.05(f)) the abstract idea of a mathematical calculation (see MPEP § 2106.04(a)(2), section I, C). In Paragraph [0044] of the specification, it states “operations for the top multilayer perceptron in data consuming devices can be performed using portions of the lookup data independently of other portions of the lookup data. A significant part of the operations for the top multilayer perceptron are the numerous matrix multiplication operations (or fused multiply adds, etc.) that are computed to generate inputs to activation functions in a deep neural network (DNN) of the top multilayer perceptron (e.g., rectified linear units, etc.). That is, matrix multiplication operations that are performed for multiplying weight values by inputs to activation functions for intermediate nodes in the DNN for the top multilayer perceptron.” BRI in light of the specification would support that “performing one or more model operations” would encompass matrix multiplication and fall under the mathematical concepts grouping.
before all portions of the respective block have been received, while the data producing node continue to generate and communicate additional portions – This limitation recites the abstract idea of a mental process, or a concept that can be performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP § 2106.04(a)(2), subsection III). In Paragraph [0030] of the specification, it states “The data producing nodes therefore perform the remote data communication of the given portion of the lookup data and the generation of a next portion of the lookup data at least partially in parallel (i.e., at substantially the same time).” BRI in light of the specification would support that “performing operations before all portions of data have been received, while the producing node continues to generate and communicate additional portions” would encompass a mental process with or without the assistance of pen and paper of collaborative multi-tasking. It can be likened to the non-technical human activity of collaborative multi-tasking where, for example one person is preparing (producing) and subsequently handing (communicating) ingredients to another person who is cooking (performing operations) a dish such that the final product is not finished until all ingredients (data) are received and processed, but both people are simultaneously performing their individual tasks.
Step 2A Prong 2 – The following additional limitations recited do not integrate the abstract idea into a practical application:
one or more data producing nodes, each comprising circuitry; a data consuming node comprising circuitry – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It recites a generic computer or generic computer components that merely act as a tool on which the method operates.
generate one or more portions of a respective block of data – This limitation recites the insignificant extra-solution activity of mere data gathering (see MPEP § 2106.05(g)) and thus, fails to integrate the exception into a practical application.
communicate the one or more portions via a communication interface – This limitation recites the insignificant extra-solution activity of mere data output (see MPEP § 2106.05(g)) and thus, fails to integrate the exception into a practical application.
receive at least some of the one or more portions – This limitation recites the insignificant extra-solution activity of mere data gathering (see MPEP § 2106.05(g)) and thus, fails to integrate the exception into a practical application.
the operations requiring the entire block of data to complete – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It represents a mere token acquiescence that multi-step operations or strings of operations require all inputs before completing and thus, fails to integrate the exception into a practical application.
Step 2B – The additional elements in Step 2A Prong 2, view individually or wholistically, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself.
one or more data producing nodes, each comprising circuitry; a data consuming node comprising circuitry – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
generate one or more portions of a respective block of data – This limitation recites the well-understood, routine, conventional activity of storing and retrieving information in memory (see MPEP § 2106.05(d)) and thus, fails to provide significantly more to the judicial exception.
communicate the one or more portions via a communication interface – This limitation recites the well-understood, routine, conventional activity of receiving or transmitting data over a network (see MPEP § 2106.05(d)) and thus, fails to provide significantly more to the judicial exception.
receive at least some of the one or more portions – This limitation recites the well-understood, routine, conventional activity of receiving or transmitting data over a network (see MPEP § 2106.05(d)) and thus, fails to provide significantly more to the judicial exception.
before all portions of the respective block have been received, while the data producing node continue to generate and communicate additional portions – This limitation merely recites the idea of concurrently performing communication and computation but fails to recite details of how the concurrency is accomplished. Reciting the idea of a solution or outcome without detailing how the result is accomplished is equivalent to saying "apply it" to the judicial exception (see MPEP § 2106.05(f)) and thus, fails to provide significantly more to the judicial exception.
the operations requiring the entire block of data to complete – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It represents a mere token acquiescence that multi-step operations or strings of operations require all inputs before completing and thus, fails to provide significantly more to the judicial exception.
As such, claims 1 and 11 are not patent eligible.
Dependent Claims
Claims 2-10 and 12-20 merely narrow the previously cited abstract idea limitations. For the reasons described above with respect to independent claims 1 and 11, these judicial exceptions are not meaningfully integrated into a practical application, nor amount to significantly more than the abstract idea itself. The claims disclose similar limitations described for the independent claims above and do not provide anything more than the abstract ideas achievable in the human mind or through mathematical computation. Therefore claims 2-10 and 12-20 also recite abstract ideas that do not integrate into a practical application or amount to significantly more than the judicial exception, and are rejected under U.S.C. § 101.
Step 1 – Claims 2-10 are drawn to a device, and claims 12-20 are drawn to a method. Therefore, each of these claims fall under one of the four categories of statutory subject matter (process/method, machine/product/apparatus, manufacture or composition of matter).
Step 2A Prong 1 – These claims are directed to a judicially recognized exception of an abstract idea without significantly more.
Claims 2 and 12:
wherein the data consuming node is configured to perform the operations for the model using the received portions concurrently with the one or more producing nodes generating and/or communicating additional portions of the respective block of data – This limitation recites the abstract idea of a mathematical calculation (see MPEP § 2106.04(a)(2), section I, C). In Paragraph [0044] of the specification, it states “operations for the top multilayer perceptron in data consuming devices can be performed using portions of the lookup data independently of other portions of the lookup data. A significant part of the operations for the top multilayer perceptron are the numerous matrix multiplication operations (or fused multiply adds, etc.) that are computed to generate inputs to activation functions in a deep neural network (DNN) of the top multilayer perceptron (e.g., rectified linear units, etc.). That is, matrix multiplication operations that are performed for multiplying weight values by inputs to activation functions for intermediate nodes in the DNN for the top multilayer perceptron.” BRI in light of the specification would support that “performing operations” would encompass matrix multiplication and fall under the mathematical concepts grouping. Additionally, this limitation recites the abstract idea of a mental process, or a concept that can be performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP § 2106.04(a)(2), subsection III). In Paragraph [0030] of the specification, it states “The data producing nodes therefore perform the remote data communication of the given portion of the lookup data and the generation of a next portion of the lookup data at least partially in parallel (i.e., at substantially the same time).” BRI in light of the specification would support that “performing the operations for the model using the received portions concurrently with the one or more producing nodes generating and/or communicating additional portions” would encompass a mental process with or without the assistance of pen and paper of independently completing tasks. It can be likened to the non-technical human activity of collaborative multi-tasking where, for example one person is cutting (producing) and subsequently handing (communicating) flowers to another person who is arranging (performing operations) a bouquet such that the final product is not finished until all flowers (data) are received and arranged, but both people are simultaneously performing their individual tasks.
Claims 3 and 13:
each data producing node is configured to dynamically allocate computational resources for generating one or more portions of its respective block of data – This limitation recites the abstract idea of a mental process, or a concept that can be performed in the human mind, including observation, evaluation, judgement or opinion (see MPEP § 2106.04(a)(2), subsection III). In Paragraph [0034] of the specification, it states “the above described allocation of the computational resources includes allocating computational resources from among a pool of available computational resources both for acquiring and communicating portions of lookup data and for performing the operations of the model. This may include respective allocated computational resources acquiring and communicating portions of lookup data and performing the operations of the model substantially in parallel (i.e., partially or wholly at the same time).” BRI in light of the specification would support that “dynamically allocating computational resources” would encompass a mental process with or without the assistance of pen and paper of assigning resources based on need.
Claims 5 and 15:
wherein a number of portions in each block of data is set based on one or more properties of the respective blocks of data, the data consuming node, or the data producing nodes – This limitation recites the abstract idea of a mathematical relationship (see MPEP § 2106.04(a)(2), subsection I, A). In Paragraph [0033] of the specification, it states “given an overall block of lookup data that is to be communicated to other nodes, the block can be divided into a specified number of portions R (where R = 12, 20, or another number). In some of these embodiments, the specified number of portions is set based on a consideration of: (1) the balance between communicating smaller portions of the lookup data to enable relatively high levels of resource utilization for both embedding table lookups and model operations and (2) an amount of communication overhead associated with communicating the portions of the lookup data.” BRI in light of the specification would support that “set[ting] a number of portions based on one or more properties” would encompass a mathematical relationship between the number of portions, the levels of resource utilization, and the amount of communication overhead, thereby falling under the mathematical concepts grouping.
Claims 7 and 17:
wherein the data consuming node is configured to combine lookup data received in the portions with results from a bottom multilayer perceptron (MLP) to generate inputs for a top multilayer perceptron for a deep learning recommendation model (DLRM) – This limitation recites the abstract idea of a mathematical calculation (see MPEP § 2106.04(a)(2), section I, C). In Paragraph [0001] of the specification, it states “The outputs of bottom multilayer perceptron 102 and embedding table lookups 106 are combined in interaction 110 to form intermediate values (e.g., by concatenating outputs from each of bottom multilayer perceptron 102 and embedding table lookups 106).” It similarly states in Paragraph [0027], “Each node next combines the outputs from bottom multilayer perceptron 102 and that node's lookup data from embedding table lookups 106 in interaction 110 to generate corresponding intermediate values (e.g., combined vectors or other values).” BRI in light of the specification would support that “combin[ing] lookup data… with results from a bottom multilayer perceptron” would encompass concatenating outputs into combined vectors and fall under the mathematical concepts grouping.
Claims 8 and 18:
the operations include matrix multiplication performed independently on data contained in each portion of the block – This limitation recites the abstract idea of a mathematical calculation (see MPEP § 2106.04(a)(2), section I, C). In Paragraph [0044] of the specification, it states “operations for the top multilayer perceptron in data consuming devices can be performed using portions of the lookup data independently of other portions of the lookup data. A significant part of the operations for the top multilayer perceptron are the numerous matrix multiplication operations (or fused multiply adds, etc.) that are computed to generate inputs to activation functions in a deep neural network (DNN) of the top multilayer perceptron (e.g., rectified linear units, etc.). That is, matrix multiplication operations that are performed for multiplying weight values by inputs to activation functions for intermediate nodes in the DNN for the top multilayer perceptron.” BRI in light of the specification would support that “matrix multiplication” would encompass a mathematical computation and fall under the mathematical concepts grouping.
Step 2A Prong 2 – These limitations do not recite any additional elements which integrate the abstract idea into a practical application.
Claims 4 and 14:
wherein the data producing node and/or the data consuming node are configured to allocate computational resources including workgroups in a graphics processing unit (GPU) for performing respective operations – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It recites a generic computer or generic computer components that merely act as a tool on which the method operates.
Claims 6 and 16:
the model is a deep learning recommendation model (DLRM), and each data producing node stores a subset of embedding tables for the model in a local memory of that node – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It merely limits the use of the abstract idea to ---a DLRM and embedding tables in local memory and thus, fails to integrate the exception into a practical application.
Claims 8 and 18:
each portion being usable for the matrix multiplication without requiring other portions of the respective block of data – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It represents a mere token acquiescence that matrix multiplication can be completed piecemeal and thus, fails to integrate the exception into a practical application.
Claims 9 and 19:
the blocks of data include model data communicated in an all-to-all exchange from the producing nodes to the consuming node – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It merely limits the use of the abstract idea to all-to-all communication and thus, fails to integrate the exception into a practical application.
Claims 10 and 20:
the operations include training the model using training data communicated from the one or more data producing nodes to the data consuming node in an all-reduce communication – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It merely limits the use of the abstract idea to a model trained with data received through all-reduce communication and thus, fails to integrate the exception into a practical application.
Step 2B – These limitations, as a whole, do not amount to significantly more than the judicial exception.
Claims 4 and 14:
wherein the data producing node and/or the data consuming node are configured to allocate computational resources including workgroups in a graphics processing unit (GPU) for performing respective operations – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
Claims 6 and 16:
the model is a deep learning recommendation model (DLRM), and each data producing node stores a subset of embedding tables for the model in a local memory of that node – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It merely limits the use of the abstract idea to ---a DLRM and embedding tables in local memory and thus, fails to provide significantly more to the judicial exception.
Claims 8 and 18:
each portion being usable for the matrix multiplication without requiring other portions of the respective block of data – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It represents a mere token acquiescence that matrix multiplication can be completed piecemeal and thus, fails to provide significantly more to the judicial exception.
Claims 9 and 19:
the blocks of data include model data communicated in an all-to-all exchange from the producing nodes to the consuming node – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It merely limits the use of the abstract idea to all-to-all communication and thus, fails to provide significantly more to the judicial exception.
Claims 10 and 20:
the operations include training the model using training data communicated from the one or more data producing nodes to the data consuming node in an all-reduce communication – This limitation amounts to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). It merely limits the use of the abstract idea to a model trained with data received through all-reduce communication and thus, fails to provide significantly more to the judicial exception.
As such, claims 2-10 and 12-20 are not patent eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 5-12 are 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Naumov et al. (“Deep Learning Recommendation Model for Personalization and Recommendation Systems”, published 05/31/2019), hereinafter Naumov; in view of Georganas et al. (“Communication Avoiding and Overlapping for Numerical Linear Algebra”, published 02/25/2013), hereinafter Georganas. Naumov was cited in the previous Office Action.
Regarding Claim 1, Naumov teaches An electronic device, comprising:
one or more data producing nodes, each comprising circuitry (Naumov: “The size of the embeddings makes it prohibitive to use data parallelism since it requires replicating large embeddings on every device. In many cases, this memory constraint necessitates the distribution of the model across multiple devices to be able satisfy memory capacity requirements.” [Section 3. Parallelism]) configured to:
generate one or more portions of a respective block of data (Naumov: “Recall that DLRM accepts continuous and categorical features as inputs. The former can be modeled by generating a vector of random numbers using either a uniform or normal (Gaussian) distributions with the numpy.random package rand or randn calls with default parameters. Then a mini-batch of inputs can be obtained by generating a matrix where each row corresponds to an element in the mini-batch.” [Section 4.1 Random]); and
communicate the one or more portions via a communication interface (Naumov: “Then personalized all-to-all communication is implemented using the butterfly shuffle operator, which appropriately slices the resulting embedding vectors and transfers them to the target devices.” [Section 3. Parallelism]); and
a data consuming node comprising circuitry (Naumov: “Since model parallelism has been used to distribute the embeddings across devices, this requires a personalized all-to-all communication [12]. At the end of the embedding lookup, each device has a vector for the embedding tables resident on those devices for all the samples in the mini-batch, which needs to be split along the mini-batch dimension and communicated to the appropriate devices, as shown in Fig. 2.” [Section 3. Parallelism]) configured to:
receive at least some of the one or more portions (Naumov: “we have implemented [model parallelism] by explicitly mapping the embedding operators (nn.EmbeddingBag for PyTorch, SparseLengthSum for Caffe2) to different devices. Then personalized all-to-all communication is implemented using the butterfly shuffle operator, which appropriately slices the resulting embedding vectors and transfers them to the target devices.” [Section 3. Parallelism]); and
perform one or more model operations using the received portions (Naumov: “The model uses embeddings to process sparse features that represent categorical data and a multilayer perceptron (MLP) to process dense features, then interacts these features explicitly using the statistical techniques proposed in [24].” [Section 1. Introduction]; “We will compute second-order interaction of different features explicitly, following the intuition for handling sparse data provided in FMs (4), optionally passing them through MLPs. This is done by taking the dot product between all pairs of embedding vectors and processed dense features. These dot products are concatenated with the original processed dense features and post-processed with another MLP (the top or output MLP) (5), and fed into a sigmoid function to give a probability.” [Section 2.2 DLRM Architecture]).
However, fails to expressly disclose perform operations before all portions of the respective block have been received, the operations requiring the entire block of data to complete, while the data producing nodes continue to generate and communicate additional portions.
In the same field of endeavor, Georganas teaches perform operations before all portions of the respective block have been received, the operations requiring the entire block of data to complete, while the data producing nodes continue to generate and communicate additional portions (Georganas: “The main part of the 2D and 2.5D algorithm consists of a loop where each thread takes part in two broadcasts (related to the grid row and column, respectively) and performs one BLAS matrix product (dgemm) per iteration. We develop new versions of both the 2D and 2.5D algorithms that overlap communication and computation by starting the broadcast of the next iteration while the dgemm of the current iteration is being performed.” [Section IV. SUMMA]).
It would have been obvious to one or ordinary skill in the art before the effective filing date of the invention to have incorporated perform operations before all portions of the respective block have been received, the operations requiring the entire block of data to complete, while the data producing nodes continue to generate and communicate additional portions, as taught by Georganas to the system of Naumov because both of these systems are directed towards parallelizing data communication to prevent memory bottlenecks. In making this combination and overlapping model operations with data communication such that computation begins before all the data is received, as taught by Georganas, it would allow the system of Naumov to “optim[ize] communication volume and “substantially reduce communication and thus improve performance”, as “communication overlap changes neither volume nor the number of messages, but lowers the per-message cost” (Georganas: [Section 1. Introduction]).
Regarding Claim 2, Naumov and Georganas teach the system of Claim 1, wherein the data consuming node is configured to perform the operations for the model using the received portions concurrently with the one or more producing nodes generating and/or communicating additional portions of the respective block of data (Georganas: “The main part of the 2D and 2.5D algorithm consists of a loop where each thread takes part in two broadcasts (related to the grid row and column, respectively) and performs one BLAS matrix product (dgemm) per iteration. We develop new versions of both the 2D and 2.5D algorithms that overlap communication and computation by starting the broadcast of the next iteration while the dgemm of the current iteration is being performed.” [Section IV. SUMMA]).
Regarding Claim 5, Naumov and Georganas teach the system of Claim 1, wherein a number of portions in each block of data is set based on one or more properties of the respective blocks of data, the data consuming node, or the data producing nodes (Georganas: “First of all, a problem set-up is defined by the matrix dimension n and the available number of processors p. The algorithms that leverage a 2D block cyclic distribution for the matrix partitioning have an extra parameter r that defines how many times we apply this 2D block cyclic distribution on the input matrix. Finally, the 2.5D algorithms have a parameter c that specifies the replication factor. For Cholesky and triangular solve, a 2D block cyclic distribution is applied on each replicated layer since we have implemented a two level blocking as it is described in Sections VI and VII. So, given the parameters n, p, r, c we can find out the block size that each BLAS call operates on and we can specify the exact number of processors and words involved in each communication operation.” [Section VIII. Methodology for Constructing Performance Models]).
Regarding Claim 6, Naumov and Georganas teach the system of Claim 1, wherein the model is a deep learning recommendation model (DLRM), and each data producing node stores a subset of embedding tables for the model in a local memory of that node (Naumov: “we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks.” [Abstract]; “At the end of the embedding lookup, each device has a vector for the embedding tables resident on those devices for all the samples in the mini-batch, which needs to be split along the mini-batch dimension and communicated to the appropriate devices, as shown in Fig. 2.” [Section 3. Parallelism]).
Regarding Claim 7, Naumov and Georganas teach the system of Claim 1, wherein the data consuming node is configured to combine lookup data received in the portions with results from a bottom multilayer perceptron to generate inputs for a top multilayer perceptron of a deep learning recommendation model (DLRM) (Naumov: “we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks.” [Abstract]; “At the end of the embedding lookup, each device has a vector for the embedding tables resident on those devices for all the samples in the mini-batch, which needs to be split along the mini-batch dimension and communicated to the appropriate devices, as shown in Fig. 2.” [Section 3. Parallelism]; “To handle the continuous features, the continuous features will be transformed by an MLP (which we call the bottom or dense MLP) which will yield a dense representation of the same length as the embedding vectors.” [Section 2.2. DLRM Architecture]; “This is done by taking the dot product between all pairs of embedding vectors and processed dense features. These dot products are concatenated with the original processed dense features and post-processed with another MLP (the top or output MLP) (5), and fed into a sigmoid function to give a probability.” [Section 2.2. DLRM Architecture]).
Regarding Claim 8, Naumov and Georganas teach the system of Claim 1, wherein the operations include matrix multiplication performed independently on data contained in each portion of the block, each portion being usable for the matrix multiplication without requiring other portions of the respective block of data (Georganas: “We study classical matrix multiplication, which computes all n 3 multiplications, though the reduced-complexity Strassen’s algorithm [7] can also minimize communication and outperform the classical algorithm in a high-performance setting [8]. Since each of the n^3 multiplications can be done independently, parallelization is very simple to load balance and schedule. Performing computation by blocks reduces the amount of data that must be moved between processors, as well as in a memory hierarchy. 2D algorithms distribute matrices in blocks among processes and communicate on a 2D grid of p processors. SUMMA performs communication with row and column broadcasts on this 2D grid.” [Section II.A. Matrix multiplication]).
Regarding Claim 9, Naumov and Georganas teach the system of Claim 1, wherein the blocks of data include model data communicated in an all-to-all exchange from the producing nodes to the consuming node (Naumov: “Since model parallelism has been used to distribute the embeddings across devices, this requires a personalized all-to-all communication [12]. At the end of the embedding lookup, each device has a vector for the embedding tables resident on those devices for all the samples in the mini-batch, which needs to be split along the mini-batch dimension and communicated to the appropriate devices, as shown in Fig. 2.” [Section 3. Parallelism]).
Regarding Claim 10, Naumov and Georganas teach the system of Claim 1, wherein the operations include training the model using training data communicated from the one or more data producing nodes to the data consuming node in an all-reduce communication (Naumov: “We note that for the data parallel MLPs, the parameter updates in the backward pass are accumulated with an allreduce and applied to the replicated parameters on each device [12] in a synchronous fashion, ensuring the updated parameters on each device are consistent before every iteration.” [Section 3. Parallelism]; “In PyTorch, data parallelism is enabled through the nn.DistributedDataParallel and nn.DataParallel modules that replicate the model on each device and insert allreduce with the necessary dependencies. In Caffe2, we manually insert allreduce before the gradient update.” [Section 3. Parallelism]).
Regarding Claims 11-12 and 15-20, they are method claims that correspond to the system of Claims 1-2 and 5-10. Therefore, they are rejected for the same reasons as Claims 1-2 and 5-10 above.
Claims 3-4 and 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Naumov in view of Georganas, as applied to Claims 1 and 11 above, in further view of Zhao et al. (US 20190312772 A1, filed 04/04/2018), hereinafter Zhao. Zhao was cited in the previous Office Action.
Regarding Claim 3, Naumov and Georganas teach the system of Claim 1. However, they fail to expressly disclose wherein each data producing node is configured to dynamically allocate computational resources for generating one or more portions of its respective block of data.
In the same field of endeavor, Zhao teaches wherein each data producing node is configured to dynamically allocate computational resources for generating one or more portions of its respective block of data (Zhao: “systems and methods for dynamically scheduling and provisioning computing resources in a heterogeneous server cluster are configured to maintain information regarding the hardware connection topology of server nodes within a heterogeneous cluster, as well as current bandwidth usage information regarding intra-node and inter-node communication links of the server nodes, and utilize such information to provision computing devices (e.g., GPUs) in a way that optimizes communication bus and networking resources (mitigates or eliminates waste of network resources), and which optimally utilizes bidirectional connection topologies, in a balanced manner, to mitigate communication bottlenecks between computing resources.” [0014]).
It would have been obvious to one or ordinary skill in the art before the effective filing date of the invention to have incorporated wherein each data producing node is configured to dynamically allocate computational resources for generating one or more portions of its respective block of data, as taught by Zhao to the system of Naumov and Georganas because both of these systems are directed towards data parallelism in a communications network for executing data processing jobs. In making this combination and dynamically allocating the computation and network resources in each node, as taught by Zhao, it would allow the system of Naumov and Georganas to “mitigate communication bottlenecks between computing resources” (Zhao: [0014]) and “optimize performance and resource usage” (Zhao: [0061]).
Regarding Claim 4, Naumov and Georganas teach the system of Claim 1. However, they fail to expressly disclose wherein at least one of the data producing node and the data consuming node is configured to allocate computational resources including workgroups for performing respective operations.
In the same field of endeavor, Zhao teaches wherein at least one of the data producing node and the data consuming node is configured to allocate computational resources including workgroups for performing respective operations (Zhao: “based on the resource demands of the service request, the control server node can determine a set of all qualified GPU devices across the server cluster which match the resource demands, and which are free for allocation.” [0070]).
It would have been obvious to one or ordinary skill in the art before the effective filing date of the invention to have incorporated wherein at least one of the data producing node and the data consuming node is configured to allocate computational resources including workgroups for performing respective operations, as taught by Zhao to the system of Naumov and Georganas because both systems are directed towards data parallelism in a communications network for executing data processing jobs. In making this combination and allocating GPU computational resources to perform operations, as taught by Zhao, it would allow the system of Naumov and Georganas to make use of the “high-throughput, accelerated processing of compute kernels for workloads (e.g., vector-based computations, matrix-based computations, etc.) that exhibit data-parallelism” that GPUs provide (Zhao: [0002]).
Regarding Claims 13-14, they are method claims that correspond to the system of Claims 3-4. Therefore, they are rejected for the same reasons as Claims 3-4 above.
Response to Arguments
The Examiner acknowledges the Applicant’s amended to Claims 1-20.
Applicant’s arguments, filed 09/19/2025, regarding the objections to the drawings have been fully considered and are persuasive. The objections have been withdrawn.
Applicant’s arguments, filed 09/19/2025, traversing the rejection of Claims 1-20 under 35 U.S.C. § 101 have been fully considered and are not persuasive.
Applicant alleges, on Pages 10-11 of the Remark, that the amended claims recite patent-eligible subject matter because they are directed to a specific improvement in the functioning of distributed computing systems, not to an abstract idea. The limitations of the independent claims recite a concrete machine architecture, implemented with circuitry and a communication interface, that modifies how computation and communication are performed in a distributed environment. Specifically, the amended claims recite the ability of the consuming node to being performing model operation prior to receiving all required data thereby overlapping computation with communication and reducing latency. This represents a technical solution to a technical problem by eliminating bottlenecks by pipelining partial portions of data, a technique not taught in the prior art. Courts have consistently held that such improvements to computer performance constitute patent-eligible subject matter, such as in Enfish, which disclosed a self-referential table that improved database performance, DDR Holdings, whose claimed solution improved functioning of computer networks, and McRO, which claimed rules that improved automated animation. As such, the amended independent claims are not “directed to” an abstract idea under Step 2A of the USPTO framework. Even if characterized as involving mathematical operations, the claims integrate those operations into a practical application that improves distributed computing.
Examiner respectfully disagrees.
As per MPEP § 2106.05(a), when evaluating if a claim represents a technical solution to a technical problem, “it is important to note, the judicial exception alone cannot provide the improvement. The improvement can be provided by one or more additional elements.” When evaluating the purported improvement of “overlapping computation and reducing latency”, as noted in the Remarks, this comes from “the ability of the consuming node to begin performing model operations prior to receiving all required data”. As noted above in the Step 2A Prong 1 section of the 101 analysis, the concept of multi-tasking, or more specifically overlapping performing intermediate operations with receiving the resources necessary to perform the operations, as described in the specification, is a concept that can be performed in the human mind and has non-technical equivalents that provide the similar benefit of reducing bottlenecks. Even if there was no non-technical equivalent to the purported solution, the limitations are recited at a high level of generality that does not provided restriction on how this overlapping is accomplished nor any description of the mechanism for accomplishing the overlapping, such that the limitation preempts any and all applications of the concept of overlapping computation. As such, the previous court case examples given are misrepresented as patent-eligible simply for alleging an improvement to computer performance. In actuality, Enfish was found patent eligible because the technical improvement was found reflected in the specific mechanism for how the invention stored and retrieved data in memory in combination with the specific data structure recited in the claims. DDR Holdings was found eligible because it provided an unconventional modification of Internet hyperlink protocol to dynamically produce a dual-hybrid webpage. And McRO was found eligible because of the particular rules recited in the claim that enabled the automation of specific animation tasks. In contrast, the amended independent claim merely claims the idea of a solution, attributing the improvement to the abstract idea of overlapping tasks, and the additional elements provided merely serve as insignificant extra-solution activity or merely limit the use of the abstract idea to a particular field of use or technological field and thus, does not constitute patent eligible subject matter. As such, Examiner affirms that a prima facie case of patent eligibility has been established. The dependent claims 2-10 and 12-20 are similarly ineligible for their dependency on an ineligible independent claim as well as for their own deficiencies outlined in the 35 U.S.C. § 101 rejection above.
Applicant’s arguments, filed 09/19/2025, regarding the rejection of Claims 1-20 under 35 U.S.C. § 103 have been fully considered and are found moot in light of the new grounds of rejection (see rejection above).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MEGAN E HWANG whose telephone number is (703)756-1377. The examiner can normally be reached Monday-Thursday 10:00-7:30 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Welch can be reached at (571) 272-7212. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/M.E.H./Examiner, Art Unit 2143
/JENNIFER N WELCH/Supervisory Patent Examiner, Art Unit 2143