Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are presented for examination in this application, 17/752,044, filed 2022-05-24 having an effective filing date of provisional application, 62/874,462, of 2019-07-15.
The Examiner cites particular sections in the references as applied to the claims below for the convenience of the applicant(s). Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant(s) fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.
Information Disclosure Statement
Acknowledgement is made of the information disclosure statements filed on 2022-10-19, 2023-09-07, 2025-02-18, 2025-04-22, 2025-05-13, and 2025-08-25.
Drawings
The drawings submitted on 2022-05-24 have been considered and accepted.
Double Patenting
The nonstatutory double patenting rejection is based on judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg,140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman,11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to a final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP § § 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filed out completely online using web-screens. An eTerminal Disclaimer may be filed out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Examiner notes that claims 1, 5-8, 12-15, 19 and 20 are rejected on the ground of nonstatutory double patenting, as indicated below.
Claims 1, 5-8, 12-15, 19 and 20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 5-8, 12-15, 19 and 20 of U.S. Patent No. US11436019B2. Although the claims at issue are not identical, they are not patentably distinct from each other because claims 1, 5- 8, 12-15, 19 and 20 of the instant application are anticipated by the claims of the issued patent by being broader than the claims of the issued patent. See the comparison below:
Instant Application
U.S. Patent No. US11436019B2
1. A system, comprising: a parameter server communicatively connected to a target device, the parameter server comprises:
a transmitter configured to transmit a portion of an artificial intelligence (AI) model to the target device,
the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model, and a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model,
and contemporaneously, with a set of microbatches of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, a weight updater is configured to perform a reduction of parameters for a second subportion of the transmitted portion of the Al model, and the transmitter is further configured to send weights for a third subportion of the transmitted portion of the Al model to the target device.
5. The system of claim 1 wherein a microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server.
6. The system of claim 1,
wherein the parameter server further comprises a precision formatter configured to: convert weights for a fourth subportion of the transmitted portion of the Al model to a first precision format prior to sending the weights to the target device; convert gradients received from the target device to a second precision format; and update the weights using the converted gradients.
7. The system of claim 1, wherein the transmitter is further configured to transmit another portion of the Al model to another target device; and the weight updater is further configured to receive gradients from the another target device to perform reduction of parameters for the another portion of the Al model.
8. A method implemented in a parameter server, comprising:
transmitting a portion of a stored artificial intelligence (AI) model to a target device the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model, and a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model
and contemporaneously, with a set of microbatches of a determined size,
of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, performing a reduction of parameters for a second subportion of the transmitted portion of the Al model and sending weights for a third subportion of the transmitted portion of the Al model to the target device.
12. The method of claim 8, wherein a microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server.
13. The method of claim 8, further comprising: converting weights for a fourth subportion of the transmitted portion of the Al model to a first precision format prior to sending the weights to the target device; converting gradients received from the target device to a second precision format; and updating the weights using the converted gradients.
14. The method of claim 8, further comprising: transmitting another portion of the Al model to another target device; and receiving gradients from the another target device to perform reduction of parameters for the another portion of the Al model.
15. A computer program product comprising a computer-readable storage device having computer program logic recorded thereon that when executed by a processor- based computer system causes the processor-based system to perform a method, the method comprising:
transmitting a portion of a stored artificial intelligence (AI) model to a target device, the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model, and a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model;
contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, performing at least one of a reduction of parameters for a second subportion of the transmitted portion of the Al model or sending weights for a third subportion of the transmitted portion of the Al model to the target device.
19. The computer program product of claim 15, wherein a microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server.
20. The computer program product of claim 15, wherein the method further comprises: transmitting another portion of the Al model to another target device; and receiving gradients from the another target device to perform reduction of parameters for the another portion of the Al model.
1. A system, comprising: a parameter server communicatively connected to a target device, the parameter server comprises:
a data manager configured to store a master copy of an artificial intelligence (AI) model
a transmitter configured to transmit a portion of an artificial intelligence (AI) model to the target device,
the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model, and a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model,
a batch manager configured to determine a microbatch size suitable for the target device;
and contemporaneously, with a set of microbatches of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, a weight updater is configured to perform a reduction of parameters for a second subportion of the transmitted portion of the Al model, and the transmitter is further configured to send weights for a third subportion of the transmitted portion of the Al model to the target device.
5. The system of claim 2, wherein the microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server.
6. A system, comprising: a parameter server communicatively connected to a target device, the parameter server comprises: a data manager configured to store a master copy of an artificial intelligence (Al) model; a transmitter configured to transmit a portion of the Al model to the target device; a batch manager configured to determine a microbatch size suitable for the target device; and contemporaneously, with a set of microbatches of a training dataset being executed at the target device on a first subportion of the transmitted portion of the Al model to generate gradients, a weight updater is configured to perform reduction of parameters for a second subportion of the transmitted portion of the Al model, and the transmitter is further configured to send weights for a third subportion of the transmitted portion of the Al model to the target device,
wherein the parameter server further comprises a precision formatter configured to: convert weights for a fourth subportion of the transmitted portion of the Al model to a first precision format prior to sending the weights to the target device; convert gradients received from the target device to a second precision format; and update the weights using the converted gradients.
7. The system of claim 1, wherein the transmitter is further configured to transmit another portion of the Al model to another target device; and the weight updater is further configured to receive gradients from the another target device to perform reduction of parameters for the another portion of the Al model.
8. A method implemented in a parameter server, comprising: storing a master copy in an artificial intelligence (AI) model; transmitting a portion of the Al model to a target device the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model, and a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model
determining a microbatch size suitable for the target device
and contemporaneously, with a set of microbatches
of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, performing a reduction of parameters for a second subportion of the transmitted portion of the Al model and sending weights for a third subportion of the transmitted portion of the Al model to the target device.
12. The method of claim 9, wherein the microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server.
13. A method implemented in a parameter server, comprising: storing a master copy of an artificial intelligence (Al) model; transmitting a portion of the Al model to a target device; determining a microbatch size suitable for the target device; contemporaneously, with a set of microbatches of a training dataset being executed at the target device on a first subportion of the transmitted portion of the Al model to generate gradients, performing reduction of parameters for a second subportion of the transmitted portion of the Al model and sending weights for a third subportion of the transmitted portion of the Al model to the target device;
converting weights for a fourth subportion of the transmitted portion of the Al model to a first precision format prior to sending the weights to the target device; converting gradients received from the target device to a second precision format; and updating the weights using the converted gradients.
14. The method of claim 8, further comprising: transmitting another portion of the Al model to another target device; and receiving gradients from the another target device to perform reduction of parameters for the another portion of the Al model.
15. A computer program product comprising a computer-readable storage device having computer program logic recorded thereon that when executed by a processor- based computer system causes the processor-based system to perform a method, the method comprising: storing a master copy of an artificial intelligence (Al) model at a parameter server;
transmitting a portion of the Al model to a target device, the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model, and a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model; determining a microbatch size suitable for the target device; and contemporaneously, with a set of microbatches of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, performing reduction of parameters for a second subportion of the transmitted portion of the Al model and sending weights for a third subportion of the transmitted portion of the Al model to the target device.
19. The computer program product of claim 15, wherein the microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server.
20. The computer program product of claim 15, wherein the method further comprises: transmitting another portion of the Al model to another target device; and receiving gradients from the another target device to perform reduction of parameters for the another portion of the Al model.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C 101 as being unpatentable because the claimed invention in these claims is directed to an abstract idea without significantly more. The analysis of the claims will follow the 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50-57 (January 7, 2019) (“2019 PEG”).
Regarding claim 1:
Step 1 – Is the claim directed to a process, machine, manufacture, or a composition of matter?
Yes, the claim is directed to a system.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or a natural phenomenon?
Yes, the claim recites abstract ideas:
contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, a weight updater is configured to perform a reduction of parameters for a second subportion of the transmitted portion of the Al model — this limitation is directed to a mathematical calculation (see MPEP 2106.04(a)(2) I. C.).
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, the claim recites additional elements that do not integrate the judicial exception into a practical application:
a transmitter configured to transmit a portion of a stored artificial intelligence (AI) model to a target device — this limitation is directed to mere data gathering and outputting which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)).
the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the AI model — this limitation amounts to mere instructions to apply an exception, as the use of a computer or other machinery in its ordinary capacity amounts to invoking computer components merely as a tool to perform an existing process (see MPEP 2106.05(f)(2)).
a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model — this limitation amounts to mere instructions to apply an exception, as the use of a computer or other machinery in its ordinary capacity amounts to invoking computer components merely as a tool to perform an existing process (see MPEP 2106.05(f)(2)).
contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, the transmitter is further configured to send weights for a third subportion of the transmitted portion of the Al model to the target device — this limitation is directed to mere data gathering and outputting which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)).
Step 2B – Does the claim recite additional elements that amount to significantly more than the abstract idea itself?
No, there are no additional elements that amount to significantly more than the judicial exception. Any additional elements that were determined to be insignificant extra-solution activities in step 2A prong 2 are further evaluated in step 2B on whether they are well-understood, routine, and conventional activities. The “a transmitter configured to transmit a portion of a stored artificial intelligence (AI) model to a target device” and “contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, the transmitter is further configured to send weights for a third subportion of the transmitted portion of the Al model to the target device” limitations were found to be insignificant extra-solution activities in claim 1. This limitation is recited at a high level of generality and amounts to transmitting data over a network, which is a well-understood, routine, and conventional activity (see MPEP 2106.05(d) II.). Thus, the claim is not patent eligible.
Regarding claim 2:
Claim 2 recites a data parallelism process using machine learning at a high level of generality, which amounts to apply the judicial exception on a computer (see MPEP 2106.05(f)). Claims 2 and 16 are analogous.
Regarding claim 3:
Claim 3 recites a data parallelism process using machine learning at a high level of generality, which amounts to apply the judicial exception on a computer (see MPEP 2106.05(f)). Claims 10 and 17 are analogous.
Regarding claim 4:
Claim 4 recites receiving gradients from a target device, which amounts to mere data gathering, considered a pre-solution activity which is an insignificant extra-solution activity (see MPEP 2106.05(g) (3)) which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754). Claim 4 also recites gradients being generated by executing microbatches, which is an abstract idea (a calculation) see MPEP 2106.04(a)(2) I. C.), generating an average of gradients, which is an abstract idea (a calculation) see MPEP 2106.04(a)(2) I. C.), and updating an AI model which amounts to a machine learning process at a high level of generality, which amounts to apply the judicial exception on a computer (see MPEP 2106.05(f)). Claims 11 and 18 are analogous.
Regarding claim 5:
Claim 5 recites a machine learning process (allowing for a batch, or subset of data, size to be configurable) recited at a high level of generality, which amounts to mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)).Claims 12 and 19 are analogous.
Regarding claim 6:
Claim 6 recites machine learning processes (converting weights and updating weights), at a high level of generality, which amounts to mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Claim 13 is analogous.
Regarding claim 7:
Claim 7 recites transmitting a portion of an AI model, as well as receiving gradients from a device, which amount to mere data gathering, considered a pre-solution activity which is an insignificant extra-solution activity (see MPEP 2106.05(g) (3)) which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754). Furthermore,gathering data to be manipulated and used within a system is a well-understood, routine, and conventional activity (WURC) that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)). Claims 14 and 20 are analogous.
Regarding claim 8:
Step 1 – Is the claim directed to a process, machine, manufacture, or a composition of matter?
Yes, the claim is directed to a method.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or a natural phenomenon?
Yes, the claim recites abstract ideas:
contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, performing a reduction of parameters for a second subportion of the transmitted portion of the Al model — this limitation is directed to a mathematical calculation (see MPEP 2106.04(a)(2) I. C.).
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, the claim recites additional elements that do not integrate the judicial exception into a practical application:
transmitting a portion of a stored artificial intelligence (AI) model to a target device — this limitation is directed to mere data gathering and outputting which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)).
the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the AI model — this limitation amounts to mere instructions to apply an exception, as the use of a computer or other machinery in its ordinary capacity amounts to invoking computer components merely as a tool to perform an existing process (see MPEP 2106.05(f)(2)).
a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model — this limitation amounts to mere instructions to apply an exception, as the use of a computer or other machinery in its ordinary capacity amounts to invoking computer components merely as a tool to perform an existing process (see MPEP 2106.05(f)(2)).
contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, sending weights for a third subportion of the transmitted portion of the Al model to the target device — this limitation is directed to mere data gathering and outputting which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)).
Step 2B – Does the claim recite additional elements that amount to significantly more than the abstract idea itself?
No, there are no additional elements that amount to significantly more than the judicial exception. Any additional elements that were determined to be insignificant extra-solution activities in step 2A prong 2 are further evaluated in step 2B on whether they are well-understood, routine, and conventional activities. The “transmitting a portion of a stored artificial intelligence (AI) model to a target device” and “sending weights for a third subportion of the transmitted portion of the Al model to the target device” limitations were found to be insignificant extra-solution activities in claim 8. This limitation is recited at a high level of generality and amounts to transmitting data over a network, which is a well-understood, routine, and conventional activity (see MPEP 2106.05(d) II.). Thus, the claim is not patent eligible.
Regarding claim 15:
Step 1 – Is the claim directed to a process, machine, manufacture, or a composition of matter?
Yes, the claim is directed to a manufacture (a computer program product).
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or a natural phenomenon?
Yes, the claim recites abstract ideas:
contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, performing at least one of a reduction of parameters for a second subportion of the transmitted portion of the Al model or sending weights for a third subportion of the transmitted portion of the Al model to the target device — this limitation is directed to a mathematical calculation (see MPEP 2106.04(a)(2) I. C.).
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, the claim recites additional elements that do not integrate the judicial exception into a practical application:
a computer program product comprising a computer-readable storage device having computer program logic recorded thereon that when executed by a processor-based computer system causes the processor-based system to perform a method — this limitation amounts to mere instructions to apply an exception, as the use of a computer or other machinery in its ordinary capacity amounts to invoking computer components merely as a tool to perform an existing process (see MPEP 2106.05(f)(2)).
transmitting a portion of a stored artificial intelligence (AI) model to a target device — this limitation is directed to mere data gathering and outputting which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)).
the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the AI model — this limitation amounts to mere instructions to apply an exception, as the use of a computer or other machinery in its ordinary capacity amounts to invoking computer components merely as a tool to perform an existing process (see MPEP 2106.05(f)(2)).
a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model — this limitation amounts to mere instructions to apply an exception, as the use of a computer or other machinery in its ordinary capacity amounts to invoking computer components merely as a tool to perform an existing process (see MPEP 2106.05(f)(2)).
contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients, performing at least one of a reduction of parameters for a second subportion of the transmitted portion of the AI model or sending weights for a third subportion of the transmitted portion of the Al model to the target device — this limitation is directed to mere data gathering and outputting which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)).
Step 2B – Does the claim recite additional elements that amount to significantly more than the abstract idea itself?
No, there are no additional elements that amount to significantly more than the judicial exception. Any additional elements that were determined to be insignificant extra-solution activities in step 2A prong 2 are further evaluated in step 2B on whether they are well-understood, routine, and conventional activities. The “transmitting a portion of a stored artificial intelligence (AI) model to a target device” and “sending weights for a third subportion of the transmitted portion of the Al model to the target device” limitations were found to be insignificant extra-solution activities in claim 15. This limitation is recited at a high level of generality and amounts to transmitting data over a network, which is a well-understood, routine, and conventional activity (see MPEP 2106.05(d) II.). Thus, the claim is not patent eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-4, 6-11, 13-18, and 20 are rejected under 35 U.S.C 103 as being unpatentable over Sridharan et al. (US20190205745A1 hereinafter referred to as Sridharan) in view of Ambrose et al. (US20170344882A1 hereinafter referred to as Ambrose).
Regarding claim 1:
Sridharan teaches a system, comprising: a parameter server communicatively connected to a target device (see [0047]: “FIG. 1 is a block diagram of a processing system 100, according to an embodiment. In various embodiments the system 100 includes one or more processors 102 and one or more graphics processors 108, and may be a single processor desktop system, a multiprocessor workstation system, or a server system having a large number of processors 102 or processor cores 107. In one embodiment, the system 100 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices.”. Also see [0223]: “The communications framework can then create a second group 2230 of worker nodes can includes additional sets of worker nodes 2236A-2236B having low mutual latency. The specific number of groups that are created can vary based on the topology of the network. One or more parameter servers 2220 can then be instantiated. The parameter servers can then be configured to enable efficient inter-group communication between the various groups 2210, 2230.”.)
the parameter server comprises: a transmitter configured to transmit a portion of an artificial intelligence (AI) model to the target device (see [0222]: “Described herein is a communication system that makes use of a topology-aware algorithm for flexible node grouping. In one embodiment, a distributed training system can be constructed in a manner that is sensitive to the existing network topology of the worker nodes, such that local nodes can be assembled into compute groups based on network topology. Nodes within a compute group communicate with each other using operations such as all-reduce, while distant nodes are bridged via synchronization operations performed with a parameter server.”.)
and
contemporaneously, with a set of microbatches of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients (see [0202]: “As shown in FIG. 20A, data parallelism can be implemented in which input data 2002 is split along a mini-batch dimension and the same model is replicated across the nodes. The mini-batch is split across several compute nodes, with each node responsible for computing gradients with respect to all model parameters using a subset of the samples in the mini-batch. Forward propagation is performed independently on each node. In one embodiment only one communication is performed during the backward pass to calculate an average for the gradients with respect to learnable parameters.”.),
a weight updater is configured to perform a reduction of parameters for a second subportion of the transmitted portion of the Al model (see [0202]: “ An allreduce operation 2005 is used to update the weights of each layer for the next forward pass”. Also see [0206]: “During back propagation 2028, distributed stochastic gradient descent is performed to generate updated weight data. An initial Allreduce operation 2012 is performed for Layer N and a set of Allreduce operations 2011A, 2011B, 2011N are performed to update the weights of each layer for the next forward pass.”.) and
the transmitter is further configured to send weights for a third subportion of the transmitted portion of the Al model to the target device (see [0202]: “ An allreduce operation 2005 is used to update the weights of each layer for the next forward pass”. Also see [0206]: “During back propagation 2028, distributed stochastic gradient descent is performed to generate updated weight data. An initial Allreduce operation 2012 is performed for Layer N and a set of Allreduce operations 2011A, 2011B, 2011N are performed to update the weights of each layer for the next forward pass.”.)
[(Examiner’s note: A person having ordinary skill in the art using broadest reasonable interpretation in light of the specification could take “ set of microbatches” to be taken as at least one minibatch. In the instant application, the specification at [0040] reads: “A group of microbatches forms a minibatch, which is the term for the number of samples per update (for training) or the number served in every inference cycle (for inference).”, which further gives reason to believe that here a group and set could be taken to be synonymous.)]
Sridharan does not explicitly teach the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model or a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model.
Ambrose, however, teaches in analogous teach the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model (see [0005]: “Graphical Processing Units (GPUs) are a strong candidate for implementing CNN algorithms because GPUs, which are suitable for parallel computation, are well adapted to exploit the high level of data parallelism in the CNN algorithms.”. Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC. In order to reduce the number of simulations but still be able to find the best selection of scheduling schemes to layers of the CNN algorithm, an estimation method is required in order to accurately predict the design costs, such as the required local memory size, the external memory accesses and the execution time of the CNN algorithm 1703.”.) and
a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model (see [0005]: “Graphical Processing Units (GPUs) are a strong candidate for implementing CNN algorithms because GPUs, which are suitable for parallel computation, are well adapted to exploit the high level of data parallelism in the CNN algorithms.”. Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC. In order to reduce the number of simulations but still be able to find the best selection of scheduling schemes to layers of the CNN algorithm, an estimation method is required in order to accurately predict the design costs, such as the required local memory size, the external memory accesses and the execution time of the CNN algorithm 1703.”.)
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sridharan and Ambrose before him or her, to modify the system of claim 1 to include attributes of having the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model or a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model in order to allow for optimal scheduling schemes (see [0046]: “Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC.”.).
Regarding claim 2:
Sridharan in view of Ambrose teaches the system of claim 1.
Sridharan further teaches wherein the first subportion of the transmitted portion of the Al model comprises a current layer of the Al model (see [0204]: “As shown in FIG. 20C, hybrid parallelism can be performed in which a partitioning is performed across activations and weights to minimize skewed matrices. For a layer of a neural network, the input data 2002, weight data 2004, and/or activation data 2006 is partitioned and distributed across multiple compute nodes (e.g., Node 0-Node 3).”) and
and the second subportion of the transmitted portion of the Al model comprises a prior layer of the Al model (see [0205]: “FIG. 20D illustrates the transfer of partial activation data 2006A-2006B for a given layer of a neural network (Layer N-1) to a successive layer of the neural network (Layer N). Via multiple nodes (Node 0, Node 1), a set of partial activations 2006A-2006B is generated by based on the application of a mathematical operation (e.g., convolution) to the input data 2002A-2002B and weight data 2004A-2004B. For example, in one embodiment a reduce_scatter operation 2010 is used which performs a reduce operation on the partial activations 2006A-2006B of layer N-1 from the multiple nodes and scatters the result to the multiple nodes as activations for use in Layer N of the neural network.”).
Regarding claims 9 and 16:
Claims 9 and 16 recite analogous limitations to claim 2 and are therefore rejected on the same grounds.
Regarding claim 3:
Sridharan in view of Ambrose teaches the system of claim 1.
Sridharan further teaches wherein the first subportion of the transmitted portion of the Al model comprises a current layer of the Al model (see [0205]: “FIG. 20D illustrates the transfer of partial activation data 2006A-2006B for a given layer of a neural network (Layer N-1) to a successive layer of the neural network (Layer N). Via multiple nodes (Node 0, Node 1), a set of partial activations 2006A-2006B is generated by based on the application of a mathematical operation (e.g., convolution) to the input data 2002A-2002B and weight data 2004A-2004B. For example, in one embodiment a reduce_scatter operation 2010 is used which performs a reduce operation on the partial activations 2006A-2006B of layer N-1 from the multiple nodes and scatters the result to the multiple nodes as activations for use in Layer N of the neural network.”.)
the third subportion of the transmitted portion of the Al model comprises a next layer of the Al model (see [0205]: “FIG. 20D illustrates the transfer of partial activation data 2006A-2006B for a given layer of a neural network (Layer N-1) to a successive layer of the neural network (Layer N). Via multiple nodes (Node 0, Node 1), a set of partial activations 2006A-2006B is generated by based on the application of a mathematical operation (e.g., convolution) to the input data 2002A-2002B and weight data 2004A-2004B. For example, in one embodiment a reduce_scatter operation 2010 is used which performs a reduce operation on the partial activations 2006A-2006B of layer N-1 from the multiple nodes and scatters the result to the multiple nodes as activations for use in Layer N of the neural network.”.).
Regarding claims 10 and 17:
Claims 10 and 17 recite analogous limitations to claim 3 and are therefore rejected on the same grounds.
Regarding claim 4:
Sridharan in view of Ambrose teaches the system of claim 1.
Sridharan further teaches wherein the weight updater is configured to: perform reduction of parameters by: receiving gradients from the target device, the gradients being generated by the target device executing the set of microbatches of the training dataset on the second subportion at the target device (see [0202]: “As shown in FIG. 20A, data parallelism can be implemented in which input data 2002 is split along a mini-batch dimension and the same model is replicated across the nodes. The mini-batch is split across several compute nodes, with each node responsible for computing gradients with respect to all model parameters using a subset of the samples in the mini-batch.”.)
generating an average of the received gradients (see [0202]: “Forward propagation is performed independently on each node. In one embodiment only one communication is performed during the backward pass to calculate an average for the gradients with respect to learnable parameters. ”.)
update the Al model with the average of the received gradients (see [0202]: “An allreduce operation 2005 is used to update the weights of each layer for the next forward pass. In one embodiment, distributed weight update can be enabled in which a reduce_scatter is used calculate an average for gradients before stochastic gradient descent is performed and an allgather operation is used after stochastic gradient descent to synchronize weights across nodes.”.)
Regarding claims 11 and 18:
Claims 11 and 18 recite analogous limitations to claim 4 and are therefore rejected on the same grounds.
Regarding claim 6:
Sridharan in view of Ambrose teaches the system of claim 1.
Sridharan further teaches convert weights for a fourth subportion of the transmitted portion of the Al model to a first precision format prior to sending the weights to the target device (see [0202]: “As shown in FIG. 20A, data parallelism can be implemented in which input data 2002 is split along a mini-batch dimension and the same model is replicated across the nodes. The mini-batch is split across several compute nodes, with each node responsible for computing gradients with respect to all model parameters using a subset of the samples in the mini-batch.”.)
convert gradients received from the target device to a second precision format (see [0202]: “Forward propagation is performed independently on each node. In one embodiment only one communication is performed during the backward pass to calculate an average for the gradients with respect to learnable parameters. ”.)
update the weights using the converted gradients (see [0202]: “An allreduce operation 2005 is used to update the weights of each layer for the next forward pass. In one embodiment, distributed weight update can be enabled in which a reduce_scatter is used calculate an average for gradients before stochastic gradient descent is performed and an allgather operation is used after stochastic gradient descent to synchronize weights across nodes.”.)
Regarding claim 13:
Claim 13 recites analogous limitations to claim 6 and is therefore rejected on the same grounds.
Regarding claim 7:
Sridharan in view of Ambrose teaches the system of claim 1.
Sridharan further teaches wherein the transmitter is further configured to transmit another portion of the Al model to another target device (see [0204]: “As shown in FIG. 20C, hybrid parallelism can be performed in which a partitioning is performed across activations and weights to minimize skewed matrices. For a layer of a neural network, the input data 2002, weight data 2004, and/or activation data 2006 is partitioned and distributed across multiple compute nodes (e.g., Node 0-Node 3).”) and
the weight updater is further configured to receive gradients from the another target device to perform reduction of parameters for the another portion of the Al model (see [0202]: “ An allreduce operation 2005 is used to update the weights of each layer for the next forward pass”. Also see [0206]: “During back propagation 2028, distributed stochastic gradient descent is performed to generate updated weight data. An initial Allreduce operation 2012 is performed for Layer N and a set of Allreduce operations 2011A, 2011B, 2011N are performed to update the weights of each layer for the next forward pass.”.)
Regarding claims 14 and 20:
Claims 14 and 20 recite analogous limitations to claim 7 and are therefore rejected on the same grounds.
Regarding claim 8:
Sridharan teaches a method implemented in a parameter server, comprising, (see [0223]: “The communications framework can then create a second group 2230 of worker nodes can includes additional sets of worker nodes 2236A-2236B having low mutual latency. The specific number of groups that are created can vary based on the topology of the network. One or more parameter servers 2220 can then be instantiated. The parameter servers can then be configured to enable efficient inter-group communication between the various groups 2210, 2230.”.)
transmitting a portion of an artificial intelligence (AI) model to the target device (see [0222]: “Described herein is a communication system that makes use of a topology-aware algorithm for flexible node grouping. In one embodiment, a distributed training system can be constructed in a manner that is sensitive to the existing network topology of the worker nodes, such that local nodes can be assembled into compute groups based on network topology. Nodes within a compute group communicate with each other using operations such as all-reduce, while distant nodes are bridged via synchronization operations performed with a parameter server.”.)
and
contemporaneously, with a set of microbatches, of a determined size, of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients (see [0202]: “As shown in FIG. 20A, data parallelism can be implemented in which input data 2002 is split along a mini-batch dimension and the same model is replicated across the nodes. The mini-batch is split across several compute nodes, with each node responsible for computing gradients with respect to all model parameters using a subset of the samples in the mini-batch. Forward propagation is performed independently on each node. In one embodiment only one communication is performed during the backward pass to calculate an average for the gradients with respect to learnable parameters.”.),
performing a reduction of parameters for a second subportion of the transmitted portion of the Al model (see [0202]: “ An allreduce operation 2005 is used to update the weights of each layer for the next forward pass”. Also see [0206]: “During back propagation 2028, distributed stochastic gradient descent is performed to generate updated weight data. An initial Allreduce operation 2012 is performed for Layer N and a set of Allreduce operations 2011A, 2011B, 2011N are performed to update the weights of each layer for the next forward pass.”.) and
sending weights for a third subportion of the transmitted portion of the Al model to the target device (see [0202]: “ An allreduce operation 2005 is used to update the weights of each layer for the next forward pass”. Also see [0206]: “During back propagation 2028, distributed stochastic gradient descent is performed to generate updated weight data. An initial Allreduce operation 2012 is performed for Layer N and a set of Allreduce operations 2011A, 2011B, 2011N are performed to update the weights of each layer for the next forward pass.”.)
[(Examiner’s note: A person having ordinary skill in the art using broadest reasonable interpretation in light of the specification could take “ set of microbatches” to be taken as at least one minibatch. In the instant application, the specification at [0040] reads: “A group of microbatches forms a minibatch, which is the term for the number of samples per update (for training) or the number served in every inference cycle (for inference).”, which further gives reason to believe that here a group and set could be taken to be synonymous.)]
Sridharan does not explicitly teach the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model or a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model.
Ambrose, however, teaches in analogous teach the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model (see [0005]: “Graphical Processing Units (GPUs) are a strong candidate for implementing CNN algorithms because GPUs, which are suitable for parallel computation, are well adapted to exploit the high level of data parallelism in the CNN algorithms.”. Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC. In order to reduce the number of simulations but still be able to find the best selection of scheduling schemes to layers of the CNN algorithm, an estimation method is required in order to accurately predict the design costs, such as the required local memory size, the external memory accesses and the execution time of the CNN algorithm 1703.”.) and
a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model (see [0005]: “Graphical Processing Units (GPUs) are a strong candidate for implementing CNN algorithms because GPUs, which are suitable for parallel computation, are well adapted to exploit the high level of data parallelism in the CNN algorithms.”. Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC. In order to reduce the number of simulations but still be able to find the best selection of scheduling schemes to layers of the CNN algorithm, an estimation method is required in order to accurately predict the design costs, such as the required local memory size, the external memory accesses and the execution time of the CNN algorithm 1703.”.)
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sridharan and Ambrose before him or her, to modify the method of claim 8 to include attributes of having the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model or a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model in order to allow for optimal scheduling schemes (see [0046]: “Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC.”.).
Regarding claim 15:
Sridharan teaches a computer program product comprising a computer-readable storage device having computer program logic recorded thereon that when executed by a processor-based computer system causes the processor-based system to perform a method, the method comprising: (see [0047]: “FIG. 1 is a block diagram of a processing system 100, according to an embodiment. In various embodiments the system 100 includes one or more processors 102 and one or more graphics processors 108, and may be a single processor desktop system, a multiprocessor workstation system, or a server system having a large number of processors 102 or processor cores 107. In one embodiment, the system 100 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices.”. Also see [0223]: “The communications framework can then create a second group 2230 of worker nodes can includes additional sets of worker nodes 2236A-2236B having low mutual latency. The specific number of groups that are created can vary based on the topology of the network. One or more parameter servers 2220 can then be instantiated. The parameter servers can then be configured to enable efficient inter-group communication between the various groups 2210, 2230.”.)
transmitting a portion of an artificial intelligence (AI) model to the target device (see [0222]: “Described herein is a communication system that makes use of a topology-aware algorithm for flexible node grouping. In one embodiment, a distributed training system can be constructed in a manner that is sensitive to the existing network topology of the worker nodes, such that local nodes can be assembled into compute groups based on network topology. Nodes within a compute group communicate with each other using operations such as all-reduce, while distant nodes are bridged via synchronization operations performed with a parameter server.”.)
and
contemporaneously, with a set of microbatches of a training dataset being executed by the integrated circuit chip at the target device on a first subportion of the transmitted portion of the Al model stored in the on-chip memory to generate gradients (see [0202]: “As shown in FIG. 20A, data parallelism can be implemented in which input data 2002 is split along a mini-batch dimension and the same model is replicated across the nodes. The mini-batch is split across several compute nodes, with each node responsible for computing gradients with respect to all model parameters using a subset of the samples in the mini-batch. Forward propagation is performed independently on each node. In one embodiment only one communication is performed during the backward pass to calculate an average for the gradients with respect to learnable parameters.”.),
performing a reduction of parameters for a second subportion of the transmitted portion of the Al model (see [0202]: “ An allreduce operation 2005 is used to update the weights of each layer for the next forward pass”. Also see [0206]: “During back propagation 2028, distributed stochastic gradient descent is performed to generate updated weight data. An initial Allreduce operation 2012 is performed for Layer N and a set of Allreduce operations 2011A, 2011B, 2011N are performed to update the weights of each layer for the next forward pass.”.) or
sending weights for a third subportion of the transmitted portion of the Al model to the target device (see [0202]: “ An allreduce operation 2005 is used to update the weights of each layer for the next forward pass”. Also see [0206]: “During back propagation 2028, distributed stochastic gradient descent is performed to generate updated weight data. An initial Allreduce operation 2012 is performed for Layer N and a set of Allreduce operations 2011A, 2011B, 2011N are performed to update the weights of each layer for the next forward pass.”.)
[(Examiner’s note: A person having ordinary skill in the art using broadest reasonable interpretation in light of the specification could take “ set of microbatches” to be taken as at least one minibatch. In the instant application, the specification at [0040] reads: “A group of microbatches forms a minibatch, which is the term for the number of samples per update (for training) or the number served in every inference cycle (for inference).”, which further gives reason to believe that here a group and set could be taken to be synonymous.)]
Sridharan does not explicitly teach the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model or a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model.
Ambrose, however, teaches in analogous teach the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model (see [0005]: “Graphical Processing Units (GPUs) are a strong candidate for implementing CNN algorithms because GPUs, which are suitable for parallel computation, are well adapted to exploit the high level of data parallelism in the CNN algorithms.”. Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC. In order to reduce the number of simulations but still be able to find the best selection of scheduling schemes to layers of the CNN algorithm, an estimation method is required in order to accurately predict the design costs, such as the required local memory size, the external memory accesses and the execution time of the CNN algorithm 1703.”.) and
a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model (see [0005]: “Graphical Processing Units (GPUs) are a strong candidate for implementing CNN algorithms because GPUs, which are suitable for parallel computation, are well adapted to exploit the high level of data parallelism in the CNN algorithms.”. Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC. In order to reduce the number of simulations but still be able to find the best selection of scheduling schemes to layers of the CNN algorithm, an estimation method is required in order to accurately predict the design costs, such as the required local memory size, the external memory accesses and the execution time of the CNN algorithm 1703.”.)
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sridharan and Ambrose before him or her, to modify the computer program product of claim 15 to include attributes of having the target device comprising an integrated circuit chip having an on-chip memory of a size less than an entirety of the Al model or a size of the portion being based at least on the on-chip memory size and a size of one or more layers of the Al model in order to allow for optimal scheduling schemes (see [0046]: “Also see [0046]: “The best scheduling scheme 1722 for the particular CNN algorithm layer 1704 is defined as the scheduling scheme which satisfies given criteria with minimal design costs to execute on the SoC 1714 for the layer 17104 in question. The criteria and design costs are for the entire SoC. For example, the best scheduling schemes can be ones which require a local shared memory (such as 1710) whose size is within a given constraint while consuming minimal accesses to the external memory 1709 thus resulting in a smaller execution time for the entire SoC.”.).
Claims 5, 12, and 19 are rejected under 35 U.S.C 103 as being unpatentable over Sridharan et al. (US20190205745A1 hereinafter referred to as Sridharan) in view of Ambrose et al. (US20170344882A1 hereinafter referred to as Ambrose) in further view of Oyama et al. (“Accelerating Deep Learning Frameworks with Micro-Batches” hereinafter referred to as Oyama).
Regarding claim 5:
Sridharan in view of Ambrose teaches the system of claim 1.
Sridharan in view of Ambrose does not explicitly teach wherein a microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server.
Oyama, however, teaches in analogous wherein a microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server (see section III B page 5 : “The goal of the WR policy is to minimize T(B), the total execution time with mini-batch size of B using Dynamic Programming (DP), where T(b) is defined as follows:
PNG
media_image1.png
86
608
media_image1.png
Greyscale
where Tμ(b) is the fastest execution time of one convolution kernel with a micro-batch size of b, within the workspace constraint.” )
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sridharan, Ambrose, and Oyama before him or her, to modify the system of claim 5 to include attributes of a microbatch size is configurable based on a rate of executing the set of microbatches at the target device and a rate of communication between the target device and the parameter server in order to minimize total workspace time (see section III B page 5: “The goal of the WR policy is to minimize T(B), the total execution time with mini-batch size of B”.).
Regarding claims 12 and 19:
Claims 12 and 19 recite analogous limitations to claim 5 and are therefore rejected on the same grounds.
Pertinent Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure:
US20190073590A1 — Wu et al. — discloses system-on-chip architecture using data parallelism and mini-batches
“Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server” — Awan — discloses system-on-chip as well as server-on-chip architecture in the context of data parallelism
“PipeDream: Fast and Efficient Pipeline Parallel DNN” — Harlap et al. — discloses data parallelism and the use of minibatches
“Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis” — Ben-Nun et al.(1) — discloses single-machine parallelism architecture in the context of data parallelism
“A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning” — Ben-Nun et al.(2) — discloses the use of micro-batches in the context of data parallelism
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Andrew A Bracero whose telephone number is 571-270-0592. The examiner can normally be reached Monday - Friday 9:00a.m. - 5:00 p.m. E.T.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at Monday – Friday 9:00 a.m. – 5:00 p.m. E.T. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ANDREW BRACERO/Examiner, Art Unit 2126
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126