Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
Acknowledgment is made of the Information Disclosure Statement dated 10/13/2025 and 11/21/2025. All of the cited references have been considered.
Response to Arguments
Applicant’s arguments on pages 11-13 regarding the rejection under 35 U.S.C. 103 with respect to claims 1-20 have been fully considered but are moot. New reference Feldman has been incorporated below to teach the newly presented limitations.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-7 and 10-20 are rejected under 35 U.S.C. 103 as being unpatentable over Rafferty et al. (US20230168932A1); hereinafter Rafferty in view of Feldman et al. (US20220237506A1); hereinafter Feldman
Claim 1 is rejected over Rafferty and Feldman.
Regarding claim 1, Rafferty teaches a method of managing execution of a first inference model hosted by data processing systems, the method comprising: (“Described herein are methods, systems, computer readable media, etc. for training a machine learning model to monitor and/or predict usage of computing services and inference models to scale up or down computing resources for use by those computing services and/or inference models.”; [0023])
obtaining an inference frequency capability of the first inference model, the inference frequency capability indicating a rate of execution of the first inference model; (“an inference model is configured to receive a request and make an inference based on the request, and that inference is returned as a response to the request.”; [0037] and “These correlations may indicate, for example, a number of queries or requests an inference model (first inference model) is predicted to receive in a future time window.”; [0038]; Note: The number of queries or requests is the inference frequency capability.)
making a first determination regarding whether the inference frequency capability of the first inference model meets an inference frequency requirement of a downstream consumer during a future period of time; (“In particular, by using a trained machine learning model to monitor usage of computing services and/or resources (e.g., a first usage) to determine and predict when usage of an inference model (e.g., a second usage) will scale up or down, the system can predictively increase or decrease computing resources allocated to and/or used by an inference model.”; [0023])
in an instance of the first determination in which the inference frequency capability of the first inference model does not meet the inference frequency requirement of the downstream consumer:
obtaining an execution plan for the first inference model based on the inference frequency requirement of the downstream consumer; and (“In particular, by using a trained machine learning model to monitor usage of computing services and/or resources (e.g., a first usage) to determine and predict when usage of an inference model (e.g., a second usage) will scale up or down, the system can predictively increase or decrease computing resources allocated to and/or used by an inference model.”; [0023]; and “At an operation 262, a command is transmitted, in real-time, in response to the determination that the current usage data is indicative of the at least one first spike in the first usage of the at least one first computing service, where the command is transmitted to increase an amount of computing resources available for an execution of the inference model.”; [0040]; Note: See Figure 2 262 to see the execution plan where computing resources are increased for execution of the inference model.)
prior to the future period of time, modifying a deployment of the first inference model to the data processing systems based on the execution plan. (“FIG. 2 is a flowchart illustrating a process for training a machine learning model and using that machine learning model to monitor and/or predict usage of computing services and inference models to scale up computing resources for use by those computing services and/or inference models in accordance with one or more embodiments of the present disclosure.”; [0009]; and “FIG. 3 is a flowchart illustrating a process for monitoring and predicting usage of computing services and inference models to scale down computing resources for use by those computing services and/or inference models in accordance with one or more embodiments of the present disclosure.”; [0010]; Note: Scaling computing resources for the execution of the inference model is modifying the deployment of the first inference model.)
Rafferty does not appear to explicitly teach by executing at least one action selected from a group of actions consisting of:
deploying additional instances of the first inference model to the data processing systems,
deploying instances of a third inference model to the data processing systems where the third inference model is a different and separately trained inference model from the first inference model, and
terminating one or more existing instances of the first inference model that are hosted on the data processing systems.
However, Feldman teaches by executing at least one action selected from a group of actions consisting of:
deploying additional instances of the first inference model to the data processing systems, (“The routing manager 164 makes decisions to load, rebalance, delete, distribute, and replicate machine-learning models in the serving containers 128-152, based on the following information. The data model's hierarchy level (2) in the service discovery system 162 provides information about which serving containers are expected to host specific machine-learning models and which serving containers actually host the specified machine-learning models.”; [0031])
terminating one or more existing instances of the first inference model that are hosted on the data processing systems. (“Each of the serving containers 128-152 will keep its own list of actual machine-learning models and if this list does not match the list of expected machine-learning models that a serving container receives, the serving container will load or delete any machine-learning models from the serving container's local cache as needed, and then update its own list of actual machine-learning models accordingly.”; [0031])
It would have been obvious before the effective filing date to combine the scaling of resources for inference models of Rafferty with replication of machine learning models of Feldman to effectively rebalance models in serving containers (Feldman [0031]). Rafferty and Feldman are analogous art because they both concern scaling deployment of models.
Claim 2 is rejected over Rafferty and Feldman with the incorporation of claim 1.
Regarding claim 2, Rafferty teaches wherein the inference frequency capability of the first inference model is based on historical data indicating the rate of execution of the first inference model during a previous period of time or an analysis of the topology of the first inference model. (“FIG. 2 is a flowchart illustrating a process 250 for training a machine learning model and using that machine learning model to monitor and/or predict usage of computing services and inference models to scale up computing resources for use by those computing services and/or inference models in accordance with one or more embodiments of the present disclosure. At an operation 252, historical usage data associated with a plurality of computing services (e.g., first usage) provided by a plurality of distributed servers is received. The computing services may be one or more of a website provider service, an advertisement provider service, an in-store traffic monitoring service, a transaction or purchase tracking service, or a game console service. In such examples, usage patterns of those services may be determined by the machine learning model/algorithm to be indicative of usage spikes or decreases in the usage of a given inference model(s)”; [0036])
Claim 3 is rejected over Rafferty and Feldman with the incorporation of claim 1.
Regarding claim 3, Rafferty teaches obtaining data anticipating an event impacting execution of the first inference model; and (“At an operation 256, a machine learning model is trained based on the historical usage data to determine a correlation between a first usage of at least one first computing service of the plurality of computing services and a second usage of the inference model. That correlation may indicate that at least one first spike in the first usage of the at least one first computing service precedes at least one second spike in the second usage of the inference model.”; [0038]; Note: The first spike in the first usage of the computing service will impact execution of the inference model.)
obtaining the inference frequency requirement of the downstream consumer during the future period of time based on the data anticipating the event impacting the execution of the first inference model. (“In various embodiments, multiple correlations between usage data of the computing services and inference model may be made, where each correlation is representative of a different prediction for what usage of an inference model will look like based on usage of one or more computer services. In various embodiments, if usage data for multiple inference models is input, the machine learning model may also learn correlations between usage of inference models, so that scaling of resources for a first inference model may also be based on usage data of one or more second inference models.”; [0038])
Claim 4 is rejected over Rafferty and Feldman with the incorporation of claim 1.
Regarding claim 4, Rafferty teaches historical data indicating occurrences of events requiring a change in the inference frequency capability of the first inference model; (“At an operation 256, a machine learning model is trained based on the historical usage data to determine a correlation between a first usage of at least one first computing service of the plurality of computing services and a second usage of the inference model. That correlation may indicate that at least one first spike in the first usage of the at least one first computing service precedes at least one second spike in the second usage of the inference model.”; [0038]; Note: The first spike in the first usage of the computing service will impact execution of the inference model.)
current operational data of the data processing systems; and (“the method further includes receiving, by the one or more processors, in real-time, current usage data associated with the at least one first computing service of the plurality of computing services. The method further includes determining, by the one or more processors based on the current usage data and the correlation, in real-time, that the current usage data is indicative of the at least one first spike in the first usage of the at least one first computing service that precedes the at least one second spike in the second usage of the inference model.”; [0004])
a transmission from the downstream consumer indicating a change in operation of the downstream consumer. (“At an operation 262, a command is transmitted, in real-time, in response to the determination that the current usage data is indicative of the at least one first spike in the first usage of the at least one first computing service, where the command is transmitted to increase an amount of computing resources available for an execution of the inference model.”; [0040]; Note: See Figure 2 262 to see the execution plan.)
Claim 5 is rejected over Rafferty and Feldman with the incorporation of claim 1.
Regarding claim 5, Rafferty teaches feeding the data anticipating the event impacting the execution of the first inference model into a second inference model, the second inference model being trained to predict the inference frequency requirement of the downstream consumer during the future period of time. (“As also described herein, an inference model is configured to receive a request and make an inference based on the request, and that inference is returned as a response to the request. The inference model may be, for example, one or more of a credit checking service, a credit limit estimation service, a line of credit approval s “; [0037] and “At an operation 256, a machine learning model is trained based on the historical usage data (second inference model) to determine a correlation between a first usage of at least one first computing service of the plurality of computing services and a second usage of the inference model. That correlation may indicate that at least one first spike in the first usage of the at least one first computing service precedes at least one second spike in the second usage of the inference model. In various embodiments, multiple correlations between usage data of the computing services and inference model may be made, where each correlation is representative of a different prediction for what usage of an inference model will look like based on usage of one or more computer services. In various embodiments, if usage data for multiple inference models is input, the machine learning model may also learn correlations between usage of inference models, so that scaling of resources for a first inference model may also be based on usage data of one or more second inference models. These correlations may indicate, for example, a number of queries or requests an inference model is predicted to receive in a future time window.”; [0038]; The number of queries or requests is the inference frequency requirement.)
Claim 6 is rejected over Rafferty and Feldman with the incorporation of claim 1.
Regarding claim 6, Rafferty teaches wherein the execution plan indicates a change in the deployment of the first inference model to meet the inference frequency requirement of the downstream consumer during the future period of time. (“At an operation 256, a machine learning model is trained based on the historical usage data (second inference model) to determine a correlation between a first usage of at least one first computing service of the plurality of computing services and a second usage of the inference model. That correlation may indicate that at least one first spike in the first usage of the at least one first computing service precedes at least one second spike in the second usage of the inference model. In various embodiments, multiple correlations between usage data of the computing services and inference model may be made, where each correlation is representative of a different prediction for what usage of an inference model will look like based on usage of one or more computer services. In various embodiments, if usage data for multiple inference models is input, the machine learning model may also learn correlations between usage of inference models, so that scaling of resources for a first inference model may also be based on usage data of one or more second inference models. These correlations may indicate, for example, a number of queries or requests an inference model is predicted to receive in a future time window.”; [0038]; and “The method further includes transmitting, by the one or more processors in response to the determination that the current usage data is indicative of the at least one first spike in the first usage of the at least one first computing service, in real-time, at least one command to increase an amount of computing resources available for an execution of the inference model.”; [0004]; and Note: The number of queries or requests is the inference frequency requirement.)
Claim 7 is rejected over Rafferty and Feldman with the incorporation of claim 1.
Regarding claim 7, Rafferty teaches obtaining a quantity of instances of the first inference model required to meet the inference frequency requirement of the downstream consumer during the future period of time based on characteristics of the first inference model; (“if usage data for multiple inference models is input, the machine learning model may also learn correlations between usage of inference models, so that scaling of resources for a first inference model may also be based on usage data of one or more second inference models. These correlations may indicate, for example, a number of queries or requests an inference model is predicted to receive in a future time window.”; [0038])
making a second determination that the data processing systems have sufficient computing resource capacity to execute the quantity of instances of the first inference model; and (“In particular, by using a trained machine learning model to monitor usage of computing services and/or resources (e.g., a first usage) to determine and predict when usage of an inference model (e.g., a second usage) will scale up or down, the system can predictively increase or decrease computing resources allocated to and/or used by an inference model.”; [0023])
based on the second determination:
generating the execution plan specifying which of the data processing systems are to host each of the quantity of the instances of the first inference model. (“At an operation 262, a command is transmitted, in real-time, in response to the determination that the current usage data is indicative of the at least one first spike in the first usage of the at least one first computing service, where the command is transmitted to increase an amount of computing resources available for an execution of the inference model.”; [0040]; Note: See Figure 2 262 to see the execution plan.)
Claim 10 is rejected over Rafferty and Feldman with the incorporation of claim 1.
Regarding claim 10, Rafferty teaches a non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for managing execution of a first inference model hosted by data processing systems, the operations comprising: (“the present disclosure provides an exemplary technically improved non-transitory computer readable medium having instructions stored thereon that, upon execution by a computing device, cause the computing device to perform operations including receiving resource scheduling instructions configured to, upon execution by the computing device, increase an amount of computing resources available for an execution of an inference model.”; [0006])
The remainder of claim 10 is claim 1 in a form of a non-transitory machine-readable medium and is rejected for the same reasons as claim 1 stated above.
Dependent claim 11 is claim 2 in the form of a non-transitory machine-readable medium and is rejected for the same reasons as claim 2 stated above. For the rejection of the limitations specifically pertaining to the non-transitory machine-readable medium of claim 10, see the rejection of claim 10 above.
Dependent claim 12 is claim 3 in the form of a non-transitory machine-readable medium and is rejected for the same reasons as claim 3 stated above. For the rejection of the limitations specifically pertaining to the non-transitory machine-readable medium of claim 10, see the rejection of claim 10 above.
Dependent claim 13 is claim 4 in the form of a non-transitory machine-readable medium and is rejected for the same reasons as claim 4 stated above. For the rejection of the limitations specifically pertaining to the non-transitory machine-readable medium of claim 10, see the rejection of claim 10 above.
Dependent claim 14 is claim 5 in the form of a non-transitory machine-readable medium and is rejected for the same reasons as claim 5 stated above. For the rejection of the limitations specifically pertaining to the non-transitory machine-readable medium of claim 10, see the rejection of claim 10 above.
Dependent claim 15 is claim 6 in the form of a non-transitory machine-readable medium and is rejected for the same reasons as claim 6 stated above. For the rejection of the limitations specifically pertaining to the non-transitory machine-readable medium of claim 10, see the rejection of claim 10 above.
Claim 16 is rejected over Rafferty and Feldman.
Regarding claim 16, Rafferty teaches a data processing system, comprising: a processor; and a memory coupled to the processor to store instructions, which when executed by the processor, cause the processor to perform operations for managing execution of a first inference model hosted by data processing systems, the operations comprising: (“the present disclosure provides an exemplary technically improved computer-based system that includes at least the following components of a memory and at least one processor coupled to the memory. The processor is configured to receive historical usage data associated with a plurality of computing services provided by a plurality of distributed servers. The processor is further configured to receive historical usage data associated with an inference model associated with at least one of the plurality of computing services.”; [0005])
The remainder of claim 16 is claim 1 in a form of a processor medium and is rejected for the same reasons as claim 1 stated above.
Dependent claim 17 is claim 2 in the form of a processor and is rejected for the same reasons as claim 2 stated above. For the rejection of the limitations specifically pertaining to the processor of claim 16, see the rejection of claim 16 above.
Dependent claim 18 is claim 3 in the form of a processor and is rejected for the same reasons as claim 3 stated above. For the rejection of the limitations specifically pertaining to the processor of claim 16, see the rejection of claim 16 above.
Dependent claim 19 is claim 4 in the form of a processor and is rejected for the same reasons as claim 4 stated above. For the rejection of the limitations specifically pertaining to the processor of claim 16, see the rejection of claim 16 above.
Dependent claim 20 is claim 5 in the form of a processor and is rejected for the same reasons as claim 5 stated above. For the rejection of the limitations specifically pertaining to the processor of claim 16, see the rejection of claim 16 above.
Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Rafferty and Feldman in view of Yang et al. (US20220269835A1); hereinafter Yang
Claim 8 is rejected over Rafferty, Feldman and Yang with the incorporation of claim 1.
Regarding claim 8, Rafferty teaches obtaining a quantity of instances of the first inference model required to meet the inference frequency requirement of the downstream consumer during the future period of time based on characteristics of the first inference model; (“if usage data for multiple inference models is input, the machine learning model may also learn correlations between usage of inference models, so that scaling of resources for a first inference model may also be based on usage data of one or more second inference models. These correlations may indicate, for example, a number of queries or requests an inference model is predicted to receive in a future time window.”; [0038])
making a second determination that the data processing systems do not have sufficient computing resource capacity to execute the quantity of instances of the first inference model; and (“In particular, by using a trained machine learning model to monitor usage of computing services and/or resources (e.g., a first usage) to determine and predict when usage of an inference model (e.g., a second usage) will scale up or down, the system can predictively increase or decrease computing resources allocated to and/or used by an inference model.”; [0023] and “FIG. 3 is a flowchart illustrating a process for monitoring and predicting usage of computing services and inference models to scale down computing resources for use by those computing services and/or inference models in accordance with one or more embodiments of the present disclosure.”; [0010])
Rafferty does not teach based on the second determination:
obtaining a quantity of instances of the third inference model to be deployed to the data processing systems based on the inference frequency requirement of the downstream consumer during the future period of time; and
generating the execution plan specifying which of the data processing systems are to host each of the quantity of the instances of the third inference model.
However, Yang teaches based on the second determination:
obtaining a quantity of instances of the third inference model to be deployed to the data processing systems based on the inference frequency requirement of the downstream consumer during the future period of time; and (“FIG. 4 shows an embodiment of a machine learning model predicting a model performance. As a general matter, the resource prediction twin 420 may output a prediction of the resources (e.g., hardware, platform, cloud instances, and the like) required for running the machine learning model. Examples of items that the resource prediction twin 420 may output include (i) trade-off between estimated cost and runtime for a different number of GPUs and memory, (ii) cost and performance options for a certain job. Additionally or alternatively, the resource prediction twin may output the prediction of a resource based on logs, scripts, and collected data 430, or hardware data 440. In the case of 430, the resource prediction twin 420 may predict metrics. For example, the resource prediction twin 420 may predict how long each would take to complete a job if the user 410 wants to run the job on a GPU versus CPU. Conversely, in the case of 440, the resource prediction twin 420 may predict parameters. For example, the resource prediction twin may predict a model configuration that may satisfy target values under the constraints given a certain resource (e.g., memory less than 1 GB) limitation on the system.”; [0057])
generating the execution plan specifying which of the data processing systems are to host each of the quantity of the instances of the third inference model. (“The processor may be configured to select a deployable machine learning model having the evaluation score that meets a predetermined criterion from among the candidate machine learning models, virtually execute the deployable machine learning model on each of candidate hardware platforms according to the constraints, and generate an assessment report of the virtual performance metrics set of the deployable machine learning model executed on each of the candidate hardware platforms. The processor may be configured to select the suggested hardware platform meeting the predetermined criterion (hosting) from among the candidate hardware platforms, the suggested hardware platform probabilistically satisfying the targeted objective under the constraints when combined with the deployable machine learning model for execution.”; [0004])
It would have been obvious before the effective filing date to combine the scaling of resources for inference models of Rafferty with the deployment constraints of Yang to satisfy competing goals without wasting time in running an actual machine learning model (Yang, [0058]). Rafferty and Yang are analogous art because they both concern scaling deployment of models.
Claim 9 is rejected over Rafferty, Feldman and Yang with the incorporation of claim 1.
Regarding claim 9, Rafferty does not teach obtaining the third inference model, the third inference model being a lower complexity inference model than the first inference model and the data processing systems having capacity to host a sufficient quantity of instances of the third inference model to meet the inference frequency requirement of the downstream consumer during the future period of time; and
obtaining an inference frequency capability of the third inference model while hosted by the data processing systems.
However, Yang teaches obtaining the third inference model, the third inference model being a lower complexity inference model than the first inference model and the data processing systems having capacity to host a sufficient quantity of instances of the third inference model to meet the inference frequency requirement of the downstream consumer during the future period of time; and (“For example, the constraint adoption stage 320 may deal with how to re-build the machine learning model that was created at the model creation stage 310 with specific constraints (e.g., 50 hours given to execute the machine learning model on a specific hardware platform, machine learning model size less than 1 GB, $500 budget for a machine learning model training). There may be one or more constraints both in a training model (e.g., hardware constraints, a memory size, a CPU computing power, training data, and/or resource) and a production model (e.g., performance constraints, data constraints, and/or runtime environment constraints).”; [0054])
obtaining an inference frequency capability of the third inference model while hosted by the data processing systems. (“For example, the resource prediction twin 420 may predict how long each would take to complete a job if the user 410 wants to run the job on a GPU versus CPU.”; [0057]; Note: The performance metrics including how long each would take to complete a job are part of the inference frequency requirement. The constraints on the rebuilt third model scale it to a lower complexity inference model.)
It would have been obvious before the effective filing date to combine the scaling of resources for inference models of Rafferty with the deployment constraints of Yang to satisfy competing goals without wasting time in running an actual machine learning model (Yang, [0058]). Rafferty and Yang are analogous art because they both concern scaling deployment of models.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID H TRAN whose telephone number is (703)756-1525. The examiner can normally be reached M-F 9:30 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DAVID H TRAN/Examiner, Art Unit 2147 /VIKER A LAMARDO/Supervisory Patent Examiner, Art Unit 2147