Last updated: May 29, 2026
Application No. 18/518,927
DYNAMIC ENDPOINT MANAGEMENT FOR HETEROGENEOUS MACHINE LEARNING MODELS

Non-Final OA §102§103
Filed
Nov 24, 2023
Examiner
EWALD, JOHN ROBERT DAKITA
Art Unit
2199
Tech Center
2100 — Computer Architecture & Software
Assignee
Amazon Technologies, Inc.
OA Round
1 (Non-Final)
Interview Optional

— +50.0% interview lift. Examiner has a relatively high allowance rate (77%); +50.0% interview lift. A written response may suffice.
Based on 22 resolved cases, 2023–2026
Examiner Intelligence

EWALD, JOHN ROBERT DAKITA View full profile →
Grants 77% — above average
Career Allowance Rate
17 granted / 22 resolved
+22.3% vs TC avg
Strong +50% interview lift
Without
With
+50.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
13 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.4%
-37.6% vs TC avg
§103
90.4%
+50.4% vs TC avg
§102
2.4%
-37.6% vs TC avg
§112
4.8%
-35.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 22 resolved cases
Office Action

§102 §103
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claims 1-20 are pending in this application. 

Information Disclosure Statement
The IDS’s filed on 1/28/2025, 2/25/2025, and 4/09/2025 have been considered. 

Claim Objections
Claim 1 is objected to because of the following informalities: The claim recites a limitation of "responsive to detection of the event to rebalance or the event to scale, make a placement decision that selects a computing resource from the plurality of computing resources host one of the plurality of machine learning models based..." Examiner believes the claim should recite "responsive to detection of the event to rebalance or the event to scale, make a placement decision that selects a computing resource from the plurality of computing resources to host one of the plurality of machine learning models based...”  Appropriate correction is required.



Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 5-6, 10, 13-15, 18, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Feldman et al. (US Pub. No. 2022/0382601 A1 hereinafter Feldman).
As per claim 5, Feldman teaches a method, comprising: detecting, at a machine learning service, a placement event for a machine learning model associated with a managed network endpoint (¶ [0027], “The ML serving infrastructure 100 can be implemented via any other type of distributed computer network environment in which a set of servers control the storage and distribution of resources and services for different client users.” ¶ [0053], “Where balancing, loading, or rebalancing is needed a bin packing algorithm can be utilized using multiple criteria to distribute ML models into different containers. These criteria can include resources such as memory, processing cycles, demand, and similar metrics. For example, each ML model size/capacity can be (S1, S2, . . . , Sn) and each container can have (C1, C2, . . . , Cn) capacity. As a result, the routing manager can employ a multi-dimensional bin packing solution. Given a list of models with different memory, processing, demand, and similar requirements, the routing manager needs to place them into a finite number of containers, each with certain memory, processing, and similar capacity, to minimize the number of containers in total.”), wherein the managed network endpoint provides access to a plurality of different machine learning models, including the machine learning model, via requests to invoke specified ones of the plurality of different machine learning models received from one or more clients of the machine learning service (¶ [0024], “The ML serving infrastructure 100 receives requests from tenants via a machine-learning service (MLS) gateway 101 or a similar interface. The MLS gateway 101 or similar interface receives a request from a tenant application and identifies a version or instance of a ML model associated with the request. The MLS gateway 101 or similar interface identifies model information associated with ML models corresponding to a cluster of available serving containers associated with the version of the ML model. The MLS gateway 101 uses the model information to select a serving container from the cluster of available serving containers…If the ML model is loaded in the serving container, the system executes, in the serving container (e.g., 105A-C), the ML model (e.g., the scoring models 133) on behalf of the request. The ML serving infrastructure 100 responds to the request based on executing the appropriate ML model on behalf of the request.”); selecting, by the machine learning service, a computing resource from a plurality of computing resources associated with the managed network endpoint to host the machine learning model based, at least in part, on a determination that the computing resource satisfies a resource requirement for the machine learning model (¶ [0041], “Similarly, rebalancing takes into consideration that some ML models are supported by a specific set of containers, thus, moving an ML model between containers must be consistent with this support. In each case, the routing manager tracks an expected ML model to container allotment that is based on an initial allotment of the ML models to the containers, as well as an actual ML model to container allotment that reflects the current allotment.” See also para. 0053.); placing, by the machine learning service, the machine learning model at the selected computing resource to complete a response to the placement event (¶ [0069], “When a new serving container joins the pool or rebalancing of the ML models are happening, there can be a need to move some models around from one container to another. As a result, the routing manager supports loading the model in the new container and un-loading it from the old one so that there is not any unavailability of the model. In one example implementation, the routing manager is moving the Mi model from Cj container to C-k container.”).

As per claim 6, Feldman teaches the method of claim 5. Feldman also teaches wherein the placement event is detected in response to a rebalance event to rebalance is detected to rebalance the plurality of different machine learning models amongst the plurality of computing resources, and wherein the one machine learning model is moved from another one of the plurality of computing resources based on performance metrics of the selected computing resource or the other one computing resource (¶ [0053], “Where balancing, loading, or rebalancing is needed a bin packing algorithm can be utilized using multiple criteria to distribute ML models into different containers. These criteria can include resources such as memory, processing cycles, demand, and similar metrics. For example, each ML model size/capacity can be (S1, S2, . . . , Sn) and each container can have (C1, C2, . . . , Cn) capacity. As a result, the routing manager can employ a multi-dimensional bin packing solution. Given a list of models with different memory, processing, demand, and similar requirements, the routing manager needs to place them into a finite number of containers, each with certain memory, processing, and similar capacity, to minimize the number of containers in total.” ¶ [0069], “When a new serving container joins the pool or rebalancing of the ML models are happening, there can be a need to move some models around from one container to another. As a result, the routing manager supports loading the model in the new container and un-loading it from the old one so that there is not any unavailability of the model. In one example implementation, the routing manager is moving the Mi model from Cj container to C-k container.”).

As per claim 10, Feldman teaches the method of claim 5. Feldman also teaches wherein placement event is detected in response to a scaling event to scale up from no replicas of the machine learning model to at least one replica of the machine learning model (¶ [0024], “The MLS gateway 101 or similar interface identifies model information associated with ML models corresponding to a cluster of available serving containers associated with the version of the ML model. The MLS gateway 101 uses the model information to select a serving container from the cluster of available serving containers. If the ML model is not loaded in the serving container, the ML serving infrastructure 100 loads the ML model in the serving container.” ¶ [0026], “If a copy of the specific ML model needed to service the incoming request is not already loaded in a serving container 115, then an existing or new serving container loads the required ML model. When a copy of the specific ML model is verified to be loaded in the serving container, then the specific ML model executes the requested service or function, as specified in the received request, in the serving container.”).
As per claim 13, Feldman teaches the method of claim 5. Feldman also teaches wherein the placement event for the machine learning model is to add a replica of the machine learning model amongst the plurality of computing resources (¶ [0033]-[0034], “The routing manager 175 103 makes decisions to load, rebalance, delete, distribute, and replicate ML models in the serving containers 115. These decisions can be based on the information provided to the routing service 103 and routing manager 175 by the serving containers 115 and other elements of the ML serving infrastructure 100…If the serving container list does not match the list of expected ML models that a serving container receives, the serving container can load or delete any ML models as needed, and then update its list of executing ML models accordingly. The routing manager 175 can monitor and maintain each serving container's list of actual ML models to determine where to route requests. The routing manager 175 can analyze the model information about each ML model to decide whether to replicate frequently used ML models to additional serving containers to prevent overloading the serving containers which are hosting the frequently used ML models.” See also para. 0043, 0045, and 0056-0057.).

As per claim 14, it is a non-transitory computer-readable media claim comprising similar limitations to claim 5, so it is rejected for similar reasons. 

As per claim 15, it is a non-transitory computer-readable media claim comprising similar limitations to claim 6, so it is rejected for similar reasons. 

As per claim 18, it is a non-transitory computer-readable media claim comprising similar limitations to claim 10, so it is rejected for similar reasons. 

As per claim 20, it is a non-transitory computer-readable media claim comprising similar limitations to claim 13, so it is rejected for similar reasons. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-4 are rejected under 35 U.S.C. 103 as being unpatentable over Feldman in view of Stefani et al. (US Patent No. 11,126,927 B2 hereinafter Stefani).
As per claim 1, Feldman teaches a system, comprising: a plurality of computing devices, respectively comprising at least one processor and a memory, that implement a machine learning service (¶ [0015], “The following description describes implementations for a method and process for managing a distribution of machine learning (ML) models in an ML serving infrastructure. The implementations for the method and process of managing a distribution of ML models in the ML serving infrastructure introduces a routing manager to the ML serving infrastructure. The routing manager manages the available set of containers for the ML serving infrastructure and the allotment of ML models across these containers.” ¶ [0080], “FIG. 7A includes hardware 720 comprising a set of one or more processor(s) 722, a set of one or more network interfaces 724 (wireless and/or wired), and machine-readable media 726 having stored therein software 728 (which includes instructions executable by the set of one or more processor(s) 722). The machine-readable media 726 may include non-transitory and/or transitory machine-readable media. Each of the previously described clients and the routing manager may be implemented in one or more electronic devices 700. In one implementation: 1) each of the clients is implemented in a separate one of the electronic devices 700 (e.g., in end user devices where the software 728 represents the software to implement clients to interface directly and/or indirectly with the routing manager (e.g., software 728 represents a web browser, a native client, a portal, a command-line interface, and/or an application programming interface (API) based upon protocols such as Simple Object Access Protocol (SOAP), Representational State Transfer (REST), etc.)); 2) the Routing manager is implemented in a separate set of one or more of the electronic devices 700 (e.g., a set of one or more server devices where the software 728 represents the software to implement the routing manager); and 3) in operation, the electronic devices implementing the clients and the routing manager would be communicatively coupled (e.g., by a network) and would establish between them (or through one or more other layers and/or or other services) connections for submitting requests to the ML models where routing is assisted by the routing manager and returning responses to the clients.” See also para. 0075.), wherein the machine learning service is configured to: host a managed network endpoint, wherein the managed network endpoint provides access to a plurality of different machine learning models hosted at one or more of a plurality of computing resources associated with the managed network endpoint (¶ [0015], “The following description describes implementations for a method and process for managing a distribution of machine learning (ML) models in an ML serving infrastructure. The implementations for the method and process of managing a distribution of ML models in the ML serving infrastructure introduces a routing manager to the ML serving infrastructure. The routing manager manages the available set of containers for the ML serving infrastructure and the allotment of ML models across these containers.” ¶ [0020]-[0021], “FIG. 1 is a diagram of one example implementation of an ML serving infrastructure that supports a multi-tenant system. The machine-learning serving infrastructure 100 includes a machine-learning service (MLS) gateway 101, routing service 103, routing manager 175, service discovery and configuration system 111, set of serving containers 115, and data stores, along with other supporting infrastructure. A serving container 115 can be an isolated execution environment that is enabled by an underlying operating system, and which executes the main functionality of a program such as an ML model. A serving container 115 can host any number of ML models for any number of tenants…An ML serving infrastructure 100 can host any number of serving containers 115 or clusters of serving containers. Different clusters can host different versions or types of ML models.”), including the machine learning model, via requests to invoke specified ones of the plurality of different machine learning models received from one or more clients of the machine learning service (¶ [0024], “The ML serving infrastructure 100 receives requests from tenants via a machine-learning service (MLS) gateway 101 or a similar interface. The MLS gateway 101 or similar interface receives a request from a tenant application and identifies a version or instance of a ML model associated with the request. The MLS gateway 101 or similar interface identifies model information associated with ML models corresponding to a cluster of available serving containers associated with the version of the ML model. The MLS gateway 101 uses the model information to select a serving container from the cluster of available serving containers…If the ML model is loaded in the serving container, the system executes, in the serving container (e.g., 105A-C), the ML model (e.g., the scoring models 133) on behalf of the request. The ML serving infrastructure 100 responds to the request based on executing the appropriate ML model on behalf of the request.”); monitor the managed network endpoint (¶ [0034], “The routing manager 175 can analyze the model information about each ML model to decide whether to replicate frequently used ML models to additional serving containers to prevent overloading the serving containers which are hosting the frequently used ML models. The routing manager 175 can use the data model information of the service discovery and configuration system 111 to manage lists of available ML models and available serving containers.” ¶ [0056], “The routing manager performs loading, rebalancing, deleting, and replicating the ML models and the other components support the routing manager 175 in terms of supplying the routing manager 175 with all the information it needs to make decisions. The service discovery and configuration system 111 and routing manager 175 maintain a hierarchy to give a full picture of what ML models are initially assigned to specific service containers and what ML models are actually currently assigned. This information can be maintained as a list of expected models as a model mapping hierarchy and a list actual models…The routing manager can periodically compare and where the lists do not match the routing manager 175 will load/delete an ML model from the local cache of a serving container and update the actual model list at the serving container accordingly.”) for: an event to rebalance the plurality of different machine learning models amongst the plurality of computing resources (¶ [0039]-[0041], “In this example implementation, the balancing process can be initiated in response to receiving or detecting an update of the container and/or ML model metrics including metrics for container resource usage, ML model demands, serviced requests, and similar metrics collected per container and/or per ML model (Block 201). The collected metrics can be processed to determine the metrics per ML model and/or container to determine recent or current container resource usage, serviced requests, ML model demands, and similar metrics per container and/or ML model (Block 203). The process can then determine whether there is an imbalance in resource usage across containers, a stressed container, or similar issue with the current distribution (Block 205)…If an imbalance between containers or ML models if found or a container or ML model is stressed, then the process initiates a rebalancing of the ML models relative to the assigned containers to decrease the imbalance or to relieve stress on an ML model or container (Block 207)… Imbalance parameters can be configurable by administrators, users of the ML models, or similar entities. In the case of imbalances, the ML model demands can be within the container resources for the current ML model assignments, but there can be instances of ML models or containers that have high usage or demands within the set of available ML models and containers while other ML models and containers have low usage or demands. Where there is a delta or difference between the high and low usage ML models and containers that exceeds a defined threshold or where there is a delta between the high and low usage ML models and containers relative to an average of the usage of ML models and containers, then an imbalance can be identified…Rebalancing can also be an alteration of the routing of the requests to ML models and containers. Rebalancing takes into consideration that the received requests are specific to an ML model, thus, rerouting of such requests must be to another instance of the same ML model.”); responsive to the detection of the event to rebalance or the event to scale, make a placement decision that selects a computing resource from the plurality of computing resources host one of the plurality of machine learning models (¶ [0069], “When a new serving container joins the pool or rebalancing of the ML models are happening, there can be a need to move some models around from one container to another. As a result, the routing manager supports loading the model in the new container and un-loading it from the old one so that there is not any unavailability of the model. In one example implementation, the routing manager is moving the Mi model from Cj container to C-k container.”) based, at least in part, on a determination that the computing resource satisfies a resource requirement for the machine learning model (¶ [0041], “Similarly, rebalancing takes into consideration that some ML models are supported by a specific set of containers, thus, moving an ML model between containers must be consistent with this support. In each case, the routing manager tracks an expected ML model to container allotment that is based on an initial allotment of the ML models to the containers, as well as an actual ML model to container allotment that reflects the current allotment.” See also para. 0053.); and place the machine learning model at the selected computing resource (¶ [0069], “When a new serving container joins the pool or rebalancing of the ML models are happening, there can be a need to move some models around from one container to another. As a result, the routing manager supports loading the model in the new container and un-loading it from the old one so that there is not any unavailability of the model. In one example implementation, the routing manager is moving the Mi model from Cj container to C-k container.”).

Although Feldman teaches a generic scaling of machine learning models, Feldman fails to teach monitoring the machine learning models for an event to scale the plurality of computing resources or the plurality of machine learning models.
Accordingly, Stefani teaches monitor for an event to scale the plurality of computing resources or the plurality of different machine learning models (Col. 4 & 5, lines 50-67 & 1-13, “For example, in some embodiments the auto-scaling system 106 includes an auto-scaling monitor 108 that can trigger an auto-scaling—e.g., an addition and/or removal of model instances from a fleet 116 by an auto-scaling engine 114—based on monitoring (or obtaining) operational metric values 110 associated with operating conditions of the fleet. The auto-scaling monitor 108 may obtain these metric values by direct observation/querying of the fleet, interacting with a logging service, receiving report data from the fleet, etc. Exemplary metric values 110 that can be utilized as part of auto-scaling hosted machine learning models are shown in FIG. 3. In this figure, a variety of operational metric values 110 are shown that can be monitored and potentially be used to determine whether the current fleet 116 of model instances 118A-118N serving a model 120 is over- or under-provisioned and thus, whether to add or remove capacity from the fleet.” See also Col. 5 & 6, lines 28-67 & 1-10.).
Feldman and Stefani are considered to be analogous to the claimed invention because they are in the same field of resource allocation and load-balancing for machine learning models. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the load-balancing system of Feldman with the scaling functionality of Stefani to arrive at the claimed invention. The motivation to modify Feldman with the teachings of Stefani is that scaling functionality ensures scalability of machine learning models even in unpredictable environments while also avoiding wasting resources (See Stefani – Col. 3, lines 6-10.).

As per claim 2, Feldman and Stefani teach the system of claim 1. Feldman teaches wherein the event to rebalance is detected, and wherein the one machine learning model is moved from another one of the plurality of computing resources based on performance metrics of the selected computing resource or the other one computing resource (¶ [0053], “Where balancing, loading, or rebalancing is needed a bin packing algorithm can be utilized using multiple criteria to distribute ML models into different containers. These criteria can include resources such as memory, processing cycles, demand, and similar metrics. For example, each ML model size/capacity can be (S1, S2, . . . , Sn) and each container can have (C1, C2, . . . , Cn) capacity. As a result, the routing manager can employ a multi-dimensional bin packing solution. Given a list of models with different memory, processing, demand, and similar requirements, the routing manager needs to place them into a finite number of containers, each with certain memory, processing, and similar capacity, to minimize the number of containers in total.”).

As per claim 3, Feldman and Stefani teach the system of claim 1. Feldman teaches wherein the event to scale is detected and wherein the event to scale increases or decreases the number of computing resources (¶ [0049], “One use case is where a new node (i.e., service container) in the cluster is added. When a new node is added, the serving container will update the service discovery and configuration system 111 (e.g., data structures like the container state). The routing manager 175 will be notified of the change (e.g., via a monitor of the data structures). In this use case, based on an output of the bin packing algorithm, the routing manager can kick off a rebalancing.” See also para. 0062.). Stefani also teaches wherein the event to scale is detected and wherein the event to scale increases or decreases the number of computing resources (Col. 3, lines 11-20, “As illustrated, an auto-scaling system 106 (e.g., software executed by one or more computing devices of a provider network 102) can “auto-scale” the resources of a fleet 116 of model instances 118A-118N that host a machine learning model 120 to dynamically match the amount of resources to host the model 120 with the demands put on the model, without degrading the performance of the model.”).

As per claim 4, Feldman and Stefani teach the system of claim 1. Feldman teaches increases or decreases the number of at least one replica of the plurality of different machine learning models (¶ [0033]-[0034], “The routing manager 175 103 makes decisions to load, rebalance, delete, distribute, and replicate ML models in the serving containers 115. These decisions can be based on the information provided to the routing service 103 and routing manager 175 by the serving containers 115 and other elements of the ML serving infrastructure 100…If the serving container list does not match the list of expected ML models that a serving container receives, the serving container can load or delete any ML models as needed, and then update its list of executing ML models accordingly. The routing manager 175 can monitor and maintain each serving container's list of actual ML models to determine where to route requests. The routing manager 175 can analyze the model information about each ML model to decide whether to replicate frequently used ML models to additional serving containers to prevent overloading the serving containers which are hosting the frequently used ML models.” See also para. 0043, 0045, and 0056-0057.). Stefani also teaches wherein the event to scale is detected and wherein the event to scale increases or decreases the number of at least one replica of the plurality different machine learning models (Col. 4 & 5, lines 50-67 & 1-13, “For example, in some embodiments the auto-scaling system 106 includes an auto-scaling monitor 108 that can trigger an auto-scaling—e.g., an addition and/or removal of model instances from a fleet 116 by an auto-scaling engine 114—based on monitoring (or obtaining) operational metric values 110 associated with operating conditions of the fleet. The auto-scaling monitor 108 may obtain these metric values by direct observation/querying of the fleet, interacting with a logging service, receiving report data from the fleet, etc. Exemplary metric values 110 that can be utilized as part of auto-scaling hosted machine learning models are shown in FIG. 3. In this figure, a variety of operational metric values 110 are shown that can be monitored and potentially be used to determine whether the current fleet 116 of model instances 118A-118N serving a model 120 is over- or under-provisioned and thus, whether to add or remove capacity from the fleet.”).

Claim(s) 7-9, 11, 16-17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Feldman as applied to claims 5 and 14 above, and further in view of Stefani.
As per claim 7, Feldman teaches the method of claim 5. 

Although Feldman teaches a generic scaling of machine learning models, Feldman fails to teach monitoring the machine learning models for an event to scale the plurality of computing resources or the plurality of machine learning models.
Accordingly, Stefani teaches wherein placement event is detected in response to a scaling event to increase the plurality of computing resources associated with the managed network endpoint (Col. 3, lines 11-20, “As illustrated, an auto-scaling system 106 (e.g., software executed by one or more computing devices of a provider network 102) can “auto-scale” the resources of a fleet 116 of model instances 118A-118N that host a machine learning model 120 to dynamically match the amount of resources to host the model 120 with the demands put on the model, without degrading the performance of the model.”).
Feldman and Stefani are considered to be analogous to the claimed invention because they are in the same field of resource allocation and load-balancing for machine learning models. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the load-balancing system of Feldman with the scaling functionality of Stefani to arrive at the claimed invention. The motivation to modify Feldman with the teachings of Stefani is that scaling functionality ensures scalability of machine learning models even in unpredictable environments while also avoiding wasting resources (See Stefani – Col. 3, lines 6-10.).

As per claim 8, Feldman teaches the method of claim 5. Feldman teaches an interface of the machine learning service (¶ [0080], “FIG. 7A includes hardware 720 comprising a set of one or more processor(s) 722, a set of one or more network interfaces 724 (wireless and/or wired), and machine-readable media 726 having stored therein software 728 (which includes instructions executable by the set of one or more processor(s) 722). The machine-readable media 726 may include non-transitory and/or transitory machine-readable media. Each of the previously described clients and the routing manager may be implemented in one or more electronic devices 700. In one implementation: 1) each of the clients is implemented in a separate one of the electronic devices 700 (e.g., in end user devices where the software 728 represents the software to implement clients to interface directly and/or indirectly with the routing manager…”).

Feldman fails to teach being able to specify a scaling policy for the machine learning service via the interface.
	However, Stefani teaches wherein placement event is detected in response to a scaling event according to a scaling policy specified via an interface of the machine learning service (Col. 5 & 6, lines 50-67 & 1-10, “For example, turning ahead to FIG. 4, a user interface 402A providing a metric value chart 404 and a user interface 402B for configuring the auto-scaling of a hosted machine learning model are illustrated. One or both of these user interfaces 402A-402B may be provided by a console 122 of the provider network 102, which could comprise a web server that powers a web application or O/S specific application used by a user 132. In some embodiments, the console 122 provides a user interface 402B allowing the user to enable or disable “reactive” auto-scaling (e.g., via a user interface input element such as a checkbox, button, etc.). The user interface 402B may also provide functionality enabling the user to specify one or more metric conditions 450. As illustrated, two types of metric conditions 450 are utilized—ones that cause a “scaling up” of additional fleet resources, and ones that cause a “scaling down” of fleet resources.”).
Feldman and Stefani are considered to be analogous to the claimed invention because they are in the same field of resource allocation and load-balancing for machine learning models. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the  interface of Feldman with the interface for interacting with the machine learning service of Stefani to arrive at the claimed invention. This substitution would have yielded predictable results and been reasonable under MPEP § 2143 as both references manage environments for hosting machine learning models. 

	As per claim 9, Feldman and Stefani teach the method of claim 8. Stefani teaches wherein the scaling policy specifies the one machine learning model (Col. 5, lines 40-49, “In some embodiments, API calls may be made by the auto-scaling monitor 108 and/or predictive auto-scaling analysis engine 112 (described later herein), and in some embodiments similar API calls may be made by clients 126A-126 via API 124 (e.g., to directly manage a size of a fleet of model instances 118). As an example, an API call with a name such as “Update Machine Learning Model Capacity” could be used, which could include arguments such as a unique identifier of the model and/or fleet, a number of machines to be added or removed, etc.”).

	As per claim 11, Feldman teaches the method of claim 5. Feldman teaches an interface of the machine learning service (¶ [0080], “FIG. 7A includes hardware 720 comprising a set of one or more processor(s) 722, a set of one or more network interfaces 724 (wireless and/or wired), and machine-readable media 726 having stored therein software 728 (which includes instructions executable by the set of one or more processor(s) 722). The machine-readable media 726 may include non-transitory and/or transitory machine-readable media. Each of the previously described clients and the routing manager may be implemented in one or more electronic devices 700. In one implementation: 1) each of the clients is implemented in a separate one of the electronic devices 700 (e.g., in end user devices where the software 728 represents the software to implement clients to interface directly and/or indirectly with the routing manager…”).

	Feldman fails to teach being able to specify resource requirements for the machine learning Servia via the interface. 
	However, Stefani teaches wherein the resource requirement is specified via an interface of the machine learning service (Col. 5 & 6, lines 62-67 & 1-27, “The user interface 402B may also provide functionality enabling the user to specify one or more metric conditions 450. As illustrated, two types of metric conditions 450 are utilized—ones that cause a “scaling up” of additional fleet resources, and ones that cause a “scaling down” of fleet resources…As shown in FIG. 4, two metric conditions are satisfied that indicate a user's desire for additional resources (e.g., model instances 118A-118N) to be added to the fleet. A first metric condition indicates that when latency per request is greater than two-hundred (200) milliseconds for two periods of time (e.g., where each period of time can be defined differently in different environments, where a period size may be defined by the time between metric collection), the fleet is to be scaled up. A second metric condition indicates that when a CPU utilization is ever detected as being greater than ninety-percent (90%), the fleet is to be scaled up.” See also Col. 5, lines 14-27.).
	Feldman and Stefani are considered to be analogous to the claimed invention because they are in the same field of resource allocation and load-balancing for machine learning models. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the interface of the machine learning service of Feldman with the ability to specify resource requirements as taught in Stefani to arrive at the claimed invention. The motivation to modify Feldman with the teachings of Stefani is that allowing users to specify resource requirements allows the machine learning service to scale up or down without wasting resources or incurring unnecessary costs (See Stefani – Col. 2 & 3, lines 63-67 & 1-10).
	As per claim 16, Feldman teaches the non-transitory computer-readable media of claim 14. 

	Although Feldman teaches a generic scaling down of machine learning models, Feldman fails to explicitly teach a scaling event that decreases the plurality of computing resources associated with the machine learning service.
	Accordingly, Stefani teaches wherein placement event is detected in response to a scaling event to decrease the plurality of computing resources associated with the managed network endpoint (Col. 4 & 5, lines 50-67 & 1-13, “For example, in some embodiments the auto-scaling system 106 includes an auto-scaling monitor 108 that can trigger an auto-scaling—e.g., an addition and/or removal of model instances from a fleet 116 by an auto-scaling engine 114—based on monitoring (or obtaining) operational metric values 110 associated with operating conditions of the fleet. The auto-scaling monitor 108 may obtain these metric values by direct observation/querying of the fleet, interacting with a logging service, receiving report data from the fleet, etc. Exemplary metric values 110 that can be utilized as part of auto-scaling hosted machine learning models are shown in FIG. 3. In this figure, a variety of operational metric values 110 are shown that can be monitored and potentially be used to determine whether the current fleet 116 of model instances 118A-118N serving a model 120 is over- or under-provisioned and thus, whether to add or remove capacity from the fleet.” See also Col. 9, lines 21-33.).
	Feldman and Stefani are considered to be analogous to the claimed invention because they are in the same field of resource allocation and load-balancing for machine learning models. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the machine learning service of Feldman with the ability to scale down as taught by Stefani to arrive at the claimed invention. The motivation to modify Feldman with the teachings of Stefani is that including the ability to scale down resources for the machine learning service avoids any scenarios in which resources are over-provisioned resulting in wasted resources and increased computing costs (See Stefani – Col. 2 & 3, lines 63-67 & 1-10).

	As per claim 17, it is a non-transitory computer-readable media claim comprising similar limitations to claim 8, so it is rejected for similar reasons. 

	As per claim 19, it is a non-transitory computer-readable media claim comprising similar limitations to claim 11, so it is rejected for similar reasons. 

Claim(s) 12 is rejected under 35 U.S.C. 103 as being unpatentable over Feldman as applied to claim 5 above, and further in view of Yang et al. (US Patent No. 12,159,167 B2 hereinafter Yang *cited in IDS*).
As per claim 12, Feldman teaches an interface of the machine learning service (¶ [0080], “FIG. 7A includes hardware 720 comprising a set of one or more processor(s) 722, a set of one or more network interfaces 724 (wireless and/or wired), and machine-readable media 726 having stored therein software 728 (which includes instructions executable by the set of one or more processor(s) 722). The machine-readable media 726 may include non-transitory and/or transitory machine-readable media. Each of the previously described clients and the routing manager may be implemented in one or more electronic devices 700. In one implementation: 1) each of the clients is implemented in a separate one of the electronic devices 700 (e.g., in end user devices where the software 728 represents the software to implement clients to interface directly and/or indirectly with the routing manager…”).

Feldman fails to explicitly teach creating the network endpoint for the machine learning service in response to a request received via the interface.
However, Yang teaches wherein the managed network endpoint is created in response to one or more requests to create the managed network endpoint and add the plurality of different machine learning models to the managed network endpoint, received via an interface of the machine learning service (Col. 11, lines 21-36, “Inference gateway 104 may include one or more interfaces to facilitate the exchange of inference requests from inference source 102, such as a user interface (e.g., a request from a client to invoke a model including new data to determine an inference, etc.) configured to facilitate a user to enter an inference request.” Col. 18, lines 8-23, “In some non-limiting embodiments or aspects, inference gateway 104, before routing an inference request to the selected computing platform, may invoke a deployment of a model for the selected computing platform (e.g., create an executable network from a network object associated with one or more single processor-bound models, etc.). Inference gateway 104 may communicate a deployment request to model operations & storage 124 and/or model deployment server 106, which activates a software function to deploy a model on model deployment server 106 (e.g., a processor-based inference engine, etc.).”).
Feldman and Yang are considered to be analogous to the claimed invention because they are in the same field of resource allocation and management of machine learning models. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute to interface of Feldman with the interface for interacting with the machine learning service of Yang to arrive at the claimed invention. This substitution would have yielded predictable results and been reasonable under MPEP § 2143 as both references handle to routing of user requests to various hosted machine learning models. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHN ROBERT DAKITA EWALD whose telephone number is (703)756-1845. The examiner can normally be reached Monday-Friday: 9:00-5:30 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Lewis Bullock can be reached at (571)272-3759. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/J.D.E./Examiner, Art Unit 2199                                                                                                                                                                                                        
/LEWIS A BULLOCK  JR/Supervisory Patent Examiner, Art Unit 2199
Read full office action
Prosecution Timeline

Nov 24, 2023
Application Filed
Feb 11, 2026
Non-Final Rejection mailed — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/575,696
Patent 12619459
EXTENDING PARALLEL SOFTWARE THREADS
4y 3m to grant Granted May 05, 2026
18/096,237
Patent 12602267
DYNAMIC APPLICATION PROGRAMMING INTERFACE MODIFICATION TO ADDRESS HARDWARE DEPRECIATION
3y 3m to grant Granted Apr 14, 2026
17/943,437
Patent 12572377
TRANSMITTING INTERRUPTS FROM A VIRTUAL MACHINE (VM) TO A DESTINATION PROCESSING UNIT WITHOUT TRIGGERING A VM EXIT
3y 5m to grant Granted Mar 10, 2026
17/970,317
Patent 12547465
METHOD AND SYSTEM FOR VIRTUAL DESKTOP SERVICE MANAGER PLACEMENT BASED ON END-USER EXPERIENCE
3y 3m to grant Granted Feb 10, 2026
17/744,933
Patent 12536041
SYSTEM AND METHOD FOR DETERMINING MEMORY RESOURCE CONFIGURATION FOR NETWORK NODES TO OPERATE IN A DISTRIBUTED COMPUTING NETWORK
3y 8m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
99%
With Interview (+50.0%)
3y 3m (~9m remaining)
Median Time to Grant
Low
PTA Risk
Based on 22 resolved cases by this examiner. Grant probability derived from career allowance rate.