Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
This action is in response to the amendments filed 10 November 2025. Claims 1, 11, and 15 are amended. Claims 1-20 are pending and have been examined.
Response to Arguments
Applicant's arguments, see pages 15-17, filed 10 November 2025, with respect to the rejections of Claims 1-20 under 35 U.S.C. 101 have been fully considered and are persuasive. The rejections of Claims 1-20 under 35 U.S.C. 101 have been withdrawn.
APPLICANT'S ARGUMENT: Applicant argues (page 15, paragraph 5) that "the amended claims are directed to a specific technological improvement in distributed machine-learning systems, particularly those involving embedded wireless inference devices. ... This language makes clear that the claimed method is not merely evaluating data or executing an abstract mathematical concept but instead implements a concrete communication-control mechanism in a distributed computing architecture."
EXAMINER'S RESPONSE: The rejections of Claims 1-20 under 35 U.S.C. 101 have been withdrawn in light of arguments and/or amendments.
APPLICANT'S ARGUMENT: Applicant argues (page 16, paragraph 1) that "the amended claims recite a remote embedded inference device that performs local inference, monitors statistical drift conditions based on confidence score distributions, and transmits data for retraining only upon detecting drift. ... This selective-transmission control is not aspirational or field-of-use language; rather, it is expressed as a positive, required operational step that governs when communication occurs."
EXAMINER'S RESPONSE: The rejections of Claims 1-20 under 35 U.S.C. 101 have been withdrawn in light of arguments and/or amendments.
APPLICANT'S ARGUMENT: Applicant argues (page 16, paragraph 2) that "the present claims use a defined drift-detection process based on confidence score distributions to trigger communication behavior. ... The Federal Circuit has confirmed that systems that monitor distributed devices and manage communications based on detected conditions provide a patent-eligible technological improvement."
EXAMINER'S RESPONSE: The rejections of Claims 1-20 under 35 U.S.C. 101 have been withdrawn in light of arguments and/or amendments.
APPLICANT'S ARGUMENT: Applicant argues (page 17, paragraph 1) that "Here, drift detection directly controls whether the inference node communicates over a wireless link, integrating the purported abstract idea into a real-world device-level operation that improves computational efficiency and conserves limited wireless resources. This selective transmission is not merely post-solution activity or data gathering. Rather, it is the operative control logic that governs when and whether communication over a constrained wireless channel occurs, addressing a concrete technical problem in edge-based and resource-constrained distributed computing systems."
EXAMINER'S RESPONSE: The rejections of Claims 1-20 under 35 U.S.C. 101 have been withdrawn in light of arguments and/or amendments.
Applicant's arguments, see pages 17-20, filed 10 November 2025, with respect to the rejection of Claims 1-20 under 35 U.S.C. 103 have been fully considered but they are not persuasive.
APPLICANT'S ARGUMENT: Applicant argues (page 20, paragraph 3) that "Calmon, Harvill, Badawy, Vasseur, Johnson, Wang, Vijayaraghavan and Stripelis, either individually or in combination, fail to disclose or suggest at least the aforementioned features recited in amended independent claims 1, 11 and 15. ¶ Calmon discloses a method for training and updating machine learning models at a central node using data from edge nodes. ... The method of Calmon, however, does not include the combination of features now recited in the amended independent claims."
EXAMINER'S RESPONSE: Examiner notes that Applicant's arguments pertain to newly recited limitations.
Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references. Applicant's arguments do not comply with 37 CFR 1.111(c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections.
The newly recited features of amended independent Claims 1, 11, and 15 are found to be obvious in view of further teachings of both Calmon and Harvill. In response to applicant's arguments against the references individually, Examiner notes that one cannot show non-obviousness by attacking references individually where the rejections are based on combinations of references.
Dependent claims are rejected as indicated in the 35 U.S.C. 103 rejections below.
Claim Rejections - 35 USC § 101
The rejections of Claims 1-20 under 35 U.S.C. 101 for abstract idea are withdrawn in light of arguments and/or amendments.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 5, 9-11, 15, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Calmon, et al. (US 2023/0004854 A1, hereinafter "Calmon"), in view of Harvill, et al. (US 2020/0050965 A1, hereinafter "Harvill").
Regarding Claim 1, Calmon teaches:
a method (Calmon, [0016]: "embodiments described herein relate to methods ...for training and updating models (e.g., machine learning (ML) models) at a central node (e.g., in the cloud) using data and other information from edge nodes" and [0017]: "in one or more embodiments, a need for efficient management and deployment of such ML models arises. In one or more embodiments, efficient management implies, beyond model training and deployment, keeping the ML model coherent with the statistic distribution of input data of all edge nodes") comprising:
training, by a training node, a machine learning model based on a training data set (Calmon, [0041]: "In Step 200, a model coordinator trains and validates an ML model using a historical data set. ... As an example, the historical data set may be a large amount of telemetry data from different storage arrays that can be used to predict the performance of some aspect of the storage arrays," where Calmon's model coordinator and historical data set correspond to the instant training node and training data set), a deployment of the machine learning model in a system being managed, the system including the training node and an inference node (Calmon, [0017]: "in one or more embodiments, a need for efficient management and deployment of such ML models arises" and [0020]: "One or more embodiments described herein provide an asynchronous ML model management framework that encompasses ML model training at a central node and ML model execution at any number of edge nodes") the inference node being located remotely from the training node (Calmon, [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). ... A network may include a datacenter network, a wide area network, a local area network, a wireless network, a cellular phone network .... A network may be located at a single physical location, or be distributed at any number of physical sites," where Caiman's edge node and model coordinator correspond to the instant inference node and training node, respectively), the training data set being obtained from the inference node (Calmon, [0049]: "the model coordinator begins receiving batch data from the edge nodes... [B]atch data is sets of data used as inputs for a ML model being executed at the edge nodes. ... T]he model coordinator may obtain the stored batch data from the shared communication layer"), the inference node generating raw data (Calmon, [0034]: "an edge node (102, 104) includes functionality to generate or otherwise obtain any amount or type of data ( e.g., telemetry data, feature data, image data, etc.) that is related in any way to the operation of the edge device. As an example, a storage array edge device may include functionality to obtain feature data related to data storage," where Calmon's edge node corresponds to the instant inference node) including generated measurements or observations (Calmon, [0034]: "a storage array edge device may include functionality to obtain feature data related to data storage, such as read response time, write response time, number and/or type of disks ( e.g., solid state, spinning disks, etc.), model number(s), number of storage engines, cache read/writes and/or hits/misses, size of reads/ writes in megabytes, etc."), the training data being a subset of the raw data (Calmon, [0024]: "if an edge node determines that the ML model the edge node is using is marked as drifted, the edge node begins a batch collection mode. ... [W]hen in batch collection mode, triggered by a model being marked as outdated, an edge node begins transmitting batch data to the central node," where data from Calmon's batch collection mode is a subset of raw edge data);
generating, by the training node, a first set of confidence scores including confidence scores associated with an output of the machine learning model (Calmon, [0042]: "as part of training the ML model, the model coordinator also calculates a confidence value for the trained ML model.... the first stage of calculating a confidence value relates to the collection of confidence levels in the results (e.g., inferences) over the training data set") when a first validation data set is inputted to the machine learning model (Calmon, [0021]: "during training and validation of the ML model, the central node determines a confidence value for the model that is a measure of the confidence that the results of the ML model are correct" and [0041]: "training an ML model also includes validating the training to determine how well the ML model is performing relative to the historical data set, some of which may have been separated from the training portion to be used in validation") the first validation data set being obtained from the inference node (Calmon, [0041]: "In Step 200, a model coordinator trains and validates an ML model using a historical data set. ... As an example, the historical data set may be a large amount of telemetry data"), the first validation data set being not used for training (Calmon, [0041]: "training an ML model also includes validating the training to determine how well the ML model is performing relative to the historical data set, some of which may have been separated from the training portion to be used in validation," where Calmon's separated from the training portion corresponds to the instant not used for training) and being used to determine stability of the machine learning model (Calmon, [0044]: "In one or more embodiments, the confidence value (e.g.,
μ
in the above example) is used to derive a confidence threshold (e.g., t) for the ML model. In one or more embodiments, the confidence threshold represents an aggregate confidence of the model on the results (e.g., inferences) produced based on the training dataset that, if confidence values at the edge nodes fall below, indicates that the ML model has drifted at such edge nodes," where Calmon's aggregate confidence with respect to model drift corresponds to the instant stability), the confidence scores being used to detect an occurrence of drift at the inference node (Calmon, [0022]: "the edge nodes access the shared communication layer to obtain the trained ML model based on the fresh indication associated with the ML model, and also obtain the confidence threshold for the ML model. ... [A]s an edge node uses the ML model, the edge node performs an analysis to derive a confidence value for the model for results produced by the ML model based on the data of the edge node. In one or more embodiments, the edge node compares the confidence value to the confidence threshold obtained from the central node. In one or more embodiments, if the confidence value for the ML model at an edge node falls below the confidence threshold, then drift has occurred for the ML model"), the drift referring to a situation where features of the first validation data being used by the machine learning model change over time (Calmon, [0019]: "ML model performance at the edge may be monitored by determining whether drift has occurred. In one or more embodiments, drift of an ML model is when the results (e.g., predictions, classifications, etc.) become increasingly less accurate, unstable, erroneous, etc.," where Calmon's increasingly corresponds to the instant over time);
transmitting, by the training node to the inference node, the first set of confidence scores and a representation of the machine learning model; receiving, by the inference node, the first set of confidence scores and the representation of the machine learning model; (Calmon, [0045]: "the model coordinator provides the trained ML model and the confidence threshold to a shared communication layer" where [0044]: "the confidence value (e.g., μ in the above example) is used to derive a confidence threshold (e.g., t) for the ML model" and [0056]: "In Step 240, an edge node obtains a trained ML model from a shared communication layer, along with an associated confidence threshold");
generating, by the inference node: inferences (Calmon, [0057]: "the edge node executes the trained ML model using data available to the edge node.... In one or more embodiments, the trained ML model may be any trained ML model used for any purpose ( e.g., inference)") by inputting a test data set obtained by the inference node into the machine learning model (Calmon, [0057]: "In Step 242, the edge node executes the trained ML model using data available to the edge node, and performs drift detection. ... In one or more embodiments, the data used to execute the trained ML model may be any data relevant to the edge node ( e.g., telemetry data, user data, etc.)," where Calmon's perform drift detection using edge data corresponds to the instant inputting test data set), the test data set being used to determine quality of the machine learning model at the inference node (Calmon, [0069]: "the edge nodes each obtain the trained ML model and the confidence threshold from the shared communication layer, and begin executing the trained ML model. While executing the trained ML model, the edge nodes perform drift detection by calculating a confidence value for the trained ML model, and comparing the confidence value with the confidence threshold"), the test data set including the raw data (Calmon, [0057]: "the data used to execute the trained ML model may be any data relevant to the edge node ( e.g., telemetry data, user data, etc.)"); and
a second set of confidence scores including confidence scores associated with the inferences (Calmon, [0057]: "In Step 242, the edge node executes the trained ML model using data available to the edge node, and performs drift detection. ... In one or more embodiments, an edge node calculates a confidence value for the trained ML model using techniques similar to that described with respect to the model coordinator in Step 200, above," where Calmon's edge confidence value corresponds to the instant second set of confidence scores);
optimizing communication overhead and limiting data transfer between the training node and the inference node (Calmon, [0019]: "an efficient model management should take into account ML model performance at the edge nodes so that the ML model can be adjusted, rather than relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application") by determining, by the inference node, whether the first set of confidence scores and the second set of confidence scores are similar (Calmon, [0019]: "In one or more embodiments, ML model performance at the edge may be monitored by determining whether drift has occurred. ... Therefore, it may be desirable to detect drift at the edge nodes, and have the detection of drift communicated to the central node" and [0057]: "In one or more embodiments, the edge node performs drift detection by comparing a confidence value calculated for the trained ML model with the confidence threshold calculated by the model coordinator and obtained from the shared communication layer");
the determining (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240") that the first set of confidence scores and the second set of confidence scores are similar (Calmon, Fig. 2B, step 244, "Drift detected?" followed by "NO" path) being indicative that the deployed machine learning model on the inference node is effective for a current data set (Calmon, Fig. 2B, step 242, "Execute trained ML model and perform drift detection" when preceded by "NO" path from step 244) without need to transfer new data to the training node to retrain the machine learning model (Calmon, Fig. 2B, step 244 followed by "NO" path, where "YES" path is followed by step 250 "Provide batch data to model coordinator"), and
the determining (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240") that the first set of confidence scores and the second set of confidence scores are not similar being indicative that the drift has occurred at the inference node (Calmon, Fig. 2B, step 244, "Drift detected?" followed by "YES" path) and the deployed machine learning model on the inference node is ineffective (Calmon, Fig. 2B, step 248, "Model marked as drifted?" followed by "YES" path) with the need to transfer the new data to the training node to retrain the machine learning model (Calmon, Fig. 2B, step 250, "Provide batch data to model coordinator"); and
in response to determining that the first set of confidence scores and the second set of confidence scores are not similar (Calmon, [0059]: "In Step 246, based on the determination in Step 244 that drift has occurred for the ML model, the edge node sends a drift signal to the model coordinator"):
transmitting, by the inference node to the training node, at least part of the test data set for training an updated machine learning model (Calmon, [0061]: "In Step 250, based on a determination in Step 248 that the trained ML model being executed is associated with a drifted indication, the edge node begins sending batch data to the central node," where [0049]: "In one or more embodiments, batch data is sets of data used as inputs for a ML model being executed at the edge nodes"), the inference node transmitting the at least part of the test data set only upon detecting drift (Calmon, Fig. 2B, step 244, "Drift detected?" and step 250, "Provide batch data to model coordinator") based on the first set of confidence scores and the second set of confidence scores (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240," where Caiman's confidence threshold and edge confidence value correspond to the instant first and second set of confidence scores, respectively), and refraining from transmitting the test data set in the absence of drift (Calmon, Fig. 2B, Step 244, "Drift detected?" followed by "NO" branch to repeat step 242, "Execute trained ML model and perform drift detection") so as to reduce wireless communication overhead between the inference node and the training node (Calmon, [0019]: "relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application. ... [D]etecting drift of ML models being executed on edge devices by a central node that trains and distributes the model would incur significant overhead due to the data at the edge nodes being sent (e.g., via a network) to the central node. Therefore, it may be desirable to detect drift at the edge nodes" where [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). ... A network may include ... a wireless network");
receiving, by the training node, the at least part of the test data set (Calmon, Fig. 2A, bock 206, "Drift detected?" followed by block 210, "Receive batch data from edge node(s)," where [0049]: "In one or more embodiments, batch data is sets of data used as inputs for a ML model being executed at the edge nodes"); and
in response to receiving the at least part of the test data set, adding, by the training node, the at least part of the test data set to the training data set (Calmon, [0051]: "In Step 214, the model coordinator trains anew ML model (or re-trains the ML model) using an updated historical data set. In one or more embodiments, regardless of the method for the batch collection, when the model coordinator assesses that a representative set of batch data has been collected, the model coordinator proceeds to produce a new training data set, which may be referred to as an updated historical data set," where Calmon's updated historical data set corresponds to the instant training data set) and generating, by the training node, a request to train a new machine learning model (Calmon, Fig. 2A, step 214, "Train ML model using updated historical data set to obtain updated trained ML model," where [0035]: "In one or more embodiments, the model coordinator (100) is a computing device (described above)" and [0029]: "Examples of computing devices include, but are not limited to, a server (e.g., a blade-server in a blade-server chassis, a rack server in a rack, etc.) .... In one or more embodiments, any or all of the aforementioned examples may be combined to create a system of such devices, which may collectively be referred to as a computing device," where Calmon's model coordinator embodiment as a system of server devices reasonably suggests a request to train a new model),
wherein the inference node includes an ... device configured to receive the representation of the machine learning model (Calmon, [0027]: "the edge nodes (102, 104) may be computing devices" and [0028]: "a computing device is any device ... capable of electronically processing instructions and may include ... one or more physical interfaces (e.g., network ports, storage ports)") via a wireless communication interface (Calmon, [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). A network may refer to an entire network or any portion thereof (e.g., a logical portion of the devices within a topology of devices). A network may include ... a wireless network ... or any other suitable network that facilitates the exchange of information from one part of the network to another" and [0015]: "'operatively connected' may refer to any direct ( e.g., wired directly between two devices or components) or indirect (e.g., wired and/or wireless connections between any number of devices or components connecting the operatively connected devices)") and to perform on-device inference using the received machine learning model (Calmon, Fig. 2B, step 242, "Execute trained ML model and perform drift detection") ... and wherein the ... device includes sensing components (Calmon, [0034]: "an edge node (102, 104) includes functionality to generate or otherwise obtain any amount or type of data ( e.g., ... image data, etc.) that is related in any way to the operation of the edge device," where Calmon's image data generation reasonably suggests a camera sensor) and processing resources (Calmon, [0027]: "the edge nodes (102, 104) may be computing devices" and [0028]: "a computing device is any device ... capable of electronically processing instructions and may include ... one or more processors (e.g. components that include integrated circuitry) (not shown), memory (e.g., random access memory (RAM)) (not shown), input and output device(s) (not shown), non-volatile storage hardware ..., one or more physical interfaces") configured to execute the machine learning model on data generated by the ... device (Calmon, [0022]: "the edge nodes then begin executing the ML model based on data generated by, obtained by, or otherwise available to the respective edge nodes"), thereby enabling local inference without transmitting all raw sensor data to the training node (Calmon, [0018]: "In one or more embodiments, to realize such a ML edge-to-cloud management system, it is important to note that, while model training could be performed on both edge nodes and central nodes, ML model execution will often be performed at the edge (e.g., due to latency constraints of time-sensitive applications)" and [0019]: "an efficient model management should take into account ML model performance at the edge nodes so that the ML model can be adjusted, rather than relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application").
Calmon teaches a method of training and deployment of a machine learning model to a system including a training node and an inference node, including a step of transmitting, by the training node to the inference node, a representation of the machine learning model, wherein the inference node includes a device configured to receive the representation of the machine learning model.
Calmon does not explicitly teach an embedded device configured to receive the representation of the machine learning model ... wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission, and wherein the embedded device includes sensing components and processing resources configured to execute the machine learning model on data generated by the embedded device.
However, Harvill teaches:
an embedded device configured to receive the representation of the machine learning model (Harvill, [0099]: "At least one hardware processor 504 is coupled to I/O subsystem 502 for processing information and instructions. Hardware processor 504 may include ... a special-purpose microprocessor such as an embedded system") ... wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission (Harvill, [0093]: "The productionize 228 step makes the trained model ready for use. [T]he model is pruned of the training part ... the values and variables are collected into a single file ... and performance can be further improved through quantization.... [T]he trained model is converted to target execution platform as in mobile devices using the CoreML or TFLite format" and [0094]: "The deliver 229 step makes the trained model accessible for use. ... [T]he model can be downloaded for use in an application on device") and wherein the embedded device includes sensing components (Harvill, [0104]: "At least one input device 514 is coupled to I/O subsystem 502 for communicating signals, data, command selections or gestures to processor 504. Examples of input devices 514 include ... various types of sensors such as force sensors, motion sensors, heat sensors, accelerometers, gyroscopes, and inertial measurement unit (IMU) sensors") and processing resources configured to execute the machine learning model on data generated by the embedded device (Harvill, [0094]: "The deliver 229 step makes the trained model accessible for use. In one embodiment the model is deployed in a cloud service that receives and image for inference and results are sent back to the client. In another embodiment, the model can be downloaded for use in an application on device" and [0100]: "Computer system 500 includes one or more units of memory 506, such as a main memory, which is coupled to I/O subsystem 502 for electronically digitally storing data and instructions to be executed by processor 504. Memory 506 may include volatile memory such as various forms of random-access memory (RAM) or other dynamic storage device").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Calmon regarding a method of training and deployment of a machine learning model to a system including a training node and an inference node, including a step of transmitting, by the training node to the inference node, a representation of the machine learning model, wherein the inference node includes a device configured to receive the representation of the machine learning model with those of Harvill regarding an embedded device configured to receive the representation of the machine learning model, wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission, and wherein the embedded device includes sensing components and processing resources configured to execute the machine learning model on data generated by the embedded device.
The motivation to do so would be to facilitate use of the model optimized for the target device (Harvill, [0093]: "The productionize 228 step makes the trained model ready for use. In one embodiment the model is pruned of the training part to reduce the model size. To optimize the model, all the values and variables are collected into a single file, the graph is frozen through constant folding and collapsing the nodes, and performance can be further improved through quantization of to lower precision values such as 16-bit float or 8-bit integers. In another embodiment the trained model is converted to target execution platform as in mobile devices using the CoreML or TFLite format").
Regarding Claim 5, the rejection of Claim 1 is incorporated. The Calmon/Harvill combination teaches:
determining, by the training node, whether the machine learning model has destabilised and re-stabilised after training the machine learning model (Calmon, [0016]: "one or more embodiments related to training an ML model at a central node, distributing the model to edge nodes, receiving indications from the edge nodes that model drift has occurred, obtaining new data from the edge nodes based on drift being detected, retraining the ML model, and re-distributing an updated model to the edge nodes," where [0019]: "drift of an ML model is when the results (e.g., predictions, classifications, etc.) become increasingly less accurate, unstable, erroneous, etc."); and
in response to determining that the machine learning model has destabilised and re-stabilised: transmitting, by the training node to the inference node, the first set of confidence scores and the representation of the machine learning model (Calmon, [0045]: "the model coordinator provides the trained ML model and the confidence threshold to a shared communication layer" where [0044]: "the confidence value (e.g., μ in the above example) is used to derive a confidence threshold (e.g., t) for the ML model" and [0056]: "In Step 240, an edge node obtains a trained ML model from a shared communication layer, along with an associated confidence threshold").
Regarding Claim 9, the rejection of Claim 1 is incorporated. The Calmon/Harvill combination teaches:
training, by the training node, the machine learning model based on a training data set (Calmon, [0025]: "after marking a ML model with a drifted indication in the shared communication layer, the central node begins receiving the aforementioned batch data from the various edge nodes as they determine that the model is marked as drifted. ... the central node retrains and validates the ML model using the new training data") including the at least part of the test data set in response to receiving the at least part of the test data set ([0049]: "batch data is sets of data used as inputs for a ML model being executed at the edge nodes" and [0057]: "In Step 242, the edge node executes the trained ML model using data available to the edge node, and performs drift detection. ... In one or more embodiments, the data used to execute the trained ML model may be any data relevant to the edge node ( e.g., telemetry data, user data, etc.)").
Regarding Claim 10, the rejection of Claim 1 is incorporated. The Calmon/Harvill combination teaches
adding, by the training node, a subset of the at least part of the test data set to a second validation data set (Calmon, Fig. 2, step 214, "Train ML model using updated historical data set to obtain updated trained ML model," where Calmon's updated historical data set corresponds to the instant second validation data, given that Calmon's training data is split between training and validation sets, as in [0025]: "the central node retrains and validates the ML model using the new training data, and calculates a new confidence value and corresponding confidence threshold" where [0041]: "training an ML model also includes validating the training to determine how well the ML model is performing relative to the historical data set, some of which may have been separated from the training portion to be used in validation").
Regarding Claim 11, Calmon teaches:
a method of operating an inference node (Calmon, [0016]: "embodiments described herein relate to methods ...for training and updating models (e.g., machine learning (ML) models) at a central node (e.g., in the cloud) using data and other information from edge nodes" and [0057]: "the edge node executes the trained ML model using data available to the edge node"), the method comprising:
receiving a first set of confidence scores generated in a training node and a representation of a machine learning model (Calmon, [0045]: "the model coordinator provides the trained ML model and the confidence threshold to a shared communication layer" and [0044]: "the confidence value (e.g., μ in the above example) is used to derive a confidence threshold (e.g., t) for the ML model" and [0056]: "In Step 240, an edge node obtains a trained ML model from a shared communication layer, along with an associated confidence threshold," where Calmon's model coordinator corresponds to the instant training node), a deployment of the machine learning model in a distributed system being managed, the system including the training node and the inference node (Calmon, [0017]: "in one or more embodiments, a need for efficient management and deployment of such ML models arises" and [0020]: "One or more embodiments described herein provide an asynchronous ML model management framework that encompasses ML model training at a central node and ML model execution at any number of edge nodes," where Calmon's framework operates over a distributed system, as in: [0016]: "embodiments described herein relate to methods, systems, and non-transitory computer readable mediums storing instructions for training and updating models (e.g., machine learning (ML) models) at a central node (e.g., in the cloud) using data and other information from edge nodes") the inference node being located remotely from the training node (Calmon, [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). ... A network may include a datacenter network, a wide area network, a local area network, a wireless network, a cellular phone network .... A network may be located at a single physical location, or be distributed at any number of physical sites," where Caiman's edge node and model coordinator correspond to the instant inference node and training node, respectively), the first set of confidence scores including: confidence scores associated with an output of the machine learning model (Calmon, [0042]: "as part of training the ML model, the model coordinator also calculates a confidence value for the trained ML model.... the first stage of calculating a confidence value relates to the collection of confidence levels in the results (e.g., inferences) over the training data set") when a first validation data set is inputted to the machine learning model (Calmon, [0021]: "during training and validation of the ML model, the central node determines a confidence value for the model that is a measure of the confidence that the results of the ML model are correct" and [0041]: "training an ML model also includes validating the training to determine how well the ML model is performing relative to the historical data set, some of which may have been separated from the training portion to be used in validation"), the first validation data set being obtained from the inference node (Calmon, [0041]: "In Step 200, a model coordinator trains and validates an ML model using a historical data set. ... As an example, the historical data set may be a large amount of telemetry data"), the first validation data set being not used for training (Calmon, [0041]: "training an ML model also includes validating the training to determine how well the ML model is performing relative to the historical data set, some of which may have been separated from the training portion to be used in validation," where Calmon's separated from the training portion corresponds to the instant not used for training) and being used to determine stability of the machine learning model (Calmon, [0044]: "In one or more embodiments, the confidence value (e.g.,
μ
in the above example) is used to derive a confidence threshold (e.g., t) for the ML model. In one or more embodiments, the confidence threshold represents an aggregate confidence of the model on the results (e.g., inferences) produced based on the training dataset that, if confidence values at the edge nodes fall below, indicates that the ML model has drifted at such edge nodes," where Calmon's aggregate confidence with respect to model drift corresponds to the instant stability), the confidence scores being used to detect an occurrence of drift at the inference node (Calmon, [0022]: "the edge nodes access the shared communication layer to obtain the trained ML model based on the fresh indication associated with the ML model, and also obtain the confidence threshold for the ML model. ... [A]s an edge node uses the ML model, the edge node performs an analysis to derive a confidence value for the model for results produced by the ML model based on the data of the edge node. In one or more embodiments, the edge node compares the confidence value to the confidence threshold obtained from the central node. In one or more embodiments, if the confidence value for the ML model at an edge node falls below the confidence threshold, then drift has occurred for the ML model"), the drift referring to a situation where features of the first validation data being used by the machine learning model change over time (Calmon, [0019]: "ML model performance at the edge may be monitored by determining whether drift has occurred. In one or more embodiments, drift of an ML model is when the results (e.g., predictions, classifications, etc.) become increasingly less accurate, unstable, erroneous, etc.," where Calmon's increasingly corresponds to the instant over time), the inference node generating raw data (Calmon, [0034]: "an edge node (102, 104) includes functionality to generate or otherwise obtain any amount or type of data ( e.g., telemetry data, feature data, image data, etc.) that is related in any way to the operation of the edge device. As an example, a storage array edge device may include functionality to obtain feature data related to data storage," where Calmon's edge node corresponds to the instant inference node) including generated measurements or observations (Calmon, [0034]: "a storage array edge device may include functionality to obtain feature data related to data storage, such as read response time, write response time, number and/or type of disks ( e.g., solid state, spinning disks, etc.), model number(s), number of storage engines, cache read/writes and/or hits/misses, size of reads/ writes in megabytes, etc.");
generating: inferences (Calmon, [0057]: "the edge node executes the trained ML model using data available to the edge node.... In one or more embodiments, the trained ML model may be any trained ML model used for any purpose ( e.g., inference)") by inputting a test data set obtained by the inference node into the machine learning model (Calmon, [0057]: "In Step 242, the edge node executes the trained ML model using data available to the edge node, and performs drift detection. ... In one or more embodiments, the data used to execute the trained ML model may be any data relevant to the edge node ( e.g., telemetry data, user data, etc.)," where Calmon's perform drift detection using edge data corresponds to the instant inputting test data set), the test data set being used to determine quality of the machine learning model at the inference node (Calmon, [0069]: "the edge nodes each obtain the trained ML model and the confidence threshold from the shared communication layer, and begin executing the trained ML model. While executing the trained ML model, the edge nodes perform drift detection by calculating a confidence value for the trained ML model, and comparing the confidence value with the confidence threshold"), the test data set including the raw data (Calmon, [0057]: "the data used to execute the trained ML model may be any data relevant to the edge node ( e.g., telemetry data, user data, etc.)"); and
a second set of confidence scores including confidence scores associated with the inferences (Calmon, [0057]: "In Step 242, the edge node executes the trained ML model using data available to the edge node, and performs drift detection. ... In one or more embodiments, an edge node calculates a confidence value for the trained ML model using techniques similar to that described with respect to the model coordinator in Step 200, above," where Calmon's edge confidence value corresponds to the instant second set of confidence scores);
optimizing communication overhead and limiting data transfer between the training node and the inference node (Calmon, [0019]: "an efficient model management should take into account ML model performance at the edge nodes so that the ML model can be adjusted, rather than relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application") by determining whether the first set of confidence scores and the second set of confidence scores are similar (Calmon, [0019]: "In one or more embodiments, ML model performance at the edge may be monitored by determining whether drift has occurred. ... Therefore, it may be desirable to detect drift at the edge nodes, and have the detection of drift communicated to the central node" and [0057]: "In one or more embodiments, the edge node performs drift detection by comparing a confidence value calculated for the trained ML model with the confidence threshold calculated by the model coordinator and obtained from the shared communication layer");
the determining (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240") that the first set of confidence scores and the second set of confidence scores are similar (Calmon, Fig. 2B, step 244, "Drift detected?" followed by "NO" path) being indicative that the deployed machine learning model on the inference node is effective for a current data set (Calmon, Fig. 2B, step 242, "Execute trained ML model and perform drift detection" when preceded by "NO" path from step 244) without need to transfer new data to the training node to retrain the machine learning model (Calmon, Fig. 2B, step 244 followed by "NO" path, where "YES" path is followed by step 250 "Provide batch data to model coordinator"), and
the determining (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240") that the first set of confidence scores and the second set of confidence scores are not similar being indicative that the drift has occurred at the inference node (Calmon, Fig. 2B, step 244, "Drift detected?" followed by "YES" path) and the deployed machine learning model on the inference node is ineffective (Calmon, Fig. 2B, step 248, "Model marked as drifted?" followed by "YES" path) with the need to transfer the new data to the training node to retrain the machine learning model (Calmon, Fig. 2B, step 250, "Provide batch data to model coordinator"); and
in response to determining that the first set of confidence scores and the second set of confidence scores are not similar (Calmon, [0059]: "In Step 246, based on the determination in Step 244 that drift has occurred for the ML model, the edge node sends a drift signal to the model coordinator"):
transmitting, by the inference node to the training node, at least part of the test data set for training an updated machine learning model (Calmon, [0061]: "In Step 250, based on a determination in Step 248 that the trained ML model being executed is associated with a drifted indication, the edge node begins sending batch data to the central node," where [0049]: "In one or more embodiments, batch data is sets of data used as inputs for a ML model being executed at the edge nodes"), the inference node transmitting the at least part of the test data set only upon detecting drift (Calmon, Fig. 2B, step 244, "Drift detected?" and step 250, "Provide batch data to model coordinator") based on the first set of confidence scores and the second set of confidence scores (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240," where Caiman's confidence threshold and edge confidence value correspond to the instant first and second set of confidence scores, respectively), and refraining from transmitting the test data set in the absence of drift (Calmon, Fig. 2B, Step 244, "Drift detected?" followed by "NO" branch to repeat step 242, "Execute trained ML model and perform drift detection") so as to reduce wireless communication overhead between the inference node and the training node (Calmon, [0019]: "relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application. ... [D]etecting drift of ML models being executed on edge devices by a central node that trains and distributes the model would incur significant overhead due to the data at the edge nodes being sent (e.g., via a network) to the central node. Therefore, it may be desirable to detect drift at the edge nodes" where [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). ... A network may include ... a wireless network");
wherein the inference node includes an ... device configured to receive the representation of the machine learning model (Calmon, [0027]: "the edge nodes (102, 104) may be computing devices" and [0028]: "a computing device is any device ... capable of electronically processing instructions and may include ... one or more physical interfaces (e.g., network ports, storage ports)") via a wireless communication interface (Calmon, [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). A network may refer to an entire network or any portion thereof (e.g., a logical portion of the devices within a topology of devices). A network may include ... a wireless network ... or any other suitable network that facilitates the exchange of information from one part of the network to another" and [0015]: "'operatively connected' may refer to any direct ( e.g., wired directly between two devices or components) or indirect (e.g., wired and/or wireless connections between any number of devices or components connecting the operatively connected devices)") and to perform on-device inference using the received machine learning model (Calmon, Fig. 2B, step 242, "Execute trained ML model and perform drift detection") ... and wherein the ... device includes sensing components (Calmon, [0034]: "an edge node (102, 104) includes functionality to generate or otherwise obtain any amount or type of data ( e.g., ... image data, etc.) that is related in any way to the operation of the edge device," where Calmon's image data generation reasonably suggests a camera sensor) and processing resources (Calmon, [0027]: "the edge nodes (102, 104) may be computing devices" and [0028]: "a computing device is any device ... capable of electronically processing instructions and may include ... one or more processors (e.g. components that include integrated circuitry) (not shown), memory (e.g., random access memory (RAM)) (not shown), input and output device(s) (not shown), non-volatile storage hardware ..., one or more physical interfaces") configured to execute the machine learning model on data generated by the ... device (Calmon, [0022]: "the edge nodes then begin executing the ML model based on data generated by, obtained by, or otherwise available to the respective edge nodes"), thereby enabling local inference without transmitting all raw sensor data to the training node (Calmon, [0018]: "In one or more embodiments, to realize such a ML edge-to-cloud management system, it is important to note that, while model training could be performed on both edge nodes and central nodes, ML model execution will often be performed at the edge (e.g., due to latency constraints of time-sensitive applications)" and [0019]: "an efficient model management should take into account ML model performance at the edge nodes so that the ML model can be adjusted, rather than relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application").
Calmon teaches a method of training and deployment of a machine learning model to a system including a training node and an inference node, including a step of transmitting, by the training node to the inference node, a representation of the machine learning model, wherein the inference node includes a device configured to receive the representation of the machine learning model.
Calmon does not explicitly teach an embedded device configured to receive the representation of the machine learning model and the training node is configured to convert the machine learning model into an embedded model format prior to transmission.
Calmon does not explicitly teach an embedded device configured to receive the representation of the machine learning model ... wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission, and wherein the embedded device includes sensing components and processing resources configured to execute the machine learning model on data generated by the embedded device.
However, Harvill teaches:
an embedded device configured to receive the representation of the machine learning model (Harvill, [0099]: "At least one hardware processor 504 is coupled to I/O subsystem 502 for processing information and instructions. Hardware processor 504 may include ... a special-purpose microprocessor such as an embedded system") ... wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission (Harvill, [0093]: "The productionize 228 step makes the trained model ready for use. [T]he model is pruned of the training part ... the values and variables are collected into a single file ... and performance can be further improved through quantization.... [T]he trained model is converted to target execution platform as in mobile devices using the CoreML or TFLite format" and [0094]: "The deliver 229 step makes the trained model accessible for use. ... [T]he model can be downloaded for use in an application on device") and wherein the embedded device includes sensing components (Harvill, [0104]: "At least one input device 514 is coupled to I/O subsystem 502 for communicating signals, data, command selections or gestures to processor 504. Examples of input devices 514 include ... various types of sensors such as force sensors, motion sensors, heat sensors, accelerometers, gyroscopes, and inertial measurement unit (IMU) sensors") and processing resources configured to execute the machine learning model on data generated by the embedded device (Harvill, [0094]: "The deliver 229 step makes the trained model accessible for use. In one embodiment the model is deployed in a cloud service that receives and image for inference and results are sent back to the client. In another embodiment, the model can be downloaded for use in an application on device" and [0100]: "Computer system 500 includes one or more units of memory 506, such as a main memory, which is coupled to I/O subsystem 502 for electronically digitally storing data and instructions to be executed by processor 504. Memory 506 may include volatile memory such as various forms of random-access memory (RAM) or other dynamic storage device").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Calmon regarding a method of training and deployment of a machine learning model to a system including a training node and an inference node, including a step of transmitting, by the training node to the inference node, a representation of the machine learning model, wherein the inference node includes a device configured to receive the representation of the machine learning model with those of Harvill regarding an embedded device configured to receive the representation of the machine learning model, wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission, and wherein the embedded device includes sensing components and processing resources configured to execute the machine learning model on data generated by the embedded device.
The motivation to do so would be to facilitate use of the model optimized for the target device (Harvill, [0093]: "The productionize 228 step makes the trained model ready for use. In one embodiment the model is pruned of the training part to reduce the model size. To optimize the model, all the values and variables are collected into a single file, the graph is frozen through constant folding and collapsing the nodes, and performance can be further improved through quantization of to lower precision values such as 16-bit float or 8-bit integers. In another embodiment the trained model is converted to target execution platform as in mobile devices using the CoreML or TFLite format").
Regarding Claim 15, Calmon teaches:
a method of operating a training node (Calmon, [0016]: "embodiments described herein relate to methods ...for training and updating models (e.g., machine learning (ML) models) at a central node (e.g., in the cloud) using data and other information from edge nodes") the method comprising:
training a machine learning model based on a training data set (Calmon, [0041]: "In Step 200, a model coordinator trains and validates an ML model using a historical data set. ... As an example, the historical data set may be a large amount of telemetry data from different storage arrays that can be used to predict the performance of some aspect of the storage arrays," where Calmon's model coordinator and historical data set correspond to the instant training node and training data set), a deployment of the machine learning model in a system being managed, the system including the training node and an inference node (Calmon, [0017]: "in one or more embodiments, a need for efficient management and deployment of such ML models arises" and [0020]: "One or more embodiments described herein provide an asynchronous ML model management framework that encompasses ML model training at a central node and ML model execution at any number of edge nodes") the inference node being located remotely from the training node (Calmon, [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). ... A network may include a datacenter network, a wide area network, a local area network, a wireless network, a cellular phone network .... A network may be located at a single physical location, or be distributed at any number of physical sites," where Caiman's edge node and model coordinator correspond to the instant inference node and training node, respectively), the training data set being obtained from the inference node (Calmon, [0049]: "the model coordinator begins receiving batch data from the edge nodes... [B]atch data is sets of data used as inputs for a ML model being executed at the edge nodes. ... T]he model coordinator may obtain the stored batch data from the shared communication layer"), the inference node generating raw data (Calmon, [0034]: "an edge node (102, 104) includes functionality to generate or otherwise obtain any amount or type of data ( e.g., telemetry data, feature data, image data, etc.) that is related in any way to the operation of the edge device. As an example, a storage array edge device may include functionality to obtain feature data related to data storage," where Calmon's edge node corresponds to the instant inference node) including generated measurements or observations (Calmon, [0034]: "a storage array edge device may include functionality to obtain feature data related to data storage, such as read response time, write response time, number and/or type of disks ( e.g., solid state, spinning disks, etc.), model number(s), number of storage engines, cache read/writes and/or hits/misses, size of reads/ writes in megabytes, etc."), the training data being a subset of the raw data (Calmon, [0024]: "if an edge node determines that the ML model the edge node is using is marked as drifted, the edge node begins a batch collection mode. ... [W]hen in batch collection mode, triggered by a model being marked as outdated, an edge node begins transmitting batch data to the central node," where data from Calmon's batch collection mode is a subset of raw edge data);
generating a first set of confidence scores including confidence scores associated with an output of the machine learning model (Calmon, [0042]: "as part of training the ML model, the model coordinator also calculates a confidence value for the trained ML model.... the first stage of calculating a confidence value relates to the collection of confidence levels in the results (e.g., inferences) over the training data set") when a first validation data set is inputted to the machine learning model (Calmon, [0021]: "during training and validation of the ML model, the central node determines a confidence value for the model that is a measure of the confidence that the results of the ML model are correct" and [0041]: "training an ML model also includes validating the training to determine how well the ML model is performing relative to the historical data set, some of which may have been separated from the training portion to be used in validation"), the first validation data set being obtained from the inference node (Calmon, [0041]: "In Step 200, a model coordinator trains and validates an ML model using a historical data set. ... As an example, the historical data set may be a large amount of telemetry data"), the first validation data set being not used for training (Calmon, [0041]: "training an ML model also includes validating the training to determine how well the ML model is performing relative to the historical data set, some of which may have been separated from the training portion to be used in validation," where Calmon's separated from the training portion corresponds to the instant not used for training) and being used to determine stability of the machine learning model (Calmon, [0044]: "In one or more embodiments, the confidence value (e.g.,
μ
in the above example) is used to derive a confidence threshold (e.g., t) for the ML model. In one or more embodiments, the confidence threshold represents an aggregate confidence of the model on the results (e.g., inferences) produced based on the training dataset that, if confidence values at the edge nodes fall below, indicates that the ML model has drifted at such edge nodes," where Calmon's aggregate confidence with respect to model drift corresponds to the instant stability), the confidence scores being used to detect an occurrence of drift at the inference node (Calmon, [0022]: "the edge nodes access the shared communication layer to obtain the trained ML model based on the fresh indication associated with the ML model, and also obtain the confidence threshold for the ML model. ... [A]s an edge node uses the ML model, the edge node performs an analysis to derive a confidence value for the model for results produced by the ML model based on the data of the edge node. In one or more embodiments, the edge node compares the confidence value to the confidence threshold obtained from the central node. In one or more embodiments, if the confidence value for the ML model at an edge node falls below the confidence threshold, then drift has occurred for the ML model"), the drift referring to a situation where features of the first validation data being used by the machine learning model change over time (Calmon, [0019]: "ML model performance at the edge may be monitored by determining whether drift has occurred. In one or more embodiments, drift of an ML model is when the results (e.g., predictions, classifications, etc.) become increasingly less accurate, unstable, erroneous, etc.," where Calmon's increasingly corresponds to the instant over time);
transmitting, to an inference node, the first set of confidence scores and a representation of the machine learning model (Calmon, [0045]: "the model coordinator provides the trained ML model and the confidence threshold to a shared communication layer" where [0044]: "the confidence value (e.g., μ in the above example) is used to derive a confidence threshold (e.g., t) for the ML model" and [0056]: "In Step 240, an edge node obtains a trained ML model from a shared communication layer, along with an associated confidence threshold"), wherein the inference node optimizes communication overhead and limits data transfer between the training node and the inference node (Calmon, [0019]: "an efficient model management should take into account ML model performance at the edge nodes so that the ML model can be adjusted, rather than relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application") by determining whether the first set of confidence scores and a second set of confidence scores generated by the inference node are similar (Calmon, [0019]: "In one or more embodiments, ML model performance at the edge may be monitored by determining whether drift has occurred. ... Therefore, it may be desirable to detect drift at the edge nodes, and have the detection of drift communicated to the central node" and [0057]: "In one or more embodiments, the edge node performs drift detection by comparing a confidence value calculated for the trained ML model with the confidence threshold calculated by the model coordinator and obtained from the shared communication layer")
the determining (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240") that the first set of confidence scores and the second set of confidence scores are similar (Calmon, Fig. 2B, step 244, "Drift detected?" followed by "NO" path) being indicative that the deployed machine learning model on the inference node is effective for a current data set (Calmon, Fig. 2B, step 242, "Execute trained ML model and perform drift detection" when preceded by "NO" path from step 244) without need to transfer new data to the training node to retrain the machine learning model (Calmon, Fig. 2B, step 244 followed by "NO" path, where "YES" path is followed by step 250 "Provide batch data to model coordinator"), and
the determining (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240") that the first set of confidence scores and the second set of confidence scores are not similar being indicative that the drift has occurred at the inference node (Calmon, Fig. 2B, step 244, "Drift detected?" followed by "YES" path) and the deployed machine learning model on the inference node is ineffective (Calmon, Fig. 2B, step 248, "Model marked as drifted?" followed by "YES" path) with the need to transfer the new data to the training node to retrain the machine learning model (Calmon, Fig. 2B, step 250, "Provide batch data to model coordinator"); and
in response to determining, by the inference node, that the first set of confidence scores and the second set of confidence scores are not similar (Calmon, [0059]: "In Step 246, based on the determination in Step 244 that drift has occurred for the ML model, the edge node sends a drift signal to the model coordinator"), the second set of confidence scores including confidence scores associated with inferences (Calmon, [0057]: "In Step 242, the edge node executes the trained ML model using data available to the edge node, and performs drift detection. ... In one or more embodiments, an edge node calculates a confidence value for the trained ML model using techniques similar to that described with respect to the model coordinator in Step 200, above," where Calmon's edge confidence value corresponds to the instant second set of confidence scores), the inferences being generated (Calmon, [0057]: "the edge node executes the trained ML model using data available to the edge node.... In one or more embodiments, the trained ML model may be any trained ML model used for any purpose ( e.g., inference)") by inputting a test data set obtained by the inference node into the machine learning model (Calmon, [0057]: "In Step 242, the edge node executes the trained ML model using data available to the edge node, and performs drift detection. ... In one or more embodiments, the data used to execute the trained ML model may be any data relevant to the edge node ( e.g., telemetry data, user data, etc.)," where Calmon's perform drift detection using edge data corresponds to the instant inputting test data set), the test data set being used to determine quality of the machine learning model at the inference node (Calmon, [0069]: "the edge nodes each obtain the trained ML model and the confidence threshold from the shared communication layer, and begin executing the trained ML model. While executing the trained ML model, the edge nodes perform drift detection by calculating a confidence value for the trained ML model, and comparing the confidence value with the confidence threshold"), the test data set including the raw data (Calmon, [0057]: "the data used to execute the trained ML model may be any data relevant to the edge node ( e.g., telemetry data, user data, etc.)"); and:
receiving at least part of test data set transmitted from the inference node (Calmon, Fig. 2A, bock 206, "Drift detected?" followed by block 210, "Receive batch data from edge node(s)," where [0049]: "In one or more embodiments, batch data is sets of data used as inputs for a ML model being executed at the edge nodes") the at least part of the test data set being for training an updated machine learning model (Calmon, Fig. 2A, step 214, "Train ML model using updated historical data set to obtain updated trained ML model"), the inference node transmitting the at least part of the test data set only upon detecting drift (Calmon, Fig. 2B, step 244, "Drift detected?" and step 250, "Provide batch data to model coordinator") based on the first set of confidence scores and the second set of confidence scores (Calmon, [0058]: "In Step 244, the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240," where Caiman's confidence threshold and edge confidence value correspond to the instant first and second set of confidence scores, respectively), and refraining from transmitting the test data set in the absence of drift (Calmon, Fig. 2B, Step 244, "Drift detected?" followed by "NO" branch to repeat step 242, "Execute trained ML model and perform drift detection") so as to reduce wireless communication overhead between the inference node and the training node (Calmon, [0019]: "relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application. ... [D]etecting drift of ML models being executed on edge devices by a central node that trains and distributes the model would incur significant overhead due to the data at the edge nodes being sent (e.g., via a network) to the central node. Therefore, it may be desirable to detect drift at the edge nodes" where [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). ... A network may include ... a wireless network"); and
in response to receiving the at least part of the test data set, adding the at least part of the test data set to the training data set (Calmon, [0051]: "In Step 214, the model coordinator trains anew ML model (or re-trains the ML model) using an updated historical data set. In one or more embodiments, regardless of the method for the batch collection, when the model coordinator assesses that a representative set of batch data has been collected, the model coordinator proceeds to produce a new training data set, which may be referred to as an updated historical data set," where Calmon's updated historical data set corresponds to the instant training data set), and generating, by the training node, a request to train a new machine learning model (Calmon, Fig. 2A, step 214, "Train ML model using updated historical data set to obtain updated trained ML model," where [0035]: "In one or more embodiments, the model coordinator (100) is a computing device (described above)" and [0029]: "Examples of computing devices include, but are not limited to, a server (e.g., a blade-server in a blade-server chassis, a rack server in a rack, etc.) .... In one or more embodiments, any or all of the aforementioned examples may be combined to create a system of such devices, which may collectively be referred to as a computing device," where Calmon's model coordinator embodiment as a system of server devices reasonably suggests a request to train a new model),
wherein the inference node includes an ... device configured to receive the representation of the machine learning model (Calmon, [0027]: "the edge nodes (102, 104) may be computing devices" and [0028]: "a computing device is any device ... capable of electronically processing instructions and may include ... one or more physical interfaces (e.g., network ports, storage ports)") via a wireless communication interface (Calmon, [0036]: "the edge nodes (102, 104) and the model coordinator (100) are operatively connected via, at least in part, a network (not shown). A network may refer to an entire network or any portion thereof (e.g., a logical portion of the devices within a topology of devices). A network may include ... a wireless network ... or any other suitable network that facilitates the exchange of information from one part of the network to another" and [0015]: "'operatively connected' may refer to any direct ( e.g., wired directly between two devices or components) or indirect (e.g., wired and/or wireless connections between any number of devices or components connecting the operatively connected devices)") and to perform on-device inference using the received machine learning model (Calmon, Fig. 2B, step 242, "Execute trained ML model and perform drift detection") ... and wherein the ... device includes sensing components (Calmon, [0034]: "an edge node (102, 104) includes functionality to generate or otherwise obtain any amount or type of data ( e.g., ... image data, etc.) that is related in any way to the operation of the edge device," where Calmon's image data generation reasonably suggests a camera sensor) and processing resources (Calmon, [0027]: "the edge nodes (102, 104) may be computing devices" and [0028]: "a computing device is any device ... capable of electronically processing instructions and may include ... one or more processors (e.g. components that include integrated circuitry) (not shown), memory (e.g., random access memory (RAM)) (not shown), input and output device(s) (not shown), non-volatile storage hardware ..., one or more physical interfaces") configured to execute the machine learning model on data generated by the ... device (Calmon, [0022]: "the edge nodes then begin executing the ML model based on data generated by, obtained by, or otherwise available to the respective edge nodes"), thereby enabling local inference without transmitting all raw sensor data to the training node (Calmon, [0018]: "In one or more embodiments, to realize such a ML edge-to-cloud management system, it is important to note that, while model training could be performed on both edge nodes and central nodes, ML model execution will often be performed at the edge (e.g., due to latency constraints of time-sensitive applications)" and [0019]: "an efficient model management should take into account ML model performance at the edge nodes so that the ML model can be adjusted, rather than relying solely on central nodes to perform such tasks, which may, for example, incur significant data exchange, thereby increasing the networking costs of the application").
Calmon teaches a method of training and deployment of a machine learning model to a system including a training node and an inference node, including a step of transmitting, by the training node to the inference node, a representation of the machine learning model, wherein the inference node includes a device configured to receive the representation of the machine learning model.
Calmon teaches a method of training and deployment of a machine learning model to a system including a training node and an inference node, including a step of transmitting, by the training node to the inference node, a representation of the machine learning model, wherein the inference node includes a device configured to receive the representation of the machine learning model.
Calmon does not explicitly teach an embedded device configured to receive the representation of the machine learning model ... wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission, and wherein the embedded device includes sensing components and processing resources configured to execute the machine learning model on data generated by the embedded device.
However, Harvill teaches:
an embedded device configured to receive the representation of the machine learning model (Harvill, [0099]: "At least one hardware processor 504 is coupled to I/O subsystem 502 for processing information and instructions. Hardware processor 504 may include ... a special-purpose microprocessor such as an embedded system") ... wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission (Harvill, [0093]: "The productionize 228 step makes the trained model ready for use. [T]he model is pruned of the training part ... the values and variables are collected into a single file ... and performance can be further improved through quantization.... [T]he trained model is converted to target execution platform as in mobile devices using the CoreML or TFLite format" and [0094]: "The deliver 229 step makes the trained model accessible for use. ... [T]he model can be downloaded for use in an application on device") and wherein the embedded device includes sensing components (Harvill, [0104]: "At least one input device 514 is coupled to I/O subsystem 502 for communicating signals, data, command selections or gestures to processor 504. Examples of input devices 514 include ... various types of sensors such as force sensors, motion sensors, heat sensors, accelerometers, gyroscopes, and inertial measurement unit (IMU) sensors") and processing resources configured to execute the machine learning model on data generated by the embedded device (Harvill, [0094]: "The deliver 229 step makes the trained model accessible for use. In one embodiment the model is deployed in a cloud service that receives and image for inference and results are sent back to the client. In another embodiment, the model can be downloaded for use in an application on device" and [0100]: "Computer system 500 includes one or more units of memory 506, such as a main memory, which is coupled to I/O subsystem 502 for electronically digitally storing data and instructions to be executed by processor 504. Memory 506 may include volatile memory such as various forms of random-access memory (RAM) or other dynamic storage device").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Calmon regarding a method of training and deployment of a machine learning model to a system including a training node and an inference node, including a step of transmitting, by the training node to the inference node, a representation of the machine learning model, wherein the inference node includes a device configured to receive the representation of the machine learning model with those of Harvill regarding an embedded device configured to receive the representation of the machine learning model, wherein the training node is configured to convert the machine learning model into an embedded model format prior to transmission, and wherein the embedded device includes sensing components and processing resources configured to execute the machine learning model on data generated by the embedded device.
The motivation to do so would be to facilitate use of the model optimized for the target device (Harvill, [0093]: "The productionize 228 step makes the trained model ready for use. In one embodiment the model is pruned of the training part to reduce the model size. To optimize the model, all the values and variables are collected into a single file, the graph is frozen through constant folding and collapsing the nodes, and performance can be further improved through quantization of to lower precision values such as 16-bit float or 8-bit integers. In another embodiment the trained model is converted to target execution platform as in mobile devices using the CoreML or TFLite format").
Regarding Claim 16, the rejection of Claim 15 is incorporated. The Calmon/Harvill combination teaches:
determining whether the machine learning model has destabilised and re-stabilised after training the machine learning model (Calmon, [0016]: "one or more embodiments related to training an ML model at a central node, distributing the model to edge nodes, receiving indications from the edge nodes that model drift has occurred, obtaining new data from the edge nodes based on drift being detected, retraining the ML model, and re-distributing an updated model to the edge nodes," where [0019]: "drift of an ML model is when the results (e.g., predictions, classifications, etc.) become increasingly less accurate, unstable, erroneous, etc."); and
the first set of confidence scores and the representation of the machine learning model are transmitted in response to determining that the machine learning model has destabilised and re-stabilised (Calmon, [0045]: "the model coordinator provides the trained ML model and the confidence threshold to a shared communication layer" where [0044]: "the confidence value (e.g., μ in the above example) is used to derive a confidence threshold (e.g., t) for the ML model" and [0056]: "In Step 240, an edge node obtains a trained ML model from a shared communication layer, along with an associated confidence threshold").
Claim 20 incorporates substantively all the limitations of Claim 9 and is rejected under the same rationale.
Claims 2, 3, 12, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Calmon, et al. (US 2023/0004854 A1, hereinafter "Calmon"), in view of Harvill, et al. (US 2020/0050965 A1, hereinafter "Harvill"), in view of Badawy, et al. (US 11,295,241 B1, hereinafter "Badawy").
Regarding Claim 2, the rejection of Claim 1 is incorporated. The Calmon/Harvill combination teaches:
generating a ... distribution ... based on the first set of confidence scores (Calmon, [0042]: "as part of training the ML model, the model coordinator also calculates a confidence value for the trained ML model.... the resulting confidence
γ
of the inference (the class with higher probability) of a sample is obtained. In one or more embodiments, an aggregate statistic
p
of the confidence over the whole training dataset is updated accordingly. In a typical embodiment, this statistic may comprise the mean prediction confidence of all inferences"); and
generating by the inference node, a second ... distribution ... based on the second set of confidence scores (Calmon, [0057]: "In one or more embodiments, the edge node performs drift detection by comparing a confidence value calculated for the trained ML model with the confidence threshold calculated by the model coordinator and obtained from the shared communication layer. In one or more embodiments, an edge node calculates a confidence value for the trained ML model using techniques similar to that described with respect to the model coordinator in Step 200, above" and [0042]: "the resulting confidence
γ
of the inference (the class with higher probability) of a sample is obtained. In one or more embodiments, an aggregate statistic
p
of the confidence over the whole training dataset is updated accordingly. In a typical embodiment, this statistic may comprise the mean prediction confidence of all inferences");
wherein: determining, by the inference node, whether the first set of confidence scores and the second set of confidence scores are similar further includes: determining whether the first ... function and the second ... function are similar (Calmon, [0057]: "In one or more embodiments, the edge node performs drift detection by comparing a confidence value calculated for the trained ML model with the confidence threshold calculated by the model coordinator and obtained from the shared communication layer" and [0058]: "the edge node determines if drift is detected on that edge node. In one or more embodiments, drift is detected when the confidence value calculated for the trained ML model in Step 242 is less than the confidence threshold associated with the trained ML model obtained in Step 240").
Calmon does not teach, but Badawy teaches:
generating ... a first cumulative distribution function based on the first set of confidence scores; and generating ... a second cumulative distribution function based on the second set of confidence scores (Badawy, col. 17, lines 15-18: "The drift detection model 174 gives an option to set a drift confidence threshold below which drift is declared and a warning confidence threshold below which warning is issued.... Such a drift detection model 174 may be based on the Kolmogorov- Smirnov(KS) statistical test.... The Kolmogorov-Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution," where the instant first and second cumulative distribution functions correspond to Badawy's reference and empirical distributions, respectively) ... wherein:
determining ... whether the first set of confidence scores and the second set of confidence scores are similar further includes: determining whether the first cumulative distribution function and the second cumulative distribution function are similar (Badawy, col. 17, lines 40-43: "KS-test compares the distance of the empirical cumulative data distribution
d
i
s
t
R
,
W
. A drift measure corresponding with a concept drift may be detected by KSWIN if:
d
i
s
t
R
.
W
>
-
l
n
α
r
").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Calmon regarding generating distributions based on the first and second sets of confidence scores and determining whether they are similar with those of Badawy regarding calculating and determining whether the cumulative distribution functions are similar.
The motivation to do so would be to enable tailoring drift detection to distributions involving most suitable confidence metrics (Badawy, col. 4, lines 4-18: "a drift detection model may be applied to the first and second dataset to determine the drift measure. In one embodiment, the drift detection model may be trained or otherwise determined based on the first dataset .... This training may, for example, including the determination of one or more metrics associated with the first dataset that may be used in the determination of drift relative to a second dataset. In this manner, the drift detection model can be tailored specifically to the associated machine learning model ( or models) trained on that same dataset ( or a portion thereof)").
Regarding Claim 3, the rejection of Claim 2 is incorporated. The Calmon/Harvill/Badawy combination has been shown to teach:
wherein determining whether the first cumulative distribution function and the second cumulative distribution function are similar includes: generating a measure of difference between the first cumulative distribution function and the second cumulative distribution function using a Kolmogorov-Smirnov (KS) test (Badawy, col. 17, lines 15-18: "Such a drift detection model 174 may be based on the Kolmogorov- Smirnov(KS) statistical test.... The Kolmogorov-Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution," where the instant first and second cumulative distribution functions correspond to Badawy's reference and empirical distributions, respectively); and
determining that the first cumulative distribution function and the second cumulative distribution function are not similar in response to determining that the measure of difference is greater than a first threshold (Badawy, col. 17, lines 15-18: "The drift detection model 174 gives an option to set a drift confidence threshold below which drift is declared and a warning confidence threshold below which warning is issued").
Claims 12-13 incorporate substantively all the limitations of Claims 2-3, respectively and are rejected under the same rationales.
Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Calmon, et al. (US 2023/0004854 A1, hereinafter "Calmon"), in view of Harvill, et al. (US 2020/0050965 A1, hereinafter "Harvill"), in view of Badawy, et al. (US 11,295,241 B1, hereinafter "Badawy"), in further view of Vasseur, et al. (US 2016/0219071 A1, hereinafter "Vasseur").
Regarding Claim 4, the rejection of Claim 3 is incorporated. The Calmon/Harvill/Badawy combination does not teach, but Vasseur teaches:
wherein the first threshold equals a sum of: a previous measure of difference generated using the Kolmogorov-Smirnov (KS) test and a first predetermined value (Vasseur, [0063]: "In the simplest embodiment, models 502 may be merely time averages pushed at regular time intervals. In more sophisticated embodiments, models 502 may include ... whole distributions (e.g., histograms, mixtures of Gaussians, etc.) pushed to SCA [Supervisory Control Agent] 404, in response to DLA [Distributed Learning Agent] 402 detecting a statistically significant change. For example, this can be achieved by computing the Kolmogorov-Smimov (KS) distance between the previous update and the current state, and trigger a push only if a threshold is met").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of The Calmon/Harvill/Badawy combination regarding determining that two distributions are sufficiently dissimilar based on a threshold with those of Vasseur regarding the first threshold being a sum of difference generated using the Kolmogorov-Smirnov (KS) test and a first predetermined value.
The motivation to do so would be to facilitate analysis using historical trend data (Badawy, col. 6, lines 13-19, "Such a drift prediction model can be used for predicting drift. In other words, a drift prediction machine learning model may be trained on data points output from the drift detection model to determine to predict when a 'drift zone' may actually be entered. When such drift can be predicted users may be made aware of such drift in advance").
Claim 14 incorporates substantively all the limitations of Claim 4 and is rejected under the same rationale.
Claims 6 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Calmon, et al. (US 2023/0004854 A1, hereinafter "Calmon"), in view of Harvill, et al. (US 2020/0050965 A1, hereinafter "Harvill"), in view of Johnson, et al. (US 2021/0304003 A1, hereinafter "Johnson"), in further view of Wang (US 2023/0030419 A1, hereinafter "Wang").
Regarding Claim 6, the rejection of Claim 5 is incorporated. The Calmon/Harvill combination teaches:
training, by the training node, the machine learning model further comprises: training the machine learning model for a plurality of training iterations (Calmon, [0041]: "Training a ML model may include any number of iterations, epochs, etc. without departing from the scope of embodiments described herein").
Calmon does not teach, but Johnson teaches:
wherein each iteration in the plurality of training iterations includes updating parameters of the machine learning model (Johnson, [0106]: "Given the one or more constraints, the hyperparameter tuner 450 is configured to identify a set of hyperparameters that influence each constraint, specify values for the identified hyperparameters, and iteratively tune the hyperparameters until the trained machine-learning model satisfies each of the one or more constraints as described below"); and ...
determining whether the ... loss ... for the plurality of training iterations is greater than a second threshold (Johnson, [0027]: "This cost or loss function is minimized as part of the training. Training techniques such as back propagation training techniques used with neural networks may be used that iteratively modify/manipulate the weights associated with inputs to perceptrons in the neural network with the goal to minimize the loss function associated with the output(s) provided by the neural network" and [0088]: "In such a case, there are two metrics: in-domain recall and out-of-domain recall that are used to evaluate a performance of the machine-learning model. In such a domain detecting model, it is often desired that the model performs well i.e., above a certain threshold level with respect to in-domain detections as compared to out-of domain detections. In this case, the weight-assigning unit 420 assigns a higher weight to the in-domain recall metric as compared to the out-of-domain recall metric. The plurality of weighted metrics are provided as a second input to the hyperparameter tuner 450").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Calmon regarding iteratively training the machine learning model with those of Johnson regarding iteratively updating model parameters.
The motivation to do so would be to ensure that the model performs optimally with respect to objective metrics (Johnson, [0121]: "At step 670, the process iteratively tunes the set of hyperparameters associated with the machine-learning model in order to optimize (e.g., obtain an optimal value of the objective function) the machine-learning model for the plurality of metrics").
The Calmon/Harvill/Johnson combination does not teach, but Wang teaches,
determining whether the machine learning model has destabilised further comprises: determining a mean absolute loss difference for the plurality of training iterations; and determining whether the mean absolute loss difference for the plurality of training iterations is greater than a second threshold (Wang, [0067]: "In step 130, a first loss function is calculated according to the recognition result and a labeling result of the image sample" and [0068]: "the first loss function may be implemented as the Mae loss (Mean Absolute loss)").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of The Calmon/Harvill/Johnson combination regarding
updating model parameters during training iterations with those of Wang regarding use of mean absolute loss for training.
The motivation to do so would be to make training more robust against outliers in the training data (Wang, [0069]: "Mae loss is insensitive to outliers, thereby improving the performance of the machine learning model").
Claims 17 incorporates substantively all the limitations of Claims 6 and is rejected under the same rationale.
Claims 7 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Calmon, et al. (US 2023/0004854 A1, hereinafter "Calmon"), in view of Harvill, et al. (US 2020/0050965 A1, hereinafter "Harvill"), in view of Johnson, et al. (US 2021/0304003 A1, hereinafter "Johnson"), in further view of Wang (US 2023/0030419 A1, hereinafter "Wang"), in further view of Vijayaraghavan, et al. (US 2022/0357929 A1, hereinafter "Vijayaraghavan"), in further view of Badawy, et al. (US 11,295,241 B1, hereinafter "Badawy").
Regarding Claim 7, the rejection of Claim 6 is incorporated. The Calmon/Harvill/Johnson/Wang combination does not teach, but Vijayaraghavan teaches:
calculating a training loss and a validation loss for each of the plurality of training iterations (Vijayaraghavan, [0065]: "In the example run of FIG. 6, training and validation loss decrease rapidly. In the prototype used to create result 600, 80% of the training data was used to test the neural estimation model, while the remaining 20% is used for testing purposes. The more epochs ran, the lower the mean error squared and mean absolute error became, indicating improvement in accuracy across each iteration of the model").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of The Calmon/Harvill/Johnson/Wang combination regarding determining the mean absolute loss difference for the plurality of training iterations with those of Vijayaraghavan regarding calculating a training loss and a validation loss for each training iteration.
The motivation to do so would be to facilitate calculating a confidence score for model predictions (Vijayaraghavan, [0064]: "The prediction output represents the metrics of project delivery details corresponding to the project requirements based on the relationship pattern on the given set of input parameters. In some embodiments, the estimation engine may provide a certainty score with the output (e.g., certainty score can be a probability that estimation is correct).").
The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination does not teach, but Badawy teaches:
the mean absolute loss difference for the plurality of training iterations is determined based on an average of the training loss minus the validation loss for each of the plurality of training iterations (Badawy, col. 16, 43-52: "the drift detection model 174 may be based on the early drift detection method (EDDM).... This type of model keeps track of the average distance between two errors instead of only the error rate. To do this, drift detection model 174 may also track the running average distance and the running standard deviation, as well as the maximum distance and the maximum standard deviation").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination regarding determining the mean absolute loss difference for training iterations by calculating a training loss and a validation loss with those of Badawy regarding the mean absolute loss difference being based on an average of the training loss minus the validation loss.
The motivation to do so would be to improve learning rates across varying types of model instability (Badawy, col. 16, 43-52: "the drift detection model 174 may be based on the early drift detection method (EDDM). EDDM is an improvement over the traditional drift detection method as discussed. It aims to improve the detection rate of gradual drift in DDM models but also keep a better performance against abrupt concept drift").
Claim 18 incorporates substantively all the limitations of Claim 7 and is rejected under the same rationale.
Claims 8 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Calmon, et al. (US 2023/0004854 A1, hereinafter "Calmon"), in view of Harvill, et al. (US 2020/0050965 A1, hereinafter "Harvill"), in view of Johnson, et al. (US 2021/0304003 A1, hereinafter "Johnson"), in further view of Wang (US 2023/0030419 A1, hereinafter "Wang"), in further view of Vijayaraghavan, et al. (US 2022/0357929 A1, hereinafter "Vijayaraghavan"), in further view of Badawy, et al. (US 11,295,241 B1, hereinafter "Badawy"), in further view of Stripelis, et al., "Accelerating Federated Learning in Heterogeneous Data and Computational Environments," hereinafter Stripelis.
Regarding Claim 8, the rejection of Claim 6 is incorporated. The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination has been shown to teach:
the plurality of training iterations includes a first set of training iterations and a second set of training iterations (Vijayaraghavan, [0065]: "In the example run of FIG. 6, training and validation loss decrease rapidly. In the prototype used to create result 600, 80% of the training data was used to test the neural estimation model, while the remaining 20% is used for testing purposes. The more epochs ran, the lower the mean error squared and mean absolute error became, indicating improvement in accuracy across each iteration of the model"), as recited in the rejection of Claim 7.
The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination has been shown to teach:
determining the mean absolute loss difference for the plurality of training iterations includes: determining a mean absolute loss difference for the first set of training iterations (Wang, [0067]: "In step 130, a first loss function is calculated according to the recognition result and a labeling result of the image sample" and [0068]: "the first loss function may be implemented as the Mae loss (Mean Absolute loss)"), as recited in the rejection of Claim 6.
The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination has been shown to teach:
determining whether the mean absolute loss difference for the first set of training iterations is greater than the second threshold (Johnson, [0027]: "This cost or loss function is minimized as part of the training. Training techniques such as back propagation training techniques used with neural networks may be used that iteratively modify/manipulate the weights associated with inputs to perceptrons in the neural network with the goal to minimize the loss function associated with the output(s) provided by the neural network" and [0088]: "In such a case, there are two metrics: in-domain recall and out-of-domain recall that are used to evaluate a performance of the machine-learning model. In such a domain detecting model, it is often desired that the model performs well i.e., above a certain threshold level with respect to in-domain detections as compared to out-of domain detections. In this case, the weight-assigning unit 420 assigns a higher weight to the in-domain recall metric as compared to the out-of-domain recall metric. The plurality of weighted metrics are provided as a second input to the hyperparameter tuner 450"), as recited in the rejection of Claim 6.
The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination has been shown to teach:
determining whether the machine learning model has re-stabilised further includes: determining a mean absolute loss difference for the second set of training iterations (Wang, [0067]: "In step 130, a first loss function is calculated according to the recognition result and a labeling result of the image sample" and [0068]: "the first loss function may be implemented as the Mae loss (Mean Absolute loss)"), as recited in the rejection of Claim 6.
The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination has been shown to teach:
calculating a standard deviation of: the mean absolute loss difference for the first set of training iterations and the mean absolute loss difference for the second set of training iterations (Badawy, col. 16, 43-52: "the drift detection model 174 may be based on the early drift detection method (EDDM).... This type of model keeps track of the average distance between two errors instead of only the error rate. To do this, drift detection model 174 may also track the running average distance and the running standard deviation, as well as the maximum distance and the maximum standard deviation"), as recited in the rejection of Claim 7.
The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination has not yet been shown to teach:
determining whether the standard deviation is less than a third threshold (Johnson, [0089]: "the specification set of a metric includes: (1) a training dataset, (2) a validation dataset, (3) a metric definition that defines a measure as to how well a model satisfies a target goal on datasets e.g., training and validation datasets, ( 4) a target score for the metric (i.e., a score for the metric that the model is expected to meet), (5) a penalty factor for the metric, and (6) a bonus factor for the metric" and [0097]: "2. The metric definition for stability is set to be a standard deviation in the machine-learning model's accuracy scores on the validation dataset ").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination regarding iteratively training a machine learning model according to two training iterations with the further teachings of Johnson regarding determining whether the standard deviation of the loss difference is less than a third threshold.
The motivation to do so would be to enable a greater degree of configurability with respect to testing constraints (Johnson, [0089]: "Each of the metrics 440 is associated with a corresponding specification set.... It is appreciated that the specification set associated with a particular metric can be configured independently with respect to the specification sets associated with other metrics").
The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination does not teach, but Stripelis teaches:
in response to determining that the standard deviation is less than the third threshold: calculating a line of best fit between the ... loss ... for the first set of training iterations and the ... loss ... for the second set of training iterations (Stripelis, p. 5, V. Adaptive Asynchronous DVW: "Validation Loss Criterion. To ensure that each learner is making good progress, we keep track of the percentage difference of the loss (cross entropy loss in our experiments) between two consecutive epochs of the learner’s local model on its local validation set, i.e.,
V
p
c
t
=
100
*
V
L
o
s
s
i
-
V
L
o
s
s
i
-
1
/
V
L
o
s
s
i
-
1
" where Stirpelis's equation fits a line between validation loss of a current iteration and a previous iteration); and
determining that the machine learning model has destabilised and restabilised recently if the gradient of the line of best fit is within a fourth threshold (Stripelis, p. 5, V. Adaptive Asynchronous DVW: "Since we do not want the learner to overfit its local dataset, the learner halts training and requests a community update when one of two conditions are met:
(C1)
V
p
c
t
≥
0
, or (C2)
V
p
c
t
<
0
and
|
V
p
c
t
|
≤
V
C
L
o
s
s
Condition C1 indicates that the validation loss has increased or plateaued, and condition C2 captures the magnitude of the decrease. The term
V
C
L
o
s
s
is a user-defined threshold that signals when the improvements are too small").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of The Calmon/Harvill/Johnson/Wang/Vijayaraghavan combination regarding determining whether a machine learning model has re-stabilised with those of Stripelis regarding calculating a line of best fit between the loss for the first and second set of training iterations and using a threshold to determine learning has re-stabilized.
The motivation to do so would be to avoid overfit during learning by early stopping (Stripelis , p. 5, V. Adaptive Asynchronous DVW: "Since we do not want the learner to overfit its local dataset, the learner halts training ... when ... conditions are met.... These conditions can be seen as a form of early stopping").
Claims 19 incorporates substantively all the limitations of Claims 8 and is rejected under the same rationale.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT N DAY whose telephone number is (703)756-1519. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/R.N.D./Examiner, Art Unit 2122
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122