DETAILED ACTION
This non-final office action is responsive to application 18/190,937 as submitted on 27 March 2023.
Claim status is currently pending and under examination for claims 1-20 of which independent claims are 1, 2 and 15.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Independent Claims 1, 2 and 15
Step 2A Prong One: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, independent claim 1, under the broadest reasonable interpretation, recites the following limitations that are abstract ideas:
determining an effect of each value of the first feature input on the first prediction based on the approximated integrated gradient; (mental process)
and determining a cyber incident response based on the cause; (mental process)
The “determining an effect” step involves identifying the impact of each feature on a prediction which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining an effect of each value of a first feature input at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “determining an effect” step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III).
The “determining a cyber incident response” step involves identifying a response based on the reasons why a label was predicted which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining a cyber incident response at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “determining a cyber incident response” step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III).
Therefore, the independent claim recites a judicial exception. Independent claims 2 and 15 recite similar limitations corresponding to claim 1, therefore the same subject matter eligibility analysis is applied.
Step 2A Prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, the judicial exception recited above is not integrated into a practical application. The claims recite the following additional elements, but these additional elements are not sufficient to integrate the judicial exception into a practical application:
one or more processors; (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea)
and a non-transitory, computer-readable medium comprising instructions recorded thereon that when executed by the one or more processors cause operations comprising: (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea)
receiving a first feature input corresponding to a dataset with an unknown label, wherein the first feature input comprises a plurality of values, and wherein the plurality of values indicates networking activity of a user; (MPEP § 2106.05(g) necessary data gathering and insignificant extra-solution activity to the judicial exception)
inputting the first feature input into an artificial intelligence model, wherein the artificial intelligence model is non-differentiable, (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
wherein the artificial intelligence model is trained to detect a known label based on a set of training data comprising labeled feature inputs corresponding to the known label, (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
and wherein the known label comprises a detected cyber incident; (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
receiving a first prediction from the artificial intelligence model, wherein the first prediction indicates whether the first feature input corresponds to the known label; (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
receiving a second prediction for the artificial intelligence model, wherein the second prediction indicates an approximated integrated gradient for the artificial intelligence model; (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
generating for display, on a user interface, a recommendation for a cause of the known label in the dataset based on the effect of each value of the first feature input on the first prediction; (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
and generating for display a second recommendation for executing the cyber incident response. (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
The “receiving a first feature input” step amounts to mere data gathering and is recited at a high level of generality, thus adding insignificant extra-solution activity to the judicial exception – see MPEP § 2106.05(g). Under MPEP § 2106.05(d), such additional elements have been found by the courts to not integrate a judicial exception into a practical application.
The “inputting” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The “wherein the artificial intelligence model is trained” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The “wherein the known label comprises a detected cyber incident” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The “receiving a first prediction” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The “receiving a second prediction” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The “generating for display … a recommendation” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The “generating for display a second recommendation” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The remaining additional elements are recited at a high-level of generality such that they amount to no more than mere instructions to “apply” an exception using a generic component. Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, see MPEP § 2106.05(f).
Therefore, the above limitations do not integrate the judicial exception into a practical application.
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
No. The claims do not include additional elements that are sufficient for the claims to amount to significantly more than the judicial exception.
In regards to the “receiving a first feature input” step, this step adds insignificant extra-solution activity. An extra-solution activity is a well-understood, routine and conventional (WURC) activity per MPEP § 2106.05(d)(II), “the courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity. i. Receiving or transmitting data over a network, e.g., using the Internet to gather data.” The “receiving a first feature input” step does not integrate the judicial exception into a practical application and does not amount to significantly more.
In regards to the “inputting”, “wherein the artificial intelligence model is trained”, “wherein the known label comprises a detected cyber incident”, “receiving a first prediction”, “receiving a second prediction”, “generating for display … a recommendation”, and “generating for display a second recommendation” steps and the remaining additional elements, the limitations are recited so generically such that they amount to no more than mere instructions to “apply” the judicial exception on a computer using generic computer components. Mere instructions to apply a judicial exception cannot provide an inventive concept. See MPEP § 2106.05(f).
Therefore, independent claims 1, 2 and 15 are not patent eligible.
Dependent Claims 3-14 and 16-20
The remaining dependent claims being rejected do not recite additional elements, whether considered individually or in combination, that are sufficient to integrate the judicial exception into a practical application or amount to significantly more than a judicial exception.
Dependent claims 3 and 16 recite the following limitations:
Step 2A Prong One: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, dependent claim 3, under the broadest reasonable interpretation, recites the following limitations that are abstract ideas:
labeling the test feature input with the known label; (mental process)
The “labeling” step involves identifying which label to assign a test feature input which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of labeling a test feature input at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “labeling” step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III).
Therefore, dependent claim 3 recites a judicial exception. Dependent claim 16 recites similar limitations corresponding to claim 3, therefore the same subject matter eligibility analysis is applied.
Step 2A Prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, the judicial exception recited above is not integrated into a practical application. The claims recite the following additional elements, but these additional elements are not sufficient to integrate the judicial exception into a practical application:
receiving a test feature input, wherein the test feature input represents test values corresponding to datasets that correspond to the known label; (MPEP § 2106.05(g) necessary data gathering and insignificant extra-solution activity to the judicial exception)
and training the artificial intelligence model to detect the known label based on the test feature input. (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
The “receiving” step amounts to mere data gathering and is recited at a high level of generality, thus adding insignificant extra-solution activity to the judicial exception – see MPEP § 2106.05(g). Under MPEP § 2106.05(d), such additional elements have been found by the courts to not integrate a judicial exception into a practical application.
The “training” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
Therefore, the above limitations do not integrate the judicial exception into a practical application.
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
No. The claims do not include additional elements that are sufficient for the claims to amount to significantly more than the judicial exception.
In regards to the “receiving” step, this step adds insignificant extra-solution activity. An extra-solution activity is a well-understood, routine and conventional (WURC) activity per MPEP § 2106.05(d)(II), “the courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity. i. Receiving or transmitting data over a network, e.g., using the Internet to gather data.” The “receiving” step does not integrate the judicial exception into a practical application and does not amount to significantly more.
In regards to the “training” step, the limitations are recited so generically such that they amount to no more than mere instructions to “apply” the judicial exception on a computer using generic computer components. Mere instructions to apply a judicial exception cannot provide an inventive concept. See MPEP § 2106.05(f).
Therefore, dependent claims 3 and 16 are not patent eligible.
Dependent claim 4 recites the following limitations:
determining a numerical approximation of gradients and integrals for the artificial intelligence model; (mental process)
and determining the approximated integrated gradient based on the numerical approximation of gradients and integrals. (mental process)
The “determining a numerical approximation” step involves computing a numerical approximation to determine gradients and integrals which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining a numerical approximation at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III).
The “determining the approximated integrated gradient” step involves computing an integrated gradient using a numerical approximation of gradients and integrals which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining an approximated integrated gradient at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III). The steps do not integrate the judicial exception into a practical application and do not amount to significantly more.
Dependent claim 5 recites the following limitations:
approximating a derivative for the artificial intelligence model using finite differences by solving differential equations; (mental process and math)
and determining numerical approximations of gradients for the artificial intelligence model based on the derivative. (mental process and math)
The “approximating” step involves using finite differences to determine a derivative which represents a mathematical calculation and amounts to no more than evaluations, observations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of approximating a derivative at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “approximating” step is considered to be a mathematical concept, see MPEP § 2106.04(a)(2)(I), and mental processes, see MPEP § 2106.04(a)(2)(III).
The “determining” step involves using a calculated derivative to approximate gradients which represents a mathematical calculation and amounts to no more than evaluations, observations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining numerical approximations of gradients at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “determining” step is considered to be a mathematical concept, see MPEP § 2106.04(a)(2)(I), and mental processes, see MPEP § 2106.04(a)(2)(III). The steps do not integrate the judicial exception into a practical application and do not amount to significantly more.
Dependent claim 6 recites the following limitations:
receiving a predetermined step-size for a first application; (MPEP § 2106.05(g) necessary data gathering and insignificant extra-solution activity to the judicial exception)
and using the predetermined step-size for approximating the derivative. (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
The “receiving” limitation represents mere necessary data gathering and is recited at a high level of generality, thus adding insignificant extra-solution activity to the judicial exception - see MPEP § 2106.05(g). The extra-solution activity is a well-understood, routine and conventional (WURC) activity per MPEP § 2106.05(d)(II), “the courts have recognized the following computer functions as well-understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity. i. Receiving or transmitting data over a network, e.g., using the Internet to gather data.” The limitation does not integrate the judicial exception into a practical application and does not amount to significantly more.
The “using” step is recited at a high-level of generality such that the limitations amount to no more than mere instructions to “apply” the judicial exception on a computer. They can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f). Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, see MPEP § 2106.05(f). The step does not integrate the judicial exception into a practical application and does not amount to significantly more.
Dependent claim 7 recites the following limitations:
approximating an integral for the artificial intelligence model by approximating a region under a graph of a function that defines the artificial intelligence model; (mental process)
and determining numerical approximations of integrals for the artificial intelligence model based on the integral. (mental process)
The “approximating” step involves calculating a region under a graph of a function to determine an integral which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of approximating an integral at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “approximating” step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III).
The “determining” step involves approximating integrals for an AI model which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining numerical approximations at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “determining” step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III). The steps do not integrate the judicial exception into a practical application and do not amount to significantly more.
Dependent claim 8 recites the following limitations:
approximating an integral for the artificial intelligence model by approximating an integrand f (x) by a quadratic interpolant P(x) of a function that defines the artificial intelligence model; (math and mental process)
and determining numerical approximations of integrals for the artificial intelligence model based on the integral. (math and mental process)
The “approximating” step involves using an integrand and quadratic interpolant to determine an integral which represents a mathematical calculation and amounts to no more than evaluations, observations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of approximating an integral at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “approximating” step is considered to be a mathematical concept, see MPEP § 2106.04(a)(2)(I), and mental processes, see MPEP § 2106.04(a)(2)(III).
The “determining” step involves using a calculated integral to approximate integrals for an AI model which represents a mathematical calculation and amounts to no more than evaluations, observations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining numerical approximations of integrals at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “determining” step is considered to be a mathematical concept, see MPEP § 2106.04(a)(2)(I), and mental processes, see MPEP § 2106.04(a)(2)(III). The steps do not integrate the judicial exception into a practical application and do not amount to significantly more.
Dependent claim 9 recites the following limitations:
determining a SHAP (SHapley Additive exPlanations) value for each value of the first feature input.
The “determining” step involves calculating a SHAP value for each feature which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining a SHAP value at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “determining” step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III). The step does not integrate the judicial exception into a practical application and does not amount to significantly more.
Dependent claim 10 recites the following limitations:
determining a respective contribution of each value to a difference between an actual prediction and a mean prediction; (math and mental process)
determining a respective SHAP value based on the respective contribution; (math and mental process)
and determining the effect of each value based on the respective contribution. (mental process)
The “determining a respective contribution” step involves using a difference between predictions to identify a respective contribution which represents a mathematical calculation and amounts to no more than evaluations, observations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining a respective contribution at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the step is considered to be a mathematical concept, see MPEP § 2106.04(a)(2)(I), and mental processes, see MPEP § 2106.04(a)(2)(III).
The “determining a respective SHAP value” step involves using a calculated respective contribution to determine a SHAP value which represents a mathematical calculation and amounts to no more than evaluations, observations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining a respective SHAP value at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the step is considered to be a mathematical concept, see MPEP § 2106.04(a)(2)(I), and mental processes, see MPEP § 2106.04(a)(2)(III).
The “determining the effect” step involves identifying the impact each value has on a prediction based on the respective contribution which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining the effect of each value at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III). The steps do not integrate the judicial exception into a practical application and do not amount to significantly more.
Dependent claims 11-14 recite the following limitations:
Step 2A Prong One: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, dependent claim 11, under the broadest reasonable interpretation, recites the following limitations that are abstract ideas:
determining a fraudulent transaction response based on the cause; (claim 11)
determining a cyber incident response based on the cause; (claim 12)
determining a response based on the cause; (claim 13)
determining an identity theft response based on the cause; (claim 14)
The “determining” step involves identifying a response based on a cause which amounts to no more than observations, evaluations, and judgments that can be performed in the human mind or with the use of a physical aid (e.g., pen and paper). The claim recites the step of determining a response at a high degree of generality, thus the step is not required to have any specific level of complexity that would preclude the step from being mental processes. Therefore, the “determining” step is considered to be mental processes, see MPEP § 2106.04(a)(2)(III).
Therefore, dependent claim 11 recites a judicial exception. Dependent claims 12-14 recite similar limitations corresponding to claim 11, therefore the same subject matter eligibility analysis is applied.
Step 2A Prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, the judicial exception recited above is not integrated into a practical application. The claims recite the following additional elements, but these additional elements are not sufficient to integrate the judicial exception into a practical application:
wherein the known label comprises a detected fraudulent transaction, (claim 11) (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
wherein the known label comprises a detected cyber incident, (claim 12)
wherein the known label comprises a refusal of a credit application, (claim 13)
wherein the known label comprises a detected identity theft, (claim 14)
wherein the plurality of values indicates a transaction history of a user, (claim 11) (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
wherein the plurality of values indicates networking activity of a user, (claim 12)
wherein the plurality of values indicates a credit history of a user, (claim 13)
wherein the plurality of values indicates a user transaction history, (claim 14)
and generating for display a second recommendation for executing the fraudulent transaction response. (claim 11) (MPEP § 2106.05(f) mere instructions to implement an abstract idea on a computer, or generally links exception to a technological environment)
and generating for display a second recommendation for executing the cyber incident response. (claim 12)
and generating for display a second recommendation for executing the response. (claim 13)
and generating for display a second recommendation for executing the identity theft response. (claim 14)
The “wherein the known label comprises …” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The “wherein the plurality of values indicates …” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
The “generating” step is recited at a high-level of generality such that the limitation amounts to no more than mere instructions to “apply” the judicial exception on a computer. It can also be viewed as nothing more than an attempt to generally link the use of the judicial exception to the technological environment of computers, see MPEP § 2106.05(f).
Therefore, the above limitations do not integrate the judicial exception into a practical application.
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
No. The claims do not include additional elements that are sufficient for the claims to amount to significantly more than the judicial exception.
In regards to the “wherein the known label comprises …”, “wherein the plurality of values indicates …”, and “generating” steps, the limitations are recited so generically such that they amount to no more than mere instructions to “apply” the judicial exception on a computer using generic computer components. Mere instructions to apply a judicial exception cannot provide an inventive concept. See MPEP § 2106.05(f).
Therefore, dependent claims 11-14 are not patent eligible.
Dependent claim 17 recites similar limitations corresponding to claim 4, therefore the same subject matter eligibility analysis is applied.
Dependent claim 18 recites similar limitations corresponding to claim 5, therefore the same subject matter eligibility analysis is applied.
Dependent claim 19 recites similar limitations corresponding to claim 6, therefore the same subject matter eligibility analysis is applied.
Dependent claim 20 recites similar limitations corresponding to claim 7, therefore the same subject matter eligibility analysis is applied.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 is rejected under 35 U.S.C. 103 as being unpatentable over Merrill et al. (US 20190378210 A1), hereinafter Merrill, in view of Mishra et al. (US 20220121744 A1), hereinafter Mishra, further in view of Ables et al. (“Creating an Explainable Intrusion Detection System Using Self Organizing Maps”), hereinafter Ables.
With respect to claim 1, Merrill teaches:
A system for generating recommendations for causes of … labels that are generated by non-differentiable artificial intelligence models …, comprising (Merrill discloses “A model evaluation system and related methods are provided. In some embodiments, the model evaluation and explanation system (e.g., 120 of FIGS. 1A and 1B) uses a non-differentiable model decomposition module (e.g., 121) to explain models by reference by transforming SHAP attributions (which are model-based attributions) to reference-based attributions (that are computed with respect to a reference population of data sets)” [0029].
Merrill discloses explanation information (‘causes’) can be generated to explain how a score (‘label’) was generated for a credit applicant, “S271 can include functions generating explanation information for a test data point (e.g., a test data point representing a credit applicant). The explanation information can include score explanation information that provides information that can be used to explain how a score was generated for the test data point” [0199].
Merrill discloses “selecting a single test data point (e.g., an input data set whose score/output generated by at least the non-differentiable model is to be explained). In some embodiments, the test data point represents a credit applicant” [0084].
Merrill discloses adverse action information (‘recommendations’) can be generated, “the model evaluation system evaluates and explains the model (or ensemble) by generating score explanation information for a specific score generated by the ensemble model for a particular input data set. In some embodiments, the score explanation information is used to generate Adverse Action information” [0033].
Merrill discloses “when generating a decision to deny a consumer credit application, lenders are required to provide to each consumer the reasons why the credit application was denied, in terms of factors the model actually used, that the consumer can take practical steps to improve. These adverse action reasons and notices …” [0018].):
one or more processors; and a non-transitory, computer-readable medium comprising instructions recorded thereon that when executed by the one or more processors cause operations comprising (Merrill discloses “the processing unit includes one or more processors communicatively coupled to one or more of a RAM, ROM, and machine-readable storage medium; the one or more processors of the processing unit receive instructions stored by the one or more of a RAM, ROM, and machine-readable storage medium via a bus; and the one or more processors execute the received instructions” [0235].):
receiving a first feature input corresponding to a dataset … (Merrill discloses a test data point (‘first feature input’), “selecting a single test data point (e.g., an input data set whose score/output generated by at least the non-differentiable model is to be explained). In some embodiments, the test data point represents a credit applicant” [0084].),
wherein the first feature input comprises a plurality of values (Merrill discloses a test data point (‘first feature input’) is comprised of multiple features (‘values’), “for each feature of a test data point, generating a difference value, the difference value for the test data point relative to a corresponding reference data point,” [0030].),
inputting the first feature input into an artificial intelligence model, wherein the artificial intelligence model is non-differentiable (Merrill discloses “selecting a single test data point (e.g., an input data set whose score/output generated by at least the non-differentiable model is to be explained” [0084].),
wherein the artificial intelligence model is trained … based on a set of training data comprising labeled feature inputs … (Merrill discloses “S220 includes selecting a single reference data point that represents the reference population (e.g., a reference data point having feature values that represent average values across the reference population). In some embodiments, the single reference data point represents average feature values of applicants who were “barely approved” (e.g., the bottom 10% of an approved population), according to their credit score” [0085]. A reference data point includes average feature values of applicants that were barely approved (therefore these feature values are associated (“labeled”) with belonging to a “barely approved” class of applicants).
Merrill discloses “S210 includes: selecting a reference population from training data (e.g., of the modelling system 110)” [0122].
Merrill discloses “a model of modeling system 110 of FIGS. 1A and 1B) that includes at least one non-differentiable model and at least one differentiable model” [0031].
It is implied that a non-differentiable model is trained by its ability to generate outputs based on a test data point.),
receiving a first prediction from the artificial intelligence model (See [0084] describing how an output for a test data point is generated by a non-differentiable model.),
receiving a second prediction for the artificial intelligence model (Merrill discloses a decomposition of ensembles (‘second prediction’) can be generated by combining SHAP and Integrated gradients, “a novel method for the decomposition of ensembles (combinations) of tree and neural network models that combines the SHAP and Integrated Gradients methods to produce a new and useful result (decomposition of ensemble models) which is used to perform new and useful analysis that results in tangible outputs that support human decision-making, e.g.: feature importance, adverse action, and disparate impact analysis, as described herein” [0028].
Merrill discloses “Feature importance is the application wherein a feature's importance is quantified with respect to a model. A feature may have significant or marginal impact, and it could hurt or harm how a model will score” [0131].
Merrill discloses “a Shapley value decomposition (e.g., generated by the non-differentiable model decomposition module) is a linear combination of feature attribution values ϕi (Shapley value). … Shapley value decompositions are SHAP (SHapley Additive exPlanation) values … SHAP values explain the output of a model ƒ as a sum of the effects ϕi of each feature being introduced into a conditional expectation” [0048].
Merrill discloses “both Shapley and Integrated Gradient methods enforce nullity, e.g., features that do not contribute to the score will receive attribution values of zero. Shapley enforces this since it is focused on marginal contributions, that is, if the marginal impact of a feature is zero, over the space of all coalitions, its Shapley value will be zero as well. Integrated gradients enforces this by performing computing partial derivatives along the path of integration. If a feature does not contribute to a score, the attribution vector along the path integration, for the specific feature, will be zero” [0168].
SHAP and integrated gradients can be used to calculate each feature’s contribution towards a model’s score (output) (therefore each calculated feature’s importance is a prediction).),
wherein the second prediction indicates an approximated integrated gradient for the artificial intelligence model (A decomposition of ensembles (‘second prediction’) can be generated by combining SHAP and Integrated gradients (see [0028]).
Merrill discloses “integrated gradients processes, as described herein) often involves integration by quadrature that contains discretization errors. In some embodiments, Advanced and adaptive quadrature methods, e.g, higher-order interpolating schemas (advanced) and optimal grid refinement techniques (adaptive), are used to perform integration during S240. In some embodiments, advanced techniques may include one or more of: performing an integration using a Gauss-Kronrod formula; performing integration by using a Clenshaw-Curtis formula; performing integration by performing a Richardson Error Estimation process … These advanced numerical methods can be useful in implementing performant integrated gradient calculations, as performed by the differentiable model decomposition module. In some embodiments, the number of computations linearly scales with the number of pointwise quadrature evaluations; any reduction of the total number of points to approximate the integral under examination will result in a linear reduction in runtime as well” [0214-0215].);
determining an effect of each value of the first feature input on the first prediction based on the approximated integrated gradient (Merrill discloses “a Shapley value decomposition (e.g., generated by the non-differentiable model decomposition module) is a linear combination of feature attribution values ϕi (Shapley value). … Shapley value decompositions are SHAP (SHapley Additive exPlanation) values … SHAP values explain the output of a model ƒ as a sum of the effects ϕi of each feature being introduced into a conditional expectation” [0048].
Merrill discloses “differentiable model decomposition module (e.g, 122) uses integrated gradients, as described by Mukund Sundararajan, Ankur Taly, Qiqi Yan, “Axiomatic Attribution for Deep Networks”, arXiv:1703.01365, 2017, the contents of which are incorporated by reference herein. This process sums gradients at points along a straight-line path from a reference input (x) to an evaluation input (x), such that the contribution of a feature i is given by:
PNG
media_image1.png
103
410
media_image1.png
Greyscale
for a given m, wherein xi is the variable value of the input variable i in the evaluation input data set x, wherein xi′ is the input variable value of the input variable i in the reference input data set, wherein F is the model” [0046].
Merrill discloses “both Shapley and Integrated Gradient methods enforce nullity, e.g., features that do not contribute to the score will receive attribution values of zero. Shapley enforces this since it is focused on marginal contributions, that is, if the marginal impact of a feature is zero, over the space of all coalitions, its Shapley value will be zero as well. Integrated gradients enforces this by performing computing partial derivatives along the path of integration. If a feature does not contribute to a score, the attribution vector along the path integration, for the specific feature, will be zero” [0168].
Merrill discloses “the non-differentiable model decomposition module computes a score decomposition for a test data point relative to a reference data point for the non-differentiable model (as described herein), and the differentiable model decomposition module computes a score decomposition for the test data point relative to the reference data point for the differentiable model, and combines the decomposition of the non-differentiable model with the decomposition for the differentiable model by using an ensembling function of the ensemble model, to generate a decomposition for an ensemble model score for the test data point relative to the reference data point” [0031].
SHAP and integrated gradients are used to determine feature importance towards a model score/output. Both SHAP and integrated gradients determine feature importance according to the same test data point (‘first feature input’) and reference data point (training data). The results of SHAP and integrated gradients are combined into a single decomposition.);
generating for display, on a user interface, a recommendation for a cause of the known label in the dataset based on the effect of each value of the first feature input on the first prediction (Merrill discloses “the model evaluation system uses a decomposition generated for a model score (e.g., at one or more of S230, S240, S250) to generate feature importance information and provide the generated feature importance information to the operator device 171 … the model evaluation system uses a decomposition generated for a model score to generate adverse action information (e.g., at S271) (as described herein) and provide the generated adverse action information to the operator device 171” [0131-0132]. See Figure 2 depicting the decomposition at S250 is a combined decomposition using SHAP and integrated gradients.
Merrill discloses “the score explanation information is used to generate Adverse Action information” [0033].
Adverse action information (‘first recommendation’) can be generated from score explanation information (‘cause of the known label’). See [0018, 0033] describing how adverse action information explains why a credit application was denied and steps a consumer can take to improve.
Merrill discloses a test data point (‘first feature input’) is comprised of features of a denied credit applicant (‘known label in the dataset’), “the model evaluation system 120 evaluates a specific denied credit applicant. In this embodiment the specific denied credit applicant comprises the test set (test data point) (selected at S220)” [0153].
Merrill discloses “S272 can include identifying features having decomposition values (in the generated decompositions) above a threshold. In some embodiments, the method 200 includes providing the identified features to an operator device (e.g., 171) via a network. In some embodiments, the method 200 includes displaying the identified features on a display device of an operator device (e.g., 171). In other embodiments, the method 20 includes displaying natural language explanations generated based on the decomposition described above. In some embodiments the method 200 includes displaying the identified features and their decomposition in a form similar to the table presented in FIG. 3” [0190-0191].);
determining a … response based on the cause (Merrill discloses above adverse action information can be generated from score explanation information (‘cause’). The adverse action information consists of reasons why a credit application was denied (‘response’) and steps a consumer can take to improve, see [0018].);
and generating for display a second recommendation for executing the … response (Merrill discloses adverse action information can be generated, “the model evaluation system evaluates and explains the model (or ensemble) by generating score explanation information for a specific score generated by the ensemble model for a particular input data set. In some embodiments, the score explanation information is used to generate Adverse Action information” [0033].
Merrill discloses “when generating a decision to deny a consumer credit application, lenders are required to provide to each consumer the reasons why the credit application was denied, in terms of factors the model actually used, that the consumer can take practical steps to improve. These adverse action reasons and notices …” [0018].
Merrill discloses adverse action information can be generated from score explanation information (‘cause’). The adverse action information consists of reasons why a credit application was denied (‘response’) and steps a consumer can take to improve (‘second recommendation’).
See [0131-0132, 0190-0191] above describing how an operator device 171 can be used to display natural language explanations and adverse action information.).
However, Merrill does not teach a first feature input with an unknown label and detecting a known label for the first feature input, which is taught by Mishra:
receiving a first feature input corresponding to a dataset with an unknown label (Mishra discloses “hardware processor is configured to execute a software application suspected of being malware; monitor behavior of the software application at run-time; and acquire an input time sequence of data records based on a trace analysis of the software application, wherein the input time sequence comprises a plurality of features of the software application. The hardware processor is further configured to classify the software application as being a malicious software application based on the plurality of features of the software application; and output a ranking of a subset plurality of features by their respective contributions towards the classification of the software application as being malicious software” [Abstract].
Mishra discloses “A classic structure of RNN is shown in FIG. 4. In the figure, A represents the neural network architecture, where x0, x1, x2, . . . represents the time series inputs and his represent the outputs of hidden layers” [0039].
A time series (sequence) comprised of a plurality of features is input into a RNN to classify a software application (therefore a time series that is being input into a RNN for classification has an unknown label).),
wherein the first feature input comprises a plurality of values (See [Abstract, 0039] describing how a time series is comprised of a plurality of features.);
inputting the first feature input into an artificial intelligence model (See [0039] describing how a time series is input into a RNN model.),
and wherein the artificial intelligence model is trained to detect a known label based on a set of training data comprising labeled feature inputs corresponding to the known label (Mishra discloses “an exemplary machine learning model for the present disclosure should satisfy the following two properties: (1) Ability to accept time series type data as input; and (2) Ability to make decisions utilizing potential information concealed in consecutive adjacent inputs. Accordingly, exemplary embodiments utilize a Recurrent Neural Network (RNN) training to satisfy these properties, since RNN is powerful in handling sequential input data.” [0038].
Mishra discloses “To do so, both malicious and benign software are executed on an exemplary hardware platform, in which a total of 367 programs (including both malicious and benign ones) are executed. All the traced data were mixed up and further split into training (80%) and test (20%) sets after labeling. Total training epochs are 200 for every model and test accuracy was plotted every 10 epochs” [0058].
Mishra discloses “the RNN accepts sequential inputs. For each single input xi, RNN not only provides immediate response hi, but also stores the information of the current input by updating the architecture itself. On the right-side of the figure, information corresponding to the previous step will also be fed into the architecture to supply extra information by unrolling the RNN structure. For trace-data-based malware detection, each column of a trace table can be set as inputs, and the hidden state of a final stage, i.e. ht, can be set as the final output” [0039].
Mishra discloses “exemplary RNN-classifier that utilizes the structure outlined in FIG. 4. … After passing through RNN units, the outputs are fed into a fully connected layer to achieve dimension reduction. Finally, a Softmax layer takes the reduced outputs from the fully connected layer to produce classification labels” [0053].);
receiving a first prediction from the artificial intelligence model (See [0053] describing how a classification label is obtained from an RNN-classifier.),
wherein the first prediction indicates whether the first feature input corresponds to the known label (Mishra discloses a classification result is either labeled as benign or malicious (‘known label’), “hardware-assisted malware detection provides transparency in malware detection by providing interpretable explanations for classification results of benign and malicious programs. In one embodiment, an exemplary system/method interprets the outputs of a machine learning model with a ranking of contribution factors, which explicitly provides a detailed feature importance map and explains the internal mechanism of each individual prediction” [0025].);
Mishra teaches classifying inputs by using a RNN trained with labeled training data is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the RNN of Mishra to classify unlabeled inputs. By classifying unlabeled inputs, a trained machine learning model can label incoming inputs based on what it has learned, thereby allowing incoming data to be classified without human supervision or manual labeling.
Furthermore, the combination of Merrill in view of Mishra does not teach detecting cyber incidents, which is taught by Ables:
generating … causes of computer security labels that are generated by … artificial intelligence models processing datasets built by monitoring network activity (Ables discloses “The results for the NSL-KDD dataset can be found in Figures 2a and 2b. The local explanation example shows that the most important features for its prediction were ‘Duration’, ‘Destination (dst) bytes’, and ‘Source (src) bytes’. The remaining features, ‘Service (srv) count’, ‘Count’, and ‘Destination (dst) host count’ are considered less significant because of their distance from the BMU” (P. 409, Sec. V-A, ¶2).);
Ables discloses “we can see the features with the largest impact on a prediction: duration, dst bytes, and src bytes. These features were the closest to the BMU, and they played a large role in computing the predicted value. Seeing the specific features that influence predictions provides insight about samples labeled as malicious or benign and can further help operators determine the reason of incorrect predictions” (P. 408, Sec. IV-C, ¶3)
Ables discloses “Self Organizing Maps (SOMs), sometimes referred to as Kohonen Maps [8], [38], Kohonen Self Organizing Maps [39], or Kohonen Networks [40], are a class of unsupervised machine learning algorithms. SOMs are comprised of a network of individual units, each of which has a feature vector of the same size as the dimension of training data” (P. 406, Sec. III-A, ¶1).
The NSL-KDD dataset is comprised of features collected from network activity (duration, destination/source bytes). Local explanations explain which features were most important in determining a label prediction (therefore local explanations are the causes of computer security labels).),
and wherein the plurality of values indicates networking activity of a user (Ables discloses “The results for the NSL-KDD dataset can be found in Figures 2a and 2b. The local explanation example shows that the most important features for its prediction were ‘Duration’, ‘Destination (dst) bytes’, and ‘Source (src) bytes’. The remaining features, ‘Service (srv) count’, ‘Count’, and ‘Destination (dst) host count’ are considered less significant because of their distance from the BMU” (P. 409, Sec. V-A, ¶2).);
and wherein the known label comprises a detected cyber incident (Ables discloses “Intrusion Detection Systems are generally utilized as part of a larger cybersecurity defense effort at an organization generally located in a Cyber-Security Operations Center (CSoC). These systems monitor networks and automate attack detection by comparing network activity to the signature of known attacks or by detecting behavior that is anomalous to benign network patterns [2]. Through these methods, a security analyst can use an IDS to detect improper use, unauthorized access, or the abuse of a network” (P. 404, Sec. I, ¶2).
Ables discloses “Figure 2a shows the local explanations for a prediction, where each feature on the y-axis has a value representing distance from its respective BMU value (See Section III). In this example, we can see the features with the largest impact on a prediction: duration, dst bytes, and src bytes. These features were the closest to the BMU, and they played a large role in computing the predicted value. Seeing the specific features that influence predictions provides insight about samples labeled as malicious or benign and can further help operators determine the reason of incorrect predictions” (P. 408, Sec. IV-C, ¶3).),
and determining a cyber incident response based on the cause (Ables discloses “Figure 2a shows the local explanations for a prediction, where each feature on the y-axis has a value representing distance from its respective BMU value (See Section III). In this example, we can see the features with the largest impact on a prediction: duration, dst bytes, and src bytes. These features were the closest to the BMU, and they played a large role in computing the predicted value. Seeing the specific features that influence predictions provides insight about samples labeled as malicious or benign and can further help operators determine the reason of incorrect predictions. These features can also be further investigated with feature value heat maps” (P. 408, Sec. IV-C, ¶3).
Ables discloses “the users need to be confident in the predictions or recommendations computed by an IDS. Understandable explanations allow users to perform their tasks correctly. The stakeholders of an IDS (e.g. CSoC operators, developers, and investors) are individuals who will be dependent on the performance of the system. CSoC operators will be performing defensive actions based on prediction and explanation results. Developers can use explanations to fortify the model in areas where it is weak. Investors may need explanations to help them in making budgeting decisions for their company” (P. 406, Sec. II-C, ¶1).);
Ables teaches an intrusion detection system that generates explanations to explain which features influenced a sample being labeled as malicious or benign is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the intrusion detection system disclosed by Ables to classify network activity. By classifying network activity, network activity that is classified as malicious can be further investigated and preventative defensive actions can be taken to mitigate threats and prevent system compromises.
Claims 2-4, 9-10, 13 and 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Merrill in view of Mishra.
With respect to claim 2, Merrill teaches:
A method for generating recommendations for causes of labeling determinations that are generated by non-differentiable artificial intelligence models, comprising (Merrill discloses explanation information (‘causes’) can be generated to explain how a score (‘label’) was generated for a credit applicant, “S271 can include functions generating explanation information for a test data point (e.g., a test data point representing a credit applicant). The explanation information can include score explanation information that provides information that can be used to explain how a score was generated for the test data point” [0199].
Merrill discloses “selecting a single test data point (e.g., an input data set whose score/output generated by at least the non-differentiable model is to be explained). In some embodiments, the test data point represents a credit applicant” [0084].
Merrill discloses adverse action information (‘recommendations’) can be generated, “the model evaluation system evaluates and explains the model (or ensemble) by generating score explanation information for a specific score generated by the ensemble model for a particular input data set. In some embodiments, the score explanation information is used to generate Adverse Action information” [0033].
Merrill discloses “when generating a decision to deny a consumer credit application, lenders are required to provide to each consumer the reasons why the credit application was denied, in terms of factors the model actually used, that the consumer can take practical steps to improve. These adverse action reasons and notices …” [0018].):
receiving a first feature input corresponding to a dataset … (Merrill discloses a test data point (‘first feature input’), “selecting a single test data point (e.g., an input data set whose score/output generated by at least the non-differentiable model is to be explained). In some embodiments, the test data point represents a credit applicant” [0084].),
wherein the first feature input comprises a plurality of values (Merrill discloses a test data point (‘first feature input’) is comprised of multiple features (‘values’), “for each feature of a test data point, generating a difference value, the difference value for the test data point relative to a corresponding reference data point,” [0030].);
inputting the first feature input into an artificial intelligence model, wherein the artificial intelligence model is non-differentiable (Merrill discloses “selecting a single test data point (e.g., an input data set whose score/output generated by at least the non-differentiable model is to be explained” [0084].),
and wherein the artificial intelligence model is trained … based on a set of training data comprising labeled feature inputs … (Merrill discloses “S220 includes selecting a single reference data point that represents the reference population (e.g., a reference data point having feature values that represent average values across the reference population). In some embodiments, the single reference data point represents average feature values of applicants who were “barely approved” (e.g., the bottom 10% of an approved population), according to their credit score” [0085]. A reference data point includes average feature values of applicants that were barely approved (therefore these feature values are associated (“labeled”) with belonging to a “barely approved” class of applicants).
Merrill discloses “S210 includes: selecting a reference population from training data (e.g., of the modelling system 110)” [0122].
Merrill discloses “a model of modeling system 110 of FIGS. 1A and 1B) that includes at least one non-differentiable model and at least one differentiable model” [0031].
It is implied that a non-differentiable model is trained by its ability to generate outputs based on a test data point.);
receiving a first prediction from the artificial intelligence model (See [0084] describing how an output for a test data point is generated by a non-differentiable model.),
receiving a second prediction for the artificial intelligence model (Merrill discloses a decomposition of ensembles (‘second prediction’) can be generated by combining SHAP and Integrated gradients, “a novel method for the decomposition of ensembles (combinations) of tree and neural network models that combines the SHAP and Integrated Gradients methods to produce a new and useful result (decomposition of ensemble models) which is used to perform new and useful analysis that results in tangible outputs that support human decision-making, e.g.: feature importance, adverse action, and disparate impact analysis, as described herein” [0028].
Merrill discloses “Feature importance is the application wherein a feature's importance is quantified with respect to a model. A feature may have significant or marginal impact, and it could hurt or harm how a model will score” [0131].
Merrill discloses “a Shapley value decomposition (e.g., generated by the non-differentiable model decomposition module) is a linear combination of feature attribution values ϕi (Shapley value). … Shapley value decompositions are SHAP (SHapley Additive exPlanation) values … SHAP values explain the output of a model ƒ as a sum of the effects ϕi of each feature being introduced into a conditional expectation” [0048].
Merrill discloses “both Shapley and Integrated Gradient methods enforce nullity, e.g., features that do not contribute to the score will receive attribution values of zero. Shapley enforces this since it is focused on marginal contributions, that is, if the marginal impact of a feature is zero, over the space of all coalitions, its Shapley value will be zero as well. Integrated gradients enforces this by performing computing partial derivatives along the path of integration. If a feature does not contribute to a score, the attribution vector along the path integration, for the specific feature, will be zero” [0168].
SHAP and integrated gradients can be used to calculate each feature’s contribution towards a model’s score (output) (therefore each calculated feature’s importance is a prediction).),
wherein the second prediction indicates an approximated integrated gradient for the artificial intelligence model (A decomposition of ensembles (‘second prediction’) can be generated by combining SHAP and Integrated gradients (see [0028]).
Merrill discloses “integrated gradients processes, as described herein) often involves integration by quadrature that contains discretization errors. In some embodiments, Advanced and adaptive quadrature methods, e.g, higher-order interpolating schemas (advanced) and optimal grid refinement techniques (adaptive), are used to perform integration during S240. In some embodiments, advanced techniques may include one or more of: performing an integration using a Gauss-Kronrod formula; performing integration by using a Clenshaw-Curtis formula; performing integration by performing a Richardson Error Estimation process … These advanced numerical methods can be useful in implementing performant integrated gradient calculations, as performed by the differentiable model decomposition module. In some embodiments, the number of computations linearly scales with the number of pointwise quadrature evaluations; any reduction of the total number of points to approximate the integral under examination will result in a linear reduction in runtime as well” [0214-0215].);
determining an effect of each value of the first feature input on the first prediction based on the approximated integrated gradient (Merrill discloses “a Shapley value decomposition (e.g., generated by the non-differentiable model decomposition module) is a linear combination of feature attribution values ϕi (Shapley value). … Shapley value decompositions are SHAP (SHapley Additive exPlanation) values … SHAP values explain the output of a model ƒ as a sum of the effects ϕi of each feature being introduced into a conditional expectation” [0048].
Merrill discloses “differentiable model decomposition module (e.g, 122) uses integrated gradients, as described by Mukund Sundararajan, Ankur Taly, Qiqi Yan, “Axiomatic Attribution for Deep Networks”, arXiv:1703.01365, 2017, the contents of which are incorporated by reference herein. This process sums gradients at points along a straight-line path from a reference input (x) to an evaluation input (x), such that the contribution of a feature i is given by:
PNG
media_image1.png
103
410
media_image1.png
Greyscale
for a given m, wherein xi is the variable value of the input variable i in the evaluation input data set x, wherein xi′ is the input variable value of the input variable i in the reference input data set, wherein F is the model” [0046].
Merrill discloses “both Shapley and Integrated Gradient methods enforce nullity, e.g., features that do not contribute to the score will receive attribution values of zero. Shapley enforces this since it is focused on marginal contributions, that is, if the marginal impact of a feature is zero, over the space of all coalitions, its Shapley value will be zero as well. Integrated gradients enforces this by performing computing partial derivatives along the path of integration. If a feature does not contribute to a score, the attribution vector along the path integration, for the specific feature, will be zero” [0168].
Merrill discloses “the non-differentiable model decomposition module computes a score decomposition for a test data point relative to a reference data point for the non-differentiable model (as described herein), and the differentiable model decomposition module computes a score decomposition for the test data point relative to the reference data point for the differentiable model, and combines the decomposition of the non-differentiable model with the decomposition for the differentiable model by using an ensembling function of the ensemble model, to generate a decomposition for an ensemble model score for the test data point relative to the reference data point” [0031].
SHAP and integrated gradients are used to determine feature importance towards a model score/output. Both SHAP and integrated gradients determine feature importance according to the same test data point (‘first feature input’) and reference data point (training data). The results of SHAP and integrated gradients are combined into a single decomposition.);
and generating for display, on a user interface, a first recommendation for a cause of the known label in the dataset based on the effect of each value of the first feature input on the first prediction (Merrill discloses “the model evaluation system uses a decomposition generated for a model score (e.g., at one or more of S230, S240, S250) to generate feature importance information and provide the generated feature importance information to the operator device 171 … the model evaluation system uses a decomposition generated for a model score to generate adverse action information (e.g., at S271) (as described herein) and provide the generated adverse action information to the operator device 171” [0131-0132]. See Figure 2 depicting the decomposition at S250 is a combined decomposition using SHAP and integrated gradients.
Merrill discloses “the score explanation information is used to generate Adverse Action information” [0033].
Adverse action information (‘first recommendation’) can be generated from score explanation information (‘cause of the known label’). See [0018, 0033] describing how adverse action information explains why a credit application was denied and steps a consumer can take to improve.
Merrill discloses a test data point (‘first feature input’) is comprised of features of a denied credit applicant (‘known label in the dataset’), “the model evaluation system 120 evaluates a specific denied credit applicant. In this embodiment the specific denied credit applicant comprises the test set (test data point) (selected at S220)” [0153].
Merrill discloses “S272 can include identifying features having decomposition values (in the generated decompositions) above a threshold. In some embodiments, the method 200 includes providing the identified features to an operator device (e.g., 171) via a network. In some embodiments, the method 200 includes displaying the identified features on a display device of an operator device (e.g., 171). In other embodiments, the method 20 includes displaying natural language explanations generated based on the decomposition described above. In some embodiments the method 200 includes displaying the identified features and their decomposition in a form similar to the table presented in FIG. 3” [0190-0191].).
However, Merrill does not teach a first feature input with an unknown label and detecting a known label for the first feature input, which is taught by Mishra:
receiving a first feature input corresponding to a dataset with an unknown label (Mishra discloses “hardware processor is configured to execute a software application suspected of being malware; monitor behavior of the software application at run-time; and acquire an input time sequence of data records based on a trace analysis of the software application, wherein the input time sequence comprises a plurality of features of the software application. The hardware processor is further configured to classify the software application as being a malicious software application based on the plurality of features of the software application; and output a ranking of a subset plurality of features by their respective contributions towards the classification of the software application as being malicious software” [Abstract].
Mishra discloses “A classic structure of RNN is shown in FIG. 4. In the figure, A represents the neural network architecture, where x0, x1, x2, . . . represents the time series inputs and his represent the outputs of hidden layers” [0039].
A time series (sequence) comprised of a plurality of features is input into a RNN to classify a software application (therefore a time series that is being input into a RNN for classification has an unknown label).),
wherein the first feature input comprises a plurality of values (See [Abstract, 0039] describing how a time series is comprised of a plurality of features.);
inputting the first feature input into an artificial intelligence model (See [0039] describing how a time series is input into a RNN model.),
and wherein the artificial intelligence model is trained to detect a known label based on a set of training data comprising labeled feature inputs corresponding to the known label (Mishra discloses “an exemplary machine learning model for the present disclosure should satisfy the following two properties: (1) Ability to accept time series type data as input; and (2) Ability to make decisions utilizing potential information concealed in consecutive adjacent inputs. Accordingly, exemplary embodiments utilize a Recurrent Neural Network (RNN) training to satisfy these properties, since RNN is powerful in handling sequential input data.” [0038].
Mishra discloses “To do so, both malicious and benign software are executed on an exemplary hardware platform, in which a total of 367 programs (including both malicious and benign ones) are executed. All the traced data were mixed up and further split into training (80%) and test (20%) sets after labeling. Total training epochs are 200 for every model and test accuracy was plotted every 10 epochs” [0058].
Mishra discloses “the RNN accepts sequential inputs. For each single input xi, RNN not only provides immediate response hi, but also stores the information of the current input by updating the architecture itself. On the right-side of the figure, information corresponding to the previous step will also be fed into the architecture to supply extra information by unrolling the RNN structure. For trace-data-based malware detection, each column of a trace table can be set as inputs, and the hidden state of a final stage, i.e. ht, can be set as the final output” [0039].
Mishra discloses “exemplary RNN-classifier that utilizes the structure outlined in FIG. 4. … After passing through RNN units, the outputs are fed into a fully connected layer to achieve dimension reduction. Finally, a Softmax layer takes the reduced outputs from the fully connected layer to produce classification labels” [0053].);
receiving a first prediction from the artificial intelligence model (See [0053] describing how a classification label is obtained from an RNN-classifier.),
wherein the first prediction indicates whether the first feature input corresponds to the known label (Mishra discloses a classification result is either labeled as benign or malicious (‘known label’), “hardware-assisted malware detection provides transparency in malware detection by providing interpretable explanations for classification results of benign and malicious programs. In one embodiment, an exemplary system/method interprets the outputs of a machine learning model with a ranking of contribution factors, which explicitly provides a detailed feature importance map and explains the internal mechanism of each individual prediction” [0025].);
Mishra teaches classifying inputs by using a RNN trained with labeled training data is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the RNN of Mishra to classify unlabeled inputs. By classifying unlabeled inputs, a trained machine learning model can label incoming inputs based on what it has learned, thereby allowing incoming data to be classified without human supervision or manual labeling.
With respect to claim 3, the combination of Merrill in view of Mishra teaches:
The method of claim 2, further comprising: receiving a test feature input, wherein the test feature input represents test values corresponding to datasets that correspond to the known label (Mishra discloses traced data is split into a test set with labels (‘known labels’), “both malicious and benign software are executed on an exemplary hardware platform, in which a total of 367 programs (including both malicious and benign ones) are executed. All the traced data were mixed up and further split into training (80%) and test (20%) sets after labeling. Total training epochs are 200 for every model and test accuracy was plotted every 10 epochs. Accordingly, FIGS. 8A-8C compares the prediction accuracy of an exemplary hardware-assisted malware detection approach with PREEMPT RF and PREEMPT DT. As we can see, the exemplary method (referred to as “proposed” in the figures) provides the best malware detection accuracy” [0058-0059].);
labeling the test feature input with the known label (Mishra discloses test accuracy is plotted using test data, see [0058-0059]. To plot accuracy, a test data input must be labeled with a classification generated by a model. Accurate results indicate that a generated classification matches with a correct, known test set label (therefore test data inputs are labeled with correct, known labels).);
and training the artificial intelligence model to detect the known label based on the test feature input (Mishra discloses “exemplary embodiments utilize a Recurrent Neural Network (RNN) training to satisfy these properties, since RNN is powerful in handling sequential input data.” [0038].
Mishra discloses “All the traced data were mixed up and further split into training (80%) and test (20%) sets after labeling. Total training epochs are 200 for every model and test accuracy was plotted every 10 epochs” [0058].
Test accuracy is plotted during training epochs, therefore training of a RNN to classify input data occurs alongside determining test data labeling correctness (detecting known labels) using the RNN.).
Mishra teaches using test data to determine classification accuracy is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the technique disclosed by Mishra to test a machine learning model’s classification accuracy during training. By testing a machine learning model’s classification accuracy during training, machine learning engineers can retrain or fine tune a model based on classification performance, thereby developing an optimal, accurate model.
With respect to claim 4, the combination of Merrill in view of Mishra teaches the method of claim 2, wherein receiving the second prediction for the artificial intelligence model further comprises: determining a numerical approximation of gradients and integrals for the artificial intelligence model (Merrill discloses “Integration by quadrature, e.g., a discretized solution for computing the integral, is useful for numerical integration that is performed by a discrete processing unit, e.g., CPU and GPU. In some embodiments, the differentiable model decomposition module 122 computes the numerical integration procedure with a Riemman sum, as suggested by Sundararajan et al. 2017: … where m is the number of steps in the approximation of the integral” [0219-0220].
Merrill discloses an equation for computing integrated gradients (reproduced below) at [0219]. Integrated gradients can be computed by approximating an integral. An integral can be approximated by summing gradients over m steps.
PNG
media_image2.png
448
966
media_image2.png
Greyscale
);
and determining the approximated integrated gradient based on the numerical approximation of gradients and integrals (See [0219-0220] disclosed above describing how integrated gradients can be computed by approximating an integral and summing gradients.).
With respect to claim 9, the combination of Merrill in view of Mishra teaches: the method of claim 2,
wherein determining the effect of each value of the first feature input on the first prediction comprises determining a SHAP (SHapley Additive exPlanations) value for each value of the first feature input (Merrill discloses “a Shapley value decomposition (e.g., generated by the non-differentiable model decomposition module) is a linear combination of feature attribution values ϕi (Shapley value). … Shapley value decompositions are SHAP (SHapley Additive exPlanation) values … SHAP values explain the output of a model ƒ as a sum of the effects ϕi of each feature being introduced into a conditional expectation” [0048].).
With respect to claim 10, the combination of Merrill in view of Mishra teaches:
the method of claim 2, wherein determining the effect of each value of the first feature input on the first prediction based on the approximated integrated gradient further comprises: determining a respective contribution of each value to a difference between an actual prediction and a mean prediction (Merrill discloses “M is the number of input features, N is the set of input features, S is the set features constructed from superset N. The function ƒ(hx(z′)) defines a manner to remove features so that an expected value of f(x) can be computed which is conditioned on the subset of a feature space xS. The missingingness is defined by z′, each zi′ variable represents a feature being observed (zi′=1) or unknown (zi′=0)” [0042].
Merrill discloses Equations 1 and 2 at [0042] (reproduced below) depicting equations for calculating SHAP values. The contribution of feature i is calculated by the difference between a prediction when feature i is known and an average model prediction.
PNG
media_image3.png
533
1410
media_image3.png
Greyscale
);
determining a respective SHAP value based on the respective contribution (See Equation 1 above depicting how a SHAP value ϕi is calculated based on a feature i’s contribution to the difference between a prediction when feature i is known and an average model prediction.);
and determining the effect of each value based on the respective contribution (Merrill discloses “a Shapley value decomposition (e.g., generated by the non-differentiable model decomposition module) is a linear combination of feature attribution values ϕi (Shapley value). … Shapley value decompositions are SHAP (SHapley Additive exPlanation) values … SHAP values explain the output of a model ƒ as a sum of the effects ϕi of each feature being introduced into a conditional expectation” [0048].).
With respect to claim 13, the combination of Merrill in view of Mishra teaches:
the method of claim 2, wherein the known label comprises a refusal of a credit application (Merrill discloses a test data point is comprised of features of a denied credit applicant (‘known label in the dataset’), “the model evaluation system 120 evaluates a specific denied credit applicant. In this embodiment the specific denied credit applicant comprises the test set (test data point) (selected at S220)” [0153].),
wherein the plurality of values indicates a credit history of a user (Merrill discloses Figure 3 (reproduced below) depicting variables used in the decomposition of a model, along with their feature description and calculated importance.
PNG
media_image4.png
371
614
media_image4.png
Greyscale
),
and wherein the method further comprises: determining a response based on the cause (Merrill discloses adverse action information can be generated, “the model evaluation system evaluates and explains the model (or ensemble) by generating score explanation information for a specific score generated by the ensemble model for a particular input data set. In some embodiments, the score explanation information is used to generate Adverse Action information” [0033].
Merrill discloses “when generating a decision to deny a consumer credit application, lenders are required to provide to each consumer the reasons why the credit application was denied, in terms of factors the model actually used, that the consumer can take practical steps to improve. These adverse action reasons and notices …” [0018].
Merrill discloses adverse action information can be generated from score explanation information (‘cause’). The adverse action information consists of reasons why a credit application was denied (‘response’) and steps a consumer can take to improve (‘second recommendation’).);
and generating for display a second recommendation for executing the response (Merrill discloses “the model evaluation system uses a decomposition generated for a model score (e.g., at one or more of S230, S240, S250) to generate feature importance information and provide the generated feature importance information to the operator device 171 … the model evaluation system … provide the generated adverse action information to the operator device 171” [0131-0132]. See [0190-0191] describing how an operator device 171 can be used to display natural language explanations.).
With respect to claim 15, the rejection of claim 2 is incorporated. The difference in scope being:
A non-transitory, computer-readable medium comprising instructions that, when executed by one or more processors, cause operations comprising (Merrill discloses “the processing unit includes one or more processors communicatively coupled to one or more of a RAM, ROM, and machine-readable storage medium; the one or more processors of the processing unit receive instructions stored by the one or more of a RAM, ROM, and machine-readable storage medium via a bus; and the one or more processors execute the received instructions” [0235].).
With respect to claim 16, the claim recites similar limitations corresponding to claim 3, therefore, the same rationale of rejection is applicable.
With respect to claim 17, the claim recites similar limitations corresponding to claim 4, therefore, the same rationale of rejection is applicable.
Claims 5 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Merrill in view of Mishra, further in view of Baydin et al. (“Automatic differentiation in machine learning: a survey”), hereinafter Baydin.
With respect to claim 5, the combination of Merrill in view of Mishra teaches the method of claim 2, however, the combination does not teach approximating a derivative using finite differences, which is taught by Baydin:
wherein receiving the second prediction for the artificial intelligence model further comprises: approximating a derivative for the artificial intelligence model using finite differences by solving differential equations (Baydin discloses “Numerical differentiation is the finite difference approximation of derivatives using values of the original function evaluated at some sample points (Burden and Faires, 2001) (Figure 2, lower right). In its simplest form, it is based on the limit definition of a derivative. For example, for a multivariate function
f
:
R
n
→
R
, one can approximate the gradient
∇
f
=
(
∂
f
∂
x
1
,
…
,
∂
f
∂
x
n
)
using
PNG
media_image5.png
174
902
media_image5.png
Greyscale
where ei is the i-th unit vector and h > 0 is a small step size. This has the advantage of being uncomplicated to implement” (P. 4, Sec. 2.1, ¶1).);
and determining numerical approximations of gradients for the artificial intelligence model based on the derivative (Baydin discloses Equation 1 on P. 4 (reproduced above) describing how a gradient
∇
f
can be approximated using numerical differentiation.).
Baydin teaches approximating a gradient using finite differences is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the finite differences disclosed by Baydin to approximate a gradient using finite differences. By approximating a gradient using finite differences, finite differences are uncomplicated to implement, thereby enabling a more straightforward and interpretable way to approximate gradients.
With respect to claim 18, the claim recites similar limitations corresponding to claim 5, therefore, the same rationale of rejection is applicable.
Claims 6 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Merrill in view of Mishra, further in view of Baydin and Kirsch et al. (“Efficient structural optimization using reanalysis and sensitivity reanalysis”), hereinafter Kirsch.
With respect to claim 6, the combination of Merrill in view of Mishra, further in view of Baydin teaches the method of claim 5, however, the combination does not teach approximating a derivative using a predetermined step-size, which is taught by Kirsch:
the method of claim 5, wherein approximating the derivative for the artificial intelligence model using finite differences by solving differential equations further comprises: receiving a predetermined step-size for a first application (Kirsch discloses “we use efficient finite-differences for purposes of illustration. Assuming forward-differences, the response derivatives are approximated from the displacements at the original design point X0 and at the perturbed point X0 + δX by
PNG
media_image6.png
123
811
media_image6.png
Greyscale
where δX is predetermined step-size” (P. 231, Sec. 2.1.1, ¶2).);
and using the predetermined step-size for approximating the derivative (Kirsch discloses Equation 2 on P. 231 (reproduced above) describing derivates are approximated using finite differences with a predetermined step-size δX.).
Kirsch teaches approximating derivatives using finite differences with a predetermined step-size is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the predetermined step-size disclosed by Kirsch to approximate derivatives with a predetermined step size. By approximating derivatives with a predetermined step-size, an optimal step-size known to balance the trade-off between truncation error and round-off error can be used to consistently maximize the accuracy of a derivative approximation.
With respect to claim 19, the claim recites similar limitations corresponding to claim 6, therefore, the same rationale of rejection is applicable.
Claims 7 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Merrill in view of Mishra, further in view of Mishra et al. (US 20230281047 A1), hereinafter Mishra ‘047.
With respect to claim 7, the combination of Merrill in view of Mishra teaches the method of claim 2, however, the combination does not teach approximating an integral by approximating a region under a graph, which is taught by Mishra ‘047:
wherein receiving the second prediction for the artificial intelligence model further comprises: approximating an integral for the artificial intelligence model by approximating a region under a graph of a function that defines the artificial intelligence model (Mishra ‘047 discloses “the transformation of one or more IG computations can be used to generate an XAI model (XAI model 620) and/or can be utilized for feature attribution to provide explainable ML in accordance with one or more embodiments of the present disclosure. As described herein, the computation of IG is straightforward using Equation 3. In many circumstances, the output function F (e.g., as provided in Equation 3) is too complicated to have an analytically solvable integral. However, this challenge can be mitigated using two strategies. In the first strategy, numerical integration with polynomial interpolation can be applied to approximate the integral” [0085].
Mishra ‘047 discloses “Integrated Gradients (IG) serves as another technique to explain ML models. The equation to compute the IG attribution for an input record x and a baseline record x′ is as follows … where
F
:
R
n
→
[
0
,
1
]
represents the ML model” [0045-0046]. See [0045] describing integrated gradients can be computed using equation 3, where function F represents a machine learning model.
Mishra ‘047 discloses “The numerical integration is computed through the trapezoidal rule. Formally, the trapezoidal rule works by approximating the region under the graph of the function F(x) as a trapezoid and calculating its area to approach the definite integral, which is actually the result obtained by averaging the left and right Riemann sums. The interpolation improves approximation by partitioning the integration interval, applying the trapezoidal rule to each sub-interval, and summing the results” [0086].);
and determining numerical approximations of integrals for the artificial intelligence model based on the integral (Mishra ‘047 discloses “the output function F (e.g., as provided in Equation 3) is too complicated to have an analytically solvable integral. However, this challenge can be mitigated using two strategies. In the first strategy, numerical integration with polynomial interpolation can be applied to approximate the integral” [0085]. See [0086] describing how approximated areas are summed to approximate an integral.).
Mishra ‘047 teaches approximating an integral by approximating a region under a graph is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the approximation technique disclosed by Mishra ‘047 to approximate an integral by approximating a region under a graph. By approximating an integral using region under a graph, complicated functions that represent machine learning models can have solvable integrals, thereby enabling machine learning models to use integrals to generate predictions or provide feature attribution.
With respect to claim 20, the claim recites similar limitations corresponding to claim 7, therefore, the same rationale of rejection is applicable.
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Merrill in view of Mishra, further in view of Holmes (“"Numerical Integration." Introduction to Scientific Computing and Data Analysis”).
With respect to claim 8, the combination of Merrill in view of Mishra teaches the method of claim 2, wherein receiving the second prediction for the artificial intelligence model further comprises: … and determining numerical approximations of integrals for the artificial intelligence model based on the integral (Merrill discloses “Integration by quadrature, e.g., a discretized solution for computing the integral, is useful for numerical integration that is performed by a discrete processing unit, e.g., CPU and GPU. In some embodiments, the differentiable model decomposition module 122 computes the numerical integration procedure with a Riemman sum, as suggested by Sundararajan et al. 2017: … where m is the number of steps in the approximation of the integral” [0219-0220].
Merrill discloses an equation for computing integrated gradients (reproduced above) at [0219]. Integrated gradients can be computed by approximating an integral. An integral can be approximated by summing gradients over m steps for a machine learning model.).
However, the combination does not teach approximating an integral by approximating an integrand f(x) by a quadratic interpolant P(x), which is taught by Holmes:
approximating an integral for the artificial intelligence model by approximating an integrand f (x) by a quadratic interpolant P(x) of a function that defines the artificial intelligence model (Holmes discloses “The objective of this chapter is to derive and then test methods that can be used to evaluate the definite integral
∫
a
b
f
x
d
x
” (P. 231, Sec. 6.1, ¶1).
Holmes discloses the integrand f(x) is approximated by using quadratic approximation, “The next step is to try a quadratic approximation for f(x), and for this we need three data points. One option is to use xi, xi+1, and some point within the subinterval. Another option is to pair up the subintervals and use an approximation over x1 ≤ x ≤ x3, another one over x3 ≤ x ≤ x5, etc. We will use the latter option although this will require n to be even. From (5.4), the quadratic that interpolates f(x) over the interval xi−1 ≤ x ≤ xi+1 is
PNG
media_image7.png
104
1106
media_image7.png
Greyscale
An example of the resulting approximation is shown in Figure 6.7 in the case when n = 4. There are two quadratics in this case, one used for 0 ≤ x ≤
1
2
and another for
1
2
≤ x ≤ 1. If you look closely, you will notice that over each subinterval the quadratic is above the function on one half, and below the function on the other half. This is also what happened for the midpoint rule, as seen in Figure 6.3, and it will result in the integration rule being more accurate than might be expected” (P. 241, Sec. 6.3.2, ¶1-2).
Holmes disclose Figure 6.7 on P. 240 (reproduced below) depicting piecewise quadratic approximation p2(x) used to approximate f(x).
PNG
media_image8.png
356
1028
media_image8.png
Greyscale
);
Holmes teaches approximating an integrand f(x) by using quadratic approximation is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the quadratic approximation technique disclosed by Holmes to approximate an integral by using quadratic approximation. By approximating an integral using quadratic approximation, the curvature of f(x) can be approximated more accurately, thereby resulting in a more precise approximation of the area under a curve for an integral.
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Merrill in view of Mishra, further in view of Gonzales et al. (US 20230245128 A1), hereinafter Gonzales.
With respect to claim 11, the combination of Merrill in view of Mishra teaches:
the method of claim 2, … and wherein the method further comprises: determining a … response based on the cause; and generating for display a second recommendation for executing the … response (Merrill discloses adverse action information can be generated, “the model evaluation system evaluates and explains the model (or ensemble) by generating score explanation information for a specific score generated by the ensemble model for a particular input data set. In some embodiments, the score explanation information is used to generate Adverse Action information” [0033].
Merrill discloses “when generating a decision to deny a consumer credit application, lenders are required to provide to each consumer the reasons why the credit application was denied, in terms of factors the model actually used, that the consumer can take practical steps to improve. These adverse action reasons and notices …” [0018].
Merrill discloses adverse action information can be generated from score explanation information (‘cause’). The adverse action information consists of reasons why a credit application was denied (‘response’) and steps a consumer can take to improve (‘second recommendation’).
Merrill discloses “the model evaluation system uses a decomposition generated for a model score (e.g., at one or more of S230, S240, S250) to generate feature importance information and provide the generated feature importance information to the operator device 171 … the model evaluation system … provide the generated adverse action information to the operator device 171” [0131-0132]. See [0190-0191] describing how an operator device 171 can be used to display natural language explanations.).
However, the combination does not teach detecting fraudulent transactions, which is taught by Gonzales:
wherein the known label comprises a detected fraudulent transaction (Gonzales discloses “the data harvesting detection system 106 can determine that the account number is involved in fraudulent activity (e.g., via the account number being flagged, location of transaction, amount of transaction, time of transaction)” [0061].),
wherein the plurality of values indicates a transaction history of a user (Gonzales discloses “the data harvesting detection system receives the network transaction requests, within a time period, having an account number (e.g., a credit card number) and a transaction amount and determines transaction request response codes for the requests” [0020].
Gonzales discloses “the client applications 114a-114n (via the client devices 112a-112n) can provide user data activity (e.g., network transaction requests) to the data harvesting detection system 106 (via the transaction facilitator network device(s) 110 to the server device(s) 102) to detect account number harvesting” [0044].),
and wherein the method further comprises: determining a fraudulent transaction response based on the cause (Gonzales discloses “As an example, the data harvesting detection system 106 can utilize transaction request response codes for fraud alert decline responses. For instance, a transaction request response code for a fraud alert decline response can include a code to indicate a response for (detected) fraudulent activity corresponding to the account number. In particular, the data harvesting detection system 106 can determine that the account number is involved in fraudulent activity (e.g., via the account number being flagged, location of transaction, amount of transaction, time of transaction). In some instances, the data harvesting detection system 106 can determine a transaction request response code for a fraud alert decline response that indicates irregular transaction activity for the account number (e.g., increased usage, usage with an irregular transaction facilitator, irregular time or location of transaction)” [0061].
Gonzales discloses “the term “transaction request response code” refers to a label, descriptor, or other text or numeric that describes or indicates a response to a network transaction request. In particular, a transaction request response code includes a label or descriptor for a verification or rejection performed on an account number that corresponds to a network transaction request” [0030].
A transaction request response code (‘fraudulent transaction response’) can be generated to explain which irregular transaction activity (‘cause’) indicated an account number as being involved in fraudulent activity.);
Gonzales teaches a detection system that labels transactions as fraudulent and generates transaction request response codes is known in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the detection system disclosed by Gonzales to detect fraudulent transactions. By detecting fraudulent transactions, financial businesses and customers can protect financial assets, thus preventing financial losses and ensuring trust in financial institutions.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Merrill in view of Mishra, further in view of Ables.
With respect to claim 12, the combination of Merrill in view of Mishra teaches:
the method of claim 2, … and wherein the method further comprises: determining a … response based on the cause; and generating for display a second recommendation for executing the … response (Merrill discloses adverse action information can be generated, “the model evaluation system evaluates and explains the model (or ensemble) by generating score explanation information for a specific score generated by the ensemble model for a particular input data set. In some embodiments, the score explanation information is used to generate Adverse Action information” [0033].
Merrill discloses “when generating a decision to deny a consumer credit application, lenders are required to provide to each consumer the reasons why the credit application was denied, in terms of factors the model actually used, that the consumer can take practical steps to improve. These adverse action reasons and notices …” [0018].
Merrill discloses adverse action information can be generated from score explanation information (‘cause’). The adverse action information consists of reasons why a credit application was denied (‘response’) and steps a consumer can take to improve (‘second recommendation’).
Merrill discloses “the model evaluation system uses a decomposition generated for a model score (e.g., at one or more of S230, S240, S250) to generate feature importance information and provide the generated feature importance information to the operator device 171 … the model evaluation system … provide the generated adverse action information to the operator device 171” [0131-0132]. See [0190-0191] describing how an operator device 171 can be used to display natural language explanations.).
However, the combination does not teach detecting cyber incidents, which is taught by Ables:
wherein the known label comprises a detected cyber incident (Ables discloses “Intrusion Detection Systems are generally utilized as part of a larger cybersecurity defense effort at an organization generally located in a Cyber-Security Operations Center (CSoC). These systems monitor networks and automate attack detection by comparing network activity to the signature of known attacks or by detecting behavior that is anomalous to benign network patterns [2]. Through these methods, a security analyst can use an IDS to detect improper use, unauthorized access, or the abuse of a network” (P. 404, Sec. I, ¶2).
Ables discloses “Figure 2a shows the local explanations for a prediction, where each feature on the y-axis has a value representing distance from its respective BMU value (See Section III). In this example, we can see the features with the largest impact on a prediction: duration, dst bytes, and src bytes. These features were the closest to the BMU, and they played a large role in computing the predicted value. Seeing the specific features that influence predictions provides insight about samples labeled as malicious or benign and can further help operators determine the reason of incorrect predictions” (P. 408, Sec. IV-C, ¶3).),
wherein the plurality of values indicates networking activity of a user (Ables discloses “The results for the NSL-KDD dataset can be found in Figures 2a and 2b. The local explanation example shows that the most important features for its prediction were ‘Duration’, ‘Destination (dst) bytes’, and ‘Source (src) bytes’. The remaining features, ‘Service (srv) count’, ‘Count’, and ‘Destination (dst) host count’ are considered less significant because of their distance from the BMU” (P. 409, Sec. V-A, ¶2).),
and wherein the method further comprises: determining a cyber incident response based on the cause (Ables discloses “Figure 2a shows the local explanations for a prediction, where each feature on the y-axis has a value representing distance from its respective BMU value (See Section III). In this example, we can see the features with the largest impact on a prediction: duration, dst bytes, and src bytes. These features were the closest to the BMU, and they played a large role in computing the predicted value. Seeing the specific features that influence predictions provides insight about samples labeled as malicious or benign and can further help operators determine the reason of incorrect predictions. These features can also be further investigated with feature value heat maps” (P. 408, Sec. IV-C, ¶3).
Ables discloses “the users need to be confident in the predictions or recommendations computed by an IDS. Understandable explanations allow users to perform their tasks correctly. The stakeholders of an IDS (e.g. CSoC operators, developers, and investors) are individuals who will be dependent on the performance of the system. CSoC operators will be performing defensive actions based on prediction and explanation results. Developers can use explanations to fortify the model in areas where it is weak. Investors may need explanations to help them in making budgeting decisions for their company” (P. 406, Sec. II-C, ¶1).);
Ables teaches an intrusion detection system that generates explanations to explain which features influenced a sample being labeled as malicious or benign is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the intrusion detection system disclosed by Ables to classify network activity. By classifying network activity, network activity that is classified as malicious can be further investigated and preventative defensive actions can be taken to mitigate threats and prevent system compromises.
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Merrill in view of Mishra, further in view of Bento et al. (“TimeSHAP: Explaining Recurrent Models through Sequence Perturbations”), hereinafter Bento.
With respect to claim 14, the combination of Merrill in view of Mishra teaches:
the method of claim 2, … and wherein the method further comprises: determining a … response based on the cause; and generating for display a second recommendation for executing the … response (Merrill discloses adverse action information can be generated, “the model evaluation system evaluates and explains the model (or ensemble) by generating score explanation information for a specific score generated by the ensemble model for a particular input data set. In some embodiments, the score explanation information is used to generate Adverse Action information” [0033].
Merrill discloses “when generating a decision to deny a consumer credit application, lenders are required to provide to each consumer the reasons why the credit application was denied, in terms of factors the model actually used, that the consumer can take practical steps to improve. These adverse action reasons and notices …” [0018].
Merrill discloses adverse action information can be generated from score explanation information (‘cause’). The adverse action information consists of reasons why a credit application was denied (‘response’) and steps a consumer can take to improve (‘second recommendation’).
Merrill discloses “the model evaluation system uses a decomposition generated for a model score (e.g., at one or more of S230, S240, S250) to generate feature importance information and provide the generated feature importance information to the operator device 171 … the model evaluation system … provide the generated adverse action information to the operator device 171” [0131-0132]. See [0190-0191] describing how an operator device 171 can be used to display natural language explanations.).
However, the combination does not teach detecting identity theft, which is taught by Bento:
wherein the known label comprises a detected identity theft (Bento discloses “We use TimeSHAP to explain the predictions of a real-world bank account takeover fraud detection RNN model, and draw key insights from its explanations: i) the model identifies important features and events aligned with what fraud analysts consider cues for account takeover” (P. 2565, Abstract).),
wherein the plurality of values indicates a user transaction history (Bento discloses “To validate our method, we use it to explain predictions of a GRU based model on a real-world account takeover fraud detection task. Account takeover fraud is a form of identity theft where a fraudster gains access to a victim’s bank account, enabling them to place unauthorized transactions [33, 34]. … The data is tabular, consisting of approximately 20M instances. Each instance, dubbed from here on as an event, represents one of three behaviours, or event types: transaction, where a client performs a monetary transaction; login, representing a client login on the banking application or website; or enrollment, representing account settings behaviours, such as logging into a new device or changing the password” (P. 2569, Sec. 4, ¶1-2).
Bento discloses “Figure 4 shows a plot of the global feature-wise explanations. We observe that some features have predominantly positive contributions, meaning that the model routinely relies on these features to predict account takeover fraud. These include the Transaction (0.29) and Event (0.092) types, the clients’ age (0.090), and features related to the IP and location of the events (0.08 to 0.03)” (P. 2570-2571, Sec. 4.2, Last Paragraph).),
and wherein the method further comprises: determining an identity theft response based on the cause (Bento discloses “Regarding feature importances, the most relevant features are related to the transaction type, event type, the clients’ age, and the location. When inspecting the raw feature data, we observe that the client is in the elderly age range, which, as previously mentioned, may indicate a more susceptible demographic. When analyzing the location features Location feature A and Location feature D, we observe a discrepancy between the location of the enrollment, login and transactions from the account’s history. This discrepancy in physical location is highly suspicious and indicates that there was an enrollment on the account from a previously unused location” (P. 2571-2572, Sec. 4.3, Last Paragraph).
Bento discloses “Local explanations explain the model’s rationale regarding one specific instance. These explanations can be used in several use cases, for example, for bias auditing or model debugging. However, these explanations can mostly be used by end-users, the fraud analysts, to aid their decision-making tasks” (P. 2571, Sec. 4.3, ¶1).
Local explanations that explain which features contribute the most towards a model’s account takeover prediction (therefore local explanations are a cause), can be used for bias auditing, model debugging, and decision-making tasks (therefore these actions are an identity theft response).);
Bento teaches using local explanations to explain which features contribute the most towards predicting identity theft is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the machine learning model explainability method of Merrill with the technique disclosed by Bento to use a machine learning model to predict identity theft. By using a machine learning model to predict identity theft, a model can automatically classify transactions as fraudulent, thereby enabling real-time fraud detection and preventing account takeover.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PEDRO J MORALES whose telephone number is (571)272-6106. The examiner can normally be reached 8:30 AM - 6:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MIRANDA M HUANG can be reached at (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PEDRO J MORALES/Examiner, Art Unit 2124
/VINCENT GONZALES/Primary Examiner, Art Unit 2124