DETAILED ACTION
This Final communication is in response to Application No. 17/657,619 filed 3/31/2022 which claims priority from 63/260,909 filed 09/03/2021. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed 10/13/2025 which provides amendments to claims 1, 15, and 20 has been entered. Claims 1-20 are pending. Applicant’s drawings submitted on 03/31/2022 have been approved.
Response to Arguments
Applicant’s arguments with respect to 35 U.S.C § 101 filed 10/13/2025 have been fully considered but they are not persuasive.
Applicant argues that the claims are not directed to an abstract idea of a mental process (page 15 of remarks/arguments). The examiner respectfully disagrees. Claim 1 recites determining atomic steps, instrumenting a ML pipeline, constructing a feature provenance graph and identifying discarded features. These in their broadest reasonable interpretations can be done in the human mind. A human could look at a ML pipeline and determine what are the atomic steps. A human could instrument (inject extra code line [0039] of specification) a ML pipeline using pen and paper. A human could make a feature provenance graph using pen and paper. Finally a human could identify discarded features from the graph.
Applicant also argues that the claimed invention is a technical improvement (page 16). Applicant cites paragraph [0044] saying that “identify the one or more discarded features and may provide cleaner ML pipelines and data-frames” and that this can lead to increase in machine learning accuracy. According to the MPEP 2106.05(a) the improvement cannot come from the judicial exception (abstract idea) alone. The additional elements of the claims are related to receiving the model, executing the model and output logs for the executed pipeline.
Thus, the 101 rejection is maintained.
Applicant’s arguments with respect to 35 U.S.C § 103 filed 10/13/2025 have been consider but they are not persuasive.
Applicant argues that Floratou fails to disclose or suggest the “atomized ML pipeline”. The examiner respectfully disagrees. The broadest reasonable interpretation of an atomized ML pipeline is breaking a ML pipeline into discrete steps. Floratou discloses in [0048] “Derivation extractor 202 is configured to parse script 102 and by performing static analysis, build a workflow model which captures the dependencies among the elements of the script”. These scripts can be seen as atomic steps and thus they are broken down to build a workflow model. This is essentially an atomized ML pipeline.
Applicant’s argument with respect that neither Floratou or Loyola disclose “generating a log of actions based on the execution of the instrumental ML pipeline” have been consider but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
101 Subject Matter Eligibility analysis
Step 1: Claims 1-20 are within the four statutory categories (a process, machine, manufacture or composition of matter.) Claims 1-14 describe a process, and claims 15-20 describe a machine.
With respect to claim 1:
Step 2A Prong 1: The claim recites an abstract idea enumerated in the 2019 PEG.
determining one or more atomic steps corresponding to the ML pipeline to determine an atomized ML pipeline, each atomic step corresponding to a unitary operation being one of assigning a variable, renaming a feature, deleting a feature; (This is an abstract idea of a "Mental Process." The "determining" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The determination could be made manually by an individual.)
instrumenting the atomized ML pipeline to determine an instrumented ML pipeline including one or more operations corresponding to the received ML project; (This is an abstract idea of a "Mental Process." The "instrumenting" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper to modify the code.)
constructing a feature provenance graph (FPG) based on the ML pipeline and the captured one or more data-frame snapshots, the data-frame snap shots relating to the generated log of actions; and (This is an abstract idea of a "Mental Process." The "constructing" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper to create a feature provenance graph.)
identifying one or more discarded features, from the plurality of features corresponding to the received ML project, based on the constructed FPG. (This is an abstract idea of a "Mental Process." The "identifying" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The identification could be made manually by an individual).
Step 2a Prong 2: The judicial exception is not integrated into a practical application
Additional elements:
receiving an ML project including a data-frame and an ML pipeline, the ML pipeline includes a plurality of code statements associated with a plurality of features corresponding to an ML project; (this limitation amounts to adding insignificant extra-solution activity to the judicial exception).
executing the instrumented ML pipeline to capture one or more data-frame snapshots based on each of the one or more operations; (This amounts to no more than mere instructions to “apply” the exception using a generic computer component.)
generating a log of actions based on the execution of the instrumental ML pipeline; (this limitation amounts to adding insignificant extra-solution activity to the judicial exception).
Step 2B: the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
The additional elements “receiving an ML project…” and “ generating a log…” add insignificant extra-solution activity to the judicial exception and cannot provide an inventive concept.
As explained above the additional element “executing…” is recited in a generic level and they represent generic computer components to apply the abstract idea. Mere instructions to apply an exception cannot provide an inventive concept.
When considered in combination, these additional elements represent mere instructions to apply an exception and insignificant extra-solution activity, which do not provide an inventive concept.
Therefore, claim 1 is ineligible.
With respect to claim 2:
Step 2A Prong 1: claim 2, which incorporates the rejection of claim 1, recites an additional abstract idea:
the determination of the one or more atomic steps corresponding to the ML pipeline is based on an application of a source- code lifting technique on the plurality of code statements. (This is an abstract idea of a "Mental Process." The "determining" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The determination could be made manually by an individual.)
Step 2a Prong 2: claim 2 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 2 does not recite any additional elements.
Therefore, claim 2 is ineligible.
With respect to claim 3:
Step 2A Prong 1: claim 3, which incorporates the rejection of claim 1, recites an additional abstract idea:
the instrumentation of the atomized ML pipeline is based on an application of a method call injection technique on the atomized ML pipeline; (This is an abstract idea of a "Mental Process." The "instrumenting" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper to modify the code.)
Step 2a Prong 2: claim 3 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 3 does not recite any additional elements.
Therefore, claim 3 is ineligible.
With respect to claim 4:
Step 2A Prong 1: claim 4, which incorporates the rejection of claim 1, does not recite an abstract idea.
Step 2a Prong 2: The judicial exception is not integrated into a practical application.
each of the captured one or more data- frame snapshots comprises at least one of: a line number, an input and an output of each variable, and a set of feature names associated with a data-frame type; (this limitation merely limits the judicial exception to a particular field of use.)
Step 2B: the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception
The additional element merely limits the judicial exception to a particular field of use and also cannot provide an inventive concept (MPEP 2106.05(h)).
Therefore, claim 4 is ineligible.
With respect to claim 5:
Step 2A Prong 1: claim 5, which incorporates the rejection of claim 1, recites additional abstract ideas:
initializing the feature provenance graph (FPG) as a directed acyclic graph including one node for each feature associated with the captured one or more data- frame snapshots; (This is an abstract idea of a "Mental Process." The "initializing" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper.)
selecting a first operation of the one or more operations to analyze a data- frame snapshot of the captured one or more data-frame snapshots; (This is an abstract idea of a "Mental Process." The "selecting" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human. The selecting could be made manually by an individual.)
adding a layer of nodes in the FPG for each feature associated with the output information of the data-frame snapshot; and (This is an abstract idea of a "Mental Process." The "adding" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper.)
adding a directed edge in the FPG from a first node associated with the input information to a second node associated with the output information, based on a correspondence between a first name of a first feature associated with the first node and a second name of a second feature associated with the second node; (This is an abstract idea of a "Mental Process." The "adding" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper.)
Step 2a Prong 2: The judicial exception is not integrated into a practical application.
retrieving an abstract syntax tree (AST) associated with the atomized ML pipeline; (this limitation is a well-understood, routine, conventional activity.)
the data-frame snapshot corresponds to the selected first operation and includes input information and output information (this limitation merely limits the judicial exception to a particular field of use.)
Step 2B: the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception
As mentioned above the additional element of “retrieving...” is a well-understood, routine, conventional activity and does not provide an inventive concept.
The additional element, “the data-frame…”, merely limits the judicial exception to a particular field of use and also cannot provide an inventive concept (MPEP 2106.05(h)).
When considered in combination, these additional elements represent well-understood, routine, conventional activity and field of use, which do not provide an inventive concept.
Therefore, claim 5 is ineligible.
With respect to claim 6:
Step 2A Prong 1: claim 6, which incorporates the rejection of claim 5, recites additional abstract ideas:
identifying, based on the retrieved AST, one or more features and an operation name associated with the selected first operation; (This is an abstract idea of a "Mental Process." The "identifying" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The identification could be made manually by an individual).
determining whether the operation name corresponds to a delete operation; (This is an abstract idea of a "Mental Process." The "determining" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The determination could be made manually by an individual.)
adding a null successor in the FPG for each node of the layer of nodes associated with the input information of the data-frame snapshot, based on the determination that the operation name corresponds to the delete operation; (This is an abstract idea of a "Mental Process." The "adding" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper.)
determining whether the operation name corresponds to a rename operation; (This is an abstract idea of a "Mental Process." The "determining" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The determination could be made manually by an individual.)
adding a directed edge in the FPG from a node associated with the input information of the data-frame snapshot to a corresponding re-named successor node associated with the output information of the data-frame snapshot, based on the determination that the operation name corresponds to the rename operation; and (This is an abstract idea of a "Mental Process." The "adding" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper.)
adding a directed edge in the FPG from each node of the layer of nodes associated with the input information of the data-frame snapshot to a corresponding successor node associated with the output information of the data-frame snapshot, based on the determination that the operation name is other than the rename operation or the delete operation. (This is an abstract idea of a "Mental Process." The "adding" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper.)
Step 2a Prong 2: claim 6 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 6 does not recite an additional element.
Therefore, claim 6 is ineligible.
With respect to claim 7:
Step 2A Prong 1: claim 7, which incorporates the rejection of claim 1, recites additional abstract ideas:
selecting a node from a data-frame snapshot in the retrieved FPG; (This is an abstract idea of a "Mental Process." The "selecting" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The selecting could be made manually by an individual.)
performing a depth first search of the retrieved FPG from the selected node by marking each visited node in the retrieved FPG; and (This is an abstract idea of a "Mental Process." The "performing" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper.)
determining one or more unmarked nodes in the retrieved FPG, based on the performed depth first search, wherein the one or more discarded features are identified based on the determined one or more unmarked nodes. (This is an abstract idea of a "Mental Process." The "determining" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The determination could be made manually by an individual.)
Step 2a Prong 2: The judicial exception is not integrated into a practical application.
retrieving the constructed feature provenance graph (FPG); (this limitation merely limits the judicial exception to a particular field of use.)
Step 2B: the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception
As mentioned above the additional element of “retrieving...” is a well-understood, routine, conventional activity and does not provide an inventive concept.
Therefore, claim 7 is ineligible.
With respect to claim 8:
Step 2A Prong 1: claim 8, which incorporates the rejection of claim 7, recites an additional abstract idea:
the depth first search corresponds to at least one of: a forward depth first search or a backward depth first search (This is an abstract idea of a "Mental Process." The "depth first search" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper.)
Step 2a Prong 2: claim 8 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 8 does not recite an additional element.
Therefore, claim 8 is ineligible.
With respect to claim 9:
Step 2A Prong 1: claim 9, which incorporates the rejection of claim 1, recites additional abstract ideas:
constructing an abstract syntax tree (AST) of the atomized ML pipeline; (This is an abstract idea of a "Mental Process." The "constructing" step under its broadest reasonable interpretation, covers concepts that can be practically performed by a human using a pen and paper to create a feature provenance graph.)
traversing the constructed AST of the atomized ML pipeline to identify a rename operation; (This is an abstract idea of a "Mental Process." The "traversing" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The traversing could be done manually by an individual.)
determining a parameter map associated with the identified rename operation; and (This is an abstract idea of a "Mental Process." The "determining" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The determination could be made manually by an individual.)
extracting a first mapping between the plurality of features corresponding to the received ML project and a set of first features associated with the ML pipeline, based on the determined parameter map, wherein (This is an abstract idea of a "Mental Process." The "extracting" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The extracting could be done manually by an individual.)
the one or more discarded features are identified further based on the extracted first mapping. (This is an abstract idea of a "Mental Process." The "identifying" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The identification could be made manually by an individual).
Step 2a Prong 2: claim 9 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 9 does not recite an additional element.
Therefore, claim 9 is ineligible.
With respect to claim 10:
Step 2A Prong 1: claim 10, which incorporates the rejection of claim 9, recites an additional abstract idea:
the set of first features correspond to one or more features that are renamed based on a user input; (This is an abstract idea of a "Mental Process." This step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind).
Step 2a Prong 2: claim 10 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 10 does not recite an additional element.
Therefore, claim 10 is ineligible.
With respect to claim 11:
Step 2A Prong 1: claim 11, which incorporates the rejection of claim 1, recites additional abstract ideas:
traversing through each of the one or more atomic steps corresponding to the ML pipeline to track an update of a name associated with each of a set of second features associated with the ML pipeline; and (This is an abstract idea of a "Mental Process." The "traversing" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The traversing could be done manually by an individual.)
extracting a second mapping between the plurality of features corresponding to the received ML project and the set of second features, wherein the one or more discarded features are identified further based on the extracted second mapping (This is an abstract idea of a "Mental Process." The "extracting" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The extracting could be done manually by an individual.)
Step 2a Prong 2: claim 11 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 11 does not recite an additional element.
Therefore, claim 11 is ineligible.
With respect to claim 12:
Step 2A Prong 1: claim 12, which incorporates the rejection of claim 11, recites an additional abstract idea:
the set of second features correspond to one or more features that are renamed based on a set of predefined rules (This is an abstract idea of a "Mental Process." This step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind).
Step 2a Prong 2: claim 12 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 12 does not recite an additional element.
Therefore, claim 12 is ineligible.
With respect to claim 13:
Step 2A Prong 1: claim 13, which incorporates the rejection of claim 11, recites an additional abstract idea:
the plurality of features corresponds to an initial data-frame of the captured one or more data-frame snapshots and the set of second features corresponds to a final data-frame of the captured one or more data-frame snapshots; (This is an abstract idea of a "Mental Process." This step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind).
Step 2a Prong 2: claim 13 does not recite any additional elements and thus cannot be integrated into a practical application.
Step 2B: claim 13 does not recite an additional element.
Therefore, claim 13 is ineligible.
With respect to claim 14:
Step 2A Prong 1: claim 14, which incorporates the rejection of claim 1, recites an additional abstract idea:
determining one or more relevant features, from the plurality of features corresponding to the received ML project, based on the identified one or more discarded features; and (This is an abstract idea of a "Mental Process." The "determining" step under its broadest reasonable interpretation, covers concepts that can be practically performed in the human mind. The determination could be made manually by an individual.)
Step 2a Prong 2: The judicial exception is not integrated into a practical application.
training a meta-learning model associated with the ML project based on the determined one or more relevant features (This amounts to no more than mere instructions to “apply” the exception using a generic computer component.)
Step 2B: the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception
As explained above the additional element “training…” is recited in a generic level and they represent generic computer components to apply the abstract idea. Mere instructions to apply an exception cannot provide an inventive concept.
Therefore, claim 14 is ineligible.
With respect to claim 15:
The claim recites similar limitations as corresponding to claim 1. Therefore, the same subject matter analysis that was utilized for claim 1, as described above, is equally applicable to claim 15. Therefore, claim 15 is ineligible.
With respect to claim 16:
The claim recites similar limitations as corresponding to claim 5. Therefore, the same subject matter analysis that was utilized for claim 5, as described above, is equally applicable to claim 16. Therefore, claim 16 is ineligible.
With respect to claim 17:
The claim recites similar limitations as corresponding to claim 6. Therefore, the same subject matter analysis that was utilized for claim 6, as described above, is equally applicable to claim 17. Therefore, claim 17 is ineligible.
With respect to claim 18:
The claim recites similar limitations as corresponding to claim 7. Therefore, the same subject matter analysis that was utilized for claim 7, as described above, is equally applicable to claim 18. Therefore, claim 18 is ineligible.
With respect to claim 19:
The claim recites similar limitations as corresponding to claim 14. Therefore, the same subject matter analysis that was utilized for claim 14, as described above, is equally applicable to claim 19. Therefore, claim 19 is ineligible.
With respect to claim 20:
The claim recites similar limitations as corresponding to claim 1. Therefore, the same subject matter analysis that was utilized for claim 1, as described above, is equally applicable to claim 20. Therefore, claim 20 is ineligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2 and 4-20 are rejected under 35 U.S.C. 103 as being unpatentable over Floratou (US 2021/0216905 A1) in view of Chen (US 2022/0036246 A1).
Regarding claim 1, Floratou teaches:
receiving an ML project including a data-frame and an ML pipeline, the ML pipeline includes a plurality of code statements associated with a plurality of features corresponding to an ML project; ([0039] “A Data Source D can be a database table/view, a spreadsheet, or any other external files that may typically be used in Python scripts to access the input data” and [0040] “A common ML pipeline accesses data source D” and Fig. 6 Step 602 [0091] “At step 602, machine learning (ML) model code is received”).
determining one or more atomic steps corresponding to the ML pipeline to determine an atomized ML pipeline, each atomic step corresponding to a unitary operation being on of assigning a variable, renaming a feature, deleting a feature; ([0047] “Embodiments of provenance tracker 210 may determine a set of columns (or rows or cells) that were explicitly included in or excluded from the features/labels by using the annotated WIR and consulting knowledge base 220” and [0078] “Based on this knowledge, knowledge base 220 may be further configured to include information about such operations to assist provenance tracker 210 in executing a provenance tracking algorithm. For example, embodiments of knowledge base 220 are further configured to include a table consisting of two types of operations as follows: 1) operations from various Python libraries that exclude columns (e.g., drop and delete in the Pandas library) or explicitly select a subset of columns (e.g., iloc and ix), and 2) a few native Python operations such as Subscript, ExtSlice, Slice, Index, and Delete” where operations are similar to “atomic steps”. Also the atomized ML pipeline is disclosed due to the scripts being broken down into a workflow model [0048] “Derivation extractor 202 is configured to parse script 102 and by performing static analysis, build a workflow model which captures the dependencies among the elements of the script”).
instrumenting the atomized ML pipeline to determine an instrumented ML pipeline including one or more operations corresponding to the received ML project; ([0048] “Derivation extractor 202 is configured to parse script 102 and by performing static analysis, build a workflow model which captures the dependencies among the elements of the script including imported libraries, input arguments, operations that change the state of the program, and the derived output variables. This model is captured in a workflow intermediate representation (“WIR”). A WIR may be understood in terms of variables, operations and provenance relationships (“PRs”).”).
executing the instrumented ML pipeline to capture one or more data-frame snapshots based on each of the one or more operations; ([0089] “As such, provenance tracker 210 may operate in two modes: for static analysis as described herein above, or for dynamic analysis where embodiments would return a new Python script including functional modifications (based on annotations as described above)”)
constructing a feature provenance graph (FPG) based on the ML pipeline and the captured one or more data-frame snapshots, the data-frame snap shots relating to the generated log of actions; and (0051] “An invocation of an operation p (by an optional caller c) embodies a provenance relationship (PR). A PR is represented as a quadruple (I, c, p, O), where I is an ordered set of input variables, (optional) variable c refers to the caller object, p is the operation, and O is an ordered set of output variables that was derived from this process. A PR can be represented as a labeled directed graph” and [0052] “PRs are composed together to form a WIR G, which is a directed graph that represents the sequence and dependencies among the extracted PRs. The WIR is useful to answer queries such as: “Which variables were derived from other variables?”, “What type of libraries and modules were used?”, and “What operations were applied to each variable?”).
identifying one or more discarded features, from the plurality of features corresponding to the received ML project, based on the constructed FPG. ([0110] “In step 1002, for each PR in the directed graph, feature variables are determined comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation, and beginning with the PR of each determined feature variable, recursively traverse the directed graph backwards to identify the data from the at least one data source that corresponds to the feature variable.”).
Floratou does not teach:
An automatic Machine Learning (ML) pipeline generation method, executed by a processor,
Generating a log of actions based on the execution of the instrumental ML pipeline;
However, Chen does:
An automatic Machine Learning (ML) pipeline generation method, executed by a processor, ([0002] “ In one or more embodiments described herein, systems, computer-implemented methods, apparatuses and/or computer program products that can automate the generation of machine learning pipelines based on one or more characteristics of time series data are described.”)
Generating a log of actions based on the execution of the instrumental ML pipeline; ([0081] “Additionally, the task component 1002 can generate one or more reports that include the outputs of the machine learning task and/or explanations regarding one or more features of the automated machine learning process.”)
Floratou and Chen are considered analogous art to the claimed invention because they are in the same field of endeavor of editing source code. It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to combine the machine learning data tracking and workflow decomposition of Floratou with the automatic ML pipeline generation and log reporting of Chen. One would be motivated to do this to create accurate automatic ML pipelines.
Regarding claim 2, Floratou in view of Chen teaches claim 1 as outlined above. Floratou further teaches:
wherein the determination of the one or more atomic steps corresponding to the ML pipeline is based on an application of a source- code lifting technique on the plurality of code statements ([0056] “In an embodiment, derivation extractor 202 is configured to generate a complete WIR from Script 1 according to a three-step process. First, derivation extractor 202 parses Script 1 to obtain a corresponding abstract syntax tree (“AST”) representation of Script 1. Generally speaking, and as known in the art, an AST is a tree representation of the abstract syntactic structure of source code written in a programming language. For example, FIG. 4 depicts a partial example AST 400 corresponding to a portion of Script 1 (shown above), and which partially corresponds to WIR 300 of FIG. 3. More specifically, FIG. 4 depicts a fraction of an AST 400 generated from line 4 of Script 1 (i.e., generated from train_df=pd.read_csv(‘heart_disease.csv’)). AST 400 is a collection of nodes that are linked together based on the grammar of the Python language. Informally, by traversing AST 400 from left-to-right and top-to-bottom, one can visit the Python statements in the order presented in Script 1.”)
Regarding claim 4, Floratou in view of Chen teaches claim 1 as outlined above. Floratou further teaches:
wherein each of the captured one or more data- frame snapshots comprises at least one of: a line number, an input and an output of each variable, and a set of feature names associated with a data-frame type ([0048] “Derivation extractor 202 is configured to parse script 102 and by performing static analysis, build a workflow model which captures the dependencies among the elements of the script including imported libraries, input arguments, operations that change the state of the program, and the derived output variables.”).
Regarding claim 5, Floratou in view of Chen teaches claim 1 as outlined above. Floratou further teaches:
initializing the feature provenance graph (FPG) as a directed acyclic graph including one node for each feature associated with the captured one or more data- frame snapshots; ([0051] “A PR can be represented as a labeled directed graph, which includes (1) a set of input edges (labeled as ‘input_edge’), where there is an input edge (v,p) for each v∈I”).
retrieving an abstract syntax tree (AST) associated with the atomized ML pipeline; (0056] “In an embodiment, derivation extractor 202 is configured to generate a complete WIR from Script 1 according to a three-step process. First, derivation extractor 202 parses Script 1 to obtain a corresponding abstract syntax tree (“AST”) representation of Script 1. Generally speaking, and as known in the art, an AST is a tree representation of the abstract syntactic structure of source code written in a programming language.”).
selecting a first operation of the one or more operations to analyze a data- frame snapshot of the captured one or more data-frame snapshots, wherein the data-frame snapshot corresponds to the selected first operation and includes input information and output information; ([0009] “In another aspect, embodiments are configured to generate PRs by traversing the AST starting at each root node of the AST and for each node of the AST that is not a literal or constant, determining an operation corresponding to the node, and recursively determining each one or more inputs, a caller and one or more outputs corresponding to the node, wherein the determined operation, one or more inputs, caller and one or more outputs together comprise a generated PR.”).
adding a layer of nodes in the FPG for each feature associated with the output information of the data-frame snapshot; and adding a directed edge in the FPG from a first node associated with the input information to a second node associated with the output information, based on a correspondence between a first name of a first feature associated with the first node and a second name of a second feature associated with the second node. ([0051] “An invocation of an operation p (by an optional caller c) embodies a provenance relationship (PR). A PR is represented as a quadruple (I, c, p, O), where I is an ordered set of input variables, (optional) variable c refers to the caller object, p is the operation, and O is an ordered set of output variables that was derived from this process. A PR can be represented as a labeled directed graph, which includes (1) a set of input edges (labeled as ‘input_edge’), where there is an input edge (v,p) for each v∈I, (2) a caller edge (labeled as ‘caller_edge’) (c,p) if p is called by c, and (3) a set of output edges (labeled as ‘output_edge’), where there is an output edge (p,v) for each v∈O. For consistency, e create a temporary output variable for the operations that do not explicitly generate one.”).
Regarding claim 6, Floratou in view of Chen teaches claim 5 as outlined above. Floratou further teaches:
identifying, based on the retrieved AST, one or more features and an operation name associated with the selected first operation; ([0058] “Returning to the description of derivation extractor 202, after generating an AST from a script, derivation extractor 202 then performs the second and third step of the WIR generation process by a) identifying the relationships between the nodes of the AST to generate PRs, and b) composing the generated PRs into the directed graph G (i.e., the WIR of the input script). In an embodiment, these two steps may be performed together with a reclusive algorithm (owing to the recursive nature of AST node definitions).”).
determining whether the operation name corresponds to a delete operation; ([0077] “Generally speaking, provenance tracker 210 operates to identify the columns by examining the operations in annotated WIR 208 that are connected to variables that contain features and labels in their annotation set, and which may act to either select or exclude certain columns in a data source. That is, there are various operations that take features (or labels) as their caller/input, and may apply transformations, drop a set of columns from it, select a subset of rows upon satisfaction of a condition, copy it into another variable, and/or use it for visualization and the like.” Here excluding refers to the ”delete operation”).
adding a null successor in the FPG for each node of the layer of nodes associated with the input information of the data-frame snapshot, based on the determination that the operation name corresponds to the delete operation; ([0051] “create a temporary output variable for the operations that do not explicitly generate one.”).
determining whether the operation name corresponds to a rename operation; ([0077] “Generally speaking, provenance tracker 210 operates to identify the columns by examining the operations in annotated WIR 208 that are connected to variables that contain features and labels in their annotation set, and which may act to either select or exclude certain columns in a data source. That is, there are various operations that take features (or labels) as their caller/input, and may apply transformations, drop a set of columns from it, select a subset of rows upon satisfaction of a condition, copy it into another variable, and/or use it for visualization and the like.”).
adding a directed edge in the FPG from a node associated with the input information of the data-frame snapshot to a corresponding re-named successor node associated with the output information of the data-frame snapshot, based on the determination that the operation name corresponds to the rename operation; and adding a directed edge in the FPG from each node of the layer of nodes associated with the input information of the data-frame snapshot to a corresponding successor node associated with the output information of the data-frame snapshot, based on the determination that the operation name is other than the rename operation or the delete operation. ([0051] “An invocation of an operation p (by an optional caller c) embodies a provenance relationship (PR). A PR is represented as a quadruple (I, c, p, O), where I is an ordered set of input variables, (optional) variable c refers to the caller object, p is the operation, and O is an ordered set of output variables that was derived from this process. A PR can be represented as a labeled directed graph, which includes (1) a set of input edges (labeled as ‘input_edge’), where there is an input edge (v,p) for each v∈I, (2) a caller edge (labeled as ‘caller_edge’) (c,p) if p is called by c, and (3) a set of output edges (labeled as ‘output_edge’), where there is an output edge (p,v) for each v∈O. For consistency, e create a temporary output variable for the operations that do not explicitly generate one.”).
Regarding claim 7, Floratou in view of Chen teaches claim 1 as outlined above. Floratou further teaches:
retrieving the constructed feature provenance graph (FPG); ([0132] “adding annotations to the PRs of the directed graph corresponding to the WIR”).
selecting a node from a data-frame snapshot in the retrieved FPG; ([0132] “beginning at each import process node of the directed graph”).
performing a depth first search of the retrieved FPG from the selected node by marking each visited node in the retrieved FPG; and ([0132] “performing a forward traversal of PRs in the directed graph, and for each PR encountered in the forward direction (forward PR): querying the KB for first input annotations corresponding to the one or more input variables of the forward PR”).
determining one or more unmarked nodes in the retrieved FPG, based on the performed depth first search, wherein the one or more discarded features are identified based on the determined one or more unmarked nodes ([0133] “determining feature variables comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation, and beginning with the PR of each determined feature variable, recursively traverse the directed graph backwards to identify the data from the at least one data source that corresponds to the feature variable.”).
Regarding claim 8, Floratou in view of Chen teaches claim 7 as outlined above. Floratou further teaches:
wherein the depth first search corresponds to at least one of: a forward depth first search or a backward depth first search ([0132] “performing a forward traversal of PRs in the directed graph, and for each PR encountered in the forward direction … performing a backward traversal of PRs in the directed graph”).
Regarding claim 9, Floratou in view of Chen teaches claim 1 as outlined above. Floratou further teaches:
constructing an abstract syntax tree (AST) of the atomized ML pipeline; ([0008] “ building an abstract syntax tree (AST) based on the ML model code”).
traversing the constructed AST of the atomized ML pipeline to identify a rename operation; ([0102] “traversing the AST starting at each root node of the AST and for each node of the AST that is not a literal or constant, determining an operation corresponding to the node” Here an operation includes rename operation).
extracting a first mapping between the plurality of features corresponding to the received ML project and a set of first features associated with the ML pipeline, based on the determined parameter map, wherein ([0102] “recursively determining each one or more inputs, a caller and one or more outputs corresponding to the node, wherein the determined operation, one or more inputs, caller and one or more outputs together comprise a generated PR.”).
the one or more discarded features are identified further based on the extracted first mapping ([0110] “for each PR in the directed graph, feature variables are determined comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation”).
Regarding claim 10, Floratou in view of Chen teaches claim 9 as outlined above. Floratou further teaches:
wherein the set of first features correspond to one or more features that are renamed based on a user input ([0102] “traversing the AST starting at each root node of the AST and for each node of the AST that is not a literal or constant, determining an operation corresponding to the node” Here an operation includes rename operation).
Regarding claim 11, Floratou in view of Chen teaches claim 1 as outlined above. Floratou further teaches:
traversing through each of the one or more atomic steps corresponding to the ML pipeline to track an update of a name associated with each of a set of second features associated with the ML pipeline; and ([0132] “performing a backward traversal of PRs in the directed graph starting with the forward PR, and for each PR encountered in the backward direction (backward PR): querying the KB based on the first input annotations for second input annotations corresponding to the backward PR, and adding the second input annotations to the backward PR.” and [0133] “In an embodiment of the foregoing method, the identifying, based at least on the annotated WIR and the first ML API, data from at least one data source that is relied upon by the ML model code when training the ML model comprises: for each PR in the directed graph, determining feature variables comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation, and beginning with the PR of each determined feature variable, recursively traverse the directed graph backwards to identify the data from the at least one data source that corresponds to the feature variable.”
extracting a second mapping between the plurality of features corresponding to the received ML project and the set of second features, wherein the one or more discarded features are identified further based on the extracted second mapping ([0133] “determining feature variables comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation”
Regarding claim 12, Floratou in view of Chen teaches claim 11 as outlined above. Floratou further teaches:
wherein the set of second features correspond to one or more features that are renamed based on a set of predefined rules ([0062] “For example, when processing the Assign node 402 of AST 400 of FIG. 4, the procedure identifies Assign.value as input, Assign as operation, and Assign.targets as output.” Assign operation is a renaming operation).
Regarding claim 13, Floratou in view of Chen teaches claim 11 as outlined above. Floratou further teaches:
wherein the plurality of features corresponds to an initial data-frame of the captured one or more data-frame snapshots and the set of second features corresponds to a final data-frame of the captured one or more data-frame snapshots ([0133] “beginning with the PR of each determined feature variable, recursively traverse the directed graph backwards to identify the data from the at least one data source that corresponds to the feature variable”).
Regarding claim 14, Floratou in view of Chen teaches claim 1 as outlined above. Floratou further teaches:
determining one or more relevant features, from the plurality of features corresponding to the received ML project, based on the identified one or more discarded features; and ([0133] “determining feature variables comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation”).
training a meta-learning model associated with the ML project based on the determined one or more relevant features ([0133] “training the ML model”).
Regarding claim 15, Floratou teaches:
receiving an ML project including a data-frame and an ML pipeline, the ML pipeline includes a plurality of code statements associated with a plurality of features corresponding to an ML project; ([0039] “A Data Source D can be a database table/view, a spreadsheet, or any other external files that may typically be used in Python scripts to access the input data” and [0040] “A common ML pipeline accesses data source D” and Fig. 6 Step 602 [0091] “At step 602, machine learning (ML) model code is received”).
determining one or more atomic steps corresponding to the ML pipeline to determine an atomized ML pipeline, each atomic step corresponding to a unitary operation being on of assigning a variable, renaming a feature, deleting a feature; ([0047] “Embodiments of provenance tracker 210 may determine a set of columns (or rows or cells) that were explicitly included in or excluded from the features/labels by using the annotated WIR and consulting knowledge base 220” and [0078] “Based on this knowledge, knowledge base 220 may be further configured to include information about such operations to assist provenance tracker 210 in executing a provenance tracking algorithm. For example, embodiments of knowledge base 220 are further configured to include a table consisting of two types of operations as follows: 1) operations from various Python libraries that exclude columns (e.g., drop and delete in the Pandas library) or explicitly select a subset of columns (e.g., iloc and ix), and 2) a few native Python operations such as Subscript, ExtSlice, Slice, Index, and Delete” where operations are similar to “atomic steps”. Also the atomized ML pipeline is disclosed due to the scripts being broken down into a workflow model [0048] “Derivation extractor 202 is configured to parse script 102 and by performing static analysis, build a workflow model which captures the dependencies among the elements of the script”).
instrumenting the atomized ML pipeline to determine an instrumented ML pipeline including one or more operations corresponding to the received ML project; ([0048] “Derivation extractor 202 is configured to parse script 102 and by performing static analysis, build a workflow model which captures the dependencies among the elements of the script including imported libraries, input arguments, operations that change the state of the program, and the derived output variables. This model is captured in a workflow intermediate representation (“WIR”). A WIR may be understood in terms of variables, operations and provenance relationships (“PRs”).”).
executing the instrumented ML pipeline to capture one or more data-frame snapshots based on each of the one or more operations; ([0089] “As such, provenance tracker 210 may operate in two modes: for static analysis as described herein above, or for dynamic analysis where embodiments would return a new Python script including functional modifications (based on annotations as described above)”)
constructing a feature provenance graph (FPG) based on the ML pipeline and the captured one or more data-frame snapshots, the data-frame snap shots relating to the generated log of actions; and (0051] “An invocation of an operation p (by an optional caller c) embodies a provenance relationship (PR). A PR is represented as a quadruple (I, c, p, O), where I is an ordered set of input variables, (optional) variable c refers to the caller object, p is the operation, and O is an ordered set of output variables that was derived from this process. A PR can be represented as a labeled directed graph” and [0052] “PRs are composed together to form a WIR G, which is a directed graph that represents the sequence and dependencies among the extracted PRs. The WIR is useful to answer queries such as: “Which variables were derived from other variables?”, “What type of libraries and modules were used?”, and “What operations were applied to each variable?”).
identifying one or more discarded features, from the plurality of features corresponding to the received ML project, based on the constructed FPG. ([0110] “In step 1002, for each PR in the directed graph, feature variables are determined comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation, and beginning with the PR of each determined feature variable, recursively traverse the directed graph backwards to identify the data from the at least one data source that corresponds to the feature variable.”).
Floratou does not teach:
One or more non-transitory computer-readable storage media configured to store instructions that, in response to being executed, cause an electronic device to perform an automatic Machine Learning (ML) pipeline generation method
Generating a log of actions based on the execution of the instrumental ML pipeline;
However, Chen does:
One or more non-transitory computer-readable storage media configured to store instructions that, in response to being executed, cause an electronic device to perform an automatic Machine Learning (ML) pipeline generation method ([0002] “ In one or more embodiments described herein, systems, computer-implemented methods, apparatuses and/or computer program products that can automate the generation of machine learning pipelines based on one or more characteristics of time series data are described.”)
Generating a log of actions based on the execution of the instrumental ML pipeline; ([0081] “Additionally, the task component 1002 can generate one or more reports that include the outputs of the machine learning task and/or explanations regarding one or more features of the automated machine learning process.”)
Floratou and Chen are considered analogous art to the claimed invention because they are in the same field of endeavor of editing source code. It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to combine the machine learning data tracking and workflow decomposition of Floratou with the automatic ML pipeline generation and log reporting of Chen. One would be motivated to do this to create accurate automatic ML pipelines.
Regarding claim 16, Floratou in view of Chen teaches claim 15 as outlined above. Floratou further teaches:
initializing the feature provenance graph (FPG) as a directed acyclic graph including one node for each feature associated with the captured one or more data- frame snapshots; ([0051] “A PR can be represented as a labeled directed graph, which includes (1) a set of input edges (labeled as ‘input_edge’), where there is an input edge (v,p) for each v∈I”).
retrieving an abstract syntax tree (AST) associated with the atomized ML pipeline; (0056] “In an embodiment, derivation extractor 202 is configured to generate a complete WIR from Script 1 according to a three-step process. First, derivation extractor 202 parses Script 1 to obtain a corresponding abstract syntax tree (“AST”) representation of Script 1. Generally speaking, and as known in the art, an AST is a tree representation of the abstract syntactic structure of source code written in a programming language.”).
selecting a first operation of the one or more operations to analyze a data- frame snapshot of the captured one or more data-frame snapshots, wherein the data-frame snapshot corresponds to the selected first operation and includes input information and output information; ([0009] “In another aspect, embodiments are configured to generate PRs by traversing the AST starting at each root node of the AST and for each node of the AST that is not a literal or constant, determining an operation corresponding to the node, and recursively determining each one or more inputs, a caller and one or more outputs corresponding to the node, wherein the determined operation, one or more inputs, caller and one or more outputs together comprise a generated PR.”).
adding a layer of nodes in the FPG for each feature associated with the output information of the data-frame snapshot; and adding a directed edge in the FPG from a first node associated with the input information to a second node associated with the output information, based on a correspondence between a first name of a first feature associated with the first node and a second name of a second feature associated with the second node. ([0051] “An invocation of an operation p (by an optional caller c) embodies a provenance relationship (PR). A PR is represented as a quadruple (I, c, p, O), where I is an ordered set of input variables, (optional) variable c refers to the caller object, p is the operation, and O is an ordered set of output variables that was derived from this process. A PR can be represented as a labeled directed graph, which includes (1) a set of input edges (labeled as ‘input_edge’), where there is an input edge (v,p) for each v∈I, (2) a caller edge (labeled as ‘caller_edge’) (c,p) if p is called by c, and (3) a set of output edges (labeled as ‘output_edge’), where there is an output edge (p,v) for each v∈O. For consistency, e create a temporary output variable for the operations that do not explicitly generate one.”).
Regarding claim 17, Floratou in view of Chen teaches claim 16 as outlined above. Floratou further teaches:
identifying, based on the retrieved AST, one or more features and an operation name associated with the selected first operation; ([0058] “Returning to the description of derivation extractor 202, after generating an AST from a script, derivation extractor 202 then performs the second and third step of the WIR generation process by a) identifying the relationships between the nodes of the AST to generate PRs, and b) composing the generated PRs into the directed graph G (i.e., the WIR of the input script). In an embodiment, these two steps may be performed together with a reclusive algorithm (owing to the recursive nature of AST node definitions).”).
determining whether the operation name corresponds to a delete operation; ([0077] “Generally speaking, provenance tracker 210 operates to identify the columns by examining the operations in annotated WIR 208 that are connected to variables that contain features and labels in their annotation set, and which may act to either select or exclude certain columns in a data source. That is, there are various operations that take features (or labels) as their caller/input, and may apply transformations, drop a set of columns from it, select a subset of rows upon satisfaction of a condition, copy it into another variable, and/or use it for visualization and the like.” Here excluding refers to the ”delete operation”).
adding a null successor in the FPG for each node of the layer of nodes associated with the input information of the data-frame snapshot, based on the determination that the operation name corresponds to the delete operation; ([0051] “create a temporary output variable for the operations that do not explicitly generate one.”).
determining whether the operation name corresponds to a rename operation; ([0077] “Generally speaking, provenance tracker 210 operates to identify the columns by examining the operations in annotated WIR 208 that are connected to variables that contain features and labels in their annotation set, and which may act to either select or exclude certain columns in a data source. That is, there are various operations that take features (or labels) as their caller/input, and may apply transformations, drop a set of columns from it, select a subset of rows upon satisfaction of a condition, copy it into another variable, and/or use it for visualization and the like.”).
adding a directed edge in the FPG from a node associated with the input information of the data-frame snapshot to a corresponding re-named successor node associated with the output information of the data-frame snapshot, based on the determination that the operation name corresponds to the rename operation; and adding a directed edge in the FPG from each node of the layer of nodes associated with the input information of the data-frame snapshot to a corresponding successor node associated with the output information of the data-frame snapshot, based on the determination that the operation name is other than the rename operation or the delete operation. ([0051] “An invocation of an operation p (by an optional caller c) embodies a provenance relationship (PR). A PR is represented as a quadruple (I, c, p, O), where I is an ordered set of input variables, (optional) variable c refers to the caller object, p is the operation, and O is an ordered set of output variables that was derived from this process. A PR can be represented as a labeled directed graph, which includes (1) a set of input edges (labeled as ‘input_edge’), where there is an input edge (v,p) for each v∈I, (2) a caller edge (labeled as ‘caller_edge’) (c,p) if p is called by c, and (3) a set of output edges (labeled as ‘output_edge’), where there is an output edge (p,v) for each v∈O. For consistency, e create a temporary output variable for the operations that do not explicitly generate one.”).
Regarding claim 18, Floratou in view of Chen teaches claim 15 as outlined above. Floratou further teaches:
retrieving the constructed feature provenance graph (FPG); ([0132] “adding annotations to the PRs of the directed graph corresponding to the WIR”).
selecting a node from a data-frame snapshot in the retrieved FPG; ([0132] “beginning at each import process node of the directed graph”).
performing a depth first search of the retrieved FPG from the selected node by marking each visited node in the retrieved FPG; and ([0132] “performing a forward traversal of PRs in the directed graph, and for each PR encountered in the forward direction (forward PR): querying the KB for first input annotations corresponding to the one or more input variables of the forward PR”).
determining one or more unmarked nodes in the retrieved FPG, based on the performed depth first search, wherein the one or more discarded features are identified based on the determined one or more unmarked nodes. ([0133] “determining feature variables comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation, and beginning with the PR of each determined feature variable, recursively traverse the directed graph backwards to identify the data from the at least one data source that corresponds to the feature variable.”).
Regarding claim 19, Floratou in view of Chen teaches claim 15 as outlined above. Floratou further teaches:
determining one or more relevant features, from the plurality of features corresponding to the received ML project, based on the identified one or more discarded features; and ([0133] “determining feature variables comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation”).
training a meta-learning model associated with the ML project based on the determined one or more relevant features ([0133] “training the ML model”).
Regarding claim 20, Floratou teaches:
a memory storing instructions; and ([0136] “a queryable knowledge base (KB) stored in one or more memory devices”).
a processor, coupled to the memory, that executes the stored instructions to perform a process comprising: ([0142] “the computer-readable storage medium having program instructions recorded thereon that, when executed by at least one processor”)
receiving an ML project including a data-frame and an ML pipeline, the ML pipeline includes a plurality of code statements associated with a plurality of features corresponding to an ML project; ([0039] “A Data Source D can be a database table/view, a spreadsheet, or any other external files that may typically be used in Python scripts to access the input data” and [0040] “A common ML pipeline accesses data source D” and Fig. 6 Step 602 [0091] “At step 602, machine learning (ML) model code is received”).
determining one or more atomic steps corresponding to the ML pipeline to determine an atomized ML pipeline, each atomic step corresponding to a unitary operation being on of assigning a variable, renaming a feature, deleting a feature; ([0047] “Embodiments of provenance tracker 210 may determine a set of columns (or rows or cells) that were explicitly included in or excluded from the features/labels by using the annotated WIR and consulting knowledge base 220” and [0078] “Based on this knowledge, knowledge base 220 may be further configured to include information about such operations to assist provenance tracker 210 in executing a provenance tracking algorithm. For example, embodiments of knowledge base 220 are further configured to include a table consisting of two types of operations as follows: 1) operations from various Python libraries that exclude columns (e.g., drop and delete in the Pandas library) or explicitly select a subset of columns (e.g., iloc and ix), and 2) a few native Python operations such as Subscript, ExtSlice, Slice, Index, and Delete” where operations are similar to “atomic steps”. Also the atomized ML pipeline is disclosed due to the scripts being broken down into a workflow model [0048] “Derivation extractor 202 is configured to parse script 102 and by performing static analysis, build a workflow model which captures the dependencies among the elements of the script”).
instrumenting the atomized ML pipeline to determine an instrumented ML pipeline including one or more operations corresponding to the received ML project; ([0048] “Derivation extractor 202 is configured to parse script 102 and by performing static analysis, build a workflow model which captures the dependencies among the elements of the script including imported libraries, input arguments, operations that change the state of the program, and the derived output variables. This model is captured in a workflow intermediate representation (“WIR”). A WIR may be understood in terms of variables, operations and provenance relationships (“PRs”).”).
executing the instrumented ML pipeline to capture one or more data-frame snapshots based on each of the one or more operations; ([0089] “As such, provenance tracker 210 may operate in two modes: for static analysis as described herein above, or for dynamic analysis where embodiments would return a new Python script including functional modifications (based on annotations as described above)”)
constructing a feature provenance graph (FPG) based on the ML pipeline and the captured one or more data-frame snapshots, the data-frame snap shots relating to the generated log of actions; and (0051] “An invocation of an operation p (by an optional caller c) embodies a provenance relationship (PR). A PR is represented as a quadruple (I, c, p, O), where I is an ordered set of input variables, (optional) variable c refers to the caller object, p is the operation, and O is an ordered set of output variables that was derived from this process. A PR can be represented as a labeled directed graph” and [0052] “PRs are composed together to form a WIR G, which is a directed graph that represents the sequence and dependencies among the extracted PRs. The WIR is useful to answer queries such as: “Which variables were derived from other variables?”, “What type of libraries and modules were used?”, and “What operations were applied to each variable?”).
identifying one or more discarded features, from the plurality of features corresponding to the received ML project, based on the constructed FPG. ([0110] “In step 1002, for each PR in the directed graph, feature variables are determined comprising variables of the PR that are annotated as features and that correspond to a data selection or exclusion operation, and beginning with the PR of each determined feature variable, recursively traverse the directed graph backwards to identify the data from the at least one data source that corresponds to the feature variable.”).
Floratou does not teach:
An automatic Machine Learning (ML) pipeline generation device
Generating a log of actions based on the execution of the instrumental ML pipeline;
However, Chen does:
An automatic Machine Learning (ML) pipeline generation device ([0002] “In one or more embodiments described herein, systems, computer-implemented methods, apparatuses and/or computer program products that can automate the generation of machine learning pipelines based on one or more characteristics of time series data are described”)
Generating a log of actions based on the execution of the instrumental ML pipeline; ([0081] “Additionally, the task component 1002 can generate one or more reports that include the outputs of the machine learning task and/or explanations regarding one or more features of the automated machine learning process.”)
Floratou and Chen are considered analogous art to the claimed invention because they are in the same field of endeavor of editing source code. It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to combine the machine learning data tracking and workflow decomposition of Floratou with the automatic ML pipeline generation and log reporting of Chen. One would be motivated to do this to create accurate automatic ML pipelines.
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Floratou in view of Chen and Loyola (US 2020/0097383 A1).
Regarding claim 3, Floratou in view of Chen teaches claim 1 as outlined above. Floratou nor Chen teach:
wherein the instrumentation of the atomized ML pipeline is based on an application of a method call injection technique on the atomized ML pipeline.
However, Loyola does teach this ([0037] “Although bug report 402B uses natural language, some bug reports contain portions of source code, such as cases where the end user adds source code to show a specific feature or log associated with the fault. In order to account for the presence of both source code and natural language, some embodiments of feature learning process 406 may distinguish between the source code and the natural language, and tokenize accordingly.”)
Floratou, Chen and Loyola are considered analogous art to the claimed invention because they are in the same field of endeavor of editing source code. It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to substitute the machine learning data tracking of Floratou with the code reporting of Loyola and combine it with the automated generation of ML pipelines of Chen. One would be motivated to do this to obtain accurate and efficient code reporting.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL PATRICK GRUSZKA whose telephone number is (571)272-5259. The examiner can normally be reached M-F 9:00 AM - 6:00 PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached at (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DANIEL GRUSZKA/Examiner, Art Unit 2121
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121