DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Applicant’s submission filed 2025-10-22 has been entered. Applicant’s amendments to the claims have overcome each and every objection and 112(b) rejection set forth in the previous office action. The status of claims is as follows:
Claims 1-3, 5-12, and 14-22 remain pending in the application.
Claims 1, 8, 10, and 16-17, and 19 are amended.
Claims 4 and 13 are cancelled.
Response to Arguments
Applicant’s arguments with respect to rejections under 35 USC 103 have been fully considered but they are not persuasive and/or are moot in light of the new reference applied toas necessitated by newly amended matter.
Applicant argues on Remarks Page 14 that Kuppa teaches a purging data based on a “threshold privacy score” instead of a “threshold value for a confidence level”, and therefore does not teach the claimed limitation.
Examiner respectfully disagrees, and points out that no details are given in the claims or specification on the nature of the “confidence level”, and the broadest reasonable interpretation of “confidence level” includes a “privacy score”, as private information that is shared can be said to be shared “in confidence” or can be said to be “confidential”, and also, as Examiner explained in the mapping, the “privacy score” may be considered a measure of confidence in the privacy of the data.
Applicant argues on Remarks Page 14 that “relying on six or more references inherently makes the instant invention not obvious” because the MPEP section that states that a large number of references does not weight against obviousness relies on In re Gorman which no longer applies because KSR v. Teleflex eliminated the need for express motivation to combine references that was required by In re Gorman.
Examiner respectfully disagrees. The KSR v. Teleflex decision resulted in additional motivations that can be used to combine references besides an explicit motivation in a reference. It did not eliminate every aspect of In re Gorman, and contains no ruling that the provision in In re Gorman that a large number of references do not weight against obviousness was overturned.
Applicant argues on Remarks Page 13 that the previously applied combination of references does not teach data control language (DCL). Examiner agrees, but this is moot in light of the new reference applied to teach this matter. The new reference also replaces Oracle in Claims 8 and 17.
Claim Interpretation (Restated from previous office action)
Examiner notes that support for several of the new amendments requires rather significant reading into the Specification by one of ordinary skill in the art. While the gaps may not be seen as sufficient to warrant a 112(a) new matter rejection, they at least allow for broad interpretation of the claimed language for the purposes of applying prior art.
Claims 1, 10, and 19 recite the limitations “wherein a first batch of produced data has for the query results a confidence level lower than a threshold value and a second batch of the produced data has for the query results a confidence level higher than a threshold value; and purging the first batch of data from the new data in response to the confidence level being lower than the threshold value.”
However, the Specification never explicitly discloses any direct correlation between “purge” and “confidence level”. Support for this may be seen as a combination of [0037-0038] (“terminate when the confidence level is considered to be sufficiently large … Processing proceeds to grow operation S275, where database (DB) activities mod 316 grows the DB activities in a manner using the confidence level.”) and [0039] (“use generated data to purge the data (that is, purge and refresh means after data generation, a new data quality validation is performed. This may include the need to purge some exception data (for example, may increase with the model but may fail with a new, similar query check).”)
Examiner notes that one need make the assumption that the “new data quality validation” uses the same metric as the “confidence level”. One also needs to make the inference that to “purge … exception data” amounts to splitting data into two batches, one which meets the confidence threshold, and one which does not meet the confidence threshold, which is referred to as “exception data” – which apparently is meant to mean “exceptional data”, interpreted to mean that it lies outside the acceptable range (or as stated in the Specification, “may fail with a new, similar query check”). Thus, Examiner notes that there is no explicit recitation of “batches”, but merely that data failing below a given confidence threshold is purged.
Claim 21 recites the limitation “enriching the pre-processed data to replace join and/or constraint information that has been lost”. There is no support for this level of specificity in the Specification. The Specification [0035] states: “The cleaned and concealed data is analyzed to obtain a set of extra patterns. In this operation, the cleaned and concealed data is removed. The whole data picture may be incomplete, for example some join or constraint information may be lost. Thus, patterns are created to enrich the data.” This states a rationale as to why the data is enriched, but nothing about how the enrichment is done, or what the patterns are, and certainly not explicitly that the “patterns … to enrich the data” in some way “replace” lost joins or constraints. This claim may be interpreted as simply meaning that some processing is performed using the data in order to preserve the database constraints.
Claim 22 recites the limitation “joining the new data that remains after purging and the cleaned raw data”. There is no support for this level of specificity in the Specification. The Specification [0034] states: “The raw data is first cleaned … After the data has been cleaned, it is ‘concealed’ as the next part of operation S265.”
Examiner notes that there is no mention about “clean” and “conceal” occurring in any time frame relative to the “purging”, whether it be before or after. There is also no specific mention of a “join” in this process. This appears to merely be referring to removing and replacing sensitive data in the database. It is possible that [0051] (“grows discriminant data from clean and concealed sample data using a DB statistic distribution model”) refers to a blend of cleaned and concealed data with fully synthetic data (“growing” data from a base of “clean and concealed data”, resulting in a “partially synthetic” dataset with a mix of both partially synthetic (“concealed”) and completely fully synthetic records). Walters teaches both a “fully synthetic” and a “partially synthetic” dataset.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5, 8, 10-12, 14, 17, 19-20, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Walters et al. (US 2020/0012584 A1; hereinafter “Walters”) in view of Gilad et al. (“Synthesizing Linked Data Under Cardinality and Integrity Constraints”; hereinafter “Gilad”), further in view of Lao et al. (“FoCL: Feature-Oriented Continual Learning for Generative Models”; hereinafter “Lao”), further in view of Fan et al. (“Relational Data Synthesis using Generative Adversarial Networks: A Design Space Exploration”; hereinafter “Fan”), further in view of Brass (“Part 7: Query Optimization”), further in view of Kuppa et al. (“Towards Improving Privacy of Synthetic DataSets”; hereinafter “Kuppa”), and further in view of Foster et al. (“Overview of SQL”; hereinafter “Foster”)
As per Claim 1, Walters teaches a computer-implemented method (CIM) comprising ([0008]: “Consistent with other disclosed embodiments, non-transitory computer readable storage media may store program instructions, which are executed by at least one processor device and perform any of the methods described herein.”)
receiving a set of raw data from a database and/or from a data stream ([0122]: “Streaming data source 1301 can be configured to retrieve new data elements from a database, a file, a datasource, a topic in a data streaming platform (e.g., IBM STREAMS), a topic in a distributed messaging system (e.g., APACHE KAFKA), or the like.”)
pre-processing and concealing the raw data to obtain pre-processed data ([0182]: “In some embodiments, at step 1806, sensitive data may be tokenized, masking underlying data values and preserving confidentiality.”)
analyzing pre-processed raw data to obtain extra patterns, with the extra patterns being programmed and/or structured to enrich the pre-processed raw data in response to a whole data picture being incomplete ([0083]: “Disclosed embodiments address this technical problem by at least one of normalizing categorical data or replacing missing values with supra-normal values”. Here, data is analyzed to obtain extra patterns for each data type (distributions so that the norm can be calculated) and this is used to enrich missing data (fill with “supra-normal values”). This is also stated in [0086]: “For example, system 100 can be configured to assign missing values a first numerical value outside the predetermined range.” Thus here, the “extra pattern” is the “predetermined range” which is used to enrich data when it is incomplete, by adding numerical values outside the “extra pattern”/”predetermined range”)
creating discriminator data for use by a discriminator component of a generative adversarial network (GAN), with the discriminator data including sample data and database (DB) statistics ([0083]: “The discriminator can be configured to determine, when presented with either an actual data sample or a sample of synthetic data generated by the generator network, whether the sample was generated by the generator network or was a sample of actual data” and [0109]: “Consistent with disclosed embodiments, system 100 can be configured to generate an encoder model and decoder model using an adversarially learned inference model, as disclosed in ‘Adversarially Learned Inference’ by Vincent Dumoulin, et al. … Thus, the encoder and decoder can be trained to fool the discriminator. When appropriately trained, the joint distribution of code and sample for the encoder and decoder match.” Here, it is shown that the discriminator takes as input sample data, and also takes into account DB statistics, as it determines which distribution the sample was drawn from (actual or synthetic). The Dumoulin paper they cite confirms this on Page 2: “A discriminator is trained to discriminate joint samples of the data and the corresponding latent variable from the encoder (or approximate posterior) from joint samples from the decoder while in opposition, the encoder and the decoder are trained together to fool the discriminator.”)
and growing the discriminator data from the pre-processed data with a DB statistic distribution model (As shown above in [0083] and [0109], discriminator data is grown by drawing samples from actual and synthetic data, and performs its actions based on the sample distributions. [0122] discloses that the data can be from a DB: “Streaming data source 1301 can be configured to retrieve new data elements from a database.”)
building a generative model, based on DB model activities, for use by the GAN, [wherein the building comprises intermittently training the generative model as the database is actively ingesting data and/or while the data stream is streaming], wherein the generative model is trained together with the discriminator component as the generative model produces data that simulates the pre-processed data and the discriminator component classifies input as being the produced data or as being part of the pre-processed data
([0033]: “Using these models, the disclosed embodiments can produce fully synthetic datasets with similar structure and statistics as the original sensitive or non-sensitive datasets.” [0036]: “Dataset generator 103 can include one or more computing devices configured to generate data. Dataset generator 103 can be configured to provide data to computing resources 101, database 105, to another component of system 100 (e.g., interface 113), or another system (e.g., an APACHE KAFKA cluster or other publication service). Dataset generator 103 can be configured to receive data from database 105 or another component of system 100. Dataset generator 103 can be configured to receive data models from model storage 109 or another component of system 100. Dataset generator 103 can be configured to generate synthetic data.” [0083]: “The discriminator can be configured to determine, when presented with either an actual data sample or a sample of synthetic data generated by the generator network, whether the sample was generated by the generator network or was a sample of actual data.”)
using the generative model to produce new data ([0036], as shown above, discloses: “Dataset generator 103 can include one or more computing devices configured to generate data”)
and performing a reward operation to validate quality of the new data ([0088]: “FIG. 9 depicts a process 900 for training a generative adversarial network using a loss function configured to ensure a predetermined degree of similarity, consistent with disclosed embodiments. System 100 can be configured to use process 900 to generate synthetic data that is similar, but not too similar to the actual data, as the actual data can include sensitive personal information. For example, when the actual data includes social security numbers or account numbers, the synthetic data would preferably not simply recreate these numbers. Instead, system 100 would preferably create synthetic data that resembles the actual data, as described below, while reducing the likelihood of overlapping values. To address this technical problem, system 100 can be configured to determine a similarity metric value between the synthetic dataset and the normalized reference dataset, consistent with disclosed embodiments. System 100 can be configured to use the similarity metric value to update a loss function for training the generative adversarial network. In this manner, system 100 can be configured to determine a synthetic dataset differing in value from the normalized reference dataset at least a predetermined amount according to the similarity metric.”)
However, Walters does not teach wherein the creating comprises: categorically including a feature factor comprising cardinality; wherein the building comprises intermittently training the generative model as the database is actively ingesting data and/or while the data stream is streaming; wherein the reward operation comprises: receiving test queries for querying the new data, wherein the test queries include data control language (DCL); extracting query features from the test queries; normalizing the extracted query features; obtaining query results on the normalized query features, wherein the quality of the new data is indicated by a confidence level that the query results are successful compared to the raw data, wherein a first batch of the new data for the query results a confidence level lower than a threshold and a second batch of the new data has for the query results a confidence level higher than a threshold value; purging the first batch of data from the new data in response to the confidence level being lower than the threshold value.
Gilad teaches wherein the creating comprises: categorically including a feature factor comprising cardinality (Page 619 Top Right: “Two prominent challenges in this field are: (1) the generation of links between different tables, i.e., aligning foreign keys with primary keys based on Cardinality Constraints (CCs).” Page 620 Bottom Left: “We believe that the problem we focus on is a key building block for the general problem of synthesizing data consistent with CCs and ICs for all three use-cases mentioned above. In particular, we believe that one can use the wealth of existing literature to synthesize individual relations consistent with CCs without the key relationships and then use our technique to fill-in the foreign keys.”)
Gilad is analogous art because it is in the field of endeavor of generating synthetic database data. It would have been obvious before the effective date of the claimed invention to combine the addition of synthetic data to databases of Walters with the cardinality constraints of Gilad. One of ordinary skill in the art would have been motivated to do so in order to be able to also generate Foreign Key values in the database that correspond to the generated data (Gilad, Abstract: “The generation of synthetic data is useful in multiple aspects, from testing applications to benchmarking to privacy preservation. Generating the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs) is an important aspect of this problem. Given instances of two relations, where one has a foreign key dependence on the other and is missing its foreign key (𝐹𝐾) values, and two types of constraints: (1) CCs that apply to the join view and (2) ICs that apply to the table with missing 𝐹𝐾 values, our goal is to impute the missing 𝐹𝐾 values such that the constraints are satisfied.”)
However, the combination of Walters and Gilad does not explicitly teach wherein the building comprises intermittently training the generative model as the database is actively ingesting data and/or while the data stream is streaming. (Although, Examiner notes, Walters appears to suggest this in [0122]: “In some aspects, streaming data source 1301 can be configured to retrieve new data elements in real-time.”)
Lao teaches wherein the building comprises intermittently training the generative model as the database is actively ingesting data and/or while the data stream is streaming (Abstract: “In this paper, we propose a general framework in continual learning for generative models.” Top of Page 3: “Given a stream of generative tasks t e [1, 2, …, T] arrived sequentially, each of which has its own designated dataset x(t)3, the goal of continual learning in generative models is to learn a unified model parameterized by θ, such that pθ(x(t)) = pdata(x(t)) for all t e [1, 2, …, T].”)
Lao is analogous art because it is in the field of endeavor of generative models. It would have been obvious before the effective filing date of the claimed invention to combine the synthetic data generation to a database of Walters and Gilad with the continual learning of Lao in order to adapt to distributional changes in the incoming data (Lao, Abstract: “We show in our experiments that FoCL has faster adaptation to distributional changes in sequentially arriving tasks, and achieves the state-of-the-art performance for generative models in task incremental learning.”)
However, the combination of Walters, Gilad, and Lao does not teach wherein the reward operation comprises: receiving test queries for querying the new data; extracting query features from the test queries, wherein the test queries include data control language (DCL);; normalizing the extracted query features; obtaining query results on the normalized query features, wherein the quality of the new data is indicated by a confidence level that the query results are successful compared to the raw data, wherein a first batch of the new data for the query results a confidence level lower than a threshold and a second batch of the new data has for the query results a confidence level higher than a threshold value; purging the first batch of data from the new data in response to the confidence level being lower than the threshold value.
Fan teaches wherein the reward operation comprises: receiving test queries for querying the new data; obtaining query results on the [normalized] query features, wherein the quality of the new data is indicated by a confidence level that the query results are successful compared to the raw data (Fan, Page 2 Bottom Left, discloses: “We study the problem of synthesizing a “fake” table T ′ from the original T , with the objective of preserving data utility and protecting privacy.” Fan, End of Page 7, discloses: “Evaluation on data utility for AQP. We use the fake table T′ to answer a given workload of aggregation queries. We follow the query generation method in [36] to generate 1,000 queries with aggregate functions (i.e., count, avg and sum), selection conditions and groupings. We also run the same queries on the original table Ttrain. For each query, we measure the relative error e′ of the result obtained from T′ by comparing with that from Ttrain. Meanwhile, following the method in [54], we draw a fixed size random sample set (1% by default) from the original table, run the queries on this sample set, and obtain relative error e for each query eliminate randomness, we draw the random sample sets for 10 times and compute the averaged e for each query. Then, as mentioned in Section 2.1, we compute the relative error difference DiffAQP and average the difference for all queries in the workload, to measure the utility of T′ for AQP.”)
Fan is analogous art because it is in the field of endeavor of synthetic relational data generation. It would have been obvious before the effective filing date of the claimed invention to combine the synthetic data generation to a database of Walters, Gilad, and Lao with the test queries of Fan. One of ordinary skill in the art would have been motivated to do so in order to measure data utility (Fan Page 2 Bottom Left: “We study the problem of synthesizing a “fake” table T ′ from the original T , with the objective of preserving data utility and protecting privacy” and Page 8 Top Right: “Then, as mentioned in Section 2.1, we compute the relative error difference DiffAQP and average the difference for all queries in the workload, to measure the utility.”)
However, the combination of Walters, Gilad, Lao, and Fan does not teach extracting query features from the test queries; normalizing the extracted query features; obtaining query results on the normalized query features; wherein a first batch of the new data for the query results a confidence level lower than a threshold and a second batch of the new data has for the query results a confidence level higher than a threshold value; purging the first batch of data from the new data in response to the confidence level being lower than the threshold value; wherein the test queries include data control language (DCL)
Brass teaches extracting query features from the test queries; normalizing the extracted query features; obtaining query results on the normalized query features (Brass, slides 21-25, discusses “Query Normalization in Oracle”. Brass discloses extracting query features (“parsing the SQL query”), various ways of normalizing the extracted query features (“evaluate constant expressions,”, “detects when the LIKE operator is really an equality”, “IN with a list of values is transformed into OR”, “The BETWEEN operator is also removed … replaced by”, “NOT is moved down to the atomic conditions”, “ANY, SOME, and ALL are removed”, “ANY with a subquery is normalized to EXISTS”, “compute certain implied conditions”, etc.), and obtaining query results on the normalized query features (“before QEPs are generated”, QEP is “query execution plan”).
Brass is analogous art because it is in the field of endeavor of databases, which is reasonably pertinent to the problem faced by both the inventor and Walters and Fan. It would have been obvious before the effective filing date of the claimed invention to combine the test queries of Fan with the normalization of the queries of Brass. One of ordinary skill in the art would have been motivated to do so in order to optimize the query execution (Brass, slide 21: “SQL often allows equivalent formulations, and in this way the optimizer does not have to handle them all. Also, some of these transformations make specific optimizations applicable later.”)
However, the combination of Walters, Gilad, Lao, Fan, and Brass does not teach wherein a first batch of the new data for the query results a confidence level lower than a threshold and a second batch of the new data has for the query results a confidence level higher than a threshold value; purging the first batch of data from the new data in response to the confidence level being lower than the threshold value; wherein the test queries include data control language (DCL)
Kuppa teaches wherein a first batch of the new data for the query results a confidence level lower than a threshold and a second batch of the new data has for the query results a confidence level higher than a threshold value; purging the first batch of data from the new data in response to the confidence level being lower than the threshold value. (Kuppa, Page 110-111 Section 3: “To avoid MIA type attacks and unintentional leakage of sensitive attributes, we devise an instance level privacy score for generated synthetic data. As shown in previous studies, the main reasons for privacy compromises in generative models is mainly because of over fitting of training data that could lead to data memorization. We empirically measure memorization coefficient αm for each generated sample and use this measure as privacy score. When distributing synthetic data to external third parties, users can filter samples with high privacy score and reduce privacy risks … The advantage of this method is the model Φ is agnostic and works on data level. Users can discard the samples which cross a certain threshold privacy score to protect privacy of users and, data audit by compliance bodies can be performed without a need to know the model internals and training dynamics.” Here, Kupa discloses a confidence score (“privacy score”, which is a confidence in the privacy of the generated synthetic data), and two batches (one batch on one side of the threshold, and one on the other), and purging the lower (“discard the samples”)).
Kuppa is analogous art because it is in the field of endeavor of synthetic data generation. It would have been obvious before the effective filing date of the claimed invention to combine the synthetic data generation of Walters, and the scoring of Fan, wherein Fan teaches “tradeoff between synthetic data utility and privacy” and using queries to measure “data utility”, with the purging of data below a confidence level threshold of Kuppa (wherein “confidence” could be confidence in “data utility” (a measure of similarity) as taught by Fan or privacy as taught by both Fan and Kuppa.) One of ordinary skill in the art would have been motivated to do so in order to improve the value of synthetic data, in a way that it can be applied after generation (“post hoc”), as noted by Kuppa on Page 115 Section 5.2: “Our work aims to give some direction towards improving synthetic data privacy by assigning instance-level privacy scores. A data auditor or service provider can leverage the scores to remove the data points prone to MI attacks. The advantage of our method is a model agnostic and it can be used in post-hoc fashion.”)
However, the combination of Walters, Gilad, Lao, Fan, Brass, and Kuppa does not teach wherein the test queries include data control language (DCL)
Foster teaches wherein the test queries include data control language (DCL) (Foster, Page 171 Section 10.1: “Structure Query Language (SQL) is an example of a DSL — consisting of DDL, DCL, and DML as defined in chapter 2.”
Foster is analogous art because it is in the field of endeavor of databases. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Walters, Gilad, Lao, Fan, Brass, and Kuppa with those of Foster. One of ordinary skill in the art would have been motivated to do so in order to take advantage of all the features of SQL, which is the most common language for performing database operations (Foster, Page 171: “The Structured Query Language (SQL) has become the universal language of choice for DBMS products … First developed by IBM in the 1970s, SQL is the universal language of databases”) and in order to manage commits, rollbacks, system privileges, and environmental settings (Foster Page 173: “The DCL statements fall into three categories: those that affect how DML operations take place (mainly COMMIT and ROLLBACK); those that relate to system privileges; and those that affect the environmental settings of the end user.”)
As per Claim 2, the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan teaches the CIM of claim 1. Walters teaches wherein the pre-processing includes: cleaning the raw data to remove, form the pre-processed data, features that include sensitive information (Walters [0182]: “In some embodiments, at step 1806, sensitive data may be tokenized, masking underlying data values and preserving confidentiality.”)
As per Claim 3, the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan teaches the CIM of claim 2. Walters teaches wherein the pre-processing further includes: removing from the pre-processed data information for confidentiality, security, and/or audit reasons (Walters [0202]: “As described above with regard to at least FIGS. 5A and 5B, the disclosed systems and methods can enable identification and removal of sensitive data portions in a dataset.”)
As per Claim 5, the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan teaches the CIM of claim 1. Lao teaches enriching the pre-processed data to replace join and/or constraint information that has been lost (Gilad, Abstract: “The generation of synthetic data is useful in multiple aspects, from testing applications to benchmarking to privacy preservation. Generating the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs) is an important aspect of this problem. Given instances of two relations, where one has a foreign key dependence on the other and is missing its foreign key (𝐹𝐾) values, and two types of constraints: (1) CCs that apply to the join view and (2) ICs that apply to the table with missing 𝐹𝐾 values, our goal is to impute the missing 𝐹𝐾 values such that the constraints are satisfied.”)
As per Claim 8, the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Foster teaches the CIM of claim 1. Foster teaches wherein the test queries further include at least one member selected from a group consisting of: DDL (data definition language), DML (data manipulation language), a statistical collection, and a query rewrite (Foster, Page 171 Section 10.1: “Structure Query Language (SQL) is an example of a DSL — consisting of DDL, DCL, and DML as defined in chapter 2.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Foster with Walters, Gilad, Lao, Fan, Brass, Kuppa for at least the reasons recited in the rejection to Claim 1.
As per Claim 10, this is a computer program product claim corresponding to method Claim 1. The difference is that it recites a processor(s) set, a set of storage device(s), and computer code stored collectively in the set of storage device(s), with the computer code including data and instructions to cause the processor(s) set to perform at least the following operations. Walters teaches this in [0006]: “Consistent with the present embodiments, an automated system for optimizing a model is disclosed, the system comprising at least one processor and at least one non-transitory memory storing instructions.” Therefore, Claim 10 is rejected for similar reasons as Claim 1.
As per Claim 11, this is a computer program product claim corresponding to method Claim 2, and is rejected for similar reasons.
As per Claim 12, this is a computer program product claim corresponding to method Claim 3, and is rejected for similar reasons.
As per Claim 14, this is a computer program product claim corresponding to method Claim 5, and is rejected for similar reasons.
As per Claim 17, this is a computer program product claim corresponding to method Claim 8, and is rejected for similar reasons.
As per Claim 19, this is a computer system claim corresponding to method Claim 1. The difference is that it recites a processor(s) set, a set of storage device(s), and computer code stored collectively in the set of storage device(s), with the computer code including data and instructions to cause the processor(s) set to perform at least the following operations. Walters teaches this in [0006]: “Consistent with the present embodiments, an automated system for optimizing a model is disclosed, the system comprising at least one processor and at least one non-transitory memory storing instructions.” Therefore, Claim 19 is rejected for similar reasons as Claim 1.
As per Claim 20, this is a computer system claim corresponding to method Claim 2, and is rejected for similar reasons.
As per Claim 22, the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan teaches the CIM of claim 2 as well as purging (see Kuppa in rejection to claim 1). Walters teaches further comprising joining: the new data that remains after purging and the cleaned raw data (Walters [0033], discloses: “In various embodiments, the disclosed systems can be used to tokenize the sensitive portions of a dataset (e.g., mailing addresses, social security numbers, email addresses, account numbers, demographic information, and the like). In some embodiments, the disclosed systems can be used to replace parts of sensitive portions of the dataset (e.g., preserve the first or last 3 digits of an account number, social security number, or the like; change a name to a first and last initial). In some aspects, the dataset can include one or more JSON (JavaScript Object Notation) or delimited files (e.g., comma-separated value, or CSV, files). In various embodiments, the disclosed systems can automatically detect sensitive portions of structured and unstructured datasets and automatically replace them with similar but synthetic values.” Also Walters Para [0053] states: “Dataset generator 103 can be configured to use the data model retrieved from model storage 109 to generate a synthetic dataset by replacing the sensitive data items with synthetic data items.” Examiner notes that Walters makes multiple references to cleaning sensitive data, and then replacing the cleaned parts with synthetic data, which amounts to joining new data with cleaned raw data. Walters also discloses performing both types of operations, wherein in addition to the recited replacing of sensitive parts of data with synthetic data, also producing completely new synthetic data, in [0033]: “Using these models, the disclosed embodiments can produce fully synthetic datasets with similar structure and statistics as the original sensitive or non-sensitive datasets. The disclosed embodiments also provide tools for desensitizing datasets and tokenizing sensitive values. In some embodiments, the disclosed systems can include a secure environment for training a model of sensitive data, and a non-secure environment for generating synthetic data with similar structure and statistics as the original sensitive data.” And [0181]: “At step 1806, similar to disclosures made in reference to process 900 (FIG. 9), one or more components of system 100 (e.g., dataset generator 103, model optimizer 107, computational resources 101, or the like) may generate a partially or fully synthetic dataset.” Thus, Walters can produce both “fully” (new data) or “partially” (new data and cleaned raw data) synthetic datasets.)
Claims 6 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan in view of Hu et al. (“The Quasi-Multinomial Synthesizer for Categorical Data”; hereinafter “Hu”), in view of Oracle (“Oracle Database SQL Tuning Guide”) and further in view of Lightstone et al. (“Automated design of multidimensional clustering tables for relational databases”; hereinafter “Lightstone”)
As per Claim 6, the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan teaches the CIM of claim 1. Walters teaches wherein the creating of the discriminator data includes: inputting concealed sample data from the database and DB statistics (Walters, as shown in the rejection to Claim 1, discloses in [0182]: “In some embodiments, at step 1806, sensitive data may be tokenized, masking underlying data values and preserving confidentiality.” Walters further states in [0192]: “In some embodiments, the input dataset is a synthetic dataset that includes tokenized data. The synthetic dataset may be based on an actual dataset and tokenized to preserve the structure of an actual dataset.”)
setting corresponding frequency weight and total effective samples ([0099]: “In some embodiments, the similarity metric can depend on a frequency of duplicate elements in the synthetic dataset and the normalized reference dataset. In some aspects, system 100 can be configured to determine the number of duplicate elements in each of the synthetic dataset and the normalized reference dataset.”)
performing calculation for a goodness-of-fit measure ([0088]: “FIG. 9 depicts a process 900 for training a generative adversarial network using a loss function configured to ensure a predetermined degree of similarity, consistent with disclosed embodiments. System 100 can be configured to use process 900 to generate synthetic data that is similar, but not too similar to the actual data, as the actual data can include sensitive personal information. For example, when the actual data includes social security numbers or account numbers, the synthetic data would preferably not simply recreate these numbers. Instead, system 100 would preferably create synthetic data that resembles the actual data, as described below, while reducing the likelihood of overlapping values. To address this technical problem, system 100 can be configured to determine a similarity metric value between the synthetic dataset and the normalized reference dataset, consistent with disclosed embodiments. System 100 can be configured to use the similarity metric value to update a loss function for training the generative adversarial network. In this manner, system 100 can be configured to determine a synthetic dataset differing in value from the normalized reference dataset at least a predetermined amount according to the similarity metric.”)
However, the combination does not teach calculating a categorical distribution for a multinomial distribution; categorically including a database statistics table, column, multi-column, partition table, and feature factors including Low2key, High2key, Frequency, Histogram
Hu teaches calculating a categorical distribution for a multinomial distribution (Hu, Page 80 Section 3.1, discloses: “The DPMPM is a Bayesian version of latent class models. Consider a sample X consists of n records, and each record has p unordered categorical variables. The basic assumption of the DPMPM is that every record Xi = (Xi1, · · · ,Xip) belongs to one of F underlying unobserved/latent classes. Given the latent class assignment zi of record i, as in Eq.(10), each variable Xij independently follows a multinomial distribution … The QM-DPMPM synthesizer just replaces the multinomial draw in Eq. (15) of the DPMPM … the QM-DPMPM obviously reduces to the DPMPM when β = 0. The parameter β is subjectively selected to take the balance of utility and disclosure risk of the synthetic data.”
Hu is analogous art because it is pertinent to the problem faced by the inventor and by Walters, which is synthetic data generation. It would have been obvious before the effective filing date of the claimed invention to combine the synthetic data generation of Walters with the multinomial distribution calculation of Hu. One of ordinary skill in the art would have been motivated to do so in order to balance the utility and disclosure risks of the created synthetic data (Hu, Abstract: “Characteristics of the Quasi-Multinomial distribution provide a tuning parameter, which allows a Quasi-Multinomial synthesizer to control the balance of the utility and the disclosure risks of synthetic data.”)
However, the combination does not teach categorically including a database statistics table, column, multi-column, partition table, and feature factors including Low2key, High2key, Frequency, Histogram
Oracle teaches categorically including a database statistics table, column, multi-column, partition table, and feature factors including [Low2key, High2key], Frequency, Histogram (Oracle, Page 14-1 to 14-3, discloses: “A column group is a set of columns that is treated as a unit. Essentially, a column group is a virtual column. By gathering statistics on a column group, the optimizer can more accurately determine the cardinality estimate when a query groups these columns together … Individual column statistics are useful for determining the selectivity of a single predicate in a WHERE clause … The diagram shows DBMS_STATS collecting statistics on each column individually and on the group … The following query of the DBA_TAB_COL_STATISTICS table shows information about statistics that have been gathered on the columns. Oracle, Page 26-34 contains an entry for “Partition table” and in Page 13-14 states: “By default, each partition of a partition table is gathered sequentially.” Oracle Page 11-1 states: “For columns that contain data skew (a nonuniform distribution of data within the column), a histogram enables the optimizer to generate accurate cardinality estimates for filter and join predicates that involve these columns.” Oracle Page 11-6 states: “In a frequency histogram, each distinct column value corresponds to a single bucket of the histogram. Because each value has its own dedicated bucket, some buckets may have many values, whereas others have few.”)
Oracle is analogous art because it is directed to databases and therefore reasonably pertinent to the problem faced by the inventor and Walters who also cites databases. It would have been obvious before the effective filing date of the claimed invention to combine the database recited by Walters with the database statistics gathering of Oracle. One of ordinary skill in the art would have been motivated to do so in order to maintain performance of the database (Oracle, Page 12-1: “The contents of tables and associated indexes change frequently, which can lead the optimizer to choose suboptimal execution plan for queries. To avoid potential performance issues, statistics must be kept current.”)
However, the combination does not teach Low2key, High2key.
Lightstone teaches Low2key, High2key. (Lightstone, Page 1175 Section 3.5, discloses: “For numeric types, coarsification begins by calculating the FUDG coarsification using the HIGH2KEY statistic (second largest column value) and LOW2KEY statistic (second smallest column value) to define the range of the dimensions, then defining an expression that divides that range into cells_max ranges (cells). If the base column has cardinality that is below the FUDG cardinality, then the base column defines the FUDG coarsification for that candidate dimension (i.e., this column’s FUDG coarsification is simply the base column itself and requires no coarsification). We define a mathematical function that divides the range between HIGH2KEY and LOW2KEY into a number of ranges, where the number of ranges is the same as the maximum number of cells possible in the table given the space constraint, as shown in Figure 3.”)
Lightstone is analogous art because it is directed to databases and therefore reasonably pertinent to the problem faced by the inventor and Walters who also cites databases. It would have been obvious before the effective filing date of the claimed invention to combine the database recited by Walters with the database statistics gathering of Oracle. One of ordinary skill in the art would have been motivated to do so in order to collect statistics regarding the reasonable range of values, while discarding the extreme highest and lowest values in a column (Lightstone, Page 1175 Section 3.5: “HIGH2KEY and LOW2KEY are assumed to represent the reasonable range of values for the dimension.”)
As per Claim 15, this is a computer program product claim corresponding to method Claim 6, and is rejected for similar reasons.
Claims 7 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan in view of Hao et al. (“Annealing Genetic GAN for Minority Oversampling”; hereinafter “Hao”).
As per Claim 7, the combination of Walters, Gilad, Lao, Fan, Brass, and Kuppa teaches the CIM of claim 1. Walters teaches grow data with DB statistics and distribution (As shown above in Rejection to Claim 1 in [0083] and [0109], discriminator data is grown by drawing samples from actual and synthetic data, and performs its actions based on the sample distributions. [0122] discloses that the data can be from a DB: “Streaming data source 1301 can be configured to retrieve new data elements from a database.”)
DB statistics; DB activities and a database statistic refresh (Walters [0122] – [0125]: “Streaming data source 1301 can be configured to retrieve new data elements from a database, a file, a datasource, a topic in a data streaming platform (e.g., IBM STREAMS), a topic in a distributed messaging system (e.g., APACHE KAFKA), or the like. In some aspects, streaming data source 1301 can be configured to retrieve new elements in response to a request from model optimizer 1303. In some aspects, streaming data source 1301 can be configured to retrieve new data elements in real-time … Model optimizer 1303 can be configured to evaluate performance criteria of a newly created synthetic data model. In some embodiments, the performance criteria can include a similarity metric (e.g., a statistical correlation score, data similarity score, or data quality score, as described herein). For example, model optimizer 1303 can be configured to compare the covariances or univariate distributions of a synthetic dataset generated by the new synthetic data model and a reference data stream dataset.” Here, Examiner notes that as data is updated in the database, statistics regarding distribution are calculated.)
However, the combination does not teach the generative model generates multiple attempts to avoid a local optimal solution from DB statistics; the discriminative model evaluates a global optimal solution from DB activities and a database statistic refresh
Hao teaches the generative model generates multiple attempts to avoid a local optimal solution from DB statistics; the discriminative model evaluates a global optimal solution from DB activities and a database statistic refresh (Recall above that Walters teaches DB activities and statistics. Hao, Page 2 discloses a GAN comprising a generator and a discriminator: “Recently, Generative Adversarial Networks (GANs) have shown some potentials to tackle class imbalance problems because theoretically they are able to reproduce the distributions of minority classes through adversarial learning. In the training process of GANs, the generator learns the mapping from a latent encoding space to the minority class distribution, and the discriminator needs to determine whether an input sample is actually drawn from the minority class or created by the generator.” Hao, Page 2 Figure 1 Caption, discloses: “(b) Our method integrates the simulated annealing genetic algorithm into the training of GANs. In doing so, GANs may update to a worse solution with a decreasing probability, which enables GANs to escape from the local optimum.” Hao, Bottom of Page 4, discloses: “In doing so, updating G with a decreasing probability in a worse direction enables
AGGAN to asymptotically converge to the global optimum. Finally, after updating the individual, the environment (i.e., the discriminator) D is updated and the training loop of our AGGAN starts the next evolutionary iteration. As the training progresses, the data generated by G gradually close to the true distribution, which helps D to continuously improve classification accuracy.”)
Hao is analogous art because it is in the field of endeavor of GANs, which is similarly used by primary reference Walters. It would have been obvious before the effective filing date of the claimed invention to combine the GAN of Walters with the avoiding of local optima of Hao. One of ordinary skill in the art would have been motivated to do so in order to avoid mode collapse which is a known problem for GANs (Hao Page 2: “There are many successful applications of using GANs. However, GANs can easily get stuck at local optimum when they try to learn the distributions from scarce samples of the minority classes that is also known as mode collapse [3] as shown in Figure 1 (a). A more effective training strategy is highly in demand for GANs to avoid the trapping at the local optimum.”)
As per Claim 16, this is a computer program product claim corresponding to method Claim 7, and is rejected for similar reasons.
Claims 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan in view of Cai et al. (“Query weighting for ranking model adaptation”; hereinafter “Cai”).
As per Claim 9, the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan teaches the CIM of claim 1 as well as performance of the reward operation (see Walters in Rejection to Claim 1). Walters teaches wherein the performance of the reward operation further comprises refreshing the new data [with query rank weighting] (Walters [0122]: “In some aspects, streaming data source 1301 can be configured to retrieve new data elements in real-time” and [0054]: “In various embodiments, dataset generator 103 can be configured to provide the synthetic dataset to database 105 for storage. In such embodiments, computing resources 101 can be configured to subsequently retrieve the synthetic dataset from database 105 directly, or indirectly through model optimizer 107 or dataset generator 103.” The generated synthetic data that is refreshed should be similar to the original data as shown in [0066]: “generate synthetic data satisfying a similarity criterion, as described herein.” Criteria for refreshing data based on rank weighting are taught below by Cai.)
However, the combination does not teach refreshing the new data with query rank weighting.
Cai teaches each query feature of a query normalizes as vectors; refresh the data with query rank weighting. (Examiner notes that while Walters teaches a synthetic dataset that is similar to an original dataset, Cai also teaches a “source domain” and “target domain” in Page 112: “transfer ranking knowledge from the source domain with plenty of labeled data to the target domain.” Cai, Page 114 Top Left: “In this work, we present two simple but very effective approaches attempting to resolve the problem from distinct perspectives: (1) we compress each query into a query feature vector by aggregating all of its document instances, and then conduct query weighting on these query feature vectors” and Page 113: “Inspired by the principle of listwise approach, we hypothesize that the importance weighting for ranking model adaptation could be done better at query level rather than document level.”)
Cai is analogous art because it is in the field of endeavor of machine learning and information retrieval (Cai Page 112 Intro: “Learning to rank, which aims at ranking documents in terms of their relevance to user’s query, has been widely studied in machine learning and information retrieval communities.”) It would have been obvious before the effective filing date of the claimed invention to combine the database augmentation with synthetic data of Walters and the query rank weighting of Cai. One of ordinary skill in the art would have been motivated to do so in order to efficiently estimate query importance (Cai, Abstract: “This method can efficiently estimate query importance by compressing query data, but the potential risk is information loss resulted from the compression. The second measures the similarity between the source query and each target query, and then combines these fine-grained similarity values for its importance estimation. Adaptation experiments on LETOR3.0 data set demonstrate that query weighting significantly outperforms document instance weighting method.”)
As per Claim 18, this is a computer program product claim corresponding to method Claim 9, and is rejected for similar reasons.
Claims 21 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan in view of Li Wanxin (“Supporting Database Constraints in Synthetic Data Generation based on Generative Adversarial Networks”; hereinafter “Li”).
As per Claim 21, the combination of Walters, Gilad, Lao, Fan, Brass, Kuppa, and Fan teaches the CS of claim 19. However, the combination does not teach wherein the operations further comprise enriching the pre-processed data to replace join and/or constraint information that has been lost
Li teaches wherein the operations further comprise enriching the pre-processed data to replace join and/or constraint information that has been lost (Li, Page 2875, discloses: “In our research, we focus on data synthesization for relational databases where the database constraints [1] of the original data must be imposed to the generated data … We offer solutions by designing extensions to Tabular Generative Adversarial Network (TGAN) [5] algorithm … In order to support database constraints, we extended TGAN as follows … In training, for a given set of constraints 𝒞 , we construct an additional penalty term ap used in the loss function by the following steps.” Examiner notes that here, Li extends synthesis of database data in order to maintain relational constraints. )
Li is analogous art because it is in the field of endeavor of relational data synthesis. It would have been obvious before the effective filing date of the claimed invention to combine the relational data synthesis of Walters and Fan with the enrichment of the data to replace constraint information that would have been lost of Li. One of ordinary skill in the art would have been motivated to do so in order to maintain the requirement of enforcing database constraints, as database constraints are often key to maintaining data integrity and performance (Li: “In our research, we focus on data synthesization for relational databases where the database constraints [1] of the original data must be imposed to the generated data.”)
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached on (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LEONARD A SIEGER/Examiner, Art Unit 2126