DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 21 November 2025 has been entered. Applicant amended claims 1, 11, and 19. Accordingly, claims 1-20 remain pending.
Response to Arguments
Regarding the 35 USC 101 Rejection:
Applicant’s arguments, filed 21 November 2025, with respect to 35 USC 101 rejection have been fully considered and are persuasive. The 35 USC 101 rejection of 28 August 2025 has been withdrawn.
Regarding the 35 USC 103 Rejection:
Applicant’s arguments with respect to the independent claim(s) have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor at the time the application was filed, had possession of the claimed invention.
Regarding claims 1, 11, and 19 and the limitations of “training by a processor configured to access training data comprising the stored synthetic trends dataset from the storage device, a machine learning model using the training data”, the specification does not provide adequate written description for the claimed training such that a person of ordinary skill in the art would understand that the inventor possessed the claimed invention at the time of tiling. The written description fails to provide: specific, concrete embodiments that teach a practical way to implement the training of the machine learning model using the training data, including equations and algorithmic steps. The written description does not provide concrete, non-ambiguous examples with actual datasets illustrating the training of the machine learning model using the training data set.
Claims 2-10 are rejected because said claims depend upon rejected claim 1.
Claims 12-18 are rejected because said claims depend upon rejected claim 11.
Claim 20 is rejected because said claims depend upon rejected claim 19.
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Claims 1-20 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claims 1, 11, and 19 recite the limitations of “training by a processor configured to access training data comprising the stored synthetic trends dataset from the storage device, a machine learning model using the training data”. However, the claims fails to describe and distinctly claim the parameters and how the parameters are chosen for the training. The claims does not explain nor correlate the data to a mathematical relationship nor provide formulas for the training. Thus, the limitations in the claims are indefinite. The examiner has interpreted the limitations as best understood.
Claims 2-10 are rejected because said claims depend upon rejected claim 1.
Claims 12-18 are rejected because said claims depend upon rejected claim 11.
Claim 20 is rejected because said claims depend upon rejected claim 19.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-7, 9-15, and 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chakraborty et al US 20190156061 (hereinafter Chakraborty), in further view of Kwatra et al US 20230154609 (hereinafter Kwatra), in further view of Lesh et al US 20230010686 (hereinafter Lesh), in further view of Aptekar et al US 11977550 (hereinafter Aptekar), and in further view of Pauly US 20200035365 (hereinafter Pauly).
As to claim 1, Chakraborty teaches a method for generating a synthetic trends dataset (abstract discloses generating anonymized synthetic data by adding noise to a first set of data from one or more data stores. Paragraph 78 reveals the data can come from private databases. Claim 19 discloses aggregating/combining the plurality of databases (see Figure 10, reference number 1006). Paragraphs 77-78 reveals the data record are health records from hospital private database. Paragraphs 77-78, 32, 52, and 68 disclose additional data records from private databases that include secondary information such as education level, zip code, and/or salary data) for training a machine learning model (paragraphs 31, 60-62 and steps 608 and 612 reveal the output candidate for anonymization is retrained in the embedding vector model. The data record strings is sent through the word embedding vector model to obtain a candidate output for anonymization/synthetic data), the method comprising:
PNG
media_image1.png
654
453
media_image1.png
Greyscale
Figure 10 of Chakraborty
receiving a healthcare dataset from a database in a first segregated data environment of a federated data cleanroom, the healthcare dataset comprising personally identifiable information (PII) pertaining to the individual (paragraphs 4 and 52 reveal receiving raw data/first data associated with a data stores. The raw data comes from database records, unstructured data, and/or any other set of data. Paragraphs 23 and 68 discloses the datastore record can include information such as disease. Paragraphs 77- 78 reveal the data can come from private databases such as health data from hospital; therefore, the databases are federated. See also Figure 9, reference number 910 of Figure 10, reference number 1008);
receiving a supplementary dataset from a database in a second segregated data environment of the federated data cleanroom, the supplementary data comprising PII pertaining to the individual (paragraphs 4 and 52 reveal receiving raw data/first data associated with a data stores. The raw data comes from database records, unstructured data, and/or any other set of data. Paragraphs 32, 52, and 68 disclose the datastore record can include information such as education level, zip code, and/or salary. The data can also include name of a person, SSN. Paragraphs 77- 78 reveal the data can come from private databases; therefore, the databases are federated. See also Figure 9, reference number 912 or Figure 10, reference number 1010);
anonymizing the data stored in each database, wherein the anonymized data from each database is stored in the corresponding segregated database (paragraph 52 discloses anonymizing received data values by deleting, masking, changing, and/or encrypting the dataset such that the data are not discernable to a human reader. Paragraph 79 further discloses each of the private databases are anonymized to a corresponding anonymized database. See also Figure 9, reference numbers 904, 906, and 908);
generating, for each segregated data environment, a numerical representation of each data feature of a plurality of data features of the anonymized data stored in the corresponding segregated data environment, wherein the numerical representation comprises a first sequence of embedding vectors (paragraphs 30 and 53-54 disclose for the plurality of real strings and real number values in the datastores of the anonymized datastore, identifying quasi-identifiers/data features and sensitive identifier values for both string data and numeral data; normalizing the data (if the data is a numerical dataset) and generating one or more tensors values for the strings data and the normalized numerical data. Tensors are numerical representation of the data record token that are fed through an embedding vector model (for the word ) or normalize vector. Vector embedding is converting words/numbers into number sequence), wherein each first sequence of embedding vectors is a compressed representation of the corresponding data feature (paragraphs 30-31 disclose the data tensor are representations of data in different form than the data itself, such as a token representing the data. Text strings that are fed through the word embedding vector model, the output is the data tensors which may be represented as integers or other real numbers. The data tensor are mapped from its high dimensionality space to a lower dimensionality space, the data is compressed);
determining, for each segregated data environment, a second sequence of embedding vectors, wherein the second sequence of embedding vectors is a transformation of the corresponding first sequence of embedding vectors, the transformation comprising a reduction of information from the first sequence of embedding vectors (paragraph 56 discloses adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). For example, a 300 dimensional vector represented as a table of token records (d.sub.1, −d.sub.4) can be mapped to a 100 dimensional vector (h.sub.1, −h.sub.3). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm);
modifying, upon determining a risk of disclosure is above a disclosure threshold, the plurality of data features for a corresponding segregated data environment, wherein the risk of disclosure is indicative of a likelihood that the PII of the data stored in a database of the corresponding segregated data environment is obtainable from the corresponding second sequence of embedding vectors (paragraph 47 further discloses utilizing the Mondrian algorithm to satisfy k-anonymity. Generalization replaces or changes quasi-identifier values with values that are less-specific but semantically consistent (e.g., values that belong to the same ontological class). As a result, more records will have the same set of quasi-identifier values. The Mondrian algorithm recursively chooses the split attribute with the largest normalized range of values and (for continuous or ordinal attributes) partitions/modify the data around the median value of the split attribute, which is repeated until there is no allowable split. A record in a k-anonymized dataset utilizes a maximum probability 1/k of being re-identified/disclosure risk1) ;
generating the synthetic trends dataset (paragraph 48 discloses the output of the Mondrian algorithm is a set of equivalence classes, where each equivalence class is of size at least k to meet k-anonymity. The equivalence class members at this point can be codes (derived from the data vectors using the encoder). Accordingly, at the output of the anonymizer layer (e.g., the anonymizer of FIG. 1), the c1, c2, c3, and c4 are indistinguishable. They are all represented by noisy code C. In some embodiments, different equivalence classes have different centroids and thus different outputs. The output layer is the synthetic trends dataset that comprises the second sequence of embedding vectors/noise associated with the anonymized databases);
training, by a processor configured to access training data comprising the ….synthetic dataset, a machine learning model using the training data (Figure 1 and paragraphs 31 and 62 disclose the word embedding vector model(s) are retrained using the anonymized data. After the decoder turns the data into the anonymized data, the word embedding vector model(s) is retrained, which is described in more detail below. This is indicated by the arrow from the decoder to the word embedding vector model(s)); and
outputting the trained machine learning model configured to process input data indicative of healthcare information and consumer information related to the individual (Figures 1 and 6 and paragraphs 31, 58, and 62 reveal the retrain embedded vector model processes data record strings/input data. Paragraph 78 reveals the data can come from private databases. Claim 19 discloses aggregating/combining the plurality of databases (see Figure 10, reference number 1006). Paragraphs 77-78 reveals the data record are health records from hospital private database. Paragraphs 77-78, 32, 52, and 68 disclose additional data records from private databases that include secondary information such as education level, zip code, and/or salary data).
Chakraborty does not teach generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment; storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset; and outputting the trained machine learning model configured to process input data and to output an output value that is indicative of a probability that the individual takes a particular action.
Kwatra teaches outputting the trained machine learning model configured to process input data and to output an output value that is indicative of a probability that the individual takes a particular action (paragraph 24 discloses collected data is anonymized. Paragraphs 35-40 reveals collecting the anonymized data, which is input to a machine learning module determine activities or behavior that the user performs. The system/bot correlate a score value or recommendation that were identified by the trained machine learning component using the anonymized collected data and outliers of the collected data. The correlation/probability score is used to determine which recommended action best matches the user’s known activities/behavior and a health score is used to determine which activities maximize improvement in health and/or minimize health risks. Paragraphs 32, 38, 60, 67 and disclose the collected data includes supplementary data such as ordered health and/or fast food and electronic health data such as medical records, health reports).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends to include the steps of Kwatra’s teachings of correlating a scored value to recommended actions that best match user’s activity/behavior to provide personalized actionable alerts for a user to improve the user’s health (paragraph 16 of Kwatra).
The combination of Chakraborty in view of Kwatra does not teach generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment; storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset.
Lesh teaches generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment (paragraphs 64-65 and Figure 5B disclose the medical records (from databases (see paragraph 48) for a patient is transformed and for each record, a numerical vector/embedded vector is generated. The medical records are transformed into concatenated vector using the embedding vectors and are inputted into a generative model (shared data environment, per paragraph 48 the synthetic data is generated from a single or multiple source database ) to generate synthetic dataset), wherein the synthetic trends dataset has [threshold metric difference] with the healthcare dataset and the supplementary dataset (paragraph 66 further discloses a discriminator can determine a distribution over the state of the same based on the received input of the synthetic dataset and the authentic medical datasets (which can be the combined healthcare dataset and the dosage dataset), detecting whether the sample is synthetic. Paragraphs 80-83 also discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved. For statistical validation, the generative system may utilize statistical measures over a population to demonstrate that the generated data has similar characteristics to the original/authentic data. For example, the system may utilize a distribution visualization called a violin plot to determine whether the synthetic data is comparable to the authentic data within an acceptable threshold. The system may also utilize other statistical validation such as a plot of a univariate correlation between each pair of variables in the original authentic electronic health record dataset and compare this against the same correlation plot in the synthetic electronic health record dataset).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value with Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models such that the system can provide quality synthetic datasets that provide a threshold distance with the authentic datasets, and quickly update the synthetic dataset using the machine learning model, such that the gradient/distance in distribution between the synthetic dataset and the authentic dataset meets a threshold value (paragraph 95 of Lesh).
The combination of Chakraborty in view of Kwatra and Lesh does not teach storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset.
Aptekar teaches storing the synthetic trends dataset in a storage device (column 3, lines 62-65 disclose outputting the synthetic dataset to a storage system), wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage (column 11, lines 11-40 disclose the synthetic dataset is generated based on combined sequences/subsequence patterns of feature vector vectors. The feature vectors are embedded in a lower dimension space).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value and Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models with Aptekar’s teachings of storing the synthetic dataset to provide improve data fidelity and data privacy by obfuscating the data in a way that preserves its statistical properties while rendering it unidentifiable to unauthorized users. This may allow researchers to more widely access and analyze such data at an earlier stage of clinical trials, which can accelerate the development of new drugs and medical treatments (column 2, lines 2-9 of Aptekar).
The combination of Chakraborty in view of Kwatra, Lesh, and Aptekar does not teach but Pauly teaches wherein the synthetic trends dataset has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset (paragraphs 102 and 147 disclose the encoded output data/synthetic data has a dimension that is lower than the medical data. The length of a vector representing medical data of the patient may be longer than the length of a vector that is an encoded output matrix encoded from the vector of medical data by an encoding module. Paragraph 4 reveals the data can include imaging, laboratory test measurements, clinical and family history data. Family history can be supplementary dataset. Paragraph 207 reveals the encoded data set has a high degree of privacy than the raw dataset).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value, Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models, and Aptekar’s teachings of storing the synthetic dataset with Pauly’s teachings of the encoded data set having less PII such that the data can be sent to remote computing environments that have lower privacy standard without compromising the data (paragraph 207 of Pauly).
As to claim 2, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein the transformation of the first sequence of embedding vectors comprises reducing a dimensionality of the first sequence of embedding vectors (Chakraborty: paragraph 56 discloses the transformation of the first embedded vector/sequence by adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). For example, a 300 dimensional vector represented as a table of token records (d.sub.1, −d.sub.4) can be mapped to a 100 dimensional vector (h.sub.1, −h.sub.3). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm) .
As to claim 3, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein the transformation of the first sequence of embedding vectors comprises a lossy compression of the first sequence of embedding vectors (Chakraborty: paragraph 56 discloses the transformation of the first embedded vector/sequence by adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm) .
As to claim 4, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein the transformation of the first sequence of embedding vectors comprises adding noise to the first sequence of embedding vectors, wherein the noise comprises one or more sources of noise (Chakraborty: paragraph 56 discloses nose is added by the noise propagation module (source of noise) to the tensors which are the first sequence of embedding vectors).
As to claim 5, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein the generation of the first sequence of embedding vectors for each segregated data environment comprises a principal component analysis of the corresponding dataset (Chakraborty: paragraph 27 discloses analysis of the dataset via the embedding vector module to generate the tensor vectors).
As to claim 6, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches further comprising generating a token from the PII of the healthcare dataset and a token from the PII of the supplementary dataset, wherein each token is operative to link the corresponding dataset to data stored outside of the federated data cleanroom (Chakraborty: paragraphs 30, 33, and 35 disclose the generated data tensors are representations of data in a different form, such as a token representing the dataset and are linked/corresponds to the real data values in the data records (health and supplementary as disclosed in claim 1above) ).
As to claim 7, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches further comprising:
determining a utility of the synthetic trends dataset, wherein the utility is indicative of a quality of the synthetic trends dataset with respect to a particular task (Kwatra: paragraphs 57 and 63 disclose determining a correlated score, wherein the correlated score is linked to a set of recommended actions based on the analyzed dataset. The correlated score and recommended action is the utility which provides indication of a quality of the dataset with respect to a particular task because the recommended action cause the one or more health parameters to fall within the threshold range. Lesh: paragraphs 80-83 discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved. See also paragraphs 94-95 regarding determining first and second gradients to generate synthetic data records);
determining that the utility of the synthetic trends dataset is below a utility threshold that represents a minimum required quality of insights generated based on analytics of the synthetic trends dataset (Lesh: paragraphs 80-83 discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved);
modifying, based on the utility of the synthetic trends dataset being below the utility threshold, one or more of the determined data features to increase the utility of the synthetic trends dataset (Lesh: paragraphs 80-83 discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved. Paragraphs 97-95 disclose updating first and second gradients to meet threshold loss distribution that is contributes to the generation of the synthetic data records. Paragraphs 5 and 46-47, 51 disclose adding noise to modify and control the generation of the synthetic dataset); and
after the modifying, outputting insights generated based on analytics of the synthetic trends dataset (Kwatra: paragraphs 21 and 52 disclose altering/modifying the recommended frequency of the action to result in improvement of the health parameter falling within the threshold range. Paragraphs 63- 64 disclose the personalized actions is sent to the user based on the correlation score, health risk score for each known activities and health parameters). Motivation is similar to the motivation presented in claim 1.
As to claim 9, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein generating the first sequence of embedding vectors comprises capturing a variance of the anonymized data in fewer dimensions than the dimensionality of the anonymized data (Chakraborty: paragraph 33 discloses in generating the tensors, the numerical data is normalized. Normalization captures the variance/difference of a anonymized data value with the mean. The mean being fewer dimensionality than the anonymized dataset. For word data paragraphs 39-40 reveal equation that yields the normalized probabilistic model for language modeling).
As to claim 10, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein the healthcare dataset comprises a plurality of alphanumeric codes, wherein each alphanumeric code is mapped to an embedding vector (Chakraborty: paragraphs 32-34 disclose that the data records includes various fields or attributes with both real numbers and strings. These fields are linked to tensors/embedded vectors. Paragraph 48 disclose the output of the Mondrian algorithm is a set of equivalence classes. The equivalence class members are codes (derived from the data vectors using the encoder)).
As to claim 11, Chakraborty teaches a system (Figure 10 discloses a system computing environment) comprising:
one or more computers (Figure 13, reference number 12 “Computing Device”; paragraph 109 discloses the computing device shown in Figure 13 is a general purpose computing device. The computing device represents computing devices that includes the anonymized databases, and the user devices shown in Figure 13);
one or more computer-readable media (Figure 13, reference number 28 “Memory”) storing instructions that are operable, when executed by the one or more computers, to perform operations (paragraph 112 reveals memory may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention ) for generating a synthetic trends dataset (abstract discloses generating anonymized synthetic data by adding noise to a first set of data from one or more data stores. Paragraph 78 reveals the data can come from private databases. Claim 19 discloses aggregating/combining the plurality of databases (see Figure 10, reference number 1006). Paragraphs 77-78 reveals the data record are health records from hospital private database. Paragraphs 77-78, 32, 52, and 68 disclose additional data records from private databases that include secondary information such as education level, zip code, and/or salary data) for training a machine learning model (paragraphs 31, 60-62 and steps 608 and 612 reveal the output candidate for anonymization is retrained in the embedding vector model. The data record strings is sent through the word embedding vector model to obtain a candidate output for anonymization/synthetic data), the operations comprising:
receiving a healthcare dataset from a database in a first segregated data environment of a federated data cleanroom, the healthcare dataset comprising personally identifiable information (PII) pertaining to the individual (paragraphs 4 and 52 reveal receiving raw data/first data associated with a data stores. The raw data comes from database records, unstructured data, and/or any other set of data. Paragraphs 23 and 68 discloses the datastore record can include information such as disease. Paragraphs 77- 78 reveal the data can come from private databases such as health data from hospital; therefore, the databases are federated. See also Figure 9, reference number 910 of Figure 10, reference number 1008);
receiving a supplementary dataset from a database in a second segregated data environment of the federated data cleanroom, the supplementary data comprising PII pertaining to the individual (paragraphs 4 and 52 reveal receiving raw data/first data associated with a data stores. The raw data comes from database records, unstructured data, and/or any other set of data. Paragraphs 32, 52, and 68 disclose the datastore record can include information such as education level, zip code, and/or salary. The data can also include name of a person, SSN. Paragraphs 77- 78 reveal the data can come from private databases; therefore, the databases are federated. See also Figure 9, reference number 912 or Figure 10, reference number 1010);
anonymizing the data stored in each database, wherein the anonymized data from each database is stored in the corresponding segregated database (paragraph 52 discloses anonymizing received data values by deleting, masking, changing, and/or encrypting the dataset such that the data are not discernable to a human reader. Paragraph 79 further discloses each of the private databases are anonymized to a corresponding anonymized database. See also Figure 9, reference numbers 904, 906, and 908);
generating, for each segregated data environment, a numerical representation of each data feature of a plurality of data features of the anonymized data stored in the corresponding segregated data environment, wherein the numerical representation comprises a first sequence of embedding vectors (paragraphs 30 and 53-54 disclose for the plurality of real strings and real number values in the datastores of the anonymized datastore, identifying quasi-identifiers/data features and sensitive identifier values for both string data and numeral data; normalizing the data (if the data is a numerical dataset) and generating one or more tensors values for the strings data and the normalized numerical data. Tensors are numerical representation of the data record token that are fed through an embedding vector model (for the word ) or normalize vector. Vector embedding is converting words/numbers into number sequence), wherein each first sequence of embedding vectors is a compressed representation of the corresponding data feature (paragraphs 30-31 disclose the data tensor are representations of data in different form than the data itself, such as a token representing the data. Text strings that are fed through the word embedding vector model, the output is the data tensors which may be represented as integers or other real numbers. The data tensor are mapped from its high dimensionality space to a lower dimensionality space, the data is compressed);
determining, for each segregated data environment, a second sequence of embedding vectors, wherein the second sequence of embedding vectors is a transformation of the corresponding first sequence of embedding vectors, the transformation comprising a reduction of information from the first sequence of embedding vectors (paragraph 56 discloses adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). For example, a 300 dimensional vector represented as a table of token records (d.sub.1, −d.sub.4) can be mapped to a 100 dimensional vector (h.sub.1, −h.sub.3). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm);
modifying, upon determining a risk of disclosure is above a disclosure threshold, the plurality of data features for a corresponding segregated data environment, wherein the risk of disclosure is indicative of a likelihood that the PII of the data stored in a database of the corresponding segregated data environment is obtainable from the corresponding second sequence of embedding vectors (paragraph 47 further discloses utilizing the Mondrian algorithm to satisfy k-anonymity. Generalization replaces or changes quasi-identifier values with values that are less-specific but semantically consistent (e.g., values that belong to the same ontological class). As a result, more records will have the same set of quasi-identifier values. The Mondrian algorithm recursively chooses the split attribute with the largest normalized range of values and (for continuous or ordinal attributes) partitions/modify the data around the median value of the split attribute, which is repeated until there is no allowable split. A record in a k-anonymized dataset utilizes a maximum probability 1/k of being re-identified/disclosure risk1);
generating the synthetic trends (paragraph 48 discloses the output of the Mondrian algorithm is a set of equivalence classes, where each equivalence class is of size at least k to meet k-anonymity. The equivalence class members at this point can be codes (derived from the data vectors using the encoder). Accordingly, at the output of the anonymizer layer (e.g., the anonymizer of FIG. 1), the c1, c2, c3, and c4 are indistinguishable. They are all represented by noisy code C. In some embodiments, different equivalence classes have different centroids and thus different outputs. The output layer is the synthetic trends dataset that comprises the second sequence of embedding vectors/noise associated with the anonymized databases);
training, by a processor configured to access training data comprising the ….synthetic dataset, a machine learning model using the training data (Figure 1 and paragraphs 31 and 62 disclose the word embedding vector model(s) are retrained using the anonymized data. After the decoder turns the data into the anonymized data, the word embedding vector model(s) is retrained, which is described in more detail below. This is indicated by the arrow from the decoder to the word embedding vector model(s)); and
outputting the trained machine learning model configured to process input data indicative of healthcare information and consumer information related to the individual (Figures 1 and 6 and paragraphs 31, 58, and 62 reveal the retrain embedded vector model processes data record strings/input data. Paragraph 78 reveals the data can come from private databases. Claim 19 discloses aggregating/combining the plurality of databases (see Figure 10, reference number 1006). Paragraphs 77-78 reveals the data record are health records from hospital private database. Paragraphs 77-78, 32, 52, and 68 disclose additional data records from private databases that include secondary information such as education level, zip code, and/or salary data).
Chakraborty does not teach generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment; storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset; and outputting the trained machine learning model configured to process input data and to output an output value that is indicative of a probability that the individual takes a particular action.
Kwatra teaches outputting the trained machine learning model configured to process input data and to output an output value that is indicative of a probability that the individual takes a particular action (paragraph 24 discloses collected data is anonymized. Paragraphs 35-40 reveals collecting the anonymized data, which is input to a machine learning module determine activities or behavior that the user performs. The system/bot correlate a score value or recommendation that were identified by the trained machine learning component using the anonymized collected data and outliers of the collected data. The correlation/probability score is used to determine which recommended action best matches the user’s known activities/behavior and a health score is used to determine which activities maximize improvement in health and/or minimize health risks. Paragraphs 32, 38, 60, 67 and disclose the collected data includes supplementary data such as ordered health and/or fast food and electronic health data such as medical records, health reports).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends to include the steps of Kwatra’s teachings of correlating a scored value to recommended actions that best match user’s activity/behavior to provide personalized actionable alerts for a user to improve the user’s health (paragraph 16 of Kwatra).
The combination of Chakraborty in view of Kwatra does not teach generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment; storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset.
Lesh teaches generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment (paragraphs 64-65 and Figure 5B disclose the medical records (from databases (see paragraph 48) for a patient is transformed and for each record, a numerical vector/embedded vector is generated. The medical records are transformed into concatenated vector using the embedding vectors and are inputted into a generative model (shared data environment, per paragraph 48 the synthetic data is generated from a single or multiple source database ) to generate synthetic dataset), wherein the synthetic trends dataset has [threshold metric difference] with the healthcare dataset and the supplementary dataset (paragraph 66 further discloses a discriminator can determine a distribution over the state of the same based on the received input of the synthetic dataset and the authentic medical datasets (which can be the combined healthcare dataset and the dosage dataset), detecting whether the sample is synthetic. Paragraphs 80-83 also discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved. For statistical validation, the generative system may utilize statistical measures over a population to demonstrate that the generated data has similar characteristics to the original/authentic data. For example, the system may utilize a distribution visualization called a violin plot to determine whether the synthetic data is comparable to the authentic data within an acceptable threshold. The system may also utilize other statistical validation such as a plot of a univariate correlation between each pair of variables in the original authentic electronic health record dataset and compare this against the same correlation plot in the synthetic electronic health record dataset).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value with Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models such that the system can provide quality synthetic datasets that provide a threshold distance with the authentic datasets, and quickly update the synthetic dataset using the machine learning model, such that the gradient/distance in distribution between the synthetic dataset and the authentic dataset meets a threshold value (paragraph 95 of Lesh).
The combination of Chakraborty in view of Kwatra and Lesh does not teach storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset.
Aptekar teaches storing the synthetic trends dataset in a storage device (column 3, lines 62-65 disclose outputting the synthetic dataset to a storage system), wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage (column 11, lines 11-40 disclose the synthetic dataset is generated based on combined sequences/subsequence patterns of feature vector vectors. The feature vectors are embedded in a lower dimension space).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value and Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models with Aptekar’s teachings of storing the synthetic dataset to provide improve data fidelity and data privacy by obfuscating the data in a way that preserves its statistical properties while rendering it unidentifiable to unauthorized users. This may allow researchers to more widely access and analyze such data at an earlier stage of clinical trials, which can accelerate the development of new drugs and medical treatments (column 2, lines 2-9 of Aptekar).
The combination of Chakraborty in view of Kwatra, Lesh, and Aptekar does not teach but Pauly teaches wherein the synthetic trends dataset has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset (paragraphs 102 and 147 disclose the encoded output data/synthetic data has a dimension that is lower than the medical data. The length of a vector representing medical data of the patient may be longer than the length of a vector that is an encoded output matrix encoded from the vector of medical data by an encoding module. Paragraph 4 reveals the data can include imaging, laboratory test measurements, clinical and family history data. Family history can be supplementary dataset. Paragraph 207 reveals the encoded data set has a high degree of privacy than the raw dataset).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value, Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models, and Aptekar’s teachings of storing the synthetic dataset with Pauly’s teachings of the encoded data set having less PII such that the data can be sent to remote computing environments that have lower privacy standard without compromising the data (paragraph 207 of Pauly).
As to claim 12, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein the transformation of the first sequence of embedding vectors comprises one or more of reducing a dimensionality of the first sequence of embedding vectors (Chakraborty: paragraph 56 discloses the transformation of the first embedded vector/sequence by adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). For example, a 300 dimensional vector represented as a table of token records (d.sub.1, −d.sub.4) can be mapped to a 100 dimensional vector (h.sub.1, −h.sub.3). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm), a lossy compression of the first sequence of embedding vectors (Chakraborty: paragraph 56 discloses the transformation of the first embedded vector/sequence by adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm), and adding noise to the first sequence of embedding vectors, wherein the noise comprises one or more sources of noise (Chakraborty: paragraph 56 discloses nose is added by the noise propagation module (source of noise) to the tensors which are the first sequence of embedding vectors).
As to claim 13, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein the generation of the first sequence of embedding vectors for each segregated data environment comprises a principal component analysis of the corresponding dataset (Chakraborty: paragraph 27 discloses analysis of the dataset via the embedding vector module to generate the tensor vectors).
As to claim 14, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein the operations further comprise generating a token from the PII of the healthcare dataset and a token from the PII of the supplementary dataset, wherein each token is operative to link the corresponding dataset to data stored outside of the federated data cleanroom (Chakraborty: paragraphs 30, 33, and 35 disclose the generated data tensors are representations of data in a different form, such as a token representing the dataset and are linked/corresponds to the real data values in the data records (health and supplementary as disclosed in claim 1above) ).
As to claim 15, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches the operation further comprising:
determining a utility of the synthetic trends dataset, wherein the utility is indicative of a quality of the synthetic trends dataset with respect to a particular task (Kwatra: paragraphs 57 and 63 disclose determining a correlated score, wherein the correlated score is linked to a set of recommended actions based on the analyzed dataset. The correlated score and recommended action is the utility which provides indication of a quality of the dataset with respect to a particular task because the recommended action cause the one or more health parameters to fall within the threshold range. Lesh: paragraphs 80-83 discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved. See also paragraphs 94-95 regarding determining first and second gradients to generate synthetic data records);
determining that the utility of the synthetic trends dataset is below a utility threshold that represents a minimum required quality of insights generated based on analytics of the synthetic trends dataset (Lesh: paragraphs 80-83 discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved );
modifying, based on the utility of the synthetic trends dataset being below the utility threshold, one or more of the determined data features to increase the utility of the synthetic trends dataset (Lesh: paragraphs 80-83 discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved. Paragraphs 97-95 disclose updating first and second gradients to meet threshold loss distribution that is contributes to the generation of the synthetic data records. Paragraphs 5 and 46-47, 51 disclose adding noise to modify and control the generation of the synthetic dataset); and
after the modifying, outputting insights generated based on analytics of the synthetic trends dataset (Kwatra: paragraphs 21 and 52 disclose altering/modifying the recommended frequency of the action to result in improvement of the health parameter falling within the threshold range. Paragraphs 63- 64 disclose the personalized actions is sent to the user based on the correlation score, health risk score for each known activities and health parameters). Motivation is similar to the motivation presented in claim 11.
As to claim 17, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches wherein generating the first sequence of embedding vectors comprises capturing a variance of the anonymized data in fewer dimensions than the dimensionality of the anonymized data (Chakraborty: paragraph 33 discloses in generating the tensors, the numerical data is normalized. Normalization captures the variance/difference of a anonymized data value with the mean. The mean being fewer dimensionality than the anonymized dataset). For word data paragraphs 39-40 reveals equation that yields the normalized probabilistic model for language modeling).
As to claim 18, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly wherein the healthcare dataset comprises a plurality of alphanumeric codes, wherein each alphanumeric code is mapped to an embedding vector (Chakraborty: paragraphs 32-34 discloses that the data records includes various fields or attributes with both real numbers and strings. These fields are linked to tensors/embedded vectors. Paragraph 48 disclose the output of the Mondrian algorithm is a set of equivalence classes. The equivalence class members are codes (derived from the data vectors using the encoder)).
As to claim 19, Chakraborty teaches a non-transitory computer-readable medium (Figure 13, reference number 28 “Memory”) storing one or more instructions executable by a computer system to perform operations (paragraph 112 reveals memory may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention ) for generating a synthetic trends dataset (abstract discloses generating anonymized synthetic data by adding noise to a first set of data from one or more data stores. Paragraph 78 reveals the data can come from private databases. Claim 19 discloses aggregating/combining the plurality of databases (see Figure 10, reference number 1006). Paragraphs 77-78 reveals the data record are health records from hospital private database. Paragraphs 77-78, 32, 52, and 68 disclose additional data records from private databases that include secondary information such as education level, zip code, and/or salary data) for training a machine learning model (paragraphs 31, 60-62 and steps 608 and 612 reveal the output candidate for anonymization is retrained in the embedding vector model. The data record strings is sent through the word embedding vector model to obtain a candidate output for anonymization/synthetic data), the operations comprising:
receiving a healthcare dataset from a database in a first segregated data environment of a federated data cleanroom, the healthcare dataset comprising personally identifiable information (PII) pertaining to the individual (paragraphs 4 and 52 reveal receiving raw data/first data associated with a data stores. The raw data comes from database records, unstructured data, and/or any other set of data. Paragraphs 23 and 68 discloses the datastore record can include information such as disease. Paragraphs 77- 78 reveal the data can come from private databases such as health data from hospital; therefore, the databases are federated. See also Figure 9, reference number 910 of Figure 10, reference number 1008);
receiving a supplementary dataset from a database in a second segregated data environment of the federated data cleanroom, the supplementary data comprising PII pertaining to the individual (paragraphs 4 and 52 reveal receiving raw data/first data associated with a data stores. The raw data comes from database records, unstructured data, and/or any other set of data. Paragraphs 32, 52, and 68 disclose the datastore record can include information such as education level, zip code, and/or salary. The data can also include name of a person, SSN. Paragraphs 77- 78 reveal the data can come from private databases; therefore, the databases are federated. See also Figure 9, reference number 912 or Figure 10, reference number 1010);
anonymizing the data stored in each database, wherein the anonymized data from each database is stored in the corresponding segregated database (paragraph 52 discloses anonymizing received data values by deleting, masking, changing, and/or encrypting the dataset such that the data are not discernable to a human reader. Paragraph 79 further discloses each of the private databases are anonymized to a corresponding anonymized database. See also Figure 9, reference numbers 904, 906, and 908);
generating, for each segregated data environment, a numerical representation of each data feature of a plurality of data features of the anonymized data stored in the corresponding segregated data environment, wherein the numerical representation comprises a first sequence of embedding vectors (paragraphs 30 and 53-54 disclose for the plurality of real strings and real number values in the datastores of the anonymized datastore, identifying quasi-identifiers/data features and sensitive identifier values for both string data and numeral data; normalizing the data (if the data is a numerical dataset) and generating one or more tensors values for the strings data and the normalized numerical data. Tensors are numerical representation of the data record token that are fed through an embedding vector model (for the word ) or normalize vector. Vector embedding is converting words/numbers into number sequence), wherein each first sequence of embedding vectors is a compressed representation of the corresponding data feature (paragraphs 30-31 disclose the data tensor are representations of data in different form than the data itself, such as a token representing the data. Text strings that are fed through the word embedding vector model, the output is the data tensors which may be represented as integers or other real numbers. The data tensor are mapped from its high dimensionality space to a lower dimensionality space, the data is compressed);
determining, for each segregated data environment, a second sequence of embedding vectors, wherein the second sequence of embedding vectors is a transformation of the corresponding first sequence of embedding vectors, the transformation comprising a reduction of information from the first sequence of embedding vectors (paragraph 56 discloses adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). For example, a 300 dimensional vector represented as a table of token records (d.sub.1, −d.sub.4) can be mapped to a 100 dimensional vector (h.sub.1, −h.sub.3). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm);
modifying, upon determining a risk of disclosure is above a disclosure threshold, the plurality of data features for a corresponding segregated data environment, wherein the risk of disclosure is indicative of a likelihood that the PII of the data stored in a database of the corresponding segregated data environment is obtainable from the corresponding second sequence of embedding vectors (paragraph 47 further discloses utilizing the Mondrian algorithm to satisfy k-anonymity. Generalization replaces or changes quasi-identifier values with values that are less-specific but semantically consistent (e.g., values that belong to the same ontological class). As a result, more records will have the same set of quasi-identifier values. The Mondrian algorithm recursively chooses the split attribute with the largest normalized range of values and (for continuous or ordinal attributes) partitions/modify the data around the median value of the split attribute, which is repeated until there is no allowable split. A record in a k-anonymized dataset utilizes a maximum probability 1/k of being re-identified/disclosure risk1);
generating the synthetic trends dataset (paragraph 48 discloses the output of the Mondrian algorithm is a set of equivalence classes, where each equivalence class is of size at least k to meet k-anonymity. The equivalence class members at this point can be codes (derived from the data vectors using the encoder). Accordingly, at the output of the anonymizer layer (e.g., the anonymizer of FIG. 1), the c1, c2, c3, and c4 are indistinguishable. They are all represented by noisy code C. In some embodiments, different equivalence classes have different centroids and thus different outputs. The output layer is the synthetic trends dataset that comprises the second sequence of embedding vectors/noise associated with the anonymized databases);
training, by a processor configured to access training data comprising the ….synthetic dataset, a machine learning model using the training data (Figure 1 and paragraphs 31 and 62 disclose the word embedding vector model(s) are retrained using the anonymized data. After the decoder turns the data into the anonymized data, the word embedding vector model(s) is retrained, which is described in more detail below. This is indicated by the arrow from the decoder to the word embedding vector model(s)); and
outputting the trained machine learning model configured to process input data indicative of healthcare information and consumer information related to the individual (Figures 1 and 6 and paragraphs 31, 58, and 62 reveal the retrain embedded vector model processes data record strings/input data. Paragraph 78 reveals the data can come from private databases. Claim 19 discloses aggregating/combining the plurality of databases (see Figure 10, reference number 1006). Paragraphs 77-78 reveals the data record are health records from hospital private database. Paragraphs 77-78, 32, 52, and 68 disclose additional data records from private databases that include secondary information such as education level, zip code, and/or salary data).
Chakraborty does not teach generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment; storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset; and outputting the trained machine learning model configured to process input data and to output an output value that is indicative of a probability that the individual takes a particular action.
Kwatra teaches outputting the trained machine learning model configured to process input data and to output an output value that is indicative of a probability that the individual takes a particular action (paragraph 24 discloses collected data is anonymized. Paragraphs 35-40 reveals collecting the anonymized data, which is input to a machine learning module determine activities or behavior that the user performs. The system/bot correlate a score value or recommendation that were identified by the trained machine learning component using the anonymized collected data and outliers of the collected data. The correlation/probability score is used to determine which recommended action best matches the user’s known activities/behavior and a health score is used to determine which activities maximize improvement in health and/or minimize health risks. Paragraphs 32, 38, 60, 67 and disclose the collected data includes supplementary data such as ordered health and/or fast food and electronic health data such as medical records, health reports).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends to include the steps of Kwatra’s teachings of correlating a scored value to recommended actions that best match user’s activity/behavior to provide personalized actionable alerts for a user to improve the user’s health (paragraph 16 of Kwatra).
The combination of Chakraborty in view of Kwatra does not teach generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment; storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset.
Lesh teaches generating, in a shared data environment, the synthetic trends dataset by combining the second sequence of embedding vectors associated with each segregated data environment (paragraphs 64-65 and Figure 5B disclose the medical records (from databases (see paragraph 48) for a patient is transformed and for each record, a numerical vector/embedded vector is generated. The medical records are transformed into concatenated vector using the embedding vectors and are inputted into a generative model (shared data environment, per paragraph 48 the synthetic data is generated from a single or multiple source database ) to generate synthetic dataset), wherein the synthetic trends dataset has [threshold metric difference] with the healthcare dataset and the supplementary dataset (paragraph 66 further discloses a discriminator can determine a distribution over the state of the same based on the received input of the synthetic dataset and the authentic medical datasets (which can be the combined healthcare dataset and the dosage dataset), detecting whether the sample is synthetic. Paragraphs 80-83 also discloses computing distance metrics that compare the real records to the synthetic records. After comparing, if any of the pairs have a distance value that is too low or within a threshold distance, the system may highlight that pair for manual review by a human to determine if any authentic patient medical data has leaked into the generated synthetic dataset. The synthetic dataset may be assumed validated if a certain threshold of confidence between a relevant metric of predictive performance is achieved. For statistical validation, the generative system may utilize statistical measures over a population to demonstrate that the generated data has similar characteristics to the original/authentic data. For example, the system may utilize a distribution visualization called a violin plot to determine whether the synthetic data is comparable to the authentic data within an acceptable threshold. The system may also utilize other statistical validation such as a plot of a univariate correlation between each pair of variables in the original authentic electronic health record dataset and compare this against the same correlation plot in the synthetic electronic health record dataset).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value with Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models such that the system can provide quality synthetic datasets that provide a threshold distance with the authentic datasets, and quickly update the synthetic dataset using the machine learning model, such that the gradient/distance in distribution between the synthetic dataset and the authentic dataset meets a threshold value (paragraph 95 of Lesh).
The combination of Chakraborty in view of Kwatra and Lesh does not teach storing the synthetic trends dataset in a storage device, wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage and has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset.
Aptekar teaches storing the synthetic trends dataset in a storage device (column 3, lines 62-65 disclose outputting the synthetic dataset to a storage system), wherein the stored synthetic trends dataset comprising the combined sequences of embedding vectors requires less storage (column 11, lines 11-40 disclose the synthetic dataset is generated based on combined sequences/subsequence patterns of feature vector vectors. The feature vectors are embedded in a lower dimension space).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value and Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models with Aptekar’s teachings of storing the synthetic dataset to provide improve data fidelity and data privacy by obfuscating the data in a way that preserves its statistical properties while rendering it unidentifiable to unauthorized users. This may allow researchers to more widely access and analyze such data at an earlier stage of clinical trials, which can accelerate the development of new drugs and medical treatments (column 2, lines 2-9 of Aptekar).
The combination of Chakraborty in view of Kwatra, Lesh, and Aptekar does not teach but Pauly teaches wherein the synthetic trends dataset has less PII pertaining to the individual than a combination of the received healthcare dataset and the received supplementary dataset (paragraphs 102 and 147 disclose the encoded output data/synthetic data has a dimension that is lower than the medical data. The length of a vector representing medical data of the patient may be longer than the length of a vector that is an encoded output matrix encoded from the vector of medical data by an encoding module. Paragraph 4 reveals the data can include imaging, laboratory test measurements, clinical and family history data. Family history can be supplementary dataset. Paragraph 207 reveals the encoded data set has a high degree of privacy than the raw dataset).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify Chakraborty’s method of generating synthetic trends in view of the steps of Kwatra’s teachings of correlating a scored value, Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models, and Aptekar’s teachings of storing the synthetic dataset with Pauly’s teachings of the encoded data set having less PII such that the data can be sent to remote computing environments that have lower privacy standard without compromising the data (paragraph 207 of Pauly).
As to claim 20, the combination of Chakraborty in view of Kwatra, Lesh, Lesh, Aptekar and Pauly teaches wherein the transformation of the first sequence of embedding vectors comprises one or more of reducing a dimensionality of the first sequence of embedding vectors (Chakraborty: paragraph 56 discloses the transformation of the first embedded vector/sequence by adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). For example, a 300 dimensional vector represented as a table of token records (d.sub.1, −d.sub.4) can be mapped to a 100 dimensional vector (h.sub.1, −h.sub.3). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm), a lossy compression of the first sequence of embedding vectors (Chakraborty: paragraph 56 discloses the transformation of the first embedded vector/sequence by adding noise to each tensor vector of the anonymized databases. Figure 4 and paragraphs 45-50 illustrate the noise propagation module which include input layer, a hidden layer, an anonymizer layer, and an output layer. The tensors are represented as a dimensional vector. An encoder maps each data point to a representation point h (e.g., h.sub.1, h.sub.2, or h.sub.3) (h=f(d)). The mapping of the entire data set includes extracting relevant features or weights associated with the data points and projecting the features from a first dimension (e.g., a higher dimension) to another dimension (e.g., a lower dimension). This can be useful for principles such as data compression, which reduces dimensionality by removing static or non-important features of data. Each data point within the hidden layer is then mapped to another representation of the same or analogous dimension within the anonymizer layer. The processing for this mapping includes first using a Mondrian algorithm), and adding noise to the first sequence of embedding vectors, wherein the noise comprises one or more sources of noise (Chakraborty: paragraph 56 discloses nose is added by the noise propagation module (source of noise) to the tensors which are the first sequence of embedding vectors).
Claim(s) 8, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chakraborty et al US 20190156061 (hereinafter Chakraborty), in further view of Kwatra et al US 20230154609 (hereinafter Kwatra), in further view of Lesh et al US 20230010686 (hereinafter Lesh), in further view of Aptekar et al US 11977550 (hereinafter Aptekar), in further view of Pauly US 20200035365 (hereinafter Pauly), and in further view of Mivule, Kato. (2012). Utilizing Noise Addition for Data Privacy, an Overview. 10.13140/2.1.4629.2482 (hereinafter Mivule).
As to claim 8, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar, and Pauly teaches all the limitations recited in claim 1 above and further teaches wherein determining the risk of disclosure comprises determining a k-anonymity metric, wherein the k-anonymity metric depends on a similarity probability of each data point of the second sequence of embedding vectors (Chakraborty: paragraph 24 discloses a table satisfies k-anonymity if every record in the table is indistinguishable from at least k−1 (similarity) other records with respect to every set of quasi-identifier attributes. Paragraph 47 further discloses utilizing the Mondrian algorithm to satisfy k-anonymity. Generalization replaces or changes quasi-identifier values with values that are less-specific but semantically consistent (e.g., values that belong to the same ontological class). As a result, more records will have the same set of quasi-identifier values. The Mondrian algorithm recursively chooses the split attribute with the largest normalized range of values and (for continuous or ordinal attributes) partitions/modify the data around the median value of the split attribute, which is repeated until there is no allowable split. A record in a k-anonymized dataset utilizes a maximum probability 1/k (metric) of being re-identified/disclosure risk1).
While the combination of Chakraborty in view of Kwatra, Lesh, Aptekar, and Pauly teaches adding noise to the tensor using K-anonymity, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar, and Pauly does not teach wherein a metric depends on a signal-to-noise ratio (SNR).
Mivule teaches wherein a privacy metric depends on a signal-to-noise ratio (SNR) (page 4, section 4.7 discloses in relation to using noise addition, one can employ SNR to measure how much noise is needed to optimally obfuscate the data).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify k-anonymity metric in Chakraborty’s noise propagation module in view of the steps of Kwatra’s teachings of correlating a scored value, Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models, Aptekar’s teachings of storing the synthetic dataset, and Pauly’s teachings of the encoded data set having less PII to further include SNR as taught by Mivule to achieve optimal data utility while preserving privacy (page 4, section 4.7 of Mivule) because noise in the context of privacy can be added and quantified by SNR.
As to claim 16, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar and Pauly teaches all the limitations recited in claim 11 above and further teaches wherein determining the risk of disclosure comprises determining a k-anonymity metric, wherein the k-anonymity metric depends on a similarity probability of each data point of the second sequence of embedding vectors (Chakraborty: paragraph 24 discloses a table satisfies k-anonymity if every record in the table is indistinguishable from at least k−1 (similarity) other records with respect to every set of quasi-identifier attributes. Paragraph 47 further discloses utilizing the Mondrian algorithm to satisfy k-anonymity. Generalization replaces or changes quasi-identifier values with values that are less-specific but semantically consistent (e.g., values that belong to the same ontological class). As a result, more records will have the same set of quasi-identifier values. The Mondrian algorithm recursively chooses the split attribute with the largest normalized range of values and (for continuous or ordinal attributes) partitions/modify the data around the median value of the split attribute, which is repeated until there is no allowable split. A record in a k-anonymized dataset utilizes a maximum probability 1/k (metric) of being re-identified/disclosure risk1).
While the combination of Chakraborty in view of Kwatra, Lesh, Aptekar, and Pauly teaches adding noise to the tensor using K-anonymity, the combination of Chakraborty in view of Kwatra, Lesh, Aptekar, and Pauly does not teach wherein a metric depends on a signal-to-noise ratio (SNR).
Mivule teaches wherein a privacy metric depends on a signal-to-noise ratio (SNR) (page 4, section 4.7 discloses in relation to using noise addition, one can employ SNR to measure how much noise is needed to optimally obfuscate the data).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify k-anonymity metric in Chakraborty’s noise propagation module in view of the steps of Kwatra’s teachings of correlating a scored value, Lesh’s teachings of combining the embedded vectors to generate the synthetic dataset using machine learning models, Aptekar’s teachings of storing the synthetic dataset, and Pauly’s teachings of the encoded data set having less PII to further include SNR as taught by Mivule to achieve optimal data utility while preserving privacy (page 4, section 4.7 of Mivule) because noise in the context of privacy can be added and quantified by SNR.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FELICIA FARROW whose telephone number is (571)272-1856. The examiner can normally be reached M - F 7:30am-4:00pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexander Lagor can be reached at (571)270-5143. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/F.F/Examiner, Art Unit 2437
/ALI S ABYANEH/Primary Examiner, Art Unit 2437
1 El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am Med Inform Assoc. 2008 Sep-Oct;15(5):627-37. doi: 10.1197/jamia.M2716. Epub 2008 Jun 25. PMID: 18579830; PMCID: PMC2528029.