Last updated: April 19, 2026
Application No. 17/716,399
SYSTEM AND METHOD FOR PRIVACY-PRESERVING ANALYTICS ON DISPARATE DATA SETS

Final Rejection §101§102
Filed
Apr 08, 2022
Examiner
LEE, PO HAN
Art Unit
3623
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Truata Limited
OA Round
4 (Final)
Interview Optional

— +41.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 158 resolved cases, 2023–2026
Examiner Intelligence

LEE, PO HAN View full profile →
Grants only 32% of cases
Career Allow Rate
51 granted / 158 resolved
-19.7% vs TC avg
Strong +41% interview lift
Without
With
+41.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
50 currently pending
Career history
208
Total Applications
across all art units
Statute-Specific Performance

§101
40.9%
+0.9% vs TC avg
§103
31.3%
-8.7% vs TC avg
§102
11.4%
-28.6% vs TC avg
§112
14.8%
-25.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 158 resolved cases
Office Action

§101 §102
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Status of the Application
The following is a Final Office Action. 

In response to Examiner's communication of 7/16/2025, Applicant responded on 11/12/2025. Amended claim  1, 20. 

Claims 1-21 are pending in this application and have been examined. 









Response to Amendment

Applicant's amendments to claims 1, 20 are not sufficient to overcome the 35 USC 101 rejections set forth in the previous action.  

Applicant's amendments to claims 1, 20 are not sufficient to overcome the prior art rejections set forth in the previous action.  




Response to Arguments – 35 USC § 101
Applicant’s arguments with respect to the rejections have been fully considered, but they are not persuasive. 

Applicant submits, “…in Example 21 of the July 2015 Update Appendix 1 to the 2014 Interim Guidance on Subject Matter Eligibility, the USPTO explains how claim 2 provides significantly more than an abstract idea, even though claim 2 is dealing with the business function of distributing stock quotes...The claimed invention addresses the Internet-centric challenge of alerting a subscriber with time sensitive information when the subscriber's computer is offline. This is addressed by transmitting the alert over a wireless communication channel to activate the stock viewer application, which causes the alert to display and enables the connection of the remote subscriber computer to the data source over the Internet when the remote subscriber computer comes online. These are meaningful limitations that add more than generally linking the use of the abstract idea (the general concept of organizing and comparing data) to the Internet, because they solve an Internet-centric problem with a claimed solution that is necessarily rooted in computer technology, similar to the additional elements in DDR Holdings."…Similarly, here, even though the present claims can be used in business or legal fields, the present claims solve a problem centered in machine learning technology, because they provide a solution for how to train a model to analyze two or more disparate data sets that cannot be combined for analysis due to a restriction…when viewing the claim as a whole that includes the fact that the claims explicitly are limited to performing analytics using a trained model, the present claims solve a problem in the situation where the trained model is unable to access both data sets, yet needs to still analyze both data sets. This is a specific problem in machine learning modeling that nobody has expressly solved yet….Because the claims specifically are limited to performing analytics over the two disparate data sets using a trained model, the claims are also necessarily rooted in the computer technology of machine learning technology, similar to the additional elements in DDR Holdings...what example 21 further shows is that the USPTO could explain that claim 2 of Example 21 solved a business problem - because it is about distributing stock quotes - but it was still patent eligible because it was still rooted in Internet technology. Similarly here, even if the present claims solve a business or a legal problem, it also solves a problem rooted in machine learning technology, which is inherently a computer technology, by providing a way to make a trained model analyze two disparate data sets while not being able to access them together…In Classen Immunotherapies v.Biogen IDEC (659 F.3d 1057, 100 USPQ2d 1492 (Fed. Cir. 2011)), the court found meaningful limitations beyond generally linking the use of the judicial exception to a particular technological environment by introducing an immunization step that integrates an abstract idea of data comparison into a specific process of immunizing that lowers the risk that immunized patients will later develop chronic immune-mediated diseases….the claims in Classen are even more administrative than anything the examiner claims the present claims are doing, because the present claims deal with the problem of using a trained model to analyze disparate data sets that cannot be combined. This is an analytic problem for a trained model, which is more technical than merely an administrative problem, and yet even solving the administrative problem in Classen was found to provide a practical application. If solving the administrative problem in Classen was enough to provide a practical application, then here too, solving the analytical problem provided by the present claims is enough to provide a practical application that is significantly more than an abstract idea…Applicant maintains that a sufficiently practical application has been demonstrated by the claims…” The Examiner respectfully disagrees.

While Applicant’s amendments further prosecution, unlike Example 21, DDR Holdings, Classen, the claims are directed to, …analyzing commercial data and human recognizing data commonality, anonymizing data to comply with ”with strict data protection regulations” (i.e. HIPAA governing human healthcare professionals)… to analyze two or more disparate data sets that cannot be combined for analysis due to a restriction (i.e. HIPAA governing human healthcare professionals)…to performing analytics over the two disparate data sets…analyze two disparate data sets while not being able to access them together…, which is a business and legal problem directed to, organizing human activity (i.e. human recognizing, analyzing, querying, separating and combining human customer behavior and commercial product behavior data based on commonality to preserve human customer privacy with restricted by laws such as HIPPA, which is also human legal interaction and relationship in compliance with the laws, the similarity and commonality is further based on similarity and commonality of human behaviors), a mental process (i.e. human recognizing, analyzing, querying, separating and combining human customer behavior and commercial product behavior data based on commonality to preserve human customer privacy), mathematical concepts (human using mathematical modeling to determine data similarity and commonality to combine similar data), as established in Step 2A Prong 1. This problem does not specifically arise in the realm of computer technology, but rather, this problem existed and was addressed long before the advent of computers, and does not require Internet. Thus, the claims do not recite a technical improvement to a technical problem or necessarily roots in computing technologies. Additionally, pursuant to the broadest reasonable interpretation, as an ordered combination, each of the additional elements are computing elements recited at high level of generality implementing the abstract idea, and thus, are no more than applying the abstract idea with generic computer components. Further, these additional elements generally link the abstract idea to a technical environment, namely the environment of a computer and machine learning, performing extra solution activities. Therefore, as a whole, the additional elements do not integrate the abstract ideas into a practical application in Step 2A Prong 2. 
Even novel and newly discovered judicial exceptions are still exceptions, despite their novelty. July 2015 Update, p. 3; see SAP America Inc. v. Investpic, LLC, No. 2017-2081, slip op. at 2 (Fed Cir. May 15, 2018). 
Simply reciting specific limitations that narrow the abstract idea does not make an abstract idea non-abstract. 79 Fed. Reg. 74631; buySAFE Inc. v. Google, Inc., 765 F.3d 1350, 1355 (2014); see SAP America at p. 12. As discussed in SAP America, no matter how much of an advance the claims recite, when “the advance lies entirely in the realm of abstract ideas, with no plausibly alleged innovation in the non-abstract application realm,” “[a]n advance of that nature is ineligible for patenting.” Id. at p. 3.
Use of a computer or other machinery in its ordinary capacity for economic or other tasks (e.g., to receive, store, or transmit data) or simply adding a general purpose computer or computer components after the fact to an abstract idea (e.g., a fundamental economic practice or mathematical equation) does not integrate a judicial exception into a practical application or provide significantly more. See Affinity Labs v. DirecTV, 838 F.3d 1253, 1262, 120 USPQ2d 1201, 1207 (Fed. Cir. 2016) (cellular telephone); TLI Communications LLC v. AV Auto, LLC, 823 F.3d 607, 613, 118 USPQ2d 1744, 1748 (Fed. Cir. 2016) (computer server and telephone unit). Similarly, “claiming the improved speed or efficiency inherent with applying the abstract idea on a computer” does not integrate a judicial exception into a practical application or provide an inventive concept. Intellectual Ventures I LLC v. Capital One Bank (USA), 792 F.3d 1363, 1367, 115 USPQ2d 1636, 1639 (Fed. Cir. 2015).


Response to Arguments – Prior Art
Applicant’s arguments with respect to the rejections have been fully considered, but they are not persuasive. 

Applicant submits, “…Applicant respectfully asserts that McFall fails to teach or suggest "wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features," as recited in claim 1. McFall appears to not teach any finding of a common representation according to this level of detail, and as such, fails to teach each and every one of these elements.…Applicant respectfully asserts that McFall fails to teach or suggest each and every element of "training, on a first disparate data set among the two or more disparate data sets, one or more models by performing machine learning classification, conducting a deterministic algorithm, generating a neural network, or conducting federated learning, or defining one or more queries, to recognize behaviors of one or more specified subjects within the first disparate data set,"…how "sensitive" a dataset may be deemed is not a "behavior" in the likely candidates, as that characteristic is already assumed by virtue of the datasets being disparate from one another. In other words, a "behavior" in the likely candidates in the second disparate data set is not the fact that it is already disparate from the first data set. Therefore, the Examiner has failed to point to what "behavior" McFall actually is trying to recognize in one or more specified subjects within the disparate data sets…” The Examiner respectfully disagrees.

Under the broadest reasonable interpretation, McFall teaches: 
wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features; (in at least [0565] Publisher's features for automatically detecting sensitive, quasi-identifying, or identifying columns. These features allow the program to assist the user in properly configuring the anonymisation of input datasets and, additionally, in identifying new datasets to anonymise. Publisher takes several approaches to detecting sensitive, quasi-identifying, or identifying columns including using metadata, measuring correlation with known columns, and using machine learning. [0599] Publisher compares column names are with a list of HIPAA constants to detect columns that contain typical personal identifiers such as: [0617] To match column names with the provided list of template identifier attribute names the Levenshtein distance between two strings is calculated. Publisher also considers substrings, so that, for example, “account number” is found to be similar to “current account number”. [0618] Publisher takes values from previously known sources of identifiers and finds similarity between those sources and the new data in question. A key source is the content of Publisher Token Vaults, which are known to contain identifiers. A second source is other columns in the dataset that have been assigned a tokenisation rule. If new data contains a significant overlap with a known list of identifiers, it is more likely to be an identifier itself. [0619] Publisher calculates the overlap between one column and another either: Using the Jaccard index, which is the cardinality of the intersection of the columns divided by the cardinality of the union of columns (where the columns are taken as sets). This index is straightforward but inefficient to calculate. For performance, Publisher may approximate the Jaccard index using the “hashing trick”, which hashes each value into a range of values (e.g. 0 to 2{circumflex over ( )}24−1), and maintains a bitstring of the same length, and flips the bit from 0 to 1 only if one of the values is hashed to that index. Publisher can then efficiently approximate the Jaccard distance using the popcount of the AND of the two bitstrings over the popcount of the OR of the two bitstrings. By calculating the cardinality of the intersection of columns divided by the cardinality of the smaller of the two columns (again where the columns are taken as sets). Similarly, Publisher may approximate this using the hashing trick—to approximate this metric, it takes the popcount of the AND of the two bitstrings over the greater of the two bitstrings' popcounts.)
training, on a first disparate data set among the two or more disparate data sets, one or more models by performing machine learning classification, conducting a deterministic algorithm, generating a neural network, or conducting federated learning, or defining one or more queries, to recognize behaviors of one or more specified subjects within the first disparate data set; (in at least [0084] Lens is a system for answering queries on datasets while preserving privacy. It is applicable for conducting analytics on any datasets that contain sensitive information about a person, company, or other entity whose privacy must be preserved. For instance, it could be used to conduct analytics on hospital visit data, credit card transaction data, mobile phone location data, or smart meter data. [0180] FIG. 10 illustrates a simple diagram where two contributors (Bank1 and Bank2) share data to a recipient. A simple example of how this system might be used would be to help calculate the distribution of net worth of individuals. In this example, the banks are Contributors and the recipient calculates the sum of all credits and debits for an individual across the whole financial system for further analysis. [0215] SecureLink can be used if a central research organisation wants to collect statistics from individuals who have data at many different service-providing organisations. For example, consider a national health research organisation that is conducting an assessment of nation-wide hospital costs. A single person may visit many hospitals, incurring costs at each hospital. The healthcare research organisation may wish to link the costs of each individual across hospitals in order to have more complete data. A convenient way to link these costs is by social security number, which one assumes is recorded consistently across hospital visits. However, the health organisation wishes to maintain privacy in the data that they collect and thus wishes the identifier to be tokenised. [0281] a bank might tag both credit card numbers and bank account numbers as Tier 1 sensitive data. Publisher would in this example associate an encryption rule with Tier 1 sensitive data, and so any data recorded as being Tier 1 sensitive in the bank's metadata store will be processed in a standard way. [0480] Infogain is a function of a parent node and a set of potential child nodes. It examines the class counts in the column marked as “interesting” (see more on this below). For instance, if the interesting column is whether or not the debtor defaulted on their loan, there will be two classes, “Yes” and “No”, and the counts of these can be gathered for any set of records. Infogain is defined as follows: Let S be the class counts of the parent Let T be the set of children; let each child have a proportion of records and class counts Let H be the entropy function Infogain(S, T)=H(S)−sum_{t in T} proportion(t) H(t) And the entropy function is defined as: H(S)=sum_{x in S} proportion(x) log_2 (1/proportion(x))  [0539] Publisher supports automatically generalising locations to ensure that they will not disclose any sensitive attributes. To make this assurance, Publisher takes a user-provided list of locations of interest, also known as points of interest (POIs). POIs may be hospitals, stores, office buildings, restaurants, bars, cafes, schools, museums, sports facilities, and so on. Publisher can ensure that every generalised location area contains a minimum number of POIs. Because this guarantee is similar to l-diversity, we call this minimum number “l”. For instance, if “l=4”, then an adversary could not tell from the published dataset which location out of at least 4 any target went to. They might know that you were near Waterloo station, but they will not know whether you visited the cafe, cake shop, pub, or gay bar. [0566] Analysing metadata about the variable: its name, its source, descriptive data about it coming from the file or from external metadata stores, its date of update, access controls applied to the data. In addition, learning from user behaviour in managing this and similar data with the Privitar application, namely: Analysing how other this data has been classified and managed in other privacy policies—if there exist policies requiring a column to be tokenised, that is a strong indicator that it is sensitive; Considering the similarity of a dataset to other data which users have indicated is sensitive through their classification of that data in a privacy policy. This is rather like learning domain knowledge using recommendations for data privacy: since a user judged that data resembling your data was sensitive or identifying, it's more likely that your data is sensitive or identifying; Reading metadata and data lineage information generated from the anonymisation process, in order to tell the difference between sensitive data, and very realistic anonymised data of the same structure. Since the tokenisation process produces fields of the same structure as the original, and the generalisation process preserves the data distributions, anonymised data looks very like raw sensitive data, and the metadata recording that it has been anonymised is necessary. Once quasi-identifiers have been discovered, identify the privacy risk by evaluating the k-distribution of the data. Evaluate how privacy risk and data sensitivity is reduced by anonymisation.  [0633]-[0653] Publisher supports a machine learning approach to identifying sensitive or quasi-identifying columns. Publisher constructs a set of training data using the column names and value sets of all datasets that pass through the system, and labels them according to whether they were marked as “sensitive” or not by the user, and separately, whether they were marked as “quasi-identifying” or not by the user. Publisher can randomly subsample the value sets in order to limit the size of the training set…several machine learning approaches can be used to build a model that can score an unknown column as sensitive or non-sensitive (or similarly, quasi-identifying or non-quasi-identifying)…Possible training algorithms include the following: Support vector machines, handling numerical features in the following way: One-hot encode the column type. Omit column name. Omit n-grams of the column name. Nearest-neighbour algorithms, using the following distance metrics: Difference for numeric features. Levenshtein difference for string features (e.g. column name). Fraction of overlapping elements for sets of strings (e.g. n-grams of column name) or cardinality of overlapping elements. Boosted decision trees.…If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0659] Once Publisher has the datasets, it can conduct column similarity measures (see elsewhere in this section) to determine whether any of the columns in the public dataset are similar to columns in the dataset being anonymised. If there is a similar column that has not been marked as quasi-identifying, the user can be prompted to check whether that the column is quasi-identifying. The user can be provided with a link to the relevant public dataset. [0692] the organisation possesses historical mortgage data which includes some customer information (e.g. age, home region) about the borrower and whether they ultimately defaulted or not. The organisation can configure Publisher to consider the customer information columns as quasi-identifying, and the default column as interesting. The organisation can also specify a value for k. Publisher's autogen can automatically generalise the customer information columns to the point where k-anonymity is achieved. The resulting dataset retains useful information about the relationships between the customer information and the default status, but is resistant to re-identification. Thus it can be provided to data scientists who can use it to train useful models but cannot re-identify people in the dataset and discover whether they defaulted on a mortgage. [1060] Privitar Publisher may learn from user behaviour in managing this and similar data with the Privitar application.)



Claim Rejections – 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

Claim 1 recites, “A method for providing the ability to analyze two or more disparate data sets via the use of either individual-to-segment or segment-to-segment matching, wherein the disparate data sets cannot be combined for analysis due to a restriction, wherein the matching utilizes modelling or querying approaches, the method comprising: 
creating a plurality of common representations between the two or more disparate data sets, wherein the plurality of common representations is based on a data representation that is common across the two or more disparate data sets, wherein each common representation is to be stored separately along with a respective data set and subject to equivalent data protection restrictions to that of the respective data set; 
wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features;
training, on a first disparate data set among the two or more disparate data sets, one or more models by performing … classification, conducting a deterministic algorithm, generating a …, or conducting federated learning, or defining one or more queries, to recognize behaviors of one or more specified subjects within the first disparate data set; 
identifying, in a second disparate data set among the two or more disparate data sets, likely candidates to compare to the one or more specified subjects in the first disparate data set, by using the one or more trained models or the one or more queries on the second disparate data set; 
performing analytics over the likely candidates in the second disparate data set by: 
using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and 
for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and 
outputting the likely candidates and their respective probabilistic scores.” 

Claim 20 recites “A … for providing the ability to use k-anonymous groups to analyze two or more disparate data sets via the use of either individual-to-segment or segment- to-segment matching, wherein the disparate data sets cannot be combined for analysis due to a restriction, wherein the matching utilizes modelling or querying approaches, the … operating to:
a … that creates a plurality of common representations between the two or more disparate data sets, wherein the plurality of common representations is based on a data representation that is common across the two or more disparate data sets, wherein each common representation is to be stored separately along with a respective data set and subject to equivalent data protection restrictions to that of the respective data set;
wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features;
a describer … that includes training one or more models on a first disparate data set among the two or more disparate data sets by performing … classification, conducting a deterministic algorithm, generating a …, or conducting federated learning, or defining one or more queries to recognize behaviors of one or more specified subjects within the first disparate data set;
a finder … that highlights, in a second disparate data set among the two or more disparate data sets, likely candidates to compare to the one or more specified subjects in the first disparate data set by using the one or more trained models and/or the one or more queries on the second disparate data set;
the describer and finder … performing actions over the one or more identified subjects for each of the two disparate sets, the actions comprising:
using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and
for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and
… to output an analytics result produced by the describer and the finder … on the second disparate data set.”


Analyzing under Step 2A, Prong 1:
The limitations regarding, …providing the ability to use k-anonymous groups to analyze two or more disparate data sets via the use of either individual-to-segment or segment- to-segment matching, wherein the disparate data sets cannot be combined for analysis due to a restriction, wherein the matching utilizes modelling or querying approaches… creating a plurality of common representations between the two or more disparate data sets, wherein the plurality of common representations is based on a data representation that is common across the two or more disparate data sets, wherein each common representation is to be stored separately along with a respective data set and subject to equivalent data protection restrictions to that of the respective data set; wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features; training, on a first disparate data set among the two or more disparate data sets, one or more models by performing … classification, conducting a deterministic algorithm, generating a …, or conducting federated learning, or defining one or more queries, to recognize behaviors of one or more specified subjects within the first disparate data set; identifying, in a second disparate data set among the two or more disparate data sets, likely candidates to compare to the one or more specified subjects in the first disparate data set, by using the one or more trained models or the one or more queries on the second disparate data set; performing analytics over the likely candidates in the second disparate data set by: using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and outputting the likely candidates and their respective probabilistic scores…. creates a plurality of common representations between the two or more disparate data sets, wherein the plurality of common representations is based on a data representation that is common across the two or more disparate data sets, wherein each common representation is to be stored separately along with a respective data set and subject to equivalent data protection restrictions to that of the respective data set; wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features; a describer … that includes training one or more models on a first disparate data set among the two or more disparate data sets by performing … classification, conducting a deterministic algorithm, generating a …, or conducting federated learning, or defining one or more queries to recognize behaviors of one or more specified subjects within the first disparate data set; a finder … that highlights, in a second disparate data set among the two or more disparate data sets, likely candidates to compare to the one or more specified subjects in the first disparate data set by using the one or more trained models and/or the one or more queries on the second disparate data set; the describer and finder … performing actions over the one or more identified subjects for each of the two disparate sets, the actions comprising: using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and… to output an analytics result produced by the describer and the finder … on the second disparate data set…, under the broadest reasonable interpretation, can include a human using their mind and using pen and paper to perform the identified limitations; therefore, the claims are directed to a mental process. 

Further, the limitations regarding, …providing the ability to use k-anonymous groups to analyze two or more disparate data sets via the use of either individual-to-segment or segment- to-segment matching, wherein the disparate data sets cannot be combined for analysis due to a restriction, wherein the matching utilizes modelling or querying approaches… creating a plurality of common representations between the two or more disparate data sets, wherein the plurality of common representations is based on a data representation that is common across the two or more disparate data sets, wherein each common representation is to be stored separately along with a respective data set and subject to equivalent data protection restrictions to that of the respective data set; wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features; training, on a first disparate data set among the two or more disparate data sets, one or more models by performing … classification, conducting a deterministic algorithm, generating a …, or conducting federated learning, or defining one or more queries, to recognize behaviors of one or more specified subjects within the first disparate data set; identifying, in a second disparate data set among the two or more disparate data sets, likely candidates to compare to the one or more specified subjects in the first disparate data set, by using the one or more trained models or the one or more queries on the second disparate data set; performing analytics over the likely candidates in the second disparate data set by: using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and outputting the likely candidates and their respective probabilistic scores…. creates a plurality of common representations between the two or more disparate data sets, wherein the plurality of common representations is based on a data representation that is common across the two or more disparate data sets, wherein each common representation is to be stored separately along with a respective data set and subject to equivalent data protection restrictions to that of the respective data set; wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features; a describer … that includes training one or more models on a first disparate data set among the two or more disparate data sets by performing … classification, conducting a deterministic algorithm, generating a …, or conducting federated learning, or defining one or more queries to recognize behaviors of one or more specified subjects within the first disparate data set; a finder … that highlights, in a second disparate data set among the two or more disparate data sets, likely candidates to compare to the one or more specified subjects in the first disparate data set by using the one or more trained models and/or the one or more queries on the second disparate data set; the describer and finder … performing actions over the one or more identified subjects for each of the two disparate sets, the actions comprising: using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and… to output an analytics result produced by the describer and the finder … on the second disparate data set…, under the broadest reasonable interpretation, are human recognizing, analyzing, querying, separating and combining human customer behavior and commercial product behavior data based on commonality to preserve human customer privacy, therefore it is commercial or legal interactions, managing personal behavior or relationships or interactions between people. Thus, the claims are directed to certain methods of organizing human activity.

Additionally, the limitations regarding, …performing analytics over the likely candidates in the second disparate data set by: using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and outputting the likely candidates and their respective probabilistic scores.…performing actions over the one or more identified subjects for each of the two disparate sets, the actions comprising: using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and… to output an analytics result produced by the describer and the finder … on the second disparate data set…, are mathematical concepts. 


Accordingly, the claims are directed to a mental process, organizing human activities, mathematical concepts and thus, the claims are directed to an abstract idea under the first prong of Step 2A.


Analyzing under Step 2A, Prong 2:
This judicial exception is not integrated into a practical application under the second prong of Step 2A. 
In particular, the claims recite the additional elements beyond the recited abstract idea identified under Step 2A, Prong 1, such as:

Claim 1, 20: system, sub-system, the system comprising: at least one processor operatively coupled to a memory, the processor operating to,  an input/output device, machine learning classification, conducting a deterministic algorithm, generating a neural network, or conducting federated learning

, and pursuant to the broadest reasonable interpretation, as an ordered combination, each of the additional elements are computing elements recited at high level of generality implementing the abstract idea, and thus, are no more than applying the abstract idea with generic computer components. Further, these additional elements generally link the abstract idea to a technical environment, namely the environment of a computer. 

Additionally, with respect to, “creating a common representation …, …training one or more models…, creating…,  ...output …,  these elements do not add a meaningful limitations to integrate the abstract idea into a practical application because they are extra-solution activity, pre and post solution activity - i.e. data gathering –  creating a common representation…, …training one or more models…,  data output – … output ….creating…

Analyzing under Step 2B:
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under Step 2B. 
As noted above, the aforementioned additional elements beyond the recited abstract idea are not sufficient to amount to significantly more than the recited abstract idea because, as an order combination, the additional elements are no more than mere instructions to implement the idea using generic computer components (i.e. apply it). 
Additionally, as an order combination, the additional elements append the recited abstract idea to well-understood, routine, and conventional activities in the field as individually evinced by the applicant’s own disclosure, as required by the Berkheimer Memo, in at least: 
[0005] A system and method, which in certain configurations are implemented via a computer, for providing the ability to use k-anonymous groups to analyze disparate data sets via the use of either individual to segment or segment to segment matching using data modeling or querying are disclosed.
[0036] The data is used to identify the related segments/groupings of data subjects on both sides as will be described further herein. Analytics may be performed on those segments/groupings in the knowledge that the analytics are effectively linked (i.e. the analytics or insights over an identified group on one data set may be applied to the same identified group on the other data set). The system identifies a behaviorally similar segment/group of people on both sides by sharing the trained model or queries between the consumer and all subsequent producers. 
[0037] In performing actions for each producer data set at step 230, the system 100, for each producer data set, may evaluate the trained model(s) or execute queries over the common representation / detailed feature array (as generated in step 210) to produce a vector of probabilities for each segment, compile S probability vectors into an S-dimensional probability array at step 235, sort and group the S-dimensional probability array to identify the most likely subjects in the producer data set for each segment s belongs to S at step 236, and may perform the specified analytics over the grouped/segmented producer data set. 
[0038] The describer is a modelling process, which may be an encoder, for example. The describer may input a group of candidates from the consumer data set and describe this group in terms of the common representation extracted in the previous step. The description is done via a modelling or querying process. In one embodiment, this description may be represented as a logistic regression model. In another embodiment, the description may represent a neural network. In another embodiment, the description may represent a set of queries over the common representation. The describer may take a defined group of data subjects from one data set and build a model or define a set of queries that describes those data subjects based on the common representation.
[0040] Automated feature generation sub-system to create a common representation in step 210 is further depicted in FIG. 3. The input identified as "Common Representation Columns"310 is the semantic description of step 212 described above. The common representation 330 is augmenting the input data and creating the common representation data 340 in step 213 and 230 and step 214, 215, 216, 218, 219 described above. 
[0057] In an embodiment, data sets that are collected for different purposes and/or by different controllers may be kept separate. Analysis may be performed over the same group of individuals or individuals with similar behaviors across the different data sets. Individuals may not be matched deterministically across data sets due to legal / regulatory restrictions (e.g., GDPR). The present system and method may be used to generate a non-deterministic matched grouping across disparate data sets, allowing for matched segment / group level analytics to be performed. 
[0066] Although features and elements are described above in particular combinations, one of ordinary skill in the art will appreciate that each feature or element can be used alone or in any combination with the other features and elements. In addition, the methods described herein may be implemented in a computer program, software, or firmware incorporated in a computer-readable medium for execution by a computer or processor. Examples of computer-readable media include electronic signals (transmitted over wired or wireless connections) and computer-readable storage media. Examples of computer-readable storage media include, but are not limited to, a read only memory (ROM), a random-access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs). A processor in association with software may be used to implement a radio frequency transceiver for use in a WTRU, UE, terminal, base station, RNC, or any host computer. 
Furthermore, as an ordered combination, these elements amount to generic computer components receiving or transmitting data over a network, performing repetitive calculations, electronic record keeping, and storing and retrieving information in memory, which, as held by the courts, are well-understood, routine, and conventional. See MPEP 2106.05(d).
Moreover, the remaining elements of dependent claims do not transform the recited abstract idea into a patent eligible invention because these remaining elements merely recite further abstract limitations that provide nothing more than simply a narrowing of the abstract idea recited in the independent claims. 
Looking at these limitations as an ordered combination adds nothing additional that is sufficient to amount to significantly more than the recited abstract idea because they simply provide instructions to use a generic arrangement of generic computer components to “apply” the recited abstract idea, perform insignificant extra-solution activity, and generally link the abstract idea to a technical environment. Thus, the elements of the claims, considered both individually and as an ordered combination, are not sufficient to ensure that the claim as a whole amounts to significantly more than the abstract idea itself. Since there are no limitations in these claims that transform the exception into a patent eligible application such that these claims amount to significantly more than the exception itself, claims 1-21 are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.
Claim Rejections - 35 USC § 102

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claim(s) 1-21 is/are rejected under 35 U.S.C. 102 as being unpatentable by US Patent Publication to US20200327252A1 to McFall et al., (hereinafter referred to as “McFall”).

As per Claim 1, McFall teaches: (Currently Amended) A method for providing the ability to analyze two or more disparate data sets via the use of either individual-to-segment or segment-to-segment matching, wherein the disparate data sets cannot be combined for analysis due to a restriction, wherein the matching utilizes modelling or querying approaches, the method comprising: ([0176]-[0181])
creating a plurality of common representations between the two or more disparate data sets, wherein the plurality of common representations is based on a data representation that is common across the two or more disparate data sets, wherein each common representation is to be stored separately along with a respective data set and subject to equivalent data protection restrictions to that of the respective data set; (in at least [0148] if a dataset contains an age column, and due to HIPAA (Health Insurance Portability and Accountability Act) regulations the data holder wishes all aggregate statistics to meet the HIPAA requirements for age generalisation, Lens can pre-process queries to replace age filter clauses with more general age filter clauses. For instance, it can replace “AVG(costs) where age=47” with “AVG(costs) WHERE age >=40 AND age <50”. The data holder supplies the information about the desired intervals or groupings of categorical options to generalise to.  [0212] While the Recipient needs to join all the information about each individual they do not care about the specific identity of each individual and so they should not be able to easily reidentify the original unique identifier for a particular individual. The blinding step performed by the Intermediary ensures that the Recipient is able to decrypt corresponding identifiers to the same value, which is distinct from the original identifier value and cannot be used to discover the original identifier value. [0213] When a new batch of data is made available to the Recipient, it uses the schema information as in the Intermediary to identify which columns contain identifiers. It then deserializes these values to retrieve the pair of points, and performs an El-Gamal decryption using the recipient private key to retrieve a point that represents the identifier. It may then serialize this point to a string or number. [0215] SecureLink can be used if a central research organisation wants to collect statistics from individuals who have data at many different service-providing organisations. For example, consider a national health research organisation that is conducting an assessment of nation-wide hospital costs. A single person may visit many hospitals, incurring costs at each hospital. The healthcare research organisation may wish to link the costs of each individual across hospitals in order to have more complete data. A convenient way to link these costs is by social security number, which one assumes is recorded consistently across hospital visits. However, the health organisation wishes to maintain privacy in the data that they collect and thus wishes the identifier to be tokenised. [0217] SecureLink allows the healthcare research organisation to gather tokenised records of each person's costs linked together across hospitals. [0225] The data may contain identifiers such as names or account numbers and thus be suitable for masking or tokenisation. Alternatively, it can represent categorical attributes such as race, religion or gender, and other sensitive values such as salary or geolocation and thus be applicable for k-anonymisation. [0599] Publisher compares column names are with a list of HIPAA constants to detect columns that contain typical personal identifiers [0617] To match column names with the provided list of template identifier attribute names the Levenshtein distance between two strings is calculated. Publisher also considers substrings, so that, for example, “account number” is found to be similar to “current account number”. )
wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features; (in at least [0565] Publisher's features for automatically detecting sensitive, quasi-identifying, or identifying columns. These features allow the program to assist the user in properly configuring the anonymisation of input datasets and, additionally, in identifying new datasets to anonymise. Publisher takes several approaches to detecting sensitive, quasi-identifying, or identifying columns including using metadata, measuring correlation with known columns, and using machine learning. [0599] Publisher compares column names are with a list of HIPAA constants to detect columns that contain typical personal identifiers such as: [0617] To match column names with the provided list of template identifier attribute names the Levenshtein distance between two strings is calculated. Publisher also considers substrings, so that, for example, “account number” is found to be similar to “current account number”. [0618] Publisher takes values from previously known sources of identifiers and finds similarity between those sources and the new data in question. A key source is the content of Publisher Token Vaults, which are known to contain identifiers. A second source is other columns in the dataset that have been assigned a tokenisation rule. If new data contains a significant overlap with a known list of identifiers, it is more likely to be an identifier itself. [0619] Publisher calculates the overlap between one column and another either: Using the Jaccard index, which is the cardinality of the intersection of the columns divided by the cardinality of the union of columns (where the columns are taken as sets). This index is straightforward but inefficient to calculate. For performance, Publisher may approximate the Jaccard index using the “hashing trick”, which hashes each value into a range of values (e.g. 0 to 2{circumflex over ( )}24−1), and maintains a bitstring of the same length, and flips the bit from 0 to 1 only if one of the values is hashed to that index. Publisher can then efficiently approximate the Jaccard distance using the popcount of the AND of the two bitstrings over the popcount of the OR of the two bitstrings. By calculating the cardinality of the intersection of columns divided by the cardinality of the smaller of the two columns (again where the columns are taken as sets). Similarly, Publisher may approximate this using the hashing trick—to approximate this metric, it takes the popcount of the AND of the two bitstrings over the greater of the two bitstrings' popcounts.)
training, on a first disparate data set among the two or more disparate data sets, one or more models by performing machine learning classification, conducting a deterministic algorithm, generating a neural network, or conducting federated learning, or defining one or more queries, to recognize behaviors of one or more specified subjects within the first disparate data set; (in at least [0084] Lens is a system for answering queries on datasets while preserving privacy. It is applicable for conducting analytics on any datasets that contain sensitive information about a person, company, or other entity whose privacy must be preserved. For instance, it could be used to conduct analytics on hospital visit data, credit card transaction data, mobile phone location data, or smart meter data. [0180] FIG. 10 illustrates a simple diagram where two contributors (Bank1 and Bank2) share data to a recipient. A simple example of how this system might be used would be to help calculate the distribution of net worth of individuals. In this example, the banks are Contributors and the recipient calculates the sum of all credits and debits for an individual across the whole financial system for further analysis. [0215] SecureLink can be used if a central research organisation wants to collect statistics from individuals who have data at many different service-providing organisations. For example, consider a national health research organisation that is conducting an assessment of nation-wide hospital costs. A single person may visit many hospitals, incurring costs at each hospital. The healthcare research organisation may wish to link the costs of each individual across hospitals in order to have more complete data. A convenient way to link these costs is by social security number, which one assumes is recorded consistently across hospital visits. However, the health organisation wishes to maintain privacy in the data that they collect and thus wishes the identifier to be tokenised. [0281] a bank might tag both credit card numbers and bank account numbers as Tier 1 sensitive data. Publisher would in this example associate an encryption rule with Tier 1 sensitive data, and so any data recorded as being Tier 1 sensitive in the bank's metadata store will be processed in a standard way. [0480] Infogain is a function of a parent node and a set of potential child nodes. It examines the class counts in the column marked as “interesting” (see more on this below). For instance, if the interesting column is whether or not the debtor defaulted on their loan, there will be two classes, “Yes” and “No”, and the counts of these can be gathered for any set of records. Infogain is defined as follows: Let S be the class counts of the parent Let T be the set of children; let each child have a proportion of records and class counts Let H be the entropy function Infogain(S, T)=H(S)−sum_{t in T} proportion(t) H(t) And the entropy function is defined as: H(S)=sum_{x in S} proportion(x) log_2 (1/proportion(x))  [0539] Publisher supports automatically generalising locations to ensure that they will not disclose any sensitive attributes. To make this assurance, Publisher takes a user-provided list of locations of interest, also known as points of interest (POIs). POIs may be hospitals, stores, office buildings, restaurants, bars, cafes, schools, museums, sports facilities, and so on. Publisher can ensure that every generalised location area contains a minimum number of POIs. Because this guarantee is similar to l-diversity, we call this minimum number “l”. For instance, if “l=4”, then an adversary could not tell from the published dataset which location out of at least 4 any target went to. They might know that you were near Waterloo station, but they will not know whether you visited the cafe, cake shop, pub, or gay bar. [0566] Analysing metadata about the variable: its name, its source, descriptive data about it coming from the file or from external metadata stores, its date of update, access controls applied to the data. In addition, learning from user behaviour in managing this and similar data with the Privitar application, namely: Analysing how other this data has been classified and managed in other privacy policies—if there exist policies requiring a column to be tokenised, that is a strong indicator that it is sensitive; Considering the similarity of a dataset to other data which users have indicated is sensitive through their classification of that data in a privacy policy. This is rather like learning domain knowledge using recommendations for data privacy: since a user judged that data resembling your data was sensitive or identifying, it's more likely that your data is sensitive or identifying; Reading metadata and data lineage information generated from the anonymisation process, in order to tell the difference between sensitive data, and very realistic anonymised data of the same structure. Since the tokenisation process produces fields of the same structure as the original, and the generalisation process preserves the data distributions, anonymised data looks very like raw sensitive data, and the metadata recording that it has been anonymised is necessary. Once quasi-identifiers have been discovered, identify the privacy risk by evaluating the k-distribution of the data. Evaluate how privacy risk and data sensitivity is reduced by anonymisation.  [0633]-[0653] Publisher supports a machine learning approach to identifying sensitive or quasi-identifying columns. Publisher constructs a set of training data using the column names and value sets of all datasets that pass through the system, and labels them according to whether they were marked as “sensitive” or not by the user, and separately, whether they were marked as “quasi-identifying” or not by the user. Publisher can randomly subsample the value sets in order to limit the size of the training set…several machine learning approaches can be used to build a model that can score an unknown column as sensitive or non-sensitive (or similarly, quasi-identifying or non-quasi-identifying)…Possible training algorithms include the following: Support vector machines, handling numerical features in the following way: One-hot encode the column type. Omit column name. Omit n-grams of the column name. Nearest-neighbour algorithms, using the following distance metrics: Difference for numeric features. Levenshtein difference for string features (e.g. column name). Fraction of overlapping elements for sets of strings (e.g. n-grams of column name) or cardinality of overlapping elements. Boosted decision trees.…If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0659] Once Publisher has the datasets, it can conduct column similarity measures (see elsewhere in this section) to determine whether any of the columns in the public dataset are similar to columns in the dataset being anonymised. If there is a similar column that has not been marked as quasi-identifying, the user can be prompted to check whether that the column is quasi-identifying. The user can be provided with a link to the relevant public dataset. [0692] the organisation possesses historical mortgage data which includes some customer information (e.g. age, home region) about the borrower and whether they ultimately defaulted or not. The organisation can configure Publisher to consider the customer information columns as quasi-identifying, and the default column as interesting. The organisation can also specify a value for k. Publisher's autogen can automatically generalise the customer information columns to the point where k-anonymity is achieved. The resulting dataset retains useful information about the relationships between the customer information and the default status, but is resistant to re-identification. Thus it can be provided to data scientists who can use it to train useful models but cannot re-identify people in the dataset and discover whether they defaulted on a mortgage. [1060] Privitar Publisher may learn from user behaviour in managing this and similar data with the Privitar application.)
identifying, in a second disparate data set among the two or more disparate data sets, likely candidates to compare to the one or more specified subjects in the first disparate data set, by using the one or more trained models or the one or more queries on the second disparate data set; (in at least [0276] When defining new Policies, Publisher assists the privacy officer by annotating the data sources in question with summary information obtained from external metadata stores. It is important that organisational knowledge about data sources is taken into account when the protection is defined; this may include data type, data classification, whether data is identifying or not, whether data is quasi-identifying or not, sensitivity, visibility/permitted audience, risk/exposure estimates, data expiration date, and access control requirements. These annotations help the privacy officer to determine which protections (that is, which types of Rule) should be applied to each Column in the Schema. [0538] Many other similarity measures are possible according to the use case in the analysis, for example: character of the area (city centre, residential, rural), presence of notable places (eg places of worship, transport facilities, workplaces, hospitals, schools and universities), political majority, density of residents of a certain attribute. These measures may be combined in various ways—for example to preferentially merge regions which are distant and similar, adjacent and similar, or adjacent and dissimilar. Sometimes it is desirable to combine regions which are dissimilar in order to achieve diversity of features within a combined region, and hence protect privacy by providing deniability. [0566] Considering the similarity of a dataset to other data which users have indicated is sensitive through their classification of that data in a privacy policy. This is rather like learning domain knowledge using recommendations for data privacy: since a user judged that data resembling your data was sensitive or identifying, it's more likely that your data is sensitive or identifying;  [0657] Publisher supports detecting quasi-identifiers by maintaining a database of public datasets and looking for columns in supplied datasets that are shared with a public dataset. To power this feature, Publisher must be provided with many public datasets (for instance, public housing records or public census listings). Publisher has a portal to upload these datasets. [0658] The user may upload datasets that they know to be related to the datasets they hold: for instance, a company holding HR data may upload an extract of publicly available LinkedIn data (e.g. a data extract of names, job titles, and years of employment). Additionally, Privitar hosts a library of standard datasets, such as census and housing registry datasets. Publisher can download these from Privitar's hosting site. [0659] Once Publisher has the datasets, it can conduct column similarity measures (see elsewhere in this section) to determine whether any of the columns in the public dataset are similar to columns in the dataset being anonymised. If there is a similar column that has not been marked as quasi-identifying, the user can be prompted to check whether that the column is quasi-identifying. The user can be provided with a link to the relevant public dataset.)
performing analytics over the likely candidates in the second disparate data set by: 
using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and (in at least [0653] If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0657] Publisher supports detecting quasi-identifiers by maintaining a database of public datasets and looking for columns in supplied datasets that are shared with a public dataset.)
for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and (in at least [0653] If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0657] Publisher supports detecting quasi-identifiers by maintaining a database of public datasets and looking for columns in supplied datasets that are shared with a public dataset.)
outputting the likely candidates and their respective probabilistic scores. (in at least [0552] Publisher's automatic generalisation outputs an anonymised copy of the data where it is guaranteed that k-anonymity is achieved. Alongside performance metrics, such as time taken to generalise the data or the number of bad rows detected, Publisher presents a variety of data distortion measures after a successfully finished Job run. [0563] Publisher tries to determine a clustering of records that provides a good trade-off between data utility and privacy. The user can adjust the privacy policy by configuration of the priority columns or the values of k and l in such a way that the resulting clusters are of the appropriate size. There is no clear metric for the “optimal” distribution of cluster sizes but the user can evaluate the results of the generalisation through the provided visualisation of cluster size distribution, as shown in FIG. 32. Clusters shown in lighter grey do not meet the minimum cluster size threshold. [0564] FIGS. 33 and 34 each show an example of a cluster size bubble chart displayed to an end-user. The cluster size bubble chart visualises the sizes and counts of clusters in the output. Each dot corresponds to a cluster of records grouped by their quasi-identifier values. The size of the bubble scales with the size of each cluster. The number in the bubble (displayed if the bubble is large enough to hold text) is the size of the cluster. By clicking on a cluster the user can examine the quasi-identifier values of the cluster. Bubbles shown in grey do not meet the minimum cluster size. For these groups, it is ensured that the values of the cluster are not revealed. The option to display the values of the quasi attributes is disabled. These charts give an overview of how much the generalisation specialized the quasi-identifier columns. If the bubbles are all small (between k and 2*k), that means the generalisation is close to optimal and the output will be generalised less, as shown in FIG. 34. If the diagram has certain very large bubbles then that means the generalisation is further from optimal and the output will be generalised more, as shown in FIG. 33. [0653] If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0683] Publisher measures the Pearson's correlation between columns. Then, when a user marks a column as quasi-identifying and there is another column that is highly correlated to it, Publisher may prompt the user to ask whether this other column is quasi-identifying as well.)


As per Claim 2, McFall teaches: (Original) The method of claim 1, 
wherein the creating a common representation includes evaluating an input list. (in at least [0657] Publisher supports detecting quasi-identifiers by maintaining a database of public datasets and looking for columns in supplied datasets that are shared with a public dataset. To power this feature, Publisher must be provided with many public datasets (for instance, public housing records or public census listings). Publisher has a portal to upload these datasets. [0658] The user may upload datasets that they know to be related to the datasets they hold: for instance, a company holding HR data may upload an extract of publicly available LinkedIn data (e.g. a data extract of names, job titles, and years of employment). Additionally, Privitar hosts a library of standard datasets, such as census and housing registry datasets. Publisher can download these from Privitar's hosting site.)


As per Claim 3, McFall teaches: (Previously Presented) The method of claim 1, 
wherein the creating a common representation includes creating a detailed feature array, wherein each individual feature complies with privacy or confidentiality restrictions including at least k-anonymity. (in at least [0017] K-anonymisation is the process of accounting for available background information and ensuring that that background information cannot be used to re-identify masked data. In the k-anonymity model, attributes that can be learned via background information—such as gender, age, or place of residence—are called quasi-identifiers. A dataset is k-anonymous if every record in the dataset shares their combination of quasi-identifier values with k−1 other records. This poses a significant obstacle to an attacker who tries to re-identify the data, because they cannot use the background information to tell which out of k records corresponds to any target individual. [0469] the user can then indicate a number of columns to be quasis, and choose k. Publisher will then perform the generalisation, partition the records based on their combination of values in each quasi column (i.e. split them up into their anonymity sets), and then drop records from any partition that has fewer than k records. For instance, if there is only one quasi, “Age”, and the user manually generalised the age column into intervals of width 10, and there are less than k records that have the generalised age value of 80-90, then Publisher will drop these records. This yields an output dataset that is k-anonymous for the configured k. [0907] Privitar Publisher may perform statistical generalisation to achieve k-anonymity and l-diversity while offering visibility of, and fine grained control over, the distortion of each transformed quasi-identifier, and allows tuning of the approach to maximise the utility of a dataset for a specific purpose.)


As per Claim 4, McFall teaches: (Original) The method of claim 1, 
wherein the creating a common representation includes forming geo-spatial features. (in at least [0534] in FIG. 29, Publisher has a location generalisation feature that allows automatic generalisation of location territories. First, there is a preprocessing step which analyses shapefiles (required as input) of maps and produces a planar graph where the nodes are the location territories and there are edges between nodes if the territories abut. These graphs can then be stored by Publisher for repeated use. Shapefiles of common territory maps (such as UK postcodes) are publicly available.)

As per Claim 5, McFall teaches: (Original) The method of claim 1, 
wherein the creating a common representation includes forming temporal features. (in at least [0553] FIG. 30 shows a table displayed by Publisher which contains the rule and distortion corresponding to a specific data column. [0554] Publisher calculates the following measures of data distortion: Mean absolute error on generalised numeric columns. Information loss on generalised numeric columns as one minus Pearson's correlation between the raw input and generalised output column. Information loss on generalised categorical columns as the average “generalisation height” across data values. Generalisation height is the number of levels up the hierarchy that the value ended up, normalized by the total distance between the leaf node and the root node. For instance, if a value “January” has a parent “Winter” which has a parent “Any”, the root node, and it is generalised to “Winter”, then this is a 50% generalisation height.)



As per Claim 6, McFall teaches: (Previously Presented) The method of claim 1, 
wherein the creating a common representation includes forming features based on spending customer behaviors.  (in at least  [0084] Lens is a system for answering queries on datasets while preserving privacy. It is applicable for conducting analytics on any datasets that contain sensitive information about a person, company, or other entity whose privacy must be preserved. For instance, it could be used to conduct analytics on hospital visit data, credit card transaction data, mobile phone location data, or smart meter data.  [0367] An example of where this is necessary is when several columns within an input are related—for example, if the input contains columns representing the date that a mortgage was taken out, the term of the mortgage and the date at which it will be fully repaid, changing either of the first two columns would necessitate changing the third. Hence if some masking of either of the initial two values is applied, the following script pseudocode might be used to ensure that the end date remains consistent with the other two columns: outputRow[“endDate”]=outputRow[“startDate”]+outputRow[“term”] [0553] FIG. 30 shows a table displayed by Publisher which contains the rule and distortion corresponding to a specific data column. [0554] Publisher calculates the following measures of data distortion: Mean absolute error on generalised numeric columns. Information loss on generalised numeric columns as one minus Pearson's correlation between the raw input and generalised output column. Information loss on generalised categorical columns as the average “generalisation height” across data values. Generalisation height is the number of levels up the hierarchy that the value ended up, normalized by the total distance between the leaf node and the root node. For instance, if a value “January” has a parent “Winter” which has a parent “Any”, the root node, and it is generalised to “Winter”, then this is a 50% generalisation height. [0566] The context in which data appears—eg a date field in an ecommerce transaction dataset is more likely to be a purchase date than it is a birthdate, but a date field in a customer table stored alongside address other primary information is more likely to be a date of birth)


As per Claim 7, McFall teaches: (Original) The method of claim 1, 
wherein the creating a common representation includes forming features based on product/ band affinities. (in at least [0009] People often use multiple banks for their finances, multiple hospitals and doctors for their medical treatments, multiple phones for their calls, and so on. [0084] Lens is a system for answering queries on datasets while preserving privacy. It is applicable for conducting analytics on any datasets that contain sensitive information about a person, company, or other entity whose privacy must be preserved. For instance, it could be used to conduct analytics on hospital visit data, credit card transaction data, mobile phone location data, or smart meter data. [0553] FIG. 30 shows a table displayed by Publisher which contains the rule and distortion corresponding to a specific data column. [0554] Publisher calculates the following measures of data distortion: Mean absolute error on generalised numeric columns. Information loss on generalised numeric columns as one minus Pearson's correlation between the raw input and generalised output column. Information loss on generalised categorical columns as the average “generalisation height” across data values. Generalisation height is the number of levels up the hierarchy that the value ended up, normalized by the total distance between the leaf node and the root node. For instance, if a value “January” has a parent “Winter” which has a parent “Any”, the root node, and it is generalised to “Winter”, then this is a 50% generalisation height.)


As per Claim 8, McFall teaches: (Previously Presented) The method of claim 1, 
wherein the creating a common representation includes forming features based on demographics or other data subject characteristics common to the plurality of disparate data sets. (in at least [0017] K-anonymisation is the process of accounting for available background information and ensuring that that background information cannot be used to re-identify masked data. In the k-anonymity model, attributes that can be learned via background information—such as gender, age, or place of residence—are called quasi-identifiers. A dataset is k-anonymous if every record in the dataset shares their combination of quasi-identifier values with k−1 other records. This poses a significant obstacle to an attacker who tries to re-identify the data, because they cannot use the background information to tell which out of k records corresponds to any target individual. [0553] FIG. 30 shows a table displayed by Publisher which contains the rule and distortion corresponding to a specific data column. [0554] Publisher calculates the following measures of data distortion: Mean absolute error on generalised numeric columns. Information loss on generalised numeric columns as one minus Pearson's correlation between the raw input and generalised output column. Information loss on generalised categorical columns as the average “generalisation height” across data values. Generalisation height is the number of levels up the hierarchy that the value ended up, normalized by the total distance between the leaf node and the root node. For instance, if a value “January” has a parent “Winter” which has a parent “Any”, the root node, and it is generalised to “Winter”, then this is a 50% generalisation height.)  


As per Claim 9, McFall teaches: (Original) The method of claim 1, 
wherein the creating a common representation includes data provided by a third party. (in at least [0658] a company holding HR data may upload an extract of publicly available LinkedIn data (e.g. a data extract of names, job titles, and years of employment).)


As per Claim 10, McFall teaches: (Original) The method of claim 1, 
wherein the performing includes creating a detailed feature array or common representation. (in at least [0219] Publisher operates on tabular datasets. It can handle a set of tables that have relations among them, such as primary/foreign key relationships. Supported data types include, for example: Strings; Numerics; Dates; Location data; Complex structures such as arrays or maps. Columns containing map fields, as long as a comprehensive list of possible keys is known, are broken out into a set of String columns, one column per key. Columns containing array fields, provided all fields have arrays of the same length, are broken out into a set of columns, one column per index in the array.)


As per Claim 11, McFall teaches: (Original) The method of claim 1, 
wherein the performing includes evaluating a model. (in at least [0691] Many organisations have data science teams that wish to train predictive models for business purposes. Sometimes the essential data for model training is sensitive data. As above, the data scientists could train models off of raw data, but this would incur privacy risks. An organisation can use the automatic generalisation features of Publisher to create a dataset that preserves as much utility as possible about the variables they want to model.)


As per Claim 12, McFall teaches: (Original) The method of claim 1, 
wherein the performing includes executing queries. (in at least [0387] Interactive lookup via Hive SerDe—Publisher contains a Hive SerDe component that allows Hive queries over anonymised data to be dynamically de-tokenised when the query is run. The behaviour of the SerDe depends on the current user's role or access permissions. If the user has the appropriate permission to see the raw values, then any tokenised values in the result set returned by Hive can be dynamically looked-up in the appropriate Publisher Token Vault. Users without the required permission continue to see the tokenised values.)  


As per Claim 13, McFall teaches: (Original) The method of claim 1, 
wherein the performing includes compiling vectors. (in at least [0417] 3. Derive the initialisation vector for the cipher using the identifier of the rule (for example, by concatenating it with itself until it is the right size), so that the same input value appearing in different rules will produce different seeds.)


As per Claim 14, McFall teaches: (Previously Presented) The method of claim 1, 
wherein the performing includes sorting and grouping a feature array. (in at least [0510] If the selected priority column is numeric, the values are sorted into bins and the information gain calculation is based on the resulting categories. The user can either choose a fixed number of bins such that the range of values is split into this number of bins with even size, or she can define a fixed numerical range for the each bin category. The numerical values in the interesting columns are sorted into the resulting, non-overlapping bins and treated as separate categories for the information gain calculation. If, for example, “age” is selected as interesting column and the age of a person is given in years, one might define a useful binning of this column into age categories from [0-15), [15-22), [22-35), [35-56), [56-71) and [71-100] or decide to split the variable into only three broader categories. If a fixed number of bins is chosen by the user Publisher will automatically divide the range of values into evenly sized bins.)


As per Claim 15, McFall teaches: (Original) The method of claim 1, 
wherein the performing includes performing analytics. (in at least [0084] Lens is a system for answering queries on datasets while preserving privacy. It is applicable for conducting analytics on any datasets that contain sensitive information about a person, company, or other entity whose privacy must be preserved. For instance, it could be used to conduct analytics on hospital visit data, credit card transaction data, mobile phone location data, or smart meter data. As shown in FIG. 2 Lens (11) is typically the only gateway through which a data analyst (14) can retrieve information about a dataset (12). The dataset itself is protected in a secure location (13). The data owner or holder (15) (e.g. the bank or health company) can configure Lens and audit analysts' activity through Lens. Lens restricts access for configuration of the query system to a single channel, with a restricted set of ways to retrieve information and types of information that may be retrieved.)


As per Claim 16, McFall teaches: (Original) The method of claim 1, 
wherein the training occurs via a sub-system for compiling a description of data relating to a group of entities. (in at least [0633] Publisher supports a machine learning approach to identifying sensitive or quasi-identifying columns. Publisher constructs a set of training data using the column names and value sets of all datasets that pass through the system, and labels them according to whether they were marked as “sensitive” or not by the user, and separately, whether they were marked as “quasi-identifying” or not by the user. Publisher can randomly subsample the value sets in order to limit the size of the training set.)


As per Claim 17, McFall teaches: (Original) The method of claim 1, 
wherein the performing occurs via a sub-system for assessing the data of each entity against the compiled description. (in at least [0587] To aid Policy creation, Publisher compiles and stores metadata about which columns in the tables contain identifiers, quasi-identifying attributes, and potentially sensitive personal information. This information can then be presented to a user creating a Policy, to ensure that all identifiers, quasi-identifiers and sensitive columns are appropriately handled by the Policy. Publisher also assesses various types of risk in the columns (detailed below), which are also useful hints when defining a Policy.)


As per Claim 18, McFall teaches: (Original) The method of claim 1, 
wherein a two-sided marketplace enables data controllers to provide data sets for analysis and consume insights produced from other data sets in a privacy-enhanced way. (in at least [0084] Lens is a system for answering queries on datasets while preserving privacy. It is applicable for conducting analytics on any datasets that contain sensitive information about a person, company, or other entity whose privacy must be preserved. For instance, it could be used to conduct analytics on hospital visit data, credit card transaction data, mobile phone location data, or smart meter data. As shown in FIG. 2 Lens (11) is typically the only gateway through which a data analyst (14) can retrieve information about a dataset (12). The dataset itself is protected in a secure location (13). The data owner or holder (15) (e.g. the bank or health company) can configure Lens and audit analysts' activity through Lens. Lens restricts access for configuration of the query system to a single channel, with a restricted set of ways to retrieve information and types of information that may be retrieved. [0171] Many software-as-a-service (SaaS) companies provide services that streamline operational processes (e.g., purchasing, supplying, payroll, invoicing) at companies. These SaaS companies possess operational data for a large number of companies, many of which are from similar industries and may even be peers or competitors. While each customer company will have strong demands about the privacy of their data, customer companies may be willing to sign up to a service in which each customer company can learn aggregate information about the group of companies similar to themselves. From this learning they may be able to focus efforts on aspects of their business which are substandard. In this use case, Lens could be used as follows. First, the SaaS company (the data holder), after obtaining permission, posts the operational datasets in Lens (for instance, a dataset of salary levels for different job types across different companies). Then, the data holder configures privacy controls on these datasets, defending against disclosures of the row-level data. Then, the data holder builds a product layer for visualization and reporting that uses Lens on the back end (thus working with privacy-preserving aggregate information) and makes this information easy to consume for customers. For instance, this data product layer may automatically gather information about the wage patterns in the past month for various job types, turn this into a set of charts, rankings, and visualizations, and refresh this report each month. Lastly, the data product may send this report to the customer companies or display it to them in a web page. [0694] Privitar enables this safe pooling by enabling a central aggregator to collect the data from each party and then make the pooled data available to each party in a privacy-preserving manner. Where each party's data contains information about common individuals or entities, SecureLink oblivious matching may be used to join these records without revealing the sensitive identifier.) 


As per Claim 19, McFall teaches: (Original) The method of claim 1, 
wherein self-service capabilities are provided to enable data controllers to create common representations, describer functionality and analytics. (in at least [0246] Publisher provides a main user interface to creating Policies. This process is to create or reference a Rule for each Column of each Table. When processing a data object, Publisher applies the logic for the appropriate Rules as configured in the Policy. A Rule is a value-level transformation. Publisher applies a Rule to each value of a Column when processing a data object. [0553] FIG. 30 shows a table displayed by Publisher which contains the rule and distortion corresponding to a specific data column. [0554] Publisher calculates the following measures of data distortion: Mean absolute error on generalised numeric columns. Information loss on generalised numeric columns as one minus Pearson's correlation between the raw input and generalised output column. Information loss on generalised categorical columns as the average “generalisation height” across data values. Generalisation height is the number of levels up the hierarchy that the value ended up, normalized by the total distance between the leaf node and the root node. For instance, if a value “January” has a parent “Winter” which has a parent “Any”, the root node, and it is generalised to “Winter”, then this is a 50% generalisation height.)


As per Claim 20, McFall teaches: (Currently Amended) A system for providing the ability to use k-anonymous groups to analyze two or more disparate data sets via the use of either individual-to-segment or segment- to-segment matching, wherein the disparate data sets cannot be combined for analysis due to a restriction, wherein the matching utilizes modelling or querying approaches, the system comprising: at least one processor operatively coupled to a memory, the processor operating to: ([0176]-[0181][0228]-[0232])
a sub system that creates a plurality of common representations between the two or more disparate data sets, wherein the plurality of common representations is based on a data representation that is common across the two or more disparate data sets, wherein each common representation is to be stored separately along with a respective data set and subject to equivalent data protection restrictions to that of the respective data set; (in at least [0148] if a dataset contains an age column, and due to HIPAA (Health Insurance Portability and Accountability Act) regulations the data holder wishes all aggregate statistics to meet the HIPAA requirements for age generalisation, Lens can pre-process queries to replace age filter clauses with more general age filter clauses. For instance, it can replace “AVG(costs) where age=47” with “AVG(costs) WHERE age >=40 AND age <50”. The data holder supplies the information about the desired intervals or groupings of categorical options to generalise to.  [0212] While the Recipient needs to join all the information about each individual they do not care about the specific identity of each individual and so they should not be able to easily reidentify the original unique identifier for a particular individual. The blinding step performed by the Intermediary ensures that the Recipient is able to decrypt corresponding identifiers to the same value, which is distinct from the original identifier value and cannot be used to discover the original identifier value. [0213] When a new batch of data is made available to the Recipient, it uses the schema information as in the Intermediary to identify which columns contain identifiers. It then deserializes these values to retrieve the pair of points, and performs an El-Gamal decryption using the recipient private key to retrieve a point that represents the identifier. It may then serialize this point to a string or number. [0215] SecureLink can be used if a central research organisation wants to collect statistics from individuals who have data at many different service-providing organisations. For example, consider a national health research organisation that is conducting an assessment of nation-wide hospital costs. A single person may visit many hospitals, incurring costs at each hospital. The healthcare research organisation may wish to link the costs of each individual across hospitals in order to have more complete data. A convenient way to link these costs is by social security number, which one assumes is recorded consistently across hospital visits. However, the health organisation wishes to maintain privacy in the data that they collect and thus wishes the identifier to be tokenised. [0217] SecureLink allows the healthcare research organisation to gather tokenised records of each person's costs linked together across hospitals. [0225] The data may contain identifiers such as names or account numbers and thus be suitable for masking or tokenisation. Alternatively, it can represent categorical attributes such as race, religion or gender, and other sensitive values such as salary or geolocation and thus be applicable for k-anonymisation. [0599] Publisher compares column names are with a list of HIPAA constants to detect columns that contain typical personal identifiers [0617] To match column names with the provided list of template identifier attribute names the Levenshtein distance between two strings is calculated. Publisher also considers substrings, so that, for example, “account number” is found to be similar to “current account number”. )
wherein the plurality of common representations is created by evaluating an input list of overlapping features and creating a detailed feature array from the list of overlapping features; (in at least [0565] Publisher's features for automatically detecting sensitive, quasi-identifying, or identifying columns. These features allow the program to assist the user in properly configuring the anonymisation of input datasets and, additionally, in identifying new datasets to anonymise. Publisher takes several approaches to detecting sensitive, quasi-identifying, or identifying columns including using metadata, measuring correlation with known columns, and using machine learning. [0599] Publisher compares column names are with a list of HIPAA constants to detect columns that contain typical personal identifiers such as: [0617] To match column names with the provided list of template identifier attribute names the Levenshtein distance between two strings is calculated. Publisher also considers substrings, so that, for example, “account number” is found to be similar to “current account number”. [0618] Publisher takes values from previously known sources of identifiers and finds similarity between those sources and the new data in question. A key source is the content of Publisher Token Vaults, which are known to contain identifiers. A second source is other columns in the dataset that have been assigned a tokenisation rule. If new data contains a significant overlap with a known list of identifiers, it is more likely to be an identifier itself. [0619] Publisher calculates the overlap between one column and another either: Using the Jaccard index, which is the cardinality of the intersection of the columns divided by the cardinality of the union of columns (where the columns are taken as sets). This index is straightforward but inefficient to calculate. For performance, Publisher may approximate the Jaccard index using the “hashing trick”, which hashes each value into a range of values (e.g. 0 to 2{circumflex over ( )}24−1), and maintains a bitstring of the same length, and flips the bit from 0 to 1 only if one of the values is hashed to that index. Publisher can then efficiently approximate the Jaccard distance using the popcount of the AND of the two bitstrings over the popcount of the OR of the two bitstrings. By calculating the cardinality of the intersection of columns divided by the cardinality of the smaller of the two columns (again where the columns are taken as sets). Similarly, Publisher may approximate this using the hashing trick—to approximate this metric, it takes the popcount of the AND of the two bitstrings over the greater of the two bitstrings' popcounts.)
a describer sub-system that includes training one or more models on a first disparate data set among the two or more disparate data sets by performing machine learning classification, conducting a deterministic algorithm, generating a neural network, or conducting federated learning, or defining one or more queries to recognize behaviors of one or more specified subjects within the first disparate data set; (in at least [0633]-[0653] Publisher supports a machine learning approach to identifying sensitive or quasi-identifying columns. Publisher constructs a set of training data using the column names and value sets of all datasets that pass through the system, and labels them according to whether they were marked as “sensitive” or not by the user, and separately, whether they were marked as “quasi-identifying” or not by the user. Publisher can randomly subsample the value sets in order to limit the size of the training set…several machine learning approaches can be used to build a model that can score an unknown column as sensitive or non-sensitive (or similarly, quasi-identifying or non-quasi-identifying)…Possible training algorithms include the following: Support vector machines, handling numerical features in the following way: One-hot encode the column type. Omit column name. Omit n-grams of the column name. Nearest-neighbour algorithms, using the following distance metrics: Difference for numeric features. Levenshtein difference for string features (e.g. column name). Fraction of overlapping elements for sets of strings (e.g. n-grams of column name) or cardinality of overlapping elements. Boosted decision trees.…If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0659] Once Publisher has the datasets, it can conduct column similarity measures (see elsewhere in this section) to determine whether any of the columns in the public dataset are similar to columns in the dataset being anonymised. If there is a similar column that has not been marked as quasi-identifying, the user can be prompted to check whether that the column is quasi-identifying. The user can be provided with a link to the relevant public dataset.)
a finder sub-system that highlights, in a second disparate data set among the two or more disparate data sets, likely candidates to compare to the one or more specified subjects in the first disparate data set by using the one or more trained models and/or the one or more queries on the second disparate data set; (in at least [0276] When defining new Policies, Publisher assists the privacy officer by annotating the data sources in question with summary information obtained from external metadata stores. It is important that organisational knowledge about data sources is taken into account when the protection is defined; this may include data type, data classification, whether data is identifying or not, whether data is quasi-identifying or not, sensitivity, visibility/permitted audience, risk/exposure estimates, data expiration date, and access control requirements. These annotations help the privacy officer to determine which protections (that is, which types of Rule) should be applied to each Column in the Schema. [0538] Many other similarity measures are possible according to the use case in the analysis, for example: character of the area (city centre, residential, rural), presence of notable places (eg places of worship, transport facilities, workplaces, hospitals, schools and universities), political majority, density of residents of a certain attribute. These measures may be combined in various ways—for example to preferentially merge regions which are distant and similar, adjacent and similar, or adjacent and dissimilar. Sometimes it is desirable to combine regions which are dissimilar in order to achieve diversity of features within a combined region, and hence protect privacy by providing deniability. [0566] Considering the similarity of a dataset to other data which users have indicated is sensitive through their classification of that data in a privacy policy. This is rather like learning domain knowledge using recommendations for data privacy: since a user judged that data resembling your data was sensitive or identifying, it's more likely that your data is sensitive or identifying; [0657] Publisher supports detecting quasi-identifiers by maintaining a database of public datasets and looking for columns in supplied datasets that are shared with a public dataset. To power this feature, Publisher must be provided with many public datasets (for instance, public housing records or public census listings). Publisher has a portal to upload these datasets. [0658] The user may upload datasets that they know to be related to the datasets they hold: for instance, a company holding HR data may upload an extract of publicly available LinkedIn data (e.g. a data extract of names, job titles, and years of employment). Additionally, Privitar hosts a library of standard datasets, such as census and housing registry datasets. Publisher can download these from Privitar's hosting site. [0659] Once Publisher has the datasets, it can conduct column similarity measures (see elsewhere in this section) to determine whether any of the columns in the public dataset are similar to columns in the dataset being anonymised. If there is a similar column that has not been marked as quasi-identifying, the user can be prompted to check whether that the column is quasi-identifying. The user can be provided with a link to the relevant public dataset.)
the describer and finder sub-system performing actions over the one or more identified subjects for each of the two disparate sets, the actions comprising:
using the trained model to recognize behaviors in the likely candidates in the second disparate data set; and (in at least [0653] If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0657] Publisher supports detecting quasi-identifiers by maintaining a database of public datasets and looking for columns in supplied datasets that are shared with a public dataset.)
for each likely candidate, using the trained model to generate a probabilistic score that represents a probability that said likely candidate in the second disparate data set matches the specified subject in the first disparate data set, based on how similar the behaviors of the likely candidate are to the behaviors of the specified subject within the first disparate data set; and (in at least [0653] If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0657] Publisher supports detecting quasi-identifiers by maintaining a database of public datasets and looking for columns in supplied datasets that are shared with a public dataset.)
an input/output device to output an analytics result produced by the describer and the finder sub-systems on the second disparate data set. (in at least [0552] Publisher's automatic generalisation outputs an anonymised copy of the data where it is guaranteed that k-anonymity is achieved. Alongside performance metrics, such as time taken to generalise the data or the number of bad rows detected, Publisher presents a variety of data distortion measures after a successfully finished Job run. [0563] Publisher tries to determine a clustering of records that provides a good trade-off between data utility and privacy. The user can adjust the privacy policy by configuration of the priority columns or the values of k and l in such a way that the resulting clusters are of the appropriate size. There is no clear metric for the “optimal” distribution of cluster sizes but the user can evaluate the results of the generalisation through the provided visualisation of cluster size distribution, as shown in FIG. 32. Clusters shown in lighter grey do not meet the minimum cluster size threshold. [0564] FIGS. 33 and 34 each show an example of a cluster size bubble chart displayed to an end-user. The cluster size bubble chart visualises the sizes and counts of clusters in the output. Each dot corresponds to a cluster of records grouped by their quasi-identifier values. The size of the bubble scales with the size of each cluster. The number in the bubble (displayed if the bubble is large enough to hold text) is the size of the cluster. By clicking on a cluster the user can examine the quasi-identifier values of the cluster. Bubbles shown in grey do not meet the minimum cluster size. For these groups, it is ensured that the values of the cluster are not revealed. The option to display the values of the quasi attributes is disabled. These charts give an overview of how much the generalisation specialized the quasi-identifier columns. If the bubbles are all small (between k and 2*k), that means the generalisation is close to optimal and the output will be generalised less, as shown in FIG. 34. If the diagram has certain very large bubbles then that means the generalisation is further from optimal and the output will be generalised more, as shown in FIG. 33. [0653] If the output score of the sensitive vs. non-sensitive model is above a certain threshold, Publisher can prompt a user suggesting that the column may be sensitive. It can do the same for the quasi-identifying vs. non-quasi-identifying model. [0683] Publisher measures the Pearson's correlation between columns. Then, when a user marks a column as quasi-identifying and there is another column that is highly correlated to it, Publisher may prompt the user to ask whether this other column is quasi-identifying as well)


As per Claim 21 for a system (see at least McFall [0228]-[0232]), respectively, substantially recite the subject matter of Claim 2-8 and are rejected based on the same reasoning and rationale.



Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PO HAN MAX LEE whose telephone number is (571)272-3821.  The examiner can normally be reached on Mon-Thurs 8:00 am - 7:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Rutao Wu can be reached on (571) 272-6045.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PO HAN LEE/Primary Examiner, Art Unit 3623
Read full office action
Prosecution Timeline

Apr 08, 2022
Application Filed
Feb 23, 2024
Non-Final Rejection — §101, §102
Jul 29, 2024
Response Filed
Oct 30, 2024
Final Rejection — §101, §102
Apr 29, 2025
Applicant Interview (Telephonic)
May 02, 2025
Examiner Interview Summary
May 05, 2025
Request for Continued Examination
May 08, 2025
Response after Non-Final Action
Jul 14, 2025
Non-Final Rejection — §101, §102
Oct 07, 2025
Applicant Interview (Telephonic)
Oct 10, 2025
Examiner Interview Summary
Nov 12, 2025
Response Filed
Jan 29, 2026
Final Rejection — §101, §102 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/100,491
Patent 12602629
USING MACHINE LEARNING TO PREDICT FLEET MOVES IN HYDRAULIC FRACTURING OPERATIONS
2y 5m to grant Granted Apr 14, 2026
17/979,302
Patent 12548089
OPTIMIZATION OF HYBRID GROWING INFRASTRUCTURE FOR DIFFERENT WEATHER PROFILES AND MARKET CONDITIONS
2y 5m to grant Granted Feb 10, 2026
18/417,506
Patent 12548046
SYSTEM FOR ACCURATE PREDICTIONS USING A PREDICTIVE MODEL
2y 5m to grant Granted Feb 10, 2026
18/940,667
Patent 12547241
SYSTEMS AND METHODS FOR COMPUTER-IMPLEMENTED SURVEYS
2y 5m to grant Granted Feb 10, 2026
17/449,344
Patent 12361363
METHOD AND SYSTEM FOR PROFICIENCY IDENTIFICATION
2y 5m to grant Granted Jul 15, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
32%
Grant Probability
74%
With Interview (+41.2%)
3y 6m
Median Time to Grant
High
PTA Risk
Based on 158 resolved cases by this examiner. Grant probability derived from career allow rate.
SYSTEM AND METHOD FOR PRIVACY-PRESERVING ANALYTICS ON DISPARATE DATA SETS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email