DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is made non-final.
Claims 1-17 are pending. Claims 1, 14 and 17 are independent claims.
Drawings
The drawings are objected to because figure 2, item 204 is labeled “Feature calculation module”. The specification refers to this item as “a feature extraction module”, for example in ¶34. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: a preprocessing module configured to: receive…, a feature detection module configured to detect…, a feature extraction module operatively connected with the feature detection module and configured to: extract…, a context recognition module operatively connected to the machine learning module and configured to: contemplate… and a classification module operatively connected to the feature extraction module, wherein the classification module is configured to: receive… in claim 1, and a score generation module configured to generate a score… in claim 6.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. These modules are interpreted to be software stored in memory as suggested by figure 5 or in specification ¶48: Embodiments of the present subject matter may be implemented in conjunction with program modules, including functions, procedures, data structures, and application programs, for performing tasks, or defining abstract data types or low-level hardware contexts. Executable program stored on any of the above- mentioned storage media may be executable by the processor(s) 402. The interpretations also apply to the dependent claims 2-13.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
Claim 11 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the enablement requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to enable one skilled in the art to which it pertains, or with which it is most nearly connected, to make and/or use the invention. Claim 11 recites: the number of personal identifiable information groups comprising a document size, a number of unique personal identifiable information types, and the type of the data source. It is unclear exactly how PII groups can comprise a document size, for example. The specification recites the document size as a possible machine learning module output feature, not as a group: ¶36, The machine learning module 110 includes the following input and output… Output… Features that represent the score of the personal identifiable information types, the number of personal identifiable information, and the like. Consider a non-limiting example, a non-limiting list of personal identifiable information features listed below:
a. Number of personal identifiable information
b. Document size
In this office action, the specified limitation of claim 11 will be interpreted under BRI to refer to simply outputting document size, number of PII types and data source type information from a machine learning module.
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 8 and 11 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 8, the phrase "such as" renders the claim indefinite because it is unclear whether the limitations following the phrase are part of the claimed invention. See MPEP § 2173.05(d).
Regarding claim 11, it recites the limitation "the score" in line 2. There is insufficient antecedent basis for this limitation in the claim.
Claim Objections
Claims 14 and 17 are objected to because of the following informalities: they recite a “future detection module” instead of a “feature detection module”. Appropriate correction is required.
Claims 1-20 recite a mixture of “personally identifiable information” and “personal identifiable information”, and claim 13 recites a “personal identifiability score” and “personally identifiable score”. The title of the invention refers to “personal identifiable information”. Please choose a consistent wording for the claims.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-17 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1:
Step 1: This part of the eligibility analysis evaluates whether the claim falls within any statutory category. See MPEP 2106.03. Claim 1 recites: A system for determination and classification of personally identifiable information in a file using machine learning, wherein the system comprises: a processing subsystem hosted on a server, and configured to execute on a network to control bidirectional communications among a plurality of modules comprising… Claim 1 is directed to a system, i.e., an apparatus (Step 1: YES).
Step 2A prong 1: Does the claim recite a judicial exception? Claim 1 recites: detect personally identifiable information features from a group of a plurality of groups, wherein the plurality of groups comprises a plurality of personally identifiable information (detecting or identifying PII features is a mental process); contemplate a plurality of data source-specific features to recognize the context of the personal identifiable information in case of the unstructured data, wherein the plurality of data source-specific features comprises at least one of a visual feature, text feature, per token representations, features indication for consideration of the token as personally identifiable information, and a type of the personally identifiable information (contemplation or consideration of data features to identify PII is a mental process)… predict the presence of personally identifiable information in the data source (predicting the presence of PII is a mental process); and group the personally identifiable information predicted on the web page and predict the presence of the personally identifiable information in an event of the unstructured data source, wherein the grouping is repeated for all the web pages (grouping PII found on a webpage and predicting if PII is present in an unstructured data source is a mental process). These steps can be performed mentally or are mathematical calculations (Step 2A prong 1: YES).
Step 2A prong 2: Does the claim recite additional elements? Do those additional elements, considered individually and in combination, integrate the judicial exception into a practical application? Claim 1 recites: a preprocessing module configured to: receive a data source comprising a plurality of structured data from a web page, a plurality of semi-structured data, and a plurality of unstructured data, wherein the data source comprises a set of information with a personal identifiable information; and convert the data source into a machine-readable format… a machine learning module operatively connected to the preprocessing module wherein the machine learning module comprises: a feature detection module configured to… a feature extraction module operatively connected with the feature detection module and configured to: extract the plurality of personal identifiable information features from the group of at least one of a static list and a stream, wherein the static list is obtained in response to scanning the data source and wherein the stream is generated dynamically in response to the scanning the data source; and featurize each group of the personal identifiable information located in the web page after scanning of the data source; a context recognition module operatively connected to the machine learning module and configured to… and a classification module operatively connected to the feature extraction module, wherein the classification module is configured to: receive the extracted plurality of personally identifiable information features; and… Receiving and converting input data, extracting features from a static list and/or a stream that is generated from scanning a data source, and “featurizing” scanned data is insignificant extra-solution activity of data gathering without significantly more. Using various modules, which are interpreted as general software components, is recited at a high level of generality, i.e., as a generic computer performing generic computer functions. Using a machine learning (ML) module is an attempt to use the machine learning module by merely applying the abstract idea (i.e., perform math/mental processes) without placing any limits on how the ML module operates. Further, the claim omits any details as to how the ML module solves a technical problem and instead recites only the idea of a solution or outcome. See MPEP 2106.05(f). Thus, the limitation represents no more than mere instructions to implement the abstract idea which is equivalent to adding the words “apply it” to the recited judicial exceptions (Step 2A prong 2: NO).
Step 2B: These elements are recited at such a high level of generality that they fail to integrate the abstract idea into a practical application, since they only amount to data gathering or outputting without significantly more (MPEP 2106.05(g)) or provide nothing more than mere instructions to implement an abstract idea on a generic computer (MPEP 2106.05(f)). These limitations, taken either alone or in combination, fail to provide an inventive concept (Step 2B: NO). Thus, the claim is not patent eligible.
Regarding claims 2-13, they recite limitations which further narrow the abstract idea by specifying more details of the mental and mathematical process that occurs (Claim 2, specifying that the input data source is in CSV format and grouping based on information being in the same row is an attempt to limit the field of use without significantly more; Claim 3, specifying that the input data source is in JSON format and grouping based on information being in the same object is an attempt to limit the field of use without significantly more; Claim 4, specifying that the input data source is in natural text format and grouping based on information being on the same webpage is an attempt to limit the field of use without significantly more; Claim 5, generating data by scanning data and metadata is insignificant extra-solution activity of data gathering without significantly more; Claim 6, generating a score based on identifiability and uniqueness of PII is a mental process; Claim 7, using a machine learning model that does not perform iterative updates is still mere instructions to implement the abstract idea, equivalent to adding the words “apply it” to the recited judicial exceptions; Claim 8, specifying examples of text data source features is insignificant extra-solution of data gathering or selecting particular types of data (see MPEP 2106.05(g)); Claim 9, describing the visual feature examples is insignificant extra-solution of data gathering or selecting particular types of data (see MPEP 2106.05(g)); Claim 10, describing that the per token comprises a plurality of word vectors and features indicating that the token is PII as well as the type of PII is insignificant extra-solution activity of data gathering – using data output from language models is an attempt to apply the judicial exception on generic computing components, as the models are recited at a high level of generality; Claim 11, the various outputs of the system are able to be gathered mentally through analysis of a document or data source; Claim 12, selecting PII information in the same row is insignificant extra-solution activity of selecting particular types of data, and predicting the presence of PII in structured data is a mental process; Claim 13, determining fixed personally identifiable/personal identifiability scores for feature types is a mental process).
Regarding claim 14, it is a method implementing steps similar to the system of claim 1 and is rejected on the same grounds – see above.
Regarding claims 15 and 16, they recite limitations which further narrow the abstract idea by specifying more details of the mental and mathematical process that occurs (Claim 15, pre-determining an identifiability score for calculating PII is a mental process; Claim 16, classifying a scanned data source is a mental process, and the use of a machine learning
Regarding claim 17, it is apparatus that recites similar limitations to claim 1 and is rejected on the same grounds – see above.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 4, 5, 8, 14, 16 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nickl et al. (US 20210125089 A1), herein Nickl, in view of Begun et al. (US 20210081613 A1), herein Begun, and Chan et al. (US 20210406266 A1), herein Chan.
Regarding claim 1, Nickl teaches: A system for determination and classification of personally identifiable information in a file using machine learning, wherein the system comprises: a processing subsystem hosted on a server, and configured to execute on a network to control bidirectional communications among a plurality of modules comprising (¶122, Broadly, the systems and methods herein are configured to facilitate automated review of a first data file collection derived from the enterprise IT network to detect the presence or absence of protected information therein – and – ¶283, Beginning at 102 of FIG. 1A, a (first) date file collection associated with a data breach event is received by at least one computer (e.g., a server or cloud computing system). The data file collection can be generated by analysis of the data breach event): a preprocessing module configured to: receive a data source comprising a plurality of structured data from a web page, a plurality of semi-structured data, and a plurality of unstructured data, wherein the data source comprises a set of information with a personal identifiable information (¶283, The first data file collection can comprise at least some of structured, unstructured, and/or semi-structured data file types – and – ¶151, “Structured data” can also be included within unstructured or semi-structured data. For example, a table that would comprise structured data if configured as a spreadsheet data file (e.g., excel, csv) can be included in a PDF file, in an email, or the like – Nickl also discloses that web pages are an example of semi-structured data – ¶142, Examples of semi-structured data include... Web pages)… a machine learning module operatively connected to the preprocessing module wherein the machine learning module comprises: a feature detection module configured to detect personally identifiable information features from a group of a plurality of groups, wherein the plurality of groups comprises a plurality of personally identifiable information (¶153, To determine whether protected information is incorporated in the first data file collection, each data file in the collection is analyzed automatically by the computer to identify information or elements of information that may comprise protected information therein – and – ¶154, As would be appreciated, for data files comprising structured data, protected information comprising each of PHI, PII, and other defined terms can be readily identifiable therein because the subject protected information will be identifiable by its classification in the database or by operation of relational databases associated therewith. That is, an automatic search for a SSN, passport number, credit card number etc. – this list can be interpreted as personally identifiable information features from a plurality of personally identifiable information)… a context recognition module operatively connected to the machine learning module and configured to: contemplate a plurality of data source-specific features to recognize the context of the personal identifiable information in case of the unstructured data., wherein the plurality of data source-specific features comprises at least one of a visual feature, text feature, per token representations, features indication for consideration of the token as personally identifiable information, and a type of the personally identifiable information (¶107, The context of the text or any other numbers used around the appearance of the SSN and in the data file in which this 9 digit number appears can also be examined via NLP, machine learning, etc. to generate a confidence level of whether a 9 digit number appearing in the data file in fact is likely to comprise a SSN);
and a classification module operatively connected to the feature extraction module, wherein the classification module is configured to: receive the extracted plurality of personally identifiable information features; and predict the presence of personally identifiable information in the data source (¶232, In a further implementation, the automated review results can be presented in a high-level arrangement that classifies the nature and type of protected information identified in the automated analysis. In the context of PII, the system can identify how many data files individually comprise data elements that are commonly associated with PII either on their own terms or in combination with other data elements, how many data files include only contact information, how many data files include both a name and a PII data element, and how many data files contain only PII data element with information that is not associated with contact information); and group the personally identifiable information predicted on the web page and predict the presence of the personally identifiable information in an event of the unstructured data source, wherein the grouping is repeated for all the web pages (¶260, If any protected information that is now linked to a known entity was not previously associated with an entity, that previously unaffiliated information will now be grouped automatically with the known entity in real time. For example, medical information could have been in a data file with only a number as an entity identification. In a later reviewed data file, the number appears along with the person's name. The numbers in each data file can then be linked to the person, and any protected information in the data files will now be associated with that individual by name – and – ¶261, Information associated with entity groupings and any corrections related thereto can be incorporated into the processes herein. In this regard, context associated with the linkage of data files to an entity (e.g., person(s), company, organization etc.) or entity category (e.g., customer, patient, client, etc.) can be incorporated into the processes to further improve the machine learning for this project and others, such as by enhancing the ability to extract useful information out of unstructured and semi-structured data).
Nickl fails to teach: and convert the data source into a machine-readable format…
However, in the same field of endeavor, Begun teaches: and convert the data source into a machine-readable format (¶78, For incoming document that do not have machine-readable text content already, OCR is also applied)…
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to convert data sources to machine readable format as disclosed by Begun in the system disclosed by Nickl to enable the selection of features that are used to analyze the document contents (¶78, These features are partly chosen by designers, and partly learned by image and pattern analysis on large number of documents).
Nickl in view of Begun fails to teach: a feature extraction module operatively connected with the feature detection module and configured to: extract the plurality of personal identifiable information features from the group of at least one of a static list and a stream, wherein the static list is obtained in response to scanning the data source and wherein the stream is generated dynamically in response to the scanning the data source; and featurize each group of the personal identifiable information located in the web page after scanning of the data source…
However, in the same field of endeavor, Chan teaches: a feature extraction module operatively connected with the feature detection module and configured to: extract the plurality of personal identifiable information features from the group of at least one of a static list and a stream, wherein the static list is obtained in response to scanning the data source and wherein the stream is generated dynamically in response to the scanning the data source; and featurize each group of the personal identifiable information located in the web page after scanning of the data source (¶28, Some embodiments alternatively or additionally extract another set of features and derive additional feature vectors from some or all additional cells of some or all of the rows and/or columns in the table. In this way, embodiments can use some or all of the “context” or values contained in these additional feature vectors as signals or factors to consider when generating a decision statistic (e.g., a classification), as described in more detail below. Particular embodiments of the present disclosure model or process these data by sequentially encoding particular feature vectors based on using one or more machine learning models. For example, some embodiments convert each feature vector of a row in a table from left to right in an ordered fashion into another concatenated feature vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to extract features from a list originating from the data source as disclosed by Chan in the system disclosed by Nickl in view of Begun to consider relationships that are captured by ordering within the data source (¶28, Various embodiments benefit from these models because of the inherent sequence and order of values within columns and rows of tables).
Regarding claim 4, Nickl further teaches: The system according to claim 1, wherein the unstructured data source is represented as at least one of a form or a natural text (¶127, In contrast, “unstructured data” is data that either does not have a predefined data model or is not organized in a pre-defined manner. Unstructured data has internal structure but is not structured via pre-defined data models or schema. It may be textual or non-textual, and human- or machine-generated. It may also be stored within a non-relational database like NoSQL) wherein the group of personally identifiable information is the personally identifiable information located on the same web page (¶161, More specifically, the application of techniques such as information extraction, coreference resolution, part of speech tagging, etc. can enhance the ability to not only automatically identify the information within context for each data file being automatically identified, but also to automatically identify when specific groupings of distributed information in a single data file are related to the same entity).
Regarding claim 5, Nickl further teaches: The system according to claim 1, wherein at least one of the static lists and the stream is generated dynamically as the data source, wherein the data source is scanned along with metadata (¶26, Various embodiments of the present disclosure perform information extraction and processing from tables in various ways. For instance, in some embodiments, a set of features are extracted from a first cell (e.g., a field) of a table. These “features” may represent particular content payload values within a cell itself and/or metadata associated with the content. For example, particular embodiments can extract a part-of-speech (POS) tag (e.g., data indicating whether a word is a noun or adjective) for each word in the first cell and a type of character for each character sequence (e.g., for the word “hi,” indicating that both “h” and “i” are “letters”) or perform other natural language processing technique so that computers can process and understand the information contained in the first cell).
Regarding claim 8, Nickl further teaches: The system according to claim 1, wherein the text data source- features comprises simple descriptors such as local text, personal identifiable information density, the meaning of the content, language modelling, and bi-directional encoder representation from transformation (¶197, the identified 9-digit number is validated as likely being an actual SSN by using the US Social Security Administration rule for issuance of SSNs. The context of the text or any other numbers used around the appearance of the SSN and in the data file in which this 9 digit number appears can also be examined via NLP, machine learning, etc. to generate a confidence level of whether a 9 digit number appearing in the data file in fact is likely to comprise a SSN).
Regarding claim 14, it is a method that recites similar limitations to claim 1 and is rejected on the same grounds – see above.
Regarding claim 16, Nickl further teaches: The method according to claim 14, comprises classifying, by the machine learning module, the scanned data source (¶232, In a further implementation, the automated review results can be presented in a high-level arrangement that classifies the nature and type of protected information identified in the automated analysis).
Regarding claim 17, it is an apparatus that recites similar limitations to claim 1 and is rejected on the same ground – see above.
Claim(s) 2 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nickl in view of Begun and Chan as applied to claim 1 above, and further in view of Muffat et al. (US 20200250139 A1), herein Muffat.
Regarding claim 2, Nickl in view of Begun and Chan fails to teach: The system according to claim 1, wherein the structured data source is represented in a comma-separated values format wherein the group of personally identifiable information is the personally identifiable information located in the same row.
However, in the same field of endeavor, Muffat teaches: wherein the structured data source is represented in a comma-separated values format wherein the group of personally identifiable information is the personally identifiable information located in the same row (¶60, After structured and semi-structured tables 106, 108 are processed by the PII extraction module 410, semantic information is linked in the PII linking 404 by building 412 a knowledge base 414 and using graph mining on a knowledge graph in the knowledge base 414. Structured documents like excel, csv, odt allow us to extract personal data types and link them based on the value coordinates (i.e., row and column in a spreadsheet)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to link together data that appears in the same row as disclosed by Muffat in the system disclosed by Nickl in view of Begun and Chan to improve information extraction (¶60, After the PII extraction, the PIIS that appear in the same row of a table can be linked together to create the knowledge base 414. This helps to reduce time and cost and get the same or greater accuracy of extraction).
Regarding claim 12, Nickl in view of Begun and Chan fails to teach: The system according to claim 1, wherein the classification module is configured to feature every group of the personal identifiable information located in the same row and the classification module predicts the presence of the personally identifiable information in the structured data source.
However, in the same field of endeavor, wherein the classification module is configured to feature every group of the personal identifiable information located in the same row and the classification module predicts the presence of the personally identifiable information in the structured data source (¶60, After structured and semi-structured tables 106, 108 are processed by the PII extraction module 410, semantic information is linked in the PII linking 404 by building 412 a knowledge base 414 and using graph mining on a knowledge graph in the knowledge base 414. Structured documents like excel, csv, odt allow us to extract personal data types and link them based on the value coordinates (i.e., row and column in a spreadsheet)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to predict PII in rows of the data source as disclosed by Muffat in the system disclosed by Nickl in view of Begun and Chan to improve PII extraction performance (¶60, After the PII extraction, the PIIS that appear in the same row of a table can be linked together to create the knowledge base 414. This helps to reduce time and cost and get the same or greater accuracy of extraction).
Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nickl in view of Begun and Chan as applied to claim 1 above, and further in view of Seilnacht et al. (EP 4198786 A1), herein Seilnacht.
Regarding claim 3, Nickl in view of Begun and Chan fails to teach: The system according to claim 1, wherein the semi-structured data source is a Javascript object notation file wherein the group of personal identifiable information is the personal identifiable information located in the same object.
However, in the same field of endeavor, Seilnacht teaches: wherein the semi-structured data source is a Javascript object notation file wherein the group of personal identifiable information is the personal identifiable information located in the same object (¶13, For example, a rule may involve detection of patterns (e.g., regular expression), identification of structural components in structured documents, such as extensible markup language (XML) documents, JavaScript object notation (JSON) objects, and/or the like, and other types of criteria that may indicate the presence of sensitive information).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to analyze data sources like a JSON file as disclosed by Seilnacht in the system disclosed by Nickl in view of Begun and Chan to handle PII that is present in already existing JSON documents (¶27, a predictive model may be trained to detect certain types of sensitive information based on known instances of those types of sensitive information in historical documents a predictive model may be trained to detect certain types of sensitive information based on known instances of those types of sensitive information in historical documents).
Claim(s) 6, 13 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nickl in view of Begun and Chan as applied to claim 1 and 14 above, and further in view of Muthusrinivasan et al. (US 8561185 B1), herein Muthusrinivasan.
Regarding claim 6, Nickl in view of Begun and Chan fails to teach: The system according to claim 1, comprises a score generation module configured to generate a score corresponding to the extracted plurality of personal identifiable information type, wherein the score is generated by the identifiability and uniqueness of the plurality of personal identifiable information.
However, in the same field of endeavor, Muthusrinivasan teaches: comprises a score generation module configured to generate a score corresponding to the extracted plurality of personal identifiable information type, wherein the score is generated by the identifiability and uniqueness of the plurality of personal identifiable information (Col. 10, line 14, The magnitude of the component risk score is indicative of the confidence that the identified PII type information is actually personally identifiable information. In some implementations, the higher the positive value of the component risk score, the higher the confidence that the identified PII type information is actually personally identifiable information. For example, the keywords CVV, CVV2, and SSN may have respective component risk scores of 0.8, 1.0, and 0.8; the keywords "expiration" and "credit card" may have component risk scores of 0.6; and the provider name of each credit card may have a component risk score of 0.5. Some component risk scores may be zero or even negative values. For example, keywords such as "dummy", "test number", and "sample" may have respective component risk scores of -2.0, -3.0, and -1.0 – keywords that are associated with lower identifiability and uniqueness are tied to lower risk scores, so data that includes a sample name or sample credit card number will be less identifying and less unique).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to generate a score for extracted PII as disclosed by Muthusrinivasan in the system disclosed by Nickl in view of Begun and Chan to avoid false positive identifications of PII (Col. 5, line 55, The paper describes a random number generator algorithm, and includes a table of five numbers that are generated by use of the algorithm. One of the numbers, by chance, happens to be a valid credit card number. Accordingly, the paper includes information that satisfies a PII type definition (e.g., credit card numbers), and the PII system 120 calculates a risk score for the resource 280. However, none of the other content within the predefined textual distance of the number is secondary information that is associated with the corresponding PII type definition. Accordingly, the PII system 120 determines that the risk score for the resource 280 does not meet the confidentiality threshold, and does not classify the resource 280 is a personal information exposure risk. This reflects the fact that the webpage 280 presents no risk of exposure of personally identifying information, even though, by chance, the resource 280 includes an actual credit card number).
Regarding claim 13, Nickl in view of Begun and Chan fails to teach: The system according to claim 1, comprises a personal identifiability score wherein the personally identifiable score comprises a fixed value and is pre-determined, based on the detected feature type.
However, in the same field of endeavor, Muthusrinivasan teaches: comprises a personal identifiability score wherein the personally identifiable score comprises a fixed value and is pre-determined, based on the detected feature type (Col. 10, line 14, The magnitude of the component risk score is indicative of the confidence that the identified PII type information is actually personally identifiable information. In some implementations, the higher the positive value of the component risk score, the higher the confidence that the identified PII type information is actually personally identifiable information. For example, the keywords CVV, CVV2, and SSN may have respective component risk scores of 0.8, 1.0, and 0.8; the keywords "expiration" and "credit card" may have component risk scores of 0.6; and the provider name of each credit card may have a component risk score of 0.5. Some component risk scores may be zero or even negative values. For example, keywords such as "dummy", "test number", and "sample" may have respective component risk scores of -2.0, -3.0, and -1.0 – keywords that are associated with lower identifiability and uniqueness are tied to lower risk scores, so data that includes a sample name or sample credit card number will be less identifying and less unique).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use a predetermined personal identifiability score as disclosed by Muthusrinivasan in the system disclosed by Nickl in view of Begun and Chan to avoid false positive identifications of PII (Col. 5, line 55, The paper describes a random number generator algorithm, and includes a table of five numbers that are generated by use of the algorithm. One of the numbers, by chance, happens to be a valid credit card number. Accordingly, the paper includes information that satisfies a PII type definition (e.g., credit card numbers), and the PII system 120 calculates a risk score for the resource 280. However, none of the other content within the predefined textual distance of the number is secondary information that is associated with the corresponding PII type definition. Accordingly, the PII system 120 determines that the risk score for the resource 280 does not meet the confidentiality threshold, and does not classify the resource 280 is a personal information exposure risk. This reflects the fact that the webpage 280 presents no risk of exposure of personally identifying information, even though, by chance, the resource 280 includes an actual credit card number).
Regarding claim 15, it recites similar limitations to claim 13 and is rejected on the same grounds – see above.
Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nickl in view of Begun and Chan as applied to claim 1 above, and further in view of Zhao et al. (US 20210019753 A1), herein Zhao.
Regarding claim 7, Nickl in view of Begun and Chan fails to teach: The system, according to claim 1, wherein the machine learning module comprises a fixed single machine learning model and prevents the iterative update.
However, in the same field of endeavor, Zhao teaches: wherein the machine learning module comprises a fixed single machine learning model and prevents the iterative update (¶78, the simpler neural network being trained only once and/or offline, which enables relatively simple calculations to be executed in real-time to determine contributions of features in new samples, etc.).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use a fixed model as disclosed by Zhao In the system disclosed by Nickl in view of Begun and Chan to improve efficiency (¶78, relatively high efficiency).
Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nickl in view of Begun and Chan as applied to claim 1 above, and further in view of Abreu et al. (US 20230230088 A1), herein Abreu, and Buezas et al. (US 20230137487 A1), herein Buezas.
Regarding claim 9, Nickl in view of Chan fails to teach: The system according to claim 1, wherein the visual feature comprises a continuous representation capturing layout, comprising whitespace, characters, an autoencoder… and field data (the “an autoencoder” limitation will be interpreted to mean an autoencoder output).
However, in the same field of endeavor, Begun teaches: wherein the visual feature comprises a continuous representation capturing layout, comprising whitespace, characters, an autoencoder… and field data (¶78, The system accepts typical word-processor documents (such as MS Word) and page-layout documents (such as PDF or .png files). In each case, visually-contiguous regions, such as headings, paragraphs, table cells, table, images, and the like are identified and represented as chunks, using a combination of their relative positions, surrounding whitespace, font and layout characteristics, and so on. These features are partly chosen by designers, and partly learned by image and pattern analysis on large number of documents – and – ¶90, The bitmap image of the text layout is divided into tiles, preferably of size on the order of 24 pixels square (adjusting for scan resolution), and the tiles are clustered. Autoencoders and neural network processing of these, including their neighbor relationships…).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize visual features like a continuous representation capturing layout, whitespace and autoencoder output as disclosed by Begun in the system disclosed by Nickl in view of Chan to improve document analysis (¶31, the system can provide valuable assistance to users, for example easier creation of higher-quality new documents, and extraction of desired information for downstream uses such as with other software applications, in back-office databases, derived reports, compliance checking, and so on. Such learning may be done with unsupervised and self-supervised learning techniques, which do not require large amounts of pre-labelled or pre-analyzed data, but instead infer patterns from unlabeled or minimally labeled data – and – ¶90, reveal similar visual events such as the boundaries between text and rules, edges and corners of text blocks, even indentation changes and substantial font/style changes. Further neural networks then use this clustering to co-identify similar layout objects, which frequently indicate or characterize important chunks).
Nickl in view of Begun and Chan fails to teach: background text…
However, in the same field of endeavor, Abreu teaches: background text… (¶48, In doing so, the IV 22 may examine the propriety of one or more of embeddings, such as that of security features including patterning and a watermark, microprint (e.g., font and sizing), placement, sizing, and spacing of Pll, and material construction (each being measured for compliance against an official, known standard for such aspects of the presented document, as applicable and provided, for example, by an appropriate governmental agency)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize visual features like background text (i.e, a watermark) as disclosed by Abreu in the in the system disclosed by Nickl in view of Begun and Chan to improve evaluation of more official documents (¶48, to evaluate the authenticity thereof).
Nickl in view of Begun, Chan, and Abreu fails to teach: anchor text…
However, in the same field of endeavor, Buezas teaches: anchor text… (¶49, The set of features 302 may be a set of attributes of the interface object and/or a set of attributes of nearby interface objects. For example, an attribute of the set of features 302 may be the HTML element class (e.g… anchor…).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize visual features like anchor text as disclosed by Buezas in the system disclosed by Nickl in view of Begun, Chan and Abreu to create a more flexible model (¶16, can accurately identify elements of interest and determine their element class regardless of the language the interface is written in because among the numerous features from which the machine learning model of the present disclosure is trained, can be found features that are common to interfaces irrespective of origin. For example, many source code keywords may be in English even in regions where English is not a primary language).
Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nickl in view of Begun and Chan as applied to claim 1 above, and further in view of Buezas.
Regarding claim 10, Nickl in view of Begun and Chan fails to teach: The system according to claim 1, wherein the per token comprises a plurality of word vectors, outputs of different layers of language models, and features indicating that the token is considered as the personal identifiable information and the type of personal identifiable information.
However, in the same field of endeavor, Buezas teaches: wherein the per token comprises a plurality of word vectors (¶14, For example, a set of features for an element of interest may be derived from the list of keywords as described in further detail herein, and the set of features may be tokenized/transformed into a feature vector suitable for input to the machine learning model (with the element class for the element of interest having been identified by a human operator as the ground truth label for the element of interest)), outputs of different layers of language models (¶16, Thus, the present disclosure contemplates that multiple machine learning models may easily be trained and implemented for different regions using separate sets of training data for the different regions), and features indicating that the token is considered as the personal identifiable information and the type of personal identifiable information (¶55, the predicted class 324 may be a suggested most-likely classification for the interface object. In other embodiments, the predicted class 324 may be a set of confidence scores representing probabilities that the interface object corresponds to different classifications. For example, if the interface object is a text box with certain features, the predicted class 324 may be a set of confidence scores indicating a 30% probability that the text box is an address field, a 11% probability that the text box is a first name field, a 2% probability that the text box is a zip code field, and so on).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use per token features as disclosed by Buezas in the system disclosed by Nickl in view of Begun and Chan to improve model adaptability and classification performance (¶16, in order to improve the form-filling accuracy for the different regions – and – ¶55, The highest confidence of the set of confidence scores may indicate which of the interface object classes the machine learning model 322 predicts is most likely the correct one).
Claim(s) 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nickl in view of Begun and Chan as applied to claim 1 above, and further in view of Buezas and Robbins et al. (US 20220414165 A1), herein Robbins.
Regarding claim 11, Nickl in view of Begun and Chan fails to teach: The system according to claim 1, wherein an output of the system is the plurality of features representing the score of the personal identifiable information types, the number of personal identifiable information groups comprising… a number of unique personal identifiable information types.
However, in the same field of endeavor, Buezas teaches: wherein an output of the system is the plurality of features representing the score of the personal identifiable information types, the number of personal identifiable information groups comprising… a number of unique personal identifiable information types (¶55, the predicted class 324 may be a suggested most-likely classification for the interface object. In other embodiments, the predicted class 324 may be a set of confidence scores representing probabilities that the interface object corresponds to different classifications. For example, if the interface object is a text box with certain features, the predicted class 324 may be a set of confidence scores indicating a 30% probability that the text box is an address field, a 11% probability that the text box is a first name field, a 2% probability that the text box is a zip code field, and so on).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have a model output a score along with a number of unique PII types as disclosed by Buezas in the system disclosed by Nickl in view of Begun and Chan to consider multiple PII types that may be present in a document or file (¶55, The highest confidence of the set of confidence scores may indicate which of the interface object classes the machine learning model 322 predicts is most likely the correct one).
Nickl in view of Begun and Chan fails to teach: a document size… and the type of the data source.
However, in the same field of endeavor, Robbins teaches: a document size… and the type of the data source (¶29, The output of the crawl 152 performed by the data crawler 104 may comprise various information such as, but not limited to, metadata describing any aspect of a file, record, or other type or collection of data, such as file identifiers, data length, file size and file type, file content, file creation and modification time, discovered LU or calculated meta labels, and file owner information, among other attributes).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to output features such as document size and data source type as disclosed by Robbins in the system disclosed by Nickl in view of Begun, Chan and Buezas to extract additional data from the data source that may be used in downstream tasks (¶30, that the more information that is captured in the metadata and associated labels, the better the insights are that may be produced in subsequent operations, as described below).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Hoffman et al. (US 20230216835 A1), discloses PII detection during screen recording based on text matching and webpage layout details, McCloskey et al. (US 11250876 B1), discloses using regular expressions to remove and replace PII in a variety of domains, and Enuka et al. (US 20210256156 A1), discloses predicting the presence of PII in documents using metadata.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HARRISON CHAN YOUNG KIM whose telephone number is (571)272-0713. The examiner can normally be reached Monday - Thursday 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, CESAR PAULA can be reached at (571) 272-4128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/HARRISON C KIM/ Examiner, Art Unit 2145
/CESAR B PAULA/ Supervisory Patent Examiner, Art Unit 2145