DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA and is in response to communications filed on 2/24/2026 in which claims 1-20 are presented for examination.
Priority
Acknowledgment is made of provisional application No. 63/227809, filed on 7/29/2022.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Shukla et al. US 20200133967 A1 (hereinafter referred to as “Shukla”) in view of Gullapudi et al. US 20230153382 A1 (hereinafter referred to as “Gullapudi”).
As per claim 1, Shukla teaches:
A method comprising:
receiving a targeted data request, wherein the targeted data request identifies a data subject (Shukla, [0045] – A search query may be received, wherein this is interpreted as a targeted data request. In addition to including the tokens associated with the search query within the web document, the retrieved set of web documents may be provided based on an associated relevance score, wherein tokens are interpreted as data subjects);
based on the targeted data request, accessing a plurality of documents from one or more data sources, wherein each document in the plurality of documents comprises unstructured content (Shukla, [0057] – Examples of such web services include websites that provide online content, such as news websites (e.g., websites for the NY Times®, Wall Street Journal®, Washington Post®, and/or other news websites), social networking websites (e.g., Facebook®, Google®, LinkedIn®, Twitter®, or other social network websites), merchant websites (e.g., Amazon®, Walmart®, or other merchant websites), or any other websites provided via websites/web services (e.g., that provide access to online content or other web services), wherein news articles or social media posts contain unstructured natural language content. [0282] – The documents (e.g., any set of data, such as any unstructured corpus of data) can then be classified);
generating a first feature representation of each document of the plurality of documents by utilizing a first embedding model to process the unstructured content associated with each document in the plurality of documents, wherein the first feature representation comprises first numerical representations of each document in the plurality of documents (Shukla, [0135] – When k=100, each entity can be represented as a 100 dimensional space vector of web documents and each web document can be represented as a 100 dimensional space vector of entities (e.g., each entity can be embedded in the 100 dimensional space). [0136] – In the event the distance between two 100 dimensional space vectors is less than or equal to a document similarity threshold, the two interests are determined to be similar. In the event the distance between two 100 dimensional space vectors is greater than a document similarity threshold, the two interests are determined to be dissimilar, wherein the distance between documents in the vector space is interpreted as a numerical representation of each document. Also, the position within the vector space can be a numerical representation, but this is meaningless unless there are other documents or a subject of interest such as a search query with which to compare as a similarity distance measure);
processing the first feature representation of each document of a plurality of documents using a classifier machine-learning model to generate a prediction as to a likelihood that the document contains the targeted data, wherein the first feature representation of each document comprises at least one first dimension representing a first feature of corresponding unstructured content found in the document (Shukla, [0135] – Dimensions within a matrix can represent all web documents that are associated with various entities. [0170] – A “count-based” method for generating low-dimensional feature embeddings from a feature co-occurrence matrix. [0282] – Pages under a specific URL can be fed into a classifier system for deep learning models. The documents (e.g., any set of data, such as any unstructured corpus of data) can then be classified into a particular category (e.g., a sports category such as baseball, football, or another sport, or a technology category such as computers, routers, medical devices, or another technology));
generating, based on the prediction for each document of the plurality of documents, a first subset of documents from the plurality of documents …
wherein each document in the first subset of documents comprises a prediction that satisfies a threshold based on the prediction as to the likelihood that the document contains the targeted data (Shukla, [0164] – These disclosed techniques can provide an estimate based on old/known terms (e.g., to predict that a document is relevant to an interest, an interest is relevant to a document, a new document is relevant to an old document, etc.). [0167] – A search can then be performed using the embedding space based on identifying online content that is near (e.g., within a predetermined/threshold distance and/or other criteria, such as freshness, popularity, prior user activity, etc.) that interest vector for that search query. Shukla, [0195] – Can be applied to determine/predict interests (e.g., entities, n-grams) based on a document by determining such interests that are in the neighborhood of/nearby the document in the embedding space (e.g., SpaceX launch can be an interest determined from a document about a rocket launch). In an example implementation, a relatedness parameter can be tuned, such as based on distance/smear (e.g., using a configurable parameter, such as 0.5, 1.0, or another value), wherein the SpaceX launch is interpreted as targeted data because SpaceX launch is a subject which corresponds to the definition given in the beginning of the claims for “targeted data request”);
generating a second feature representations of each document of the first subset of documents by utilizing a second embedding model that differs from the first embedding model to process the unstructured content associated with each document in the first subset of documents, wherein the second feature representations comprises second numerical representations that differ from the first numerical representations (Shukla, [0129] and [0136] – An initial space of documents that are below a distance and therefore above a confidence threshold. [0167] – Other features can be determined such as freshness of a document which will inherently contain a different set of numerical representations in the form of scores within the subset. [0218] – A second filter can be applied to the list of recommended interests which further removes interests with inappropriate content, wherein this is interpreted as a second feature representation that differs from the first. [00243] – A vector-based model (e.g., a vector model) for each document in the index. Furthermore, [0313] – Determining if the document exceeds a popularity threshold is performed (e.g., or some other threshold or combination of thresholds based on usefulness factors/signals as described herein or other metrics associated with the document and online activity/sources), wherein a combination of thresholds based on other metrics is interpreted as multiple feature representations which change numerical representations of the documents. At 2310, modifying the reevaluation rate based on a threshold change in the document's popularity is performed, wherein a similarity threshold combined with any other threshold is interpreted as a second feature representation of each document of the first subset);
processing the second feature representation of each document of the first subset of documents using a clustering machine-learning model to generate a plurality of document clusters:
wherein the second feature representation of each document comprises at least one second dimension representing a second feature of the unstructured content found in the document and each document cluster of the plurality of document clusters comprises a subset of similar documents from the first subset of documents, wherein each document in the subset of similar documents comprises similar unstructured features (Shukla, [0135] – The collaborative filtering scheme represents all entities and all documents as a matrix. Given the vast number of web documents and the vast number of potential interests, an m×n matrix X (e.g., a co-occurrence matrix of dimensions m by n) can represent all the web documents and whether a particular web document is about a particular entity that corresponds to a particular interest. [0161] – Clustering techniques for filtering data especially with regards to dimensional space vectors); and
providing the plurality of document clusters so that an analysis can be performed on each document cluster of the plurality of document clusters to at least one of
eliminate the document cluster as having the targeted data associated with the data subject; or
identify the targeted data associated with the data subject found in a document cluster by reviewing less than all of the subset of similar documents for the document cluster (Shukla, [0161] – Clustering techniques for filtering data especially with regards to dimensional space vectors).
Although Shukla teaches embeddings, Shukla doesn’t explicitly teach that a second set of documents is filtered, and therefore, excluded from a second processing module, however, Gullapudi teaches:
…, the first subset of documents excluding one or more documents from the plurality of documents from being processed by a second embedding model (Gullapudi, [0019] – Problem spaces are interpreted as embedding models. [0056] – A ML model loading module 420, a first document (D1) processing module 422, a second document (D2) loading module 424, a second document (D2) filtering module. [0058] – The second document (D2) filter 426 selectively filters entities represent in the second document from being processed for inference. In some examples, the second document (D2) filter 426 receives a set of matched keys, each matched key being associated with an entity represented in the second document that has already been matched to an entity represented in the first document),
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shukla’s invention in view of Gullapudi in order to filter out documents from being processed for inference; this is advantageous because the lowest possible probability threshold is selected that still achieves a target accuracy, which enables more entities to be removed from further consideration in inference than would be removed by selecting a higher probability threshold (Gullapudi, [0053]).
As per claim 2, Shukla as modified teaches:
The method of Claim 1, wherein the first feature representation comprises a Word2Vec representation (Shukla, [0243] – The unsupervised machine learning can learn a representation of a word, a sequence of words, parts of a document such as title, and finally, a representation for the entire document itself), and
the second feature representation comprises a term frequency - inverse document frequency (TF-IDF) representation (Shukla, [0085] – Each word/combination of words within the news article can be assigned a term-frequency-inverse document frequency (TF-IDF) value).
As per claim 3, Shukla as modified teaches:
The method of Claim 1 further comprising:
Identifying, based on at least one of a type of the targeted data request or the data subject, a plurality of data sources (Shukla, [0282] – The documents (e.g., any set of data, such as any unstructured corpus of data) can then be classified into a particular category (e.g., a sports category such as baseball, football, or another sport, or a technology category such as computers, routers, medical devices, or another technology)); and
querying, based on a parameter provided with the targeted data request, the plurality of data sources to retrieve the plurality of documents (Shukla, [0045] – A search and feed service may receive a search query that is comprised of a plurality of query terms corresponding to a large number of tokens to retrieve one or more web documents that match the search query).
As per claim 4, Shukla as modified teaches:
The method of Claim 1 further comprising identifying top words found in the subset of similar documents for a particular document cluster of the plurality of document clusters, wherein the top words are also provided along with the plurality of document clusters (Shukla, [0085] – Each word/combination of words can be assigned a term-frequency-inverse document frequency (TF-IDF) value).
As per claim 5, Shukla as modified teaches:
The method of Claim 4, wherein the top words are based on at least one of a top number of words with respect to frequency of appearance in the subset of similar documents for the particular document cluster (Shukla, [0085] – Each word/combination of words can be assigned a term-frequency-inverse document frequency (TF-IDF) value),
a top percentage of words with respect to frequency of appearance in the subset of similar documents for the particular document cluster (Shukla, [0107] – The certain threshold can be a threshold endorsement score, a top percentage of interests (e.g., top 10%), a top tier of interests (e.g., top 20 interests), etc.), or
words that satisfy a second threshold with respect to frequency of appearance in the subset of similar documents (Shukla, [0085] – The score is normalized to a value between 0 and 1. A word/combination of words with a score above a threshold value is determined to be an interest associated with the user account).
As per claim 6, Shukla as modified teaches:
The method of Claim 1 further comprising:
processing features of at least one document of the subset of similar documents for a particular document cluster of the plurality of document clusters using a multi-label machine learning model to generate a second prediction on a likelihood of a certain type of the targeted data is present in the subset of similar documents for the particular document cluster (Shukla, [0091] – The machine learning model can be implemented using machine-learning based classifiers [0161] – Clustering techniques for filtering dimensional spaces. [0191] – The disclosed techniques can utilize a different set of labels (e.g., point-wise mutual information (PMI) as similarly described above) to generate the embeddings as further described. Also, [0284]-[0286]); and
determining, based on the second prediction satisfying a second threshold, that the certain type of the targeted data is present in the subset of similar documents for the particular document cluster, wherein the certain type of the targeted data is also provided along with the plurality of document clusters (Shukla, [0282] – The documents (e.g., any set of data, such as any unstructured corpus of data) can then be classified into a particular category (e.g., a sports category such as baseball, football, or another sport, or a technology category such as computers, routers, medical devices, or another technology)).
As per claim 7, Shukla as modified teaches:
The method of Claim 1, wherein providing the plurality of document clusters involves providing the plurality of document clusters to a computing system configured to perform the analysis and use the targeted data associated with the data subject to perform an automated task (Shukla, [0161] – Clustering techniques for filtering data especially with regards to dimensional space vectors).
As per claim 8, Shukla as modified teaches:
The method of Claim 7, wherein the automated task comprises at least one of generating a report comprising the targeted data associated with the data subject, creating a map of where the targeted data associated with the data subject is found in the plurality of documents, or deleting the targeted data associated with the data subject (Shukla, [0163] – Embedding generally refers to a technique for mapping discrete items to a vector of real numbers (e.g., represented by floating-point numbers/computation using a computer). [0351] – The selected and ranked set of documents can then be generated and communicated to the client application).
Claims 9-14 are directed to a system performing steps recited in claims 1-8 with substantially the same limitations. Therefore, the rejections made to claims 1-8 are applied to claims 9-14.
Claims 15-20 are directed to a non-transitory computer-readable medium performing steps recited in claims 1-8 with substantially the same limitations. Therefore, the rejections made to claims 1-8 are applied to claims 15-20.
Response to Arguments
Applicant’s arguments with respect to claims have been considered but are generally moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
XU, et al. 2010, https://dl.acm.org/doi/pdf/10.1145/1851476.1851588 “GatorShare: A File System Framework for High-Throughput Data Management”, ACM, pp. 776-786.
Enuka et al. US 20200050966 A1 teaches scanning any number of data sources in order to provide users with visibility into stored personal information, risk associated with storing such information and/or usage activity relating to such information (Abstract).
Boyer et al. US 11200510 B2 teaches text classifier training (Abstract).
Liu et al. US 11755626 B1 teaches a multi-dimensional vector from tokenized topics resulting in one or more vectorized topics. There’s also a similarity score for documents with respect to each topic as well as classifications for documents that are most similar to the respective topics in at least column 39, lines 45-65.
Tapuhi et al. US 10455088 B2 teaches a plurality of feature vectors, each feature vector corresponding to one of the recorded interactions; computing, by the processor, similarities between pairs of the feature vectors; grouping, by the processor, similar feature vectors based on the computed similarities into groups of interactions; rating, by the processor, feature vectors within each group of interactions based on one or more criteria, wherein the criteria include at least one of interaction time, success rate, and customer satisfaction; and outputting, by the processor, a dialog tree in accordance with the rated feature vectors for configuring the automated self-help system (Abstract).
Barday et al. US 9892444 B2 teaches presenting a threshold privacy assessment that includes a first set of privacy-related questions for a privacy campaign.
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Matthew Ellis whose telephone number is (571)270-3443. The examiner can normally be reached on Monday-Friday 8AM-5PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Neveen Abel-Jalil can be reached on (571)270-0474. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
March 18, 2026
/MATTHEW J ELLIS/Primary Examiner, Art Unit 2152