Last updated: April 19, 2026
Application No. 17/986,865
DAY ZERO NATURAL LANGUAGE PROCESSING MODEL

Non-Final OA §103§112
Filed
Nov 14, 2022
Examiner
ZHU, RICHARD Z
Art Unit
2654
Tech Center
2600 — Communications
Assignee
Movius Interactive Corporation
OA Round
3 (Non-Final)
Interview Optional

— +15.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 718 resolved cases, 2023–2026
Examiner Intelligence

ZHU, RICHARD Z View full profile →
Grants 69% — above average
Career Allow Rate
498 granted / 718 resolved
+7.4% vs TC avg
Strong +15% interview lift
Without
With
+15.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
32 currently pending
Career history
750
Total Applications
across all art units
Statute-Specific Performance

§101
16.0%
-24.0% vs TC avg
§103
54.5%
+14.5% vs TC avg
§102
19.7%
-20.3% vs TC avg
§112
4.2%
-35.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 718 resolved cases
Office Action

§103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114 
A request for continued examination under, including the fee set forth in 37 CFR1.17(e), was filed in this application after final rejection. Since this application is eligiblefor continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e)has been timely paid, the finality of the previous Office action has been withdrawnpursuant to 37 CFR 1.114. Applicant's submission filed on 01/09/2026 has been entered.
Status of the Claims
Claims 1-12 are pending. 
Response to Applicant’s Arguments
In response to “The claims do not merely use a computer as a tool. They improve the operation of the computer system itself by: Enabling NLP model generation without labeled datasets; Reducing computational inefficiencies associated with irrelevant or cross-domain data; and Altering how inference is performed at runtime. This constitutes a technical improvement under Enfish and MPEP § 2106.05(a)”.
The Supreme Court held that when a claim containing an abstract idea (e.g., mathematical formula) implements or applies that abstract idea (e.g., math formula) in a structure or process which, when considered as a whole, is performing a function which the patent laws were designed to protect (e. g., transforming or reducing an article to a different state or thing), then the claim satisfies the requirements of §101. Diamond v. Diehr, 450 U.S. 175, 192 (1981); MPEP 2106.04(d)I (“Implementing a judicial exception with, or using a judicial exception in conjunction with, a particular machine or manufacture that is integral to the claim, as discussed in MPEP 2106.05(b)”).
In one example, the CAFC applied Alice inquiry to ask whether the focus of the claims is on the specific asserted improvement in computer capabilities (i.e., the self-referential table for a computer database) or instead, on a process that qualifies as an abstract idea for which computers are invoked merely as a tool. Enfish L.L.C. v. Microsoft Corp., 822 F.3d 1327, 1335-36 (Fed. Cir. 2016).  
In Enfish, the claims were specifically directed to a self-referential table for a computer database. Id. at 1337. In particular, the claim language required a four step algorithm specifically directed to a self-referential table for a computer database that improves upon prior art information search and retrieval systems by employing a flexible, self-referential table to store data. Id. at 1336-37. CAFC determined that the plain focus of the claims was on an improvement to computer functionality itself (i.e., the self-referential table for a computer database), not on economic or other tasks for which a computer is used in its ordinary capacity. Id at 1335-36.
Therefore, the focus of the claims is on a specific asserted improvement in computer capabilities (i.e., the self-referential table for a computer database), not on economic or other tasks for which a computer is used in its ordinary capacity. Id. at 1336. See also MPEP 2106.04(d)I (“an improvement in the functioning of a computer or an improvement to other technology or technical field, as discussed in MPEP 2106.04(d)(1) and 2106.05(a)”).
In the instant application, claim 1 recites in part “a dataset generator, operating on the networked computer system, generating a training dataset by automatically labeling the aggregated data using the domain-specific embeddings without manual annotation; the networked computer system launching the natural language processing model with the training dataset such that the model dynamically restricts semantic interpretation to the identified industry-specific embeddings when responding to user queries; whereby user queries relevant to the particular industry that are provided to the natural language processing model with the training dataset are functionally responded to and processed upon launching”.
Claim 7 recites in part “a dataset generator configured to automatically generate a training dataset by assigning semantic labels to portions of the aggregated corpus based on similarity relationships between the industry-specific semantic embeddings, without manual annotation; and a processor configured to deploy the natural language processing model using the training dataset by constraining inference operations of the model to the industry-specific semantic embeddings, wherein, upon deployment, the system responds to user inquiries by executing the natural language processing model using the constrained inference operations to produce responses that are specific relevant to the particular industry”.
Just as the four step algorithm in Enfish was specifically directed to a self-referential table for a computer database that improved upon prior art information search and retrieval systems, the deployment of the natural language processing model with the training dataset such that the model restricts semantic interpretation or constraining inference operations to industry specific semantic embedding when responding to user inquiries is specifically directed to the computer functionality of search and retrieval in an industry specific domain. 
Therefore, claims 1-12 are eligible because the claims are directed to a specifically asserted computer search and retrieval functionality.
In response to “Stated differently, the focus of the present application is to build a functional NLP system from scratch, even where no training data exists — Zhao does not describe, suggest or teach this functionality” and “The present invention, as recited in the claims, is directed towards generating a “day zero” model as an NLP model that does not need to rely on any specific data obtained from a historical data process. The “zero” in “day zero” means that at the initial launching of the NLP, it is ready to be employed without requiring the actions that are presented and identified in the Zhoa reference”.
Zhao teaches creating an intent flow training dataset to train a natural language processing model (¶17) that does not require a pre-existing dialog dataset (¶15). Therefore, Zhao does describe building a functional NLP system from scratch (e.g., ¶16, create a training dataset from scratch), even where no training data exists. 
In response to “The various embodiments of the present invention are directed to generating a dataset and an NLP model for any industry, field or application that does not have a historical dataset already built that can be utilized. In Zhao, the generalization capabilities of the LLM is leveraged based on the input from domain experts and paraphrasing workers to generate processes for handling a new task. In contrast, in the present invention, the various embodiments operate to build an applicable data set and NLP model in an automated fashion without leveraging the capabilities of the LLM, domain experts and workers to paraphrase”
Examiner notes claim 1 recites “gathering data from one or more sources, wherein at least one source does not require human interaction” and claim 7 recites “automatically gather unstructured textual data from one or more network-accessible sources, wherein at least one of the sources provides data without human interaction”. 
In other words, the claims only required at least one source providing data without human interaction while the scope of the claims still allow other sources to provide data with human interaction. 
In response to “With regards to claim 1, the Office argues that paragraphs 0016-0017 describe “generating a natural language processing model for a particular industry without utilizing a predefined or historical dataset for that particular industry”. However, as noted by the Office, Zhao uses “intent flow” to create a training dataset to train a zero-shot intent recognition model in paragraphs 0016-0017. The Office argues that in paragraph 0015, Zhao states that “intent flow” does not require a “pre-existing dialog dataset”. However, as noted above, the “intent flow” is created by a domain expert 33 and is provided to the system, and as such, IS a predefined dataset. Claim 1 is premised on the fact that an NLP is created for a particular industry without utilizing such an “intent flow””
According to Zhao, “intent flow” is a task definition format (¶12), it is not a pre-defined dataset. Rather, a domain expert uses an intent flow interface to drawn an intent flow (¶36) such that for a given intent flow, a paragraph task generator generates a set of paraphrase tasks and dispatch the set of paraphrase tasks to crowd annotators or workers to paraphrase these paraphrase tasks into different utterances with the same intentions to create a training dataset (¶16). 
In other words, the domain expert merely defines a task definition format, not a training dataset that can be used to train a natural language processing model. 
In response to “Further, claim 1 recites the element of “gathering data from one or more sources, wherein each of the one or more sources include data that is relevant to the particular industry”. The Office argues that paragraph 0036 of Zhao and FIG. 7 describe this element by arguing that “domain experts” are “industry experts” that create intent flows via a web interface. As presented herein-above, this is not the same as the system gathering data from one or more sources and aggregating the gathered data to create a language model and generating a training set. In essence, the process of “domain experts creating intent flows via a web interface” is completely eliminated by the present invention. Or, said another way, Zhao relies upon “domain experts” to perform the actions that the present invention performs automatically by gathering the data from one or more sources. It is clear that the “domain experts” are not included in the recited “one or more sources” as the domain experts are actually generating the intent flows, whereas in the present invention, the data gathered is used to create the NLP, and as such, generate the intent flows without the use of domain experts. Although, it is appreciated that in some embodiments, such as recited in claim 2, experts can be utilized to provide particular industry related intents, the present invention only utilizes that as one of multiple sources and still requires the use of non-human generated intents”.
Upon further search and consideration, please see details of a new combination of references set forth below. 
Claim Rejections - 35 USC § 112
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claims 1-12 are rejected under 35 USC 112(a) due to lack of written description of the claimed invention in the specification.
Claim 1 recites, in part, “wherein the language model generator automatically identifies industry-specific terminology by performing unsupervised semantic clustering on the aggregated data to generate domain-specific embeddings”.
Similarly, claim 7 recites, in part, “the language model generator configured to generate industry-specific semantic embeddings by performing unsupervised vector-based clustering on the aggregated corpus to identify industry-specific language features”.
According to the specification, US 2024/0160851 A1 at ¶¶68-69: “There are two popular models of word embedding Word2Vac and Glove. Word2vec takes as its input a large corpus of text and produces a vector space with each unique word being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. Word2Vec is very famous at capturing meaning and demonstrating it on tasks like calculating analogy questions of the form a is to b as c is to ?. For example, man is to woman as uncle is to ? (aunt) using a simple vector offset method based on cosine distance” and at ¶74: “Glove, or Global Vectors for Word Representation, is an algorithm that is basically an extension to the word2vec method for efficiently learning word vectors. GLOVE constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings”.
While the specification contains written description of using Word2Vac or Glove to generate semantic embeddings, the specification does not contain written description by “performing unsupervised semantic clustering on the aggregated data to generate domain specific embedding” or “performing unsupervised vector based clustering on the aggregated corpus to identify industry specific language features”.
Therefore, the specification does not contain written description support of respective limitations in independent claims 1 and 7. Because claims 2-6 and 8-12 are dependent on claims 1 and 7, claims 2-6 and 8-12 are rejected under the same rationale.
Claims 3-6 and 9-12 are rejected under 35 USC 112(b) for not particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Regarding claims 3 and 9, claim 1 recites “a method, performed by a networked computer system, for generating a natural language processing model for a particular industry without utilizing a predefined or historical dataset for that particular industry…” and claim 7 recites “a system for generating a natural language processing model for a particular industry without utilizing a predefined or historical dataset for that particular industry…”.
The limitations of claims 3 and 9 explicitly contradict the requirements set forth in claims 1 and 7 because claims 3 and 9 require (1) “obtaining a third set of intents from historical data related to the particular industry” in direct contradiction of the requirement “without utilizing a historical dataset for that particular industry” and (2) “obtaining a fourth set of intents from industry specific data sources that include one or more processes that resemble a process in the particular industry” in direct contradiction of the requirement “without utilizing a predefined dataset for that particular industry”.
In other words, applicant cannot require claims 1 and 7 to generate the natural language processing model “without utilizing a predefined or historical dataset for that particular industry” and thereafter requires dependent claims 3 and 9 to generate the natural language processing model based on gathered data “utilizing a predefined or historical dataset for that particular industry”.
Either the claims generate the natural language processing model “without utilizing a predefined or historical dataset for that particular industry” or generate the natural language processing model “utilizing a predefined or historical dataset for that particular industry”, the claims cannot claim both. 
Therefore, claims 3-5 and 9-11 are rejected for failing to particularly pointing out and distinctly claiming the invention “generating a natural language processing model for a particular industry without utilizing a predefined or historical dataset for that particular industry” because the scopes of the claims are unclear as to whether to generate the natural language processing model “without utilizing a predefined or historical dataset for that particular industry” or not. 
Further, while claims 2-5 and 8-11 set forth the first, second, third, fourth, and fifth sets of intents, claims 6 and 12 are not dependent on claims 5 and 11. Therefore, claims 6 and 12 do not particularly pointing out and distinctly claiming “…the first, second, third, fourth, and fifth sets of intents”. Appropriate correction of claim dependency for claims 6 and 12 are required.  
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1 is rejected under 35 USC 103(a) as being unpatentable over Shin et al. (US 2023/0394232 A1) in view of Chandra et al. (US 2022/0208177 A1).
Regarding Claim 1, Shin discloses a method, performed by a networked computer system (Figs. 2, 7, and ¶57, data center computer configured as a service for users to train or performing inferencing of information), for generating a natural language processing model for a particular industry (¶27, client device 202 requests input manager 210 to train a language model 212, the request is to develop a new domain which to train the language model 212; ¶28, convert database datasets 216 into text based information to train language model 212) without utilizing a predefined or historical dataset for that particular industry (¶29, datasets 216 in a domain directed toward pharmaceutical compound interactions for medical application; per ¶20, database information sets provide useful relational information but the format of these information sets is not suitable (i.e., cannot be utilized) for training the machine learning systems; i.e., without conversion / modification, datasets 216 are not in a format that can be utilized to train the language model for medical industry and therefore datasets 216 are neither predefined nor historical dataset for the particular industry),1 the method comprising the actions of: 
a data collector, operating on the networked computer system, gathering data from one or more sources (¶29, data processing unit 214 acquires datasets 216 and convert the data sets into text based information to train the language model 212; ¶¶20-21, convert / modify the information sets to be used as training data by converting series of micro-records converted into plaintext format (e.g., Fig. 3B) while maintaining the relationships), wherein at least one source does not require human interaction (¶17 and ¶29, train language model 212 using rich hierarchical and relational data without requiring the use of human annotated dataset thereby saving time and cost) and, wherein each of the one or more sources includes data that is relevant to the particular industry (Fig. 3A, drug / medical industry); 
an aggregator, operating on the networked computer system, aggregating the gathered data (¶29, store training data in training database 218; i.e., database 218 aggregates converted datasets as training data); 
a language model generator, operating on the networked computer system, creating a language model from the aggregated data (¶27, input manager 210 receives request to train a language model from client device 202 and process the request and then proceed with fulfilling the request to train the language model); 
a dataset generator, operating on the networked computer system, generating a training dataset by automatically labeling the aggregated data using domain-specific labels without manual annotation (¶28, data processing unit 214 converts one or more datasets 216 into text based information; per ¶21, conversion system 104 (i.e., data processing unit 214) uses a set of rules to analyze input information and then representing various hierarchical or structural relationships between the information to convert a series of micro-records into plaintext format; per ¶29, rich hierarchical and relational data without providing human reviewers to annotate and provide information, thereby saving time and cost); 
the networked computer system launching the natural language processing model with the training dataset such that the model dynamically restricts semantic interpretation to the identified industry-specific domain when responding to user queries (¶30, search system uses the language model 212 to enhance search engine 222 with search domain 224 to achieve improved search results, the search domain 224 being associated with the input query substantially similar to or related to the domain which the language model 212 has been trained); 
whereby user queries relevant to the particular industry that are provided to the natural language processing model with the training dataset are s functionally responded to and processed upon launching (¶30, a user searching for novel findings receives better results than using search engine 222 alone, which may not have knowledge of the relationships maintained and learned at the language model 212; using language model 212 to enhance search engine 222 achieves improved search results through relationships maintained via training data 218).
Shin does not disclose wherein the language model generator automatically identifies industry-specific terminology by performing unsupervised semantic clustering on the aggregated data to generate domain-specific embeddings for the dataset generator to automatically label the aggregated data.
Chandra discloses a method, performed by a networked computer system (Fig. 4 and ¶74, host computer 400 connected to a network), for generating a natural language processing model (¶7, hierarchical approach to train clusters of models; ¶8, training at least an intent cluster classification model; ¶55 and ¶56, train an intent cluster classification model as a first layer in a hierarchy of models and train an intent classification model for each intent cluster of clusters 1-N) comprising a language model generator to create a language model from aggregated data (¶48, collect first utterance data 202 from conversation logs), wherein the language model generator automatically identifies industry-specific terminology by performing unsupervised semantic clustering on the aggregated data to generate domain-specific embeddings (¶¶49-50, using pre-trained embedding models to perform embedding of the first utterance data to create n-dimensional vectors based on semantic similarity specific to particular domains (i.e., industry) such as medical science; ¶51, using unsupervised clustering to create clusters of dataset using cosine similarity to find the distance between vectors in n-dimensions and cluster them into clusters 1-N)
a dataset generator, operating on the networked computer system, generating a training dataset by automatically labeling the aggregated data using the domain-specific embeddings without manual annotation (¶51, create clusters 1-N of dataset using unsupervised machine learning algorithm eases any annotation requirement because the clustering of utterances is associated with a pre-defined intent such that the clustering of utterances essentially creates clusters 1-N of intents; ¶34, using unsupervised learning algorithm eliminates cost associated with manual efforts); 
the networked computer system launching the natural language processing model with the training dataset such that the model dynamically restricts semantic interpretation to the identified industry-specific embeddings when responding to user queries (¶39 and ¶70, configure each intent classification model to receive user utterance and identify an intent from the respective intent cluster; see Fig. 3B and ¶71).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to automatically identifies industry-specific terminology by performing unsupervised semantic clustering on the aggregated data to generate domain-specific embeddings for the dataset generator to automatically label the aggregated data in order to eliminate cost associated with manual efforts (Chandra, ¶34; compare Shin, ¶29, implement unsupervised clustering of datasets into clusters of intents / semantics to generate rich hierarchical and relational data without providing human reviewers to annotate, thereby saving time and cost).
Claim 7 is rejected under 35 USC 103(a) as being unpatentable over Shin et al. (US 2023/0394232 A1) in view of Chandra et al. (US 2022/0208177 A1) and Kunnumma et al. (US 2020/0320430 A1).
Regarding Claim 7, Shin discloses a system (Figs. 2, 7, and ¶57, data center computer configured as a service for users to train or performing inferencing of information) for generating a natural language processing model (¶27, client device 202 requests input manager 210 to train a language model 212, the request is to develop a new domain which to train the language model 212; ¶28, convert database datasets 216 into text based information to train language model 212) for a particular industry without utilizing a predefined or historical dataset for that particular industry (¶29, datasets 216 in a domain directed toward pharmaceutical compound interactions for medical application; per ¶20, database information sets provide useful relational information but the format of these information sets is not suitable (i.e., cannot be utilized) for training the machine learning systems; i.e., without conversion / modification, datasets 216 are not in a format that can be utilized to train the language model for medical industry and therefore datasets 216 are neither predefined nor historical dataset for the particular industry),2 the system comprising: 
a data collector configured to automatically gather structured textual data from one or more network-accessible sources (¶29, data processing unit 214 acquires datasets 216 (i.e., ¶20, information set from one or more databases 102) and convert the data sets into text based information to train the language model 212), wherein at least one of the sources provides data without human interaction (¶17 and ¶29, train language model 212 using rich hierarchical and relational data without requiring the use of human annotated dataset thereby saving time and cost) and wherein the data is industry-specific terminology industry (Fig. 3A, drug / medical industry); 
an aggregator configured to normalize and aggregate the gathered structured textual data into a unified corpus (¶29, store training data in training database 218; i.e., database 218 aggregates converted datasets (which was converted / normalized into a format to train machine learning system) as training data); 
a language model generator configured to create a language model from the unified corpus / training dataset (¶27, input manager 210 receives request to train a language model from client device 202 and process the request and then proceed with fulfilling the request to train the language model); 
a dataset generator configured to automatically generate the training dataset without manual annotation (¶28, data processing unit 214 converts one or more datasets 216 into text based information; per ¶21, conversion system 104 (i.e., data processing unit 214) uses a set of rules to analyze input information and then representing various hierarchical or structural relationships between the information to convert a series of micro-records into plaintext format; per ¶29, rich hierarchical and relational data without providing human reviewers to annotate and provide information, thereby saving time and cost); and
 a processor (¶49, data center CPU; ¶57, data center uses CPU to perform training and inferencing of information such as speech recognition) configured to deploy the natural language processing model using the training dataset by constraining inference operations of the model to the industry-specific semantic domain (¶30, search system uses the language model 212 to enhance search engine 222 with search domain 224 to achieve improved search results, the search domain 224 being associated with the input query substantially similar to or related to the domain which the language model 212 has been trained), 
wherein, upon deployment, the system responds to whereby user inquiries by executing the natural language processing model using the constrained inference operations to produce responses that are specific relevant to the particular industry (¶30, a user searching for novel findings receives better results than using search engine 222 alone, which may not have knowledge of the relationships maintained and learned at the language model 212; using language model 212 to enhance search engine 222 achieves improved search results through relationships maintained via training data 218). 
Shin does not disclose the language model generator configured to generate industry-specific semantic embeddings by performing unsupervised vector-based clustering on the aggregated corpus to identify industry-specific language features. 
Chandra discloses a networked computer system (Fig. 4 and ¶74, host computer 400 connected to a network) for generating a natural language processing model (¶7, hierarchical approach to train clusters of models; ¶8, training at least an intent cluster classification model; ¶55 and ¶56, train an intent cluster classification model as a first layer in a hierarchy of models and train an intent classification model for each intent cluster of clusters 1-N) comprising a language model generator to create a language model from aggregated data (¶48, collect first utterance data 202 from conversation logs), the language model generator configured to generate industry-specific semantic embeddings by performing unsupervised vector-based clustering on the aggregated corpus to identify industry-specific language features (¶¶49-50, using pre-trained embedding models to perform embedding of the first utterance data to create n-dimensional vectors based on semantic similarity specific to particular domains (i.e., industry) such as medical science; ¶51, using unsupervised clustering to create clusters of dataset using cosine similarity to find the distance between vectors in n-dimensions and cluster them into clusters 1-N); 
a dataset generator configured to automatically generate a training dataset by assigning semantic labels to portions of the aggregated corpus based on similarity relationships between the industry-specific semantic embeddings, without manual annotation (¶51, create clusters 1-N of dataset using unsupervised machine learning algorithm eases any annotation requirement because the clustering of utterances is associated with a pre-defined intent such that the clustering of utterances essentially creates clusters 1-N of intents; ¶34, using unsupervised learning algorithm eliminates cost associated with manual efforts); and
 a processor configured to deploy the natural language processing model using the training dataset by constraining inference operations of the model to the industry-specific semantic embeddings (¶39 and ¶70, configure each intent classification model to receive user utterance and identify an intent from the respective intent cluster; see Fig. 3B and ¶71).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to automatically identifies industry-specific terminology by performing unsupervised semantic clustering on the aggregated data to generate domain-specific embeddings for the dataset generator to automatically label the aggregated data in order to eliminate cost associated with manual efforts (Chandra, ¶34; compare Shin, ¶29, implement unsupervised clustering of datasets into clusters of intents / semantics to generate rich hierarchical and relational data without providing human reviewers to annotate, thereby saving time and cost).
Shin does not disclose gathering unstructured textual data from the network accessible sources and normalize, aggregate the unstructured textual data into a unified corpus. 
Kunnumma discloses automatically gathering unstructured textual data from one or more network accessible sources (¶16, system for classification of data in a machine learning system includes a computer network, processors, and storage location that receives an input stream of unstructured data associated with the storage location; e.g., ¶51, unstructured data in the form of supplier contract documents), normalize and aggregate the gathered unstructured textual data into a unified corpus / aggregated corpus (¶51, clean, tokenize, and vectorize the unstructured data for machine learning tasks; ¶52, the clauses in the contracts and the labels corresponding to the clauses are used as initial training data / seed data), and a model generator configured to generate industry specific semantic embeddings by performing unsupervised vector based clustering on the aggregated corpus to identify industry specific language features to generate a classifier model (¶52, generate data classifier based on training data comprising the seed data with labeled clauses in addition to unlabeled clauses by using the seed data to identify the unlabeled clauses; e.g., ¶65, using clustering technique to determine vector distance where a distance value of 1 indicates they are completely similar and value 0 indicates they are completely dissimilar).
Shin generates language model based on training data with preserved structure (¶24).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to gather unstructured textual data from the network accessible sources, normalize, aggregate the unstructured textual data into structured text comprising labeled relationships in the unified corpus / aggregated corpus to provide seed data to generate data classifier / language model (Kunnumma, ¶52). 
Claims 2-6 are rejected under 35 USC 103(a) as being unpatentable over Shin et al. (US 2023/0394232 A1) and Chandra et al. (US 2022/0208177 A1) as applied to claim 1, in further view of Zhao (US 2020/0251091 A1).
Claims 8-12 are rejected under 35 USC 103(a) as being unpatentable over Shin et al. (US 2023/0394232 A1) in view of Chandra et al. (US 2022/0208177 A1) and Kunnumma et al. (US 2020/0320430 A1) as applied to claim 7, in further view of Zhao (US 2020/0251091 A1).
Regarding claims 2 and 8, Shin discloses wherein the data collector is configured to gather data from one or more sources from human reviewer (¶20). In particular, Chandra teaches obtaining a first set of intents from one or more experts in a particular industry (¶48, in some embodiments, obtain a minimal set of utterances through experts (e.g., linguists)).
The combination does not disclose search social media data forums related to the particular industry to obtain a second set of intents.
Zhao discloses a networked computer system (Fig. 2) for generating a natural language processing model for a particular industry without utilizing a predefined or historical dataset for that particular industry (¶¶16-17, use intent flow to create a training dataset to train a zero-shot intent recognition model; per ¶15, intent flow does not require a pre-existing dialog dataset), the system comprising: 
a data collector operating on the networked computer system configured to gather data from one or more sources (¶34, step 11 defines user intents using intent flow format (i.e., domain experts 33 create intent flows per 36) and step 12 involves collection of user input data (¶37, crowd annotators / workers 36 create training set 35); per ¶66, database 93 saves intent flow tuple pairs), wherein each of the one or more sources includes data that is relevant to the particular industry (¶36 and Fig. 7, domain (i.e., industry) experts create intent flows via a web interface and ¶37, workers 36 paraphrases the respective intent flows into different utterances with the same intentions) to obtain a first set of intents generated by one or more experts in the particular industry (¶36 and Fig. 7, domain (i.e., industry) experts create intent flows via a web interface); and 
search social media data forums related to the particular industry to obtain a second set of intents (¶37, generate and dispatch a set of paraphrase tasks 34 associated with intent flows to workers (e.g., employees) who paraphrase these paraphrase tasks into different utterances with the same intentions; per ¶66, employees from crowdsourcing platform).
Shin noted that language models are effective for question answer (¶1); i.e., dialog. Zhao teaches creating potential user intentions in the dialogs from a particular domain; i.e., industry (¶14).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to gather data from one or more sources comprising obtaining a first set of intents generated by one or more experts in the particular industry and search social media data forums related to the particular industry to obtain a second set of intents in order to create a dataset / training data comprising intent questions and corresponding sample answers (Zhao, ¶66) and create a language model for intent recognition in a speech / text dialog system (Zhao, ¶2) for question answering, based on the training data (Zhao, ¶37).  
Regarding Claims 3 and 9, Zhao discloses wherein the data collector is further configured to: 
obtain a third set of intents from historical data related to the particular industry (¶38, the list of intent labels can be obtained from prior implementation of the training stage and, in particular, previously developed intent flow graphs); and 
obtain a fourth set of intents from industry specific data sources that include one or more processes that resemble a process in the particular industry (Fig. 2 and ¶17, Intent Definition 27 can come from existing methods; per ¶9, from companies and products such as Dialogflow, Chatflow, Wit.ai, and LUIS).
Regarding Claims 4 and 10, Zhao discloses wherein the data collector is further configured to: 
obtain a set of questions related to the particular industry (paraphrase tasks of Fig. 2 per ¶¶59-61, context: you are in a shop, a sale asks how can she/he help you? Intent Label: you want to express: “I am looking for dress shoes”: Task: please write N utterances that are semantically similar but syntactically different, that expressed the above intent); 
run the questions through a search engine (¶66, dispatch as question to a crowdsourcing platform such as Amazon Mechanical Turk to obtain answer); and 
convert the results of the search engine into a fifth set of intents (per ¶63, the result dataset will create data in tuple formats of ¶¶64-65).
Regarding Claims 5 and 11, Zhao discloses an intent convertor configured to convert the first (¶66, intent flow parser 92 converts input intent flow into tuple pairs (context, intent) and saves the pairs into a database 93), second (store worker answers from employees on company crowdsourcing platform as result dataset of ¶¶63-65 into database 93), third (¶38 in view of ¶66, intent labels from prior implementation of training stage stored in database 93 as (context, intent) pair), fourth (¶9 and ¶17, intent definitions from existing company and product) and fifth sets of intents (store worker answers from Amazon Mechanical Turk as result dataset of ¶¶63-65 into database 93) into respective corpora.
Regarding Claims 6 and 12, Zhao discloses wherein aggregator is further configured to aggregate the first, second, third, fourth and fifth sets of intents into respective corpora into a single corpus (¶16 and ¶37, create a training dataset to train the ZSIR model; see Model Training Data 29 of Fig. 2 and Training Dataset 35 of Fig. 3).
Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor Hai Phan whose telephone number is 571-272-6338. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RICHARD Z ZHU/Primary Examiner, Art Unit 2654                                                                                                                                                                                                        02/13/2026


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 Applicant argued on p. 7 of the amendment filed on 12/19/2025: “The various embodiments of the present invention are directed to generating a dataset and an NLP model for any industry, field or application that does not have a historical dataset already built that can be utilized”. See also specification US 2024/016-851 A1 at ¶49: “One of the biggest challenges for deploying an NLP model right out on day-zero is the lack of relevant training datasets that are specific to that process. In addressing this challenge, various embodiments include a novel method that operates to synthesize training data sets so as to deploy a day zero NLP model for any interaction” and ¶50: “The premises of the various embodiments is to create a unique synthesized dataset that is applicable and relevant for a particular industry, application or process and that can be used by the NLP model to provide day-zero operation and also for training”.
        2 Applicant argued on p. 7 of the amendment filed on 12/19/2025: “The various embodiments of the present invention are directed to generating a dataset and an NLP model for any industry, field or application that does not have a historical dataset already built that can be utilized”. See also specification US 2024/016-851 A1 at ¶49: “One of the biggest challenges for deploying an NLP model right out on day-zero is the lack of relevant training datasets that are specific to that process. In addressing this challenge, various embodiments include a novel method that operates to synthesize training data sets so as to deploy a day zero NLP model for any interaction” and ¶50: “The premises of the various embodiments is to create a unique synthesized dataset that is applicable and relevant for a particular industry, application or process and that can be used by the NLP model to provide day-zero operation and also for training”.
Read full office action
Prosecution Timeline

Nov 14, 2022
Application Filed
Feb 21, 2025
Non-Final Rejection — §103, §112
May 02, 2025
Response Filed
May 30, 2025
Response after Non-Final Action
Aug 15, 2025
Final Rejection — §103, §112
Dec 19, 2025
Request for Continued Examination
Jan 06, 2026
Response after Non-Final Action
Feb 14, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/247,441
Patent 12592228
SPEECH INTERACTION METHOD ,AND APPARATUS, COMPUTER READABLE STORAGE MEDIUM, AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 31, 2026
18/365,694
Patent 12592222
APPARATUSES, COMPUTER PROGRAM PRODUCTS, AND COMPUTER-IMPLEMENTED METHODS FOR ADAPTING SPEECH RECOGNITION CONFIDENCE SCORES BASED ON EXPECTED RESPONSE
2y 5m to grant Granted Mar 31, 2026
18/510,086
Patent 12586574
ELECTRONIC DEVICE FOR PROCESSING UTTERANCE, OPERATING METHOD THEREOF, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
18/520,336
Patent 12579978
NETWORKED DEVICES, SYSTEMS, & METHODS FOR INTELLIGENTLY DEACTIVATING WAKE-WORD ENGINES
2y 5m to grant Granted Mar 17, 2026
17/957,934
Patent 12572739
GENERATING MACHINE INTERPRETABLE DECOMPOSABLE MODELS FROM REQUIREMENTS TEXT
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
69%
Grant Probability
85%
With Interview (+15.4%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 718 resolved cases by this examiner. Grant probability derived from career allow rate.