DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The listing of references in the specification is not a proper information disclosure statement. 37 CFR 1.98(b) requires a list of all patents, publications, or other information submitted for consideration by the Office, and MPEP §609.04(a) states, “the list may not be incorporated into the specification but must be submitted in a separate paper.”
The Specification, ¶[0031], ¶[0080], ¶[0082], ¶[0084], ¶[0085], and ¶[0086] includes various citations of non-patent literature, but there are no copies of this non-patent literature and there is no proper citation of this non-patent literature on an Information Disclosure Statement. Generally, any prior art cited in the Specification may be relevant to examination, may not be readily available to the USPTO, and consequently should be cited on an Information Disclosure Statement, Form PTO-1449. Applicants are requested to submit copies of this non-patent literature with a proper citation in an Information Disclosure Statement.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 to 2, 5, 8 to 9, 12, 15 to 16, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Specia et al. (U.S. Patent Publication 2023/0419117) view of De Freitas Adiwardana et al. (U.S. Patent No. 12,086,713).
Concerning independent claims 1, 8, and 15, Specia et al. discloses a method, computer system, and computer program product for a multi-task neural network with toxicity detection, comprising:
“receiving, via a communication interface, a text input of a sequence of tokens” – data input is obtained comprising a representation of user-generated text content (Abstract); multi-task neural network 202 receives input data comprising a representation of user-generated textual content 201 (¶[0041]: Figures 2A to 2B); input data comprising a representation of user-generated textual content is obtained (¶[0100] - ¶[0101]: Figure 4);
“generating, by a first neural network model that is trained [to generate output] tokens belonging to a predefined vocabulary set and having a specific attribute, a first output probability [for a next token] in response to the text input” – a trained multi-task neural network (“a first neural network model that is trained”) is trained to identify one or more attributes of user-generated textual content as additional tasks; a multi-task neural network may predict whether there is profanity in the user-generated textual content (¶[0026]); keyword-based approaches utilize a lexicon to check for the presence of words in a lexicon of posts (“a predefined vocabulary set”) (¶[0027]); a toxicity prediction may be a score or probability indicating how likely a post is considered to be toxic (“a first output probability . . . in response to the text input”); trained multi-task neural network 103 generates toxicity predictions for each of one or more attributes for a post; pre-processing of posts may include tokenization of user-generated textual content; input data to multi-task neural network 103 is a sequence of vectors with each vector corresponding to a particular token of the textual content (¶[0031] - ¶[0033]: Figure 1); multi-task neural network 202 is trained to generate toxicity prediction 203 and a prediction 204-1 to 204N for each of N other attributes (¶[0040]: Figures 2A to 2B); a toxicity feature representation for each of one or more attributes and a common feature representation is generated; the one or more attributes for the user-generated textual content comprises a representation for one or more of presence of profanity, topic class, sentiment, group identity class, presence of a joke, presence of sarcasm, and presence of an idiom (¶[0103] - ¶[0104]: Figure 4); here, multi-task neural network includes an attribute for toxicity prediction which can be construed as a ‘model’ (“a first neural network model”); that is, multi-task neural network 202 can be considered to be composed of a plurality of “neural network models” that each detect one of a plurality of ‘specific attributes’, e.g., an attribute of profanity;
“generating, by a second neural network model that is trained [to generate output] tokens belonging to the predefined vocabulary set without any emphasis on the specific attribute, a second output probability [of the next token] in response to the text input” – a trained multi-task neural network (“a second neural network model that is trained”) is trained to identify one or more attributes of user-generated textual content as additional tasks; a multi-task neural network may predict a presence of sarcasm in the user-generated textual content (¶[0026]); keyword-based approaches utilize a lexicon to check for the presence of words in a lexicon of posts (“a predefined vocabulary set”) (¶[0027]); a toxicity prediction may be a score or probability indicating how likely a post is considered to be toxic (“a second output probability . . . in response to the text input”); trained multi-task neural network 103 generates toxicity predictions for each of one or more attributes for a post; pre-processing of posts may include tokenization of user-generated textual content; input data to multi-task neural network 103 is a sequence of vectors each vector corresponding to a particular token of the textual content (¶[0031] - ¶[0033]: Figure 1); multi-task neural network 202 is trained to generate toxicity prediction 203 and a prediction 204-1 to 204N for each of N other attributes (¶[0040]: Figures 2A to 2B); a toxicity feature representation for each of one or more attributes and a common feature representation is generated; the one or more attributes for the user-generated textual content comprises a representation for one or more of presence of profanity, topic class, sentiment, group identity class, presence of a joke, presence of sarcasm, and presence of an idiom (¶[0103] - ¶[0104]: Figure 4); here, multi-task neural network includes an attribute for toxicity prediction which can be construed as a ‘model’ (“a second neural network model”); that is, multi-task neural network 202 can be considered to be composed of a plurality of “neural network models” that each detect one of a plurality of additional attributes, e.g., an attribute of sarcasm that is “without any emphasis on the specific attribute”;
“generating, in response to the text input, [the next token for] a text output based on a combined output probability computed based on a correction item reflective of the first output probability and the second output probability” – a plurality of combined feature representations are generated comprising each of the toxicity feature representations (¶[0002]); any user-generated textual content predicted to be toxic over a certain probability threshold, that is also predicted to be directed towards one or more particular groups (e.g. based on sex, gender, race), may be automatically removed by a content moderation system utilizing the multi-task neural network (¶[0026]); toxicity predictions generated by the multi-task neural network 103 may be scores which may be used to rank and prioritize posts in the moderation log 104 (¶[0036]: Figure 1); toxicity prediction 203 may be a score indicating the toxicity of the textual content 201; identity prediction 204-2 may be a vector of scores comprising a score for each identity of a set of identities (¶[0056]: Figure 2); a post of user-generated textual content may be flagged for moderation based on the toxicity prediction and/or the prediction for one or more of the attributes (¶[0109]: Figure 4); here, a plurality of probabilities or scores are combined for each attribute (“reflective of the first output probability and the second output probability”) to determine if textual content should be moderated (“generating, in response to the text input, an text output”); broadly, content moderation can be construed as “a correction item”.
Concerning independent claims 1, 8, and 15, Specia et al. discloses a basic concept of moderating textual input by a plurality of neural network ‘models’ corresponding to a plurality of different attributes to generate a probability or score for each attribute. That is, a trained neural network can be considered to have a plurality of sub-components as ‘models’, and each of the ‘models’ is directed to determining a probability of a specific attribute or a different attribute that does not emphasize a specific attribute. These probabilities for the attributes are combined to determine if content should be moderated so as to generate text output based on text input. However, Specia et al. is mainly directed to a neural network that only determines if text is to be moderated, and is not directed to a generative neural network for text generation of a next token. Specia et al., then, does not disclose “generating . . . output tokens . . . for a next token” by a first neural network model and “generating . . . output tokens . . . of the next token” by a second neural network model.
Concerning independent claims 1, 8, and 15, De Freitas Adiwardana et al. teaches evaluating output sequences using an auto-regressive language model neural network to generate a candidate output sequence. (Abstract) Auto-regressive neural network 110 generates a particular token in an output sequence conditioned on a current input sequence that includes any tokens that precede the particular text token in the output sequence, i.e., tokens that are already generated. (Column 3, Lines 41 to 50: Figure 1) To generate a particular token in a particular position within a candidate output sequence 120, neural network 110 can process a current input sequence to generate a score distribution, e.g., a probability distribution that assigns a score to each token in a vocabulary of tokens. Neural network 110 can then select as the particular token a token from the vocabulary using the score distribution. Neural network 110 can greedily select the highest-scoring token or can sample, e.g., using nucleus sampling, a token from the distribution. (Column 3, Line 66 to Column 4, Line 11: Figure 1) System 100 uses a criteria evaluation system 130 to generate a respective rating score 140 for each of the one or more criteria measuring the degree to which candidate output sequence 120 satisfies the criteria. (Column 6, Lines 1 to 6: Figure 1) Each criteria engine 220A-N receives a candidate output sequence 120 and generates a rating score 140 for the corresponding criterion representing a degree to which candidate output sequence 120 satisfies the criterion. (Column 7, Lines 6 to 14: Figure 2) Each engine 220A-N determines from the respective scores for the tokens that are a subset of the vocabulary of tokens, rating score 140. Engine 220A-N computes rating score 140 as equal to a weighted sum. (Column 7, Line 61 to Column , Line 14: Figure 2) System 100 generates using the auto-regressive language model neural network a first candidate output sequence that includes a plurality of tokens that are each selected from a vocabulary of tokens. Scores may be generated in parallel using multiple copies of the auto-regressive language model neural network. (Column 9, Lines 21 to 55: Figure 3) De Freitas Adiwardana et al., then, generates a next token in a token sequence by evaluating a plurality of probability scores for a plurality of criteria to determine what a next token should be from a combination of the plurality of probability scores. An objective is to provide task-specific fine-tuning of a model so that the quality of the output sequence can be significantly increased. (Column 2, Lines 10 to 17) It would have been obvious to one having ordinary skill in the art to generate a next token in a sequence of tokens from a combination of probability scores as taught by De Freitas Adiwardana et al. to perform content moderation with a multi-task neural network for toxicity detection of Specia et al. for a purpose of increasing a quality of an output sequence.
Concerning claims 2, 9, and 16, De Freitas Adiwardana et al. teaches that neural network 100 is pre-trained with a training system that fine-tunes to accurately generate rating scores with a set of training examples; training examples include (i) a training output sequence and (ii) one or more tokens that specify a particular criterion from the set of criteria; ground truth rating scores can be obtained as a result of manual labeling of output sequences by users or as output of an automatic labeling system (“the neural network is trained using a training pair of a text input and a corresponding labeled output belonging to the predefined vocabulary set”). (Column 8, Lines 21 to 38: Figure 1)
Concerning claims 5, 12, and 19, Specia et al. discloses a multi-task neural network 202 that comprises a toxicity predictor 203 and a plurality of attribute predictors 204-1 to 204-N. (¶[0040]: Figures 2A to 2B) Broadly, toxicity predictor 203 and a plurality of attribute predictors 204-1 to 204-N are construed as ‘models’ and all “share a same neural network structure” because they are all components of neural network 202. Alternatively, De Freitas Adiwardana et al. teaches using multiple copies of an autoregressive language model neural network with parallelization to generate scores representing a degree to which an output sequence satisfies a plurality of criteria. (Column 9, Lines 39 to 46: Figure 3) Here, providing multiple copies of a same language model neural network is “wherein the first neural network model and the second neural network model share a same neural network structure.”
Claims 3 to 4, 10 to 11, and 17 to 18 are rejected under 35 U.S.C. 103 as being unpatentable over Specia et al. (U.S. Patent Publication 2023/0419117) view of De Freitas Adiwardana et al. (U.S. Patent No. 12,086,713), as applied to claims 1 to 2, 8 to 9, and 15 to 16 above, and further in view of Wu et al. (U.S. Patent Publication 2022/0253688).
Specia et al. discloses a training process makes use of one or more adversarial losses to train the multi-task neural network to generate task-specific attribute feature representations that only capture aspects of a particular task (i.e. without capturing aspects relating to the other tasks). (¶[0028]) Training using the specific loss comprises updating parameters of the multi-task neural network to minimize a measure of difference between the specific output 314 and the target output 303 for the current task. The objective function may comprise a loss function which compares the combined output 315 to the target output 303. (¶[0087] - ¶[0088]: Figure 3) Specia et al., then, discloses “wherein training the first neural network model further comprises: generating, by the first neural network model . . . a training output in response to the text input” and “a loss comparing the training output and the corresponding labeled output”. Here, “virtual tokens” appear to correspond to training examples as described in ¶[0030] - ¶[0034] of the Specification. Mainly, Specia et al. discloses training “of the number of virtual tokens based on a loss comparing the training output and the corresponding labeled output”, but does not disclose “updating embeddings . . . while keeping weights of the first neural network model unchanged” or “wherein the number of virtual tokens have the embeddings that are tunable.”
However, Wu et al. teaches training a recommendation system over a plurality of training iterations so that system parameters are learned including (i) a set of model embeddings and (ii) weight parameters. (Abstract) Loss computation module 220 performs operations required to compute inner level loss and outer level loss. Model embeddings Θ are updated based on the inner level loss, and weight generator parameters Λ are updated based on an outer level loss. System parameters are learned through a two stage interactive training process. Inner optimization model embedding Θ update stage is performed during which the weight generator parameters are fixed, and model embeddings Λ are updated using gradient descent. A pseudocode representation of the bilevel optimization process for training indicates pseudocode for inner optimization model embedding Θ update stage during which the weight generator parameters Λ are fixed and model embeddings are updated during a first time-step t according to an inner level objective function (“updating embeddings of the number of virtual tokens”). Additionally, proxy model embeddings are model embeddings determined by scaling by a hyperparameter scaling value α in a proxy function, so that a proxy function provides a manual adjustment of the model embeddings by one step of gradient descent (“wherein the number of virtual tokens have the embedding that are tunable”). (¶[0065] - ¶[0068]: Figures 2 and 4 to 5) Wu et al., then, teaches “updating embeddings . . . while keeping weights of the first neural network model unchanged” in an inner optimization model process. Providing a manual adjustment of a proxy function with a hyperparameter scaling value α is equivalent to providing “embeddings that are tunable.” That is, hyperparameter scaling value α can be manually adjusted to tune a training optimization process in a proxy function for embeddings Θt+1= Θt – α div J(Θt, Λ). (¶[0066] - ¶[0067]) Compare Specification, ¶[0025], which describes tuning with a hyperparameter α to calculate correction item c in c= αΔP. An objective is to provide a bilevel optimization for learning of weight parameters that can improve efficiency so that a user is not presented with irrelevant or misleading item options. (¶[0021]) It would have been obvious to one having ordinary skill in the art to perform training of a multi-task neural network of Specia et al. according to bilevel optimization that updates embedding while keeping weights unchanged along with embeddings that are tunable as taught by Wu et al. for a purpose of improving efficiency in training so that a user is not presented with irrelevant information.
Claims 6, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Specia et al. (U.S. Patent Publication 2023/0419117) view of De Freitas Adiwardana et al. (U.S. Patent No. 12,086,713), as applied to claims 1, 8, and 15 above, and further in view of Altschul et al. (U.S. Patent Publication 2022/0277149).
De Freitas Adiwardana et al. teaches that system 100 can maintain a respective threshold value for a subset of the criteria and determine whether a respective rating score for a given candidate output sequence for the criterion satisfies the threshold value. System 100 can determine not to output a given candidate output sequence if the given candidate output sequence does not satisfy the threshold. (Column 6, Lines 35 to 47: Figure 1) Arguably, then, De Freitas Adiwardana et al. teaches “wherein the generating, by the first neural network model the first output probability for the next token further comprises: restricting the next token to a number of tokens having corresponding cumulative output probabilities that are greater than a pre-defined threshold.” However, De Freitas Adiwardana et al. does not clearly teach “cumulative” output probabilities. Still, Altschul et al. teaches token scoring component 540 that computes values for scores that may be used to select a likely next token. Token selection component 550 processes the output of token scoring component 540 to select one or more tokens to follow the input tokens. Any appropriate techniques may be used including nucleus or top-p sampling or top-k sampling. With nucleus sampling, a number of most probable tokens may be selected so that their cumulative probabilities are above a threshold. (¶[0047] - ¶[0048]) An objective it to iteratively generate tokens of a simulated conversation. (¶[0049]) It would have been obvious to one having ordinary skill in the art to perform token selection based on cumulative output probabilities of a sequence of tokens as taught by Altschul et al. to output tokens with probabilities greater than a threshold in De Freitas Adiwardana et al. for a purpose of iteratively generating tokens to simulate a conversation.
Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Specia et al. (U.S. Patent Publication 2023/0419117) view of De Freitas Adiwardana et al. (U.S. Patent No. 12,086,713), as applied to claims 1 and 8 above, and further in view of Krause et al. (U.S. Patent Publication 2021/0374341).
Specia et al. discloses generating a decision 208 if a post should be deleted based on a set of probabilities corresponding to a plurality of predictions of attributes. (¶[0060]: Figure 2B) Broadly, this combination of probabilities provide “a correction item” as to if a post should be deleted or output (“generating . . . a text output based on a combined output probability computed based on a correction item reflective of the first output probability and the second output probability”). However, Specia et al. does not disclose “wherein the correction term is computed based on a difference between the second output probability and the first output probability.” Krause et al. teaches generative discriminative language modeling for controllable text generation with a logarithmic probability difference between a first class conditional probability and a second class conditional probability is determined for each token candidate. A combined probability is determined by combining the unconditional probability and the logarithmic probability difference for each token candidate. The next token is selected from the token candidates based on the combined probabilities of the token candidates. (Abstract) An objective is to determine a next token in a text sequence using a generative-discriminative model. (¶[0002]) It would have been obvious to one having ordinary skill in the art to provide a correction term in Specia et al. based on a difference between a second output probability and a first output probability as taught by Krause et al. for a purpose of determining a next token in a text sequence using a generative-discriminative model.
Response to Arguments
Applicants’ arguments filed 10 March 2026 have been considered but are moot in view of new grounds of rejection necessitated by amendment.
Applicants provide some fairly significant amendments to independent claims 1, 8, and 15, and present arguments traversing the prior rejection of the independent claims as being obvious under 35 U.S.C. §103 over Imanigooghari et al. (U.S. Patent Publication 2025/0209271) in view of Spencer et al. (U.S. Patent Publication 2014/0297267). Specifically, Applicants argue that their provisional application, including their Appendix I, provides sufficient support for the claimed invention as amended so that Imanigooghari et al. is not prior art.
Applicants’ amendments overcome the objections to the Specification.
Applicants have not addressed the absence of copies of non-patent literature cited in Specification, and which should be properly cited on an Information Disclosure Statement. This informality is being maintained so that this non-patent literature may be submitted in a proper Information Disclosure Statement with a Request for Continued Examination.
Applicants’ argument that the claimed invention is fully supported by the provisional application is not completely persuasive, but new grounds of rejection are set forth as directed to the independent claims being obvious under 35 U.S.C. §103 over Specia et al. (U.S. Patent Publication 2023/0419117) view of De Freitas Adiwardana et al. (U.S. Patent No. 12,086,713). The examiner reserves an option to reconsider support for the present claim limitations in the provisional application. However, Applicants’ amendments necessitate these new grounds of rejection because the scope of the independent claims is now changed in a fairly significant manner by these amendments. An obviousness rejection of some dependent claims continues to rely upon Wu et al. (U.S. Patent Publication 2022/0253688), and new grounds of rejection are necessitated for obviousness further in view of Altschul et al. (U.S. Patent Publication 2022/0277149) and Krause et al. (U.S. Patent Publication 2021/0374341).
Mainly, Specia et al. and De Freitas Adiwardana et al. are maintained to teach the new limitations of the independent claims as amended. Specia et al. discloses a basic concept of determining toxicity of input text based on probabilities of a plurality of attributes using a neural network, and moderating textual content of the input text based on a combination of probabilities. Specia et al. provides one multi-task neural network, and a given task is construed as a ‘model’ of a multi-task neural network for given attribute. That is, a multi-task neural network can be understood to include a plurality of neural network ‘models’ with a given ‘model’ performing a task of attribute prediction for a given attribute. Attribute predictors 203 and 204-1 to 204-N each produce an output probability score of a given attribute, so that a first attribute predictor produces a probability of “a specific attribute” and a second attribute predictor produces a probability of an attribute “without any emphasis on the specific attribute”. A first attribute could be toxicity or profanity, and a second attribute could be sarcasm or a joke. Specia et al., then, discloses combining outputs of probabilities of attributes to produce a decision 208 as to if content moderation is performed. Broadly, Specia et al.’s content moderation produces “a correction term” so as to disclose “generating . . . a text output based on a combined output probability computed based on a correction term reflective of the first output probability and the second output probability.” Specia et al. does not disclose a generative neural network that generates output tokens for “a next token”, but this is taught by De Freitas Adiwardana et al. Specifically, De Freitas Adiwardana et al. teaches an auto-regressive language model neural network that generates a candidate output sequence for a next token based on probabilities from a plurality of criteria engines 220A-N.
A rationale for obviousness can be predicated on SR International Co. v. Teleflex Inc. (KSR), 550 U.S. 398, 82 USPQ2d 1385 (2007): (A) Combining prior art elements according to known methods to yield predictable results. Specia et al. discloses a known method of content moderation that includes a neural network that determines combined probabilities for a plurality of attributes. De Freitas Adiwardana et al. teaches a known method of generating output text sequences for a next token using a neural network from combined probabilities for a plurality of criteria. It would be a predictable result to generate a next token in a sequence of tokens as taught by De Freitas Adiwardana et al. in toxicity detection of Specia et al.
Applicants’ amendments necessitate these new grounds of rejection. Accordingly, this rejection is properly FINAL.
Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
Fang et al., Tongya, Ngo et al., and Palangi et al. disclose related prior art.
Applicants’ amendment necessitated the new grounds of rejection presented in this Office Action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP §706.07(a). Applicants are reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608. The examiner can normally be reached Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MARTIN LERNER/Primary Examiner
Art Unit 2658 April 10, 2026