Office Action Analysis: 18143432 — Mitigation for Prompt Injection in A.I. Models Capable of Accepting Text Input

Office Action

§101 §103 §112 §DP
DETAILED ACTION
	This action is in response to the application filed 05/04/2023. Claims 1-20 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
The later-filed application must be an application for a patent for an invention which is also disclosed in the prior application (the parent or original nonprovisional application or provisional application). The disclosure of the invention in the parent application and in the later-filed application must be sufficient to comply with the requirements of 35 U.S.C. 112(a) or the first paragraph of pre-AIA  35 U.S.C. 112, except for the best mode requirement. See Transco Products, Inc. v. Performance Contracting, Inc., 38 F.3d 551, 32 USPQ2d 1077 (Fed. Cir. 1994).
The disclosures of the prior-filed applications, Provisional Applications No. 63/341,011 and No. 63/338,445, fail to provide adequate support or enablement in the manner provided by 35 U.S.C. 112(a) or pre-AIA  35 U.S.C. 112, first paragraph for one or more claims of this application:
Claim 3: “wherein the processor is configured to disregard instructions that are semi-trusted”. While the prior provisional applications disclose disregarding untrusted instructions, they don’t disclose anything about “semi-trusted” instructions.
Claim 4: “wherein the trusted instructions and the untrusted instructions are represented using incompatible token sets”. The prior provisional applications don’t disclose representing instructions using incompatible token sets.
Thus, claims 3 and 4 are ineligible to claim priority from the cited ancestral provisional applications, and their effective filing dates are synonymous with the filing date of the instant application: 05/04/2023.

Specification
The disclosure is objected to because of the following informalities:
[0045]: The parentheses around “FIG. 13” are unclosed. i.e., “(FIG.13” should be “FIG. 13)”.
Appropriate correction is required.

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign(s) mentioned in the description: 1006, 1010.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Objections
Claim 6 is objected to because of the following informalities: “before being entered to the AI model” is improper grammar. Appropriate correction is required.

Claim 17 is objected to because of the following informalities: ”untrusted instructions is selected” is improper grammar.  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 1 recites two inputs in its first limitation, “text input” and “an input comprising tokens”. It’s unclear whether these inputs are intended to be synonymous or distinct. Claim 1’s third limitation recites  "apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt". It’s unclear whether the “input” of the third limitation refers to the “text input”, the “input comprising tokens”, or both inputs of limitation 1. Thus, there is insufficient antecedent basis for this limitation in the claim. This deficiency is applicable to substantially similar independent claims 18 & 19 and inherited by all dependent claims. The two inputs of limitation 1 are interpreted as being potentially synonymous, with limitation 3 referring to either or both of them.

	Claim 2 recites “The system of Claim 1, wherein the RL is reinforcement learning from human feedback (RLHF)”. Reinforcement learning is used in two limitations of claim 1, and it’s unclear which are referred to by claim 2. Thus, the scope of the claim is rendered indefinite. This deficiency is inherited by dependent claim 3. Claim 2 is interpreted as referring to either or both uses of reinforcement learning in claim 1.

	Claim 17 recites “the untrusted instructions is selected from a group including cyberbullying, harassment, toxicity, islamophobia, misogyny, and journalistic qualities”. The scope of the recited group isn’t clearly defined, as it’s unclear what additional elements do or do not belong in this group. Thus, the scope of the claim is rendered indefinite.

	Claim 20 recites “wherein the processor removes the nontrusted instructions”. There’s no antecedent for “the processor” or “the nontrusted instructions”. Thus, the scope of the claim is rendered indefinite. “the processor” is interpreted as referring to “a processor”, while “the nontrusted instructions” are interpreted as being synonymous with “the untrusted instructions”.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the U.S. Patent No. 12118471 claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the U.S. Patent No. 12118471 claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.

Claims 1-20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 2, 2, 4, 1, 6, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 17, and 17 of U.S. Patent No. 12118471, respectively. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of patent No. 12118471 the entirety of material disclosed by the instant claims.

	Instant claim 1 is rejected over reference U.S. Patent No. 12118471 claim 1:
Instant claim 1
Patent No. 12118471 claim 1
A system, comprising: an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens
A system, comprising: an artificial intelligence (AI) model configured to accept electronic text input comprising instructions including a sequence of tokens in response to an AI model prompt and configured to use deep learning to produce humanlike text responsive to the instructions
a processor configured to
a processor configured to:
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt
apply reinforcement learning (RL) during operation of the AI model to determine electronic trusted instructions and electronic untrusted instructions from the electronic text input provided responsive to the AI model prompt;
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag
electronically tag the electronic trusted instructions with a trusted tag and electronically tag the electronic untrusted instructions with an untrusted tag
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag
apply RL to modify the instructions including the sequence of tokens provided in response to the AI model prompt to indicate that instructions tagged with the trusted tag and represented by the first token set are to be obeyed and that instructions tagged with the untrusted tag and represented by the second token set are to be disregarded and to remove instructions tagged with the untrusted tag from the sequence of tokens provided responsive to the AI model prompt to create instructions to the AI model that are influenced by the instructions tagged with the trusted tag but not influenced by the instructions tagged with the untrusted tag,

wherein the AI model is configured to execute the instructions in the electronic text input that has been tagged with the trusted tag to provide the trusted human-like text.


	Instant claim 2 is rejected over reference U.S. Patent No. 12118471 claim 2:
	
Instant claim 2
Patent No. 12118471 claim 2
The system of Claim 1
The system of claim 1
wherein the RL is reinforcement learning from
human feedback (RLHF).
wherein the RL is reinforcement learning from human feedback (RLHF).


	Instant claim 3 is rejected in view of U.S. Patent No. 12118471 claim 2:
	
Instant claim 3
Patent No. 12118471 claim 2
The system of Claim 2
The system of claim 1
wherein the processor is configured to disregard instructions that are semi-trusted
wherein the type of instruction further comprises a semi-trusted instruction, and wherein semi-trusted instructions are disregarded in the modified instructions by the processor


	Instant claim 4 is rejected in view of U.S. Patent No. 12118471 claim 4:
	
Instant claim 4
Patent No. 12118471 claim 4
The system of Claim 1
The system of claim 1
wherein the trusted instructions and the untrusted instructions are represented using incompatible token sets
(Parent claim 1) wherein the trusted tag and untrusted tag are respectively indicative of a type of instruction including at least a trusted instruction represented by a first token set adapted to be executed by the AI model to produce trusted human- 30 like text responsive to the trusted instruction and an untrusted instruction represented by a second token set incompatible with the first token set and adapted to not be executed by the AI model

(Claim 4) wherein the incompatible token sets have separate incompatible dictionaries


	Instant claim 5 is rejected in view of U.S. Patent No. 12118471 claim 1:
	
Instant claim 5
Patent No. 12118471 claim 1
The system of Claim 1
(U.S. Patent No. 12118471 claim 1 is mapped to instant claim 1, see above)
wherein the processor is configured to remove the untrusted instructions from the input and create content that is influenced by the trusted instructions but not influenced by the untrusted instructions
instructions tagged with the untrusted tag and represented by the second token set are to be disregarded and to remove instructions tagged with the untrusted tag from the sequence of tokens provided responsive to the AI model prompt to create instructions to the AI model that are influenced by the instructions tagged with the trusted tag but not influenced by the instructions tagged with the untrusted tag


	Instant claim 6 is rejected in view of U.S. Patent No. 12118471 claim 6:
	
Instant claim 6
Patent No. 12118471 claim 6
The system of Claim 5
The system of claim 1 (U.S. Patent No. 12118471 claim 1 is mapped to instant claim 5, see above)
wherein the processor is configured to automatically delete the untrusted instructions from the input before being entered to the AI model
wherein applying RL by the processor to modify the instructions comprises the RL automatically deleting instructions tagged with the untrusted tag from the electronic text input before being entered into the AI model


	Instant claim 7 is rejected in view of U.S. Patent No. 12118471 claim 6:
	
Instant claim 7
Patent No. 12118471 claim 6
The system of Claim 5
The system of claim 1 (U.S. Patent No. 12118471 claim 1 is mapped to instant claim 5, see above)
wherein the untrusted instructions are detected using a set of rules.
wherein applying RL by the processor to determine electronic untrusted instructions comprises applying a set of rules to the sequence of tokens provided responsive to the AI model prompt.


	Instant claim 8 is rejected in view of U.S. Patent No. 12118471 claim 7:
	
Instant claim 8
Patent No. 12118471 claim 7
The system of Claim 7
The system of claim 6
wherein the rules are configured to be custom configured by a user
wherein the rules are configured
to be custom configured by a user.


	Instant claim 9 is rejected in view of U.S. Patent No. 12118471 claim 8:
	
Instant claim 9
Patent No. 12118471 claim 8
The system of Claim 1
The system of claim 1
wherein the processor is configured to tag each said token of the input
wherein electronically tagging the electronic trusted instructions with a trusted tag and electronically tagging the electronic untrusted instructions with an untrusted tag comprises the processor electronically tagging each token of the sequence of tokens of the instructions with either a trusted tag or an untrusted tag


	Instant claim 10 is rejected in view of U.S. Patent No. 12118471 claim 9:
	
Instant claim 10
Patent No. 12118471 claim 9
The system of Claim 9
The system of claim 8
wherein the processor is configured to use the tags to keep track of which tokens of input come from a user and from a trusted application prompt
wherein electronically tagging the electronic trusted instructions with a trusted tag and electronically tagging the electronic untrusted instructions with an untrusted tag comprises the processor using the tags to keep track of which tokens of the sequence of tokens come from a user and which tokens of the sequence of tokens come from a trusted application prompt


	Instant claim 11 is rejected in view of U.S. Patent No. 12118471 claim 10:
	
Instant claim 11
Patent No. 12118471 claim 10
The system of Claim 1
The system of claim 1
wherein the processor is trained to follow an
instruction of a trusted sequence and penalize the system for following any instruction
received in full or in part from a danger sequence
wherein the AI model is trained to follow an instruction of a trusted sequence of tokens and the AI model is penalized for following any instruction received in full or in part from an untrusted sequence of tokens


	Instant claim 12 is rejected in view of U.S. Patent No. 12118471 claim 11:
	
Instant claim 12
Patent No. 12118471 claim 11
The system of Claim 1
The system of claim 1
wherein the processor is configured to:
detect non-conforming hidden content in the input; and modify the input responsive to the non-conforming hidden content
wherein the processor is further configured to: detect a non-conforming hidden command to the AI model in the electronic text input; and modify the electronic text input to remove the nonconforming hidden command


	Instant claim 13 is rejected in view of U.S. Patent No. 12118471 claim 12:
	
Instant claim 13
Patent No. 12118471 claim 12
The system of Claim 1
The system of claim 1
wherein the AI model is a generative pretrained
transformer (GPT), wherein the processor is a trained platform to modify operation of
the GPT
wherein the AI model is a generative pretrained transformer (GPT), and wherein the processor is trained to modify operation of the GPT


	Instant claim 14 is rejected in view of U.S. Patent No. 12118471 claim 13:
	
Instant claim 14
Patent No. 12118471 claim 13
The system of Claim 1
The system of claim 1
wherein the processor is configured to remove the untrusted instructions from the input in a way that is hidden from a user entering the input
wherein the processor is configured to remove the instructions tagged with the untrusted tag from the electronic text input in a way that is hidden from a user entering the electronic text input


	Instant claim 15 is rejected in view of U.S. Patent No. 12118471 claim 14:
	
Instant claim 15
Patent No. 12118471 claim 14
The system of Claim 1
The system of claim 1
wherein the processor is configured to identify users entering untrusted instructions in a report configured to allow management to understand and address users entering potential violating commands
wherein the processor is configured to identify users entering instructions tagged with the untrusted tag in a report configured to allow management to understand and address users entering potential violating commands in the electronic text input.


	Instant claim 16 is rejected in view of U.S. Patent No. 12118471 claim 15:
	
Instant claim 16
Patent No. 12118471 claim 15
The system of Claim 15
The system of claim 14
wherein the report is configured to be generated in real-time
wherein the processor generates the report in real-time


	Instant claim 17 is rejected in view of U.S. Patent No. 12118471 claim 16:
	
Instant claim 17
U.S. Patent No. 12118471 claim 16
The system of Claim 1
The system of claim 1
wherein the untrusted instructions is selected from a group including cyberbullying, harassment, toxicity, islamophobia, misogyny, and journalistic qualities
wherein instructions tagged
with an untrusted tag include words having attributes directed to at least one of cyberbullying, harassment, toxicity, islamophobia, misogyny, or journalistic qualities


Instant claim 18 is rejected over reference U.S. Patent No. 12118471 claim 1:
Instant claim 18
Patent No. 12118471 claim 1
A system operable with an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens
A system, comprising: an artificial intelligence (AI) model configured to accept electronic text input comprising instructions including a sequence of tokens in response to an AI model prompt and configured to use deep learning to produce humanlike text responsive to the instructions
the system comprising a processor configured to:
a processor configured to:
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt
apply reinforcement learning (RL) during operation of the AI model to determine electronic trusted instructions and electronic untrusted instructions from the electronic text input provided responsive to the AI model prompt;
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag
electronically tag the electronic trusted instructions with a trusted tag and electronically tag the electronic untrusted instructions with an untrusted tag
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag
apply RL to modify the instructions including the sequence of tokens provided in response to the AI model prompt to indicate that instructions tagged with the trusted tag and represented by the first token set are to be obeyed and that instructions tagged with the untrusted tag and represented by the second token set are to be disregarded and to remove instructions tagged with the untrusted tag from the sequence of tokens provided responsive to the AI model prompt to create instructions to the AI model that are influenced by the instructions tagged with the trusted tag but not influenced by the instructions tagged with the untrusted tag,

wherein the AI model is configured to execute the instructions in the electronic text input that has been tagged with the trusted tag to provide the trusted human-like text.


Instant claim 19 is rejected over reference U.S. Patent No. 12118471 claim 17:
Instant claim 19
Patent No. 12118471 claim 17
A method of using an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens, the method comprising:
A method of instructing an artificial intelligence (AI) model configured to accept electronic text input comprising instructions including a sequence of tokens in response to an AI model prompt and to perform deep learning to produce human-like text responsive to the instructions, the method, the method comprising:
applying reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt
applying, by a processor, reinforcement learning (RL) during operation of the AI model to the instructions to determine electronic trusted instructions and electronic untrusted instructions from the electronic text input provided responsive to the AI model prompt
tagging the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag
electronically tagging, by the processor, the electronic trusted instructions with a trusted tag and electronically tagging the electronic untrusted instructions with an untrusted tag
applying RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag
applying, by the processor, RL to modify the instructions including the sequence of tokens provided in response to the AI model prompt to indicate that instructions tagged with the trusted tag and represented by the first token set are to be obeyed and that instructions tagged with the untrusted tag and represented by the second token set are to be disregarded and to remove instructions tagged with the untrusted tag from the sequence of tokens provided responsive to the AI model prompt to create instructions to the AI model that are influenced by the instructions tagged with the trusted tag but not influenced by the instructions tagged with the untrusted tag

executing, by the AI model, the instructions in the electronic text input that has been tagged with the trusted tag to provide the trusted human-like text.


Instant claim 20 is rejected in view of U.S. Patent No. 12118471 claim 17:
	
Instant claim 20
Patent No. 12118471 claim 17
The system of Claim 19
(U.S. Patent No. 12118471 claim 17 is mapped to instant claim 17, see above)
wherein the processor removes the nontrusted instructions from the input and creates content that is influenced by the trusted instructions but that is not influenced by the untrusted instructions
instructions tagged with the untrusted tag and represented by the second token set are to be disregarded and to remove instructions tagged with the untrusted tag from the sequence 10 of tokens provided responsive to the AI model prompt to create instructions to the AI model that are influenced by the instructions tagged with the trusted tag but not influenced by the instructions tagged with the untrusted tag


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed inventions are directed to non-statutory subject matter without significantly more.

Claim 1
Step 1: The claim recites “A system”, and is therefore directed to the statutory category of machine
Step 2A Prong 1: The claim recites the following judicial exception(s)
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This can be performed as a mental process. One can merely decide which instructions that were input to the prompt are trustworthy.
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag: This can be performed as a mental process. One can mentally tag instructions based on trustworthiness.
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This can be performed as a mental process. One can merely imagine the tagged trusted and untrusted instructions, and pay no mind to the untrusted instructions.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the following additional element(s)
an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens: This is a well-understood limitation of all chatbots / dialogue systems, and thus amounts to insignificant extra-solution activity (MPEP 2106.05(g)). 
a processor configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
Step 2B: The following additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens: This is a fundamental limitation of chatbot systems, as noted by Ganu et al. (SYSTEM AND METHOD FOR REDUCING USER QUERY AMBIGUITY THROUGH CHATBOT CLARIFYING QUESTIONS, published 3/18/2021, US 20210081442 A1): “A dialog system, which is commonly known as a chatbot, is typically used to interact with a user via text or voice conversations that imitate natural human conversations, behavior, and interactions. Some dialog systems provide a conversational user interface and act as a translator between a user and an application with which a user desires to interact, thereby allowing a user to use voice or text to interact with the application through the dialog system. For a dialog system processing voice data, automatic speech recognition (ASR) can be utilized to convert user speech or voice data to text. For processing text data, a dialog system can use natural language understanding (NLU) to recognize the meaning of the text data. In this way, the dialog system can provide an engaging user experience and a lifelike, or human-like, conversational interaction between the user and the application. Although traditional dialog systems are able to determine an intent of a user from the user's input voice or text data, the determined intent of the user may nevertheless be ambiguous.” (Ganu, [0001])
a processor configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.

Claim 2
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the RL is reinforcement learning from human feedback (RLHF): Using reinforcement learning to determine trusted / untrusted instructions, detect and obey trusted instructions, and detect and disregard untrusted instructions is still mere instruction to apply reinforcement learning to a judicial exception in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the RL is reinforcement learning from human feedback (RLHF): Using reinforcement learning to determine trusted / untrusted instructions, detect and obey trusted instructions, and detect and disregard untrusted instructions is still mere instruction to apply reinforcement learning to a judicial exception in a generic manner (MPEP 2106.05(f)).

Claim 3
Step 1: The claim recites a machine, as in claim 2
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the processor is configured to disregard instructions that are semi-trusted: This can be performed as a mental process. One can merely ignore instructions they don’t fully trust.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor is configured to disregard instructions that are semi-trusted: This is mere instruction to execute the recited judicial exceptions in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor is configured to disregard instructions that are semi-trusted: This is mere instruction to execute the recited judicial exceptions in a generic manner (MPEP 2106.05(f)).

Claim 4
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the trusted instructions and the untrusted instructions are represented using incompatible token sets: This can be performed as a mental process. One can merely imagine the two sets using two different languages or mutually exclusive character sets.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the additional element(s)
Step 2B: The additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)

Claim 5
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor is configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
remove the untrusted instructions from the input: This amounts to merely updating data in memory and is insignificant extra-solution activity (MPEP 2106.05(g)).
and create content that is influenced by the trusted instructions but not influenced by the untrusted instructions: This is mere instruction to create content based on the tagged instructions in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor is configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
remove the untrusted instructions from the input: This is an instance of storing information in memory (updated instructions), a limitation known to be well-understood, routine, and conventional (MPEP 2106.05(d) II. iv.).
and create content that is influenced by the trusted instructions but not influenced by the untrusted instructions: This is mere instruction to create content based on the tagged instructions in a generic manner (MPEP 2106.05(f)).

Claim 6
Step 1: The claim recites a machine, as in claim 5
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor is configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
automatically delete the untrusted instructions from the input before being entered to the AI model: This amounts to merely updating data in memory and is insignificant extra-solution activity (MPEP 2106.05(g)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor is configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
automatically delete the untrusted instructions from the input before being entered to the AI model: This is an instance of storing information in memory (updated instructions), a limitation known to be well-understood, routine, and conventional (MPEP 2106.05(d) II. iv.).

Claim 7
Step 1: The claim recites a machine, as in claim 5
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the untrusted instructions are detected using a set of rules: This can be performed as a mental process. One can decide on the trustworthiness of instructions based on rules they’ve envisioned.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the additional element(s)
Step 2B: The additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)

Claim 8
Step 1: The claim recites a machine, as in claim 7
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the rules are configured to be custom configured by a user: Detecting untrusted instructions using a set of rules can still be performed as a mental process. In envisioning the rules, they’re being configured by the one performing the mental process.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the additional element(s)
Step 2B: The additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)

Claim 9
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the processor is configured to tag each said token of the input: This can be performed as a mental process. One can mentally tag each token (word or set of characters) of the input.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor is configured to tag each said token of the input: This is mere instruction to apply a judicial exception with generic computer hardware (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor is configured to tag each said token of the input: This is mere instruction to apply a judicial exception with generic computer hardware (MPEP 2106.05(f)).

Claim 10
Step 1: The claim recites a machine, as in claim 9
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the processor is configured to use the tags to keep track of which tokens of input come from a user and from a trusted application prompt: This can be performed as a mental process. One can imagine which tags came from a user and which constitute part of a trusted prompt.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor is configured to use the tags to keep track of which tokens of input come from a user and from a trusted application prompt: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor is configured to use the tags to keep track of which tokens of input come from a user and from a trusted application prompt: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).

Claim 11
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor is trained to follow an instruction of a trusted sequence and penalize the system for following any instruction received in full or in part from a danger sequence: This is mere instruction to generically follow trusted instructions and penalize the system over dangerous instructions, performed with generic computer hardware (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor is trained to follow an instruction of a trusted sequence and penalize the system for following any instruction received in full or in part from a danger sequence: This is mere instruction to generically follow trusted instructions and penalize the system over dangerous instructions, performed with generic computer hardware (MPEP 2106.05(f)).

Claim 12
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
detect non-conforming hidden content in the input: This can be performed as a mental process. One can merely identify content in the input failing to conform with some guideline(s).
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor is configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
modify the input responsive to the non-conforming hidden content: This is mere instruction to modify input based on a judicial exception in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor is configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
modify the input responsive to the non-conforming hidden content: This is mere instruction to modify input based on a judicial exception in a generic manner (MPEP 2106.05(f)).

Claim 13
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the AI model is a generative pretrained transformer (GPT), wherein the processor is a trained platform to modify operation of the GPT: This is a conventional data structure used for a conventional process, and is thus insignificant extra-solution activity (MPEP 2106.05(g)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the AI model is a generative pretrained transformer (GPT), wherein the processor is a trained platform to modify operation of the GPT: Using GPTs for large language processing is a conventional technique, as noted by Reza et al. (ASPECT PROMPTING FRAMEWORK FOR LANGUAGE MODELING, filed 1/25/2022, US 20230237277 A1): “ As discussed herein, most conventional language models follow a general training approach of pre-training and fine-tuning the model. Adding a prompting pipeline along with pre-training and fine-tuning allows these language models to close the gap and become better learners. In general, a prompt is a piece of text inserted in a training example and the prompt can be used in the reformulation of a masked language model task. Downstream tasks like sentiment classification and named entity recognition (NER) can benefit from these approaches. Conventional large language models pre-trained with prompting demonstrate the ability to infer with the help of zero shot and few shot learning and can handle a large set of downstream tasks like Q&A, sentiment analysis, NER, etc. The rising success of these conventional models such as Generative Pre-trained Transformer 3 (GPT-3) are their ability to leverage the natural language prompts alongside their giant set of parameters.” (Reza, [0025])

Claim 14
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor is configured to remove the untrusted instructions from the input in a way that is hidden from a user entering the input: This is mere instruction to remove the untrusted instructions in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor is configured to remove the untrusted instructions from the input in a way that is hidden from a user entering the input: This is mere instruction to remove the untrusted instructions in a generic manner (MPEP 2106.05(f)).

Claim 15
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
the processor is configured to identify users entering untrusted instructions in a report configured to allow management to understand and address users entering potential violating commands: This is mere instruction to generate a report based on a judicial exception in a generic manner, understand said report in a generic manner, and accordingly address users in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
the processor is configured to identify users entering untrusted instructions in a report configured to allow management to understand and address users entering potential violating commands: This is mere instruction to generate a report based on a judicial exception in a generic manner, understand said report in a generic manner, and accordingly address users in a generic manner (MPEP 2106.05(f)).

Claim 16
Step 1: The claim recites a machine, as in claim 15
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the report is configured to be generated in real-time: Generation of the report is still highly generic and amounts to mere instruction to apply a judicial exception (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the report is configured to be generated in real-time: Generation of the report is still highly generic and amounts to mere instruction to apply a judicial exception (MPEP 2106.05(f)).

Claim 17
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the untrusted instructions is selected from a group including cyberbullying, harassment, toxicity, islamophobia, misogyny, and journalistic qualities: Tagging untrusted instructions can still be performed as a mental process.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the additional element(s)
Step 2B: The additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)

Claim 18
Step 1: The claim recites “A system”, and is therefore directed to the statutory category of machine
Step 2A Prong 1: The claim recites the following judicial exception(s)
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This can be performed as a mental process. One can merely decide which instructions that were input to the prompt are trustworthy.
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag: This can be performed as a mental process. One can mentally tag instructions based on trustworthiness.
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This can be performed as a mental process. One can merely imagine the tagged trusted and untrusted instructions, and pay no mind to the untrusted instructions.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the following additional element(s)
an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens: This is a well-understood limitation of all chatbots / dialogue systems, and thus amounts to insignificant extra-solution activity (MPEP 2106.05(g)). 
the system comprising a processor configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
Step 2B: The following additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens: This is a fundamental limitation of chatbot systems, as noted by Ganu et al. (SYSTEM AND METHOD FOR REDUCING USER QUERY AMBIGUITY THROUGH CHATBOT CLARIFYING QUESTIONS, published 3/18/2021, US 20210081442 A1): “A dialog system, which is commonly known as a chatbot, is typically used to interact with a user via text or voice conversations that imitate natural human conversations, behavior, and interactions. Some dialog systems provide a conversational user interface and act as a translator between a user and an application with which a user desires to interact, thereby allowing a user to use voice or text to interact with the application through the dialog system. For a dialog system processing voice data, automatic speech recognition (ASR) can be utilized to convert user speech or voice data to text. For processing text data, a dialog system can use natural language understanding (NLU) to recognize the meaning of the text data. In this way, the dialog system can provide an engaging user experience and a lifelike, or human-like, conversational interaction between the user and the application. Although traditional dialog systems are able to determine an intent of a user from the user's input voice or text data, the determined intent of the user may nevertheless be ambiguous.” (Ganu, [0001])
the system comprising a processor configured to: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.

Claim 19
Step 1: The claim recites “A method”, and is therefore directed to the statutory category of process
Step 2A Prong 1: The claim recites the following judicial exception(s)
applying reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This can be performed as a mental process. One can merely decide which instructions that were input to the prompt are trustworthy.
tagging the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag: This can be performed as a mental process. One can mentally tag instructions based on trustworthiness.
applying RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This can be performed as a mental process. One can merely imagine the tagged trusted and untrusted instructions, and pay no mind to the untrusted instructions.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the following additional element(s)
an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens: This is a well-understood limitation of all chatbots / dialogue systems, and thus amounts to insignificant extra-solution activity (MPEP 2106.05(g)). 
applying reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
applying RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
Step 2B: The following additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens: This is a fundamental limitation of chatbot systems, as noted by Ganu et al. (SYSTEM AND METHOD FOR REDUCING USER QUERY AMBIGUITY THROUGH CHATBOT CLARIFYING QUESTIONS, published 3/18/2021, US 20210081442 A1): “A dialog system, which is commonly known as a chatbot, is typically used to interact with a user via text or voice conversations that imitate natural human conversations, behavior, and interactions. Some dialog systems provide a conversational user interface and act as a translator between a user and an application with which a user desires to interact, thereby allowing a user to use voice or text to interact with the application through the dialog system. For a dialog system processing voice data, automatic speech recognition (ASR) can be utilized to convert user speech or voice data to text. For processing text data, a dialog system can use natural language understanding (NLU) to recognize the meaning of the text data. In this way, the dialog system can provide an engaging user experience and a lifelike, or human-like, conversational interaction between the user and the application. Although traditional dialog systems are able to determine an intent of a user from the user's input voice or text data, the determined intent of the user may nevertheless be ambiguous.” (Ganu, [0001])
applying reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.
applying RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: This is mere instruction to apply reinforcement learning to a judicial exception in a generic manner.

Claim 20
Step 1: The claim recites a process, as in claim 19
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the processor: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
removes the untrusted instructions from the input: This amounts to merely updating data in memory and is insignificant extra-solution activity (MPEP 2106.05(g)).
and creates content that is influenced by the trusted instructions but not influenced by the untrusted instructions: This is mere instruction to create content based on the tagged instructions in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the processor: This is mere instruction to execute the recited judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
removes the untrusted instructions from the input: This is an instance of storing information in memory (updated instructions), a limitation known to be well-understood, routine, and conventional (MPEP 2106.05(d) II. iv.).
and creates content that is influenced by the trusted instructions but not influenced by the untrusted instructions: This is mere instruction to create content based on the tagged instructions in a generic manner (MPEP 2106.05(f)).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 5-7, 9-10, 12-13, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari.

	Regarding claim 1, Mehrabi discloses [a] system, comprising:
an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens:
”Adversarial attacks on different Machine Learning (ML) and Natural Language Processing (NLP) (AI model[s]) applications can reveal important vulnerability issues related to these systems. Most existing research focuses on adversarial attacks that degrade performance of existing ML systems with regards to accuracy (Chakraborty et al., 2018; Zhang et al., 2020b). More recent work has considered attacks that target ethical concerns, such as triggering the models into outputting unfair predictions (Mehrabi et al., 2021b; Solans et al., 2021), or in the context of NLP, generating biased (Sheng et al., 2020) and toxic (Wallace et al., 2019) text” (Mehrabi, page 2831, left column, paragraph 1)

    PNG
    media_image1.png
    432
    446
    media_image1.png
    Greyscale
”Figure 1: An example illustrating the attack (text input comprising tokens) performed by the adversary on the third turn of the conversation (red line) that leads the defender (AI model) into generating a toxic utterance (human-like text) (dotted box). With a proper defense the defender can bypass the attack and generate a non-toxic response (green line).” (Mehrabi, page 2831, right column, Figure 1)
“General Setup We use DialoGPT (deep learning model) (Zhang et al., 2020c) to generate 100 conversations around a specific topic.” (Mehrabi, page 2833, right column, paragraph 2)
a processor configured to: “We used Nvidia GeForce RTX 2080 (processor) to perform all our experiments except the experiment using the GPT-2 model which was ran on CPU (processor) for memory constraints” (Mehrabi, page 2842, left column, paragraph 3)
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt:
“We quantify the toxicity of each candidate attack utterance (input provided) using either a single toxicity classifier or an ensemble of such classifiers we first apply a threshold T to toxicity scores of the candidate utterances and label the utterances above this threshold as toxic.” (Mehrabi, page 2833, left column, paragraph 3). Trusted instructions are utterances that fall below the toxicity threshold, while those at or above or are untrusted instructions.
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag:

    PNG
    media_image2.png
    592
    436
    media_image2.png
    Greyscale
 (Mehrabi, page 2836, Figure 6)
“The second layer aims to detect which tokens in the adversary’s attack utterance are responsible for generation of L1 tokens form defender’s utterance. We call these tokens (untrusted tag[s]) identified in layer 2 as the L2 tokens” (Mehrabi, page 2836, left column, paragraph 1). At the end of this process each token of the adversary’s input is either untrustworthy (with an L2 tag) or trustworthy (without an L2 tag).
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag:

    PNG
    media_image3.png
    574
    420
    media_image3.png
    Greyscale
(Mehrabi, page 2836, Figure 6)
“The defender then masks the L2 tokens (untrusted instructions) from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. We then apply a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise we iteratively apply the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe.” (Mehrabi, page 2836, left column, paragraph 1)
	Mehrabi relates to guarding against inappropriate model outputs due to untrusted user inputs and is analogous to the claimed invention.
	While Mehrabi fails to disclose the further limitations of the claim, Pujari discloses a system, able to
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: 
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag:
“The input to the multi-task model is the text of the data example and a task ID. Output of the model is predicted label (tag) on the specified task.” (Pujari, page 5, left column, paragraph 5)
“We use six datasets for our empirical evaluation, namely, Jigsaw Toxicity Dataset, Hate Speech Detection (de Gibert et al., 2018), Misogyny Detection (Fersini et al., 2018), Offensive Language Detection (Davidson et al., 2017), coarse-grained Stereotype Detection (combination of Stereoset, CrowSPairs and Reddit Data) and finally fine-grained Stereotype Detection Data (as described in section 3).” (Pujari, page 6, right column, paragraph 3)
“Jigsaw Toxicity Dataset consists of 159,571 training examples and 153,164 test examples labeled (tag[ged]) with one or more of the seven labels: toxic (untrusted), severely toxic (untrusted), obscene (untrusted), threat (untrusted), insult (untrusted), identity hate (untrusted), none (trusted)” (Pujari, page 7, left column, paragraph 3)
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: “we then propose a reinforcement learning agent that learns to guide the multi-task learning model by selecting meaningful data examples from the neighboring task datasets that help in improving the target task.” (Pujari, page 2, right column, paragraph 1)
	Pujari relates to detecting offensive text using machine learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Mehrabi to use a multi-task classification model trained with reinforcement learning to detect unsafe text, as disclosed by Pujari. Pujari’s reinforcement learning agent improves model performance by selectively taking advantage of overlapping cases in different task datasets across several different but interrelated tasks. See Pujari, page 4, right column, paragraph 3 & page 6, left column, paragraphs 4-5.

	Regarding claim 5, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system, wherein the processor is configured to remove the untrusted instructions from the input and create content that is influenced by the trusted instructions but not influenced by the untrusted instructions:

    PNG
    media_image4.png
    574
    424
    media_image4.png
    Greyscale
(Mehrabi, page 2836, Figure 6)
“The defender then masks the L2 tokens from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. We then apply a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise we iteratively apply the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe.” (Mehrabi, page 2836, left column, paragraph 1)


	Regarding claim 6, the rejection of claim 5 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system, wherein the processor is configured to automatically delete the untrusted instructions from the input before being entered to the AI model: “The defender then masks the L2 tokens from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. We then apply a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise we iteratively apply the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe.” (Mehrabi, page 2836, left column, paragraph 1)

	Regarding claim 7, the rejection of claim 5 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system, wherein the untrusted instructions are detected using a set of rules: “The first layer aims to detect which tokens in the defender’s utterance is making the toxicity detection model to label the utterance as being toxic. We call these tokens the L1 tokens. The second layer aims to detect which tokens in the adversary’s attack utterance are responsible for generation of L1 tokens form defender’s utterance. We call these tokens identified in layer 2 as the L2 tokens (untrusted instructions)” (Mehrabi, page 2835, right column, paragraph 3); “The defense framework is demonstrated in Figure 6. For the first layer, we use transformers interpret (set of rules) which provides explanations and identifies the L1 token according to Toxic-bert model. For the second layer, we use LERG (Tuan et al., 2021) (set of rules) that provides local explanations for dialogue response generation and identifies the L2 token (given the L1 token in the response utterance it identifies the L2 token in the query utterance)” (Mehrabi, page 2836, left column, paragraph 2).

	Regarding claim 9, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system, wherein the processor is configured to tag each said token of the input:

    PNG
    media_image5.png
    392
    294
    media_image5.png
    Greyscale
(Mehrabi, page 2836, Figure 6)
“The second layer aims to detect which tokens in the adversary’s attack utterance (input) are responsible for generation of L1 tokens form defender’s utterance. We call these tokens identified in layer 2 as the L2 tokens.” (Mehrabi, page 2836, left column, paragraph 1). Each token is either designated as L2 (untrusted), or as not L2 (trusted).

	Regarding claim 10, the rejection of claim 9 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system, wherein the processor is configured to use the tags to keep track of which tokens of input come from a user and from a trusted application prompt: “The second layer aims to detect which tokens in the adversary’s (user) attack utterance are responsible for generation of L1 tokens form defender’s utterance. We call these tokens identified in layer 2 as the L2 tokens (untrusted tokens). The defender then masks the L2 tokens from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. We then apply a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise we iteratively apply the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe” (Mehrabi, page 2836, left column, paragraph 1). Untrusted L2 tokens are removed from the prompt to form a trusted application prompt.

	Regarding claim 12, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system, wherein the processor is configured to: detect non-conforming hidden content in the input; and modify the input responsive to the non-conforming hidden content:
“In this work, we propose attacks (non-conforming content) against conversational agents that are imperceptible (hidden), i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language.” (Mehrabi, page 2831, left column, Abstract)

    PNG
    media_image3.png
    574
    420
    media_image3.png
    Greyscale
(Mehrabi, page 2836, Figure 6)
“The defender then masks the L2 tokens (non-conforming hidden content) from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. We then apply a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise we iteratively apply the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe.” (Mehrabi, page 2836, left column, paragraph 1)

	Regarding claim 13, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system, wherein the AI model is a generative pretrained transformer (GPT), wherein the processor is a trained platform to modify operation of the GPT: “We use DialoGPT (Zhang et al., 2020c) to generate 100 conversations around a specific topic. The topic is determined by the context sentence that starts the conversation between the adversary and the defender. Each conversation runs for 10 turns. To measure the effectiveness of the attack and defense mechanisms given the conversation history as well preservation of relevancy and coherency, the adversary generates the attack utterance on the third turn of each conversation.” (Mehrabi, page 2833, right column, paragraph 2)

Regarding claim 18, Mehrabi discloses [a] system
operable with an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens:
”Adversarial attacks on different Machine Learning (ML) and Natural Language Processing (NLP) (AI model[s]) applications can reveal important vulnerability issues related to these systems. Most existing research focuses on adversarial attacks that degrade performance of existing ML systems with regards to accuracy (Chakraborty et al., 2018; Zhang et al., 2020b). More recent work has considered attacks that target ethical concerns, such as triggering the models into outputting unfair predictions (Mehrabi et al., 2021b; Solans et al., 2021), or in the context of NLP, generating biased (Sheng et al., 2020) and toxic (Wallace et al., 2019) text” (Mehrabi, page 2831, left column, paragraph 1)

    PNG
    media_image1.png
    432
    446
    media_image1.png
    Greyscale
”Figure 1: An example illustrating the attack (text input comprising tokens) performed by the adversary on the third turn of the conversation (red line) that leads the defender (AI model) into generating a toxic utterance (human-like text) (dotted box). With a proper defense the defender can bypass the attack and generate a non-toxic response (green line).” (Mehrabi, page 2831, right column, Figure 1)
“General Setup We use DialoGPT (deep learning model) (Zhang et al., 2020c) to generate 100 conversations around a specific topic.” (Mehrabi, page 2833, right column, paragraph 2)
the system comprising a processor configured to: “We used Nvidia GeForce RTX 2080 (processor) to perform all our experiments except the experiment using the GPT-2 model which was ran on CPU (processor) for memory constraints” (Mehrabi, page 2842, left column, paragraph 3)
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt:
“We quantify the toxicity of each candidate attack utterance (input provided) using either a single toxicity classifier or an ensemble of such classifiers we first apply a threshold T to toxicity scores of the candidate utterances and label the utterances above this threshold as toxic.” (Mehrabi, page 2833, left column, paragraph 3). Trusted instructions are utterances that fall below the toxicity threshold, while those at or above or are untrusted instructions.
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag:

    PNG
    media_image2.png
    592
    436
    media_image2.png
    Greyscale
 (Mehrabi, page 2836, Figure 6)
“The second layer aims to detect which tokens in the adversary’s attack utterance are responsible for generation of L1 tokens form defender’s utterance. We call these tokens (untrusted tag[s]) identified in layer 2 as the L2 tokens” (Mehrabi, page 2836, left column, paragraph 1). At the end of this process each token of the adversary’s input is either untrustworthy (with an L2 tag) or trustworthy (without an L2 tag).
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag:

    PNG
    media_image3.png
    574
    420
    media_image3.png
    Greyscale
(Mehrabi, page 2836, Figure 6)
“The defender then masks the L2 tokens (untrusted instructions) from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. We then apply a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise we iteratively apply the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe.” (Mehrabi, page 2836, left column, paragraph 1)
	Mehrabi relates to guarding against inappropriate model outputs due to untrusted user inputs and is analogous to the claimed invention.
	While Mehrabi fails to disclose the further limitations of the claim, Pujari discloses a system, able to
apply reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: 
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag:
“The input to the multi-task model is the text of the data example and a task ID. Output of the model is predicted label (tag) on the specified task.” (Pujari, page 5, left column, paragraph 5)
“We use six datasets for our empirical evaluation, namely, Jigsaw Toxicity Dataset, Hate Speech Detection (de Gibert et al., 2018), Misogyny Detection (Fersini et al., 2018), Offensive Language Detection (Davidson et al., 2017), coarse-grained Stereotype Detection (combination of Stereoset, CrowSPairs and Reddit Data) and finally fine-grained Stereotype Detection Data (as described in section 3).” (Pujari, page 6, right column, paragraph 3)
“Jigsaw Toxicity Dataset consists of 159,571 training examples and 153,164 test examples labeled (tag[ged]) with one or more of the seven labels: toxic (untrusted), severely toxic (untrusted), obscene (untrusted), threat (untrusted), insult (untrusted), identity hate (untrusted), none (trusted)” (Pujari, page 7, left column, paragraph 3)
apply RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: “we then propose a reinforcement learning agent that learns to guide the multi-task learning model by selecting meaningful data examples from the neighboring task datasets that help in improving the target task.” (Pujari, page 2, right column, paragraph 1)
	Pujari relates to detecting offensive text using machine learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Mehrabi to use a multi-task classification model trained with reinforcement learning to detect unsafe text, as disclosed by Pujari. Pujari’s reinforcement learning agent improves model performance by selectively taking advantage of overlapping cases in different task datasets across several different but interrelated tasks. See Pujari, page 4, right column, paragraph 3 & page 6, left column, paragraphs 4-5.

	Regarding claim 19, Mehrabi discloses [a] method of using an artificial intelligence (AI) model configured to accept text input and configured to use deep learning to produce human-like text responsive to an input comprising tokens:
”Adversarial attacks on different Machine Learning (ML) and Natural Language Processing (NLP) (AI model[s]) applications can reveal important vulnerability issues related to these systems. Most existing research focuses on adversarial attacks that degrade performance of existing ML systems with regards to accuracy (Chakraborty et al., 2018; Zhang et al., 2020b). More recent work has considered attacks that target ethical concerns, such as triggering the models into outputting unfair predictions (Mehrabi et al., 2021b; Solans et al., 2021), or in the context of NLP, generating biased (Sheng et al., 2020) and toxic (Wallace et al., 2019) text” (Mehrabi, page 2831, left column, paragraph 1)

    PNG
    media_image1.png
    432
    446
    media_image1.png
    Greyscale
”Figure 1: An example illustrating the attack (text input comprising tokens) performed by the adversary on the third turn of the conversation (red line) that leads the defender (AI model) into generating a toxic utterance (human-like text) (dotted box). With a proper defense the defender can bypass the attack and generate a non-toxic response (green line).” (Mehrabi, page 2831, right column, Figure 1)
“General Setup We use DialoGPT (deep learning model) (Zhang et al., 2020c) to generate 100 conversations around a specific topic.” (Mehrabi, page 2833, right column, paragraph 2)
the method comprising:
applying reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt:
“We quantify the toxicity of each candidate attack utterance (input provided) using either a single toxicity classifier or an ensemble of such classifiers we first apply a threshold T to toxicity scores of the candidate utterances and label the utterances above this threshold as toxic.” (Mehrabi, page 2833, left column, paragraph 3). Trusted instructions are utterances that fall below the toxicity threshold, while those at or above or are untrusted instructions.
tagging the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag:

    PNG
    media_image2.png
    592
    436
    media_image2.png
    Greyscale
 (Mehrabi, page 2836, Figure 6)
“The second layer aims to detect which tokens in the adversary’s attack utterance are responsible for generation of L1 tokens form defender’s utterance. We call these tokens (untrusted tag[s]) identified in layer 2 as the L2 tokens” (Mehrabi, page 2836, left column, paragraph 1). At the end of this process each token of the adversary’s input is either untrustworthy (with an L2 tag) or trustworthy (without an L2 tag).
applying RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag:

    PNG
    media_image3.png
    574
    420
    media_image3.png
    Greyscale
(Mehrabi, page 2836, Figure 6)
“The defender then masks the L2 tokens (untrusted instructions) from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. We then apply a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise we iteratively apply the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe.” (Mehrabi, page 2836, left column, paragraph 1)
	Mehrabi relates to guarding against inappropriate model outputs due to untrusted user inputs and is analogous to the claimed invention.
	While Mehrabi fails to disclose the further limitations of the claim, Pujari discloses a method, comprising:
applying reinforcement learning (RL) to determine trusted instructions and untrusted instructions from the input provided responsive to an AI model prompt: 
tag the trusted instructions with a trusted tag and tag the untrusted instructions with an untrusted tag:
“The input to the multi-task model is the text of the data example and a task ID. Output of the model is predicted label (tag) on the specified task.” (Pujari, page 5, left column, paragraph 5)
“We use six datasets for our empirical evaluation, namely, Jigsaw Toxicity Dataset, Hate Speech Detection (de Gibert et al., 2018), Misogyny Detection (Fersini et al., 2018), Offensive Language Detection (Davidson et al., 2017), coarse-grained Stereotype Detection (combination of Stereoset, CrowSPairs and Reddit Data) and finally fine-grained Stereotype Detection Data (as described in section 3).” (Pujari, page 6, right column, paragraph 3)
“Jigsaw Toxicity Dataset consists of 159,571 training examples and 153,164 test examples labeled (tag[ged]) with one or more of the seven labels: toxic (untrusted), severely toxic (untrusted), obscene (untrusted), threat (untrusted), insult (untrusted), identity hate (untrusted), none (trusted)” (Pujari, page 7, left column, paragraph 3)
applying RL to detect and obey instructions tagged with the trusted tag, and to detect and disregard instructions tagged with the untrusted tag: “we then propose a reinforcement learning agent that learns to guide the multi-task learning model by selecting meaningful data examples from the neighboring task datasets that help in improving the target task.” (Pujari, page 2, right column, paragraph 1)
	Pujari relates to detecting offensive text using machine learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Mehrabi to use a multi-task classification model trained with reinforcement learning to detect unsafe text, as disclosed by Pujari. Pujari’s reinforcement learning agent improves model performance by selectively taking advantage of overlapping cases in different task datasets across several different but interrelated tasks. See Pujari, page 4, right column, paragraph 3 & page 6, left column, paragraphs 4-5.

	Regarding claim 20, the rejection of claim 19 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system, wherein the processor removes the nontrusted instructions from the input and creates content that is influenced by the trusted instructions but that is not influenced by the untrusted instructions:
“We used Nvidia GeForce RTX 2080 (processor) to perform all our experiments except the experiment using the GPT-2 model which was ran on CPU (processor) for memory constraints” (Mehrabi, page 2842, left column, paragraph 3)

    PNG
    media_image4.png
    574
    424
    media_image4.png
    Greyscale
(Mehrabi, page 2836, Figure 6)
“The defender then masks the L2 tokens from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. We then apply a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise we iteratively apply the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe.” (Mehrabi, page 2836, left column, paragraph 1)

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari, and further in view of Abel et al. (Agent-Agnostic Human-in-the-Loop Reinforcement Learning, published 2017, arXiv:1701.04079v1), hereafter referred to as Abel.

	Regarding claim 2, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. While Mehrabi and Pujari fail to disclose the further limitations of the claim, Abel discloses a system, wherein the RL is reinforcement learning from human feedback (RLHF): “In this work, we explore protocol programs, an agent-agnostic schema for Human-in-the-Loop Reinforcement Learning. Our goal is to incorporate the beneficial properties of a human teacher into Reinforcement Learning without making strong assumptions about the inner workings of the agent. We show how to represent existing approaches such as action pruning, reward shaping, and training in simulation as special cases of our schema and conduct preliminary experiments on simple domains.” (Abel, page 1, Abstract)
	Abel relates to reinforcement learning from human feedback and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the existing combination to incorporate agent-agnostic human feedback, as disclosed by Abel, to Pujari’s reinforcement learning agent. Abel’s method is applicable to any reinforcement learning agent, allowing the benefits of human feedback (biased exploration, prevention of catastrophic outcomes, and accelerated learning) to be widely applied to reinforcement learning methods. See Abel, page 1, Abstract & page 2, paragraph 1.

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari, and further in view of Abel et al. (Agent-Agnostic Human-in-the-Loop Reinforcement Learning, published 2017, arXiv:1701.04079v1), hereafter referred to as Abel, and Lee et al. (An abusive text detection system based on enhanced abusive and non-abusive word lists, published 2018, Decision Support Systems 113 pp. 22-31), hereafter referred to as Lee.

	Regarding claim 3, the rejection of claim 2 in view of Mehrabi, Pujari, and Abel is incorporated. While the aforementioned references fail to disclose the further limitations of the claim, Lee, in combination with Mehrabi, discloses a system, wherein the processor is configured to disregard instructions that are semi-trusted: (Lee) “If the target word is not in the blacklist (untrusted words) nor in the non-abusive word list (trusted words), but the word is very similar to an abusive word, then it is tagged as abusive. This is because the target word might be a never-before-seen malicious word, namely, a new abusive word, which occurs often online” (Lee, page 25, right column, paragraph 2). A word that’s semi-trusted (not in the list of explicitly untrusted or trusted words) is classified as abusive (toxic) by Lee’s system. As discussed regarding parent claim 1, Mehrabi disregards detected toxic words detected in the model response by eliminating their trigger and generating a new response.
	Lee relates to machine learning for detecting offensive language and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the existing combination to use Lee’s method to detect toxic words in the model output. Lee’s method enables the detection not only of existing hateful speech, but previously unseen abusive language, enabling defense against intentional obfuscation of abusive words. See Lee, page 22, right column, paragraph 2 to page 23, left column, paragraph 2.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari, and further in view of Pattabhi et al. (NON-REPEATING RANDOM VALUES IN USER SPECIFIED FORMATS AND CHARACTER SETS, published 10/29/2009, US 2009/0271361 A1), hereafter referred to as Pattabhi.

	Regarding claim 4, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. While Mehrabi and Pujari fail to disclose the further limitations of the claim, Pattabhi, in combination with Mehrabi, discloses a system, wherein the trusted instructions and the untrusted instructions are represented using incompatible token sets: (Pattabhi)
“Example systems and methods produce non-repeating random values (NRRVs) in user-specified character sets. The NRRVs comply with user-specified formats. The NRRVs may be used to mask selected data in a database table. One example system can mask data in a column having a set of N NRRVs in O(N) time. The NRRVs may be numbers, strings, combined number-string values, and so on” (Pattabhi, [0013])
“The data describing the mask may include different information used to format the mask. For example, the information may include a user-specified character set for characters appearing in a masked value. In different examples a user could provide a language identifier (e.g., English, French), a closed set of characters (e.g., "abcdefgABCDEFG"), an open set of characters (e.g., "a . . . z"), and so on.” (Pattabhi, [0065])
“For example, method 300 includes, at 365, selectively translating members of the set of filler values produced at 360. The filler values may be numbers and/or characters. Thus, the translating may include different actions. The translating may include, for example, translating a digit from a first base to a second base, translating a digit from a numeric value to a character value, translating a character from a first character set to a second character set, and so on. By way of illustration, a filler value may include two base the digits associated with an English number set and may also include three characters associated with lower case English letters. Translating the filler value may include translating the base ten digits from an English number set to a different base and/or number set. Similarly, translating the filler value may include translating the three characters from lower case English letters to another language.” (Pattabhi, [0076])
Examiner’s note: As discussed regarding parent claim 1, Mehrabi discloses a method of masking untrusted user instructions. Pattabhi discloses a method of masking an original string of characters from character set A to characters of distinct character set B. In other words, applying Pattabhi’s method to Mehrabi’s masked untrusted instructions results in the untrusted instructions being represented with character sets incompatible with those of the trusted instruction.
	Pattabhi relates to masking text data and is analogous to the claimed invention. The combination of Mehrabi and Pujari teaches a system that masks untrusted user inputs. Pattabhi teaches a system for masking inputs into new character sets. It would have been obvious to one of ordinary skill in the art to combine Mehrabi, Pujari, and Pattabhi by using Pattabhi’s method for Mehrabi’s token masking. This would achieve the predictable result of untrusted user inputs having incompatible characters with trusted inputs, with Mehrabi’s masking application and Pattabhi’s masking method performing the same together as they did separately. (MPEP 2143 I. (A) Combining prior art elements according to known methods to yield predictable results).

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari, and further in view of Tuan et al. (Local Explanation of Dialogue Response Generation, published 2/7/2022, arXiv:2106.06528v2), hereafter referred to as Tuan.

	Regarding claim 8, the rejection of claim 7 in view of Mehrabi and Pujari is incorporated. While Mehrabi and Pujari fail to disclose the further limitations of the claim, Tuan discloses a system, wherein the rules are configured to be custom configured by a user: “we introduce LERG, a novel yet simple method that extracts the sorted importance scores of every input-output segment pair from a dialogue response generation model. We view this sequence prediction as the uncertainty estimation of one human response and find a linear proxy that simulates the certainty caused from one input segment to an output segment. We further derive two optimization variations of LERG. One is learning-based [35] and another is the derived optimal similar to Shapley value” (Tuan, page 2, paragraph 2). One variation of LERG can be selected for use, thus the system is configured by its programmer.
	Tuan relates to using machine learning to analyze conversations and is analogous to the claimed invention. The combination of Mehrabi and Pujari teaches a system to detect untrusted user inputs based on their relationship to potential model responses. The claimed invention improves upon this method by making the ruleset for detecting untrusted user inputs configurable. Tuan teaches a configurable system of measuring input-response relationships in conversations, applicable to the combination of Mehrabi and Pujari. A person of ordinary skill in the art would have recognized that Mehrabi uses Tuan’s system (LERG) for determining untrusted user inputs (Mehrabi, page 2836, left column, paragraph 2), thus Tuan makes it clear that Mehrabi’s untrusted instruction detection is predictably configurable, which would improve the known device by allowing the selection of an optimal detection algorithm for a given dataset, model, and hardware system. (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari, and further in view of Cai et al. (An reinforcement learning‑based speech censorship chatbot system, published 11/30/2021, The Journal of Supercomputing (2022) 78:8751–8773), hereafter referred to as Cai.

	Regarding claim 11, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. While Mehrabi and Pujari fail to disclose the further limitations of the claim, Cai discloses a system, wherein the processor is trained to follow an instruction of a trusted sequence and penalize the system for following any instruction received in full or in part from a danger sequence: “Due to the uncontrolled and unrestricted online learning of chatbots, malicious users can interfere with the learning algorithms of chatbots through large batches of offensive or insulting comments (danger sequence[s]), causing them to generate invasive responses when conversing with other normal users, causing property and psychological damage to companies and users alike. Therefore, we purify the polluted chatbots through a reinforcement learning approach. The flow of the speech purification algorithm is shown in Fig. 3. In our speech purification algorithm, the chatbot accepts user input sentences and outputs k candidate responses. The input sentences and candidate responses are then sent together to the speech censorship model, which will generate a return value (i.e. a safety score) for each candidate response, which will be fed back to the chatbot as a reward function for reinforcement learning. Through the reinforcement learning process, the model will reduce the probability of producing aggressive responses.” (Cai, page 8759, paragraph 2). A response that’s less safe (lower safety score) results in a lower reward to the RL agent. Thus, it’s being penalized.
	Cai relates to machine learning for defense against AI prompt injections and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the existing combination to use reinforcement learning to reduce the probability of generating aggressive responses due to untrusted user inputs, as disclosed by Cai. Doing so would enable the model to learn online without developing harmful aggressive responses in response to untrusted user inputs. See Cai, page 8751, Abstract.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari, and further in view of Dorn et al. (MODIFYING GAME CONTENT TO REDUCE ABUSER ACTIONS TOWARD OTHER USERS, published 3/31/2022, US 20220096937 A1), hereafter referred to as Dorn.

	Regarding claim 14, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. While Mehrabi and Pujari fail to disclose the further limitations of the claim, Dorn discloses a system, wherein the processor is configured to remove the untrusted instructions from the input in a way that is hidden from a user entering the input: “when the input of the first user includes audio or text or chat content, modifying the content includes dynamically filtering the audio or text or chat content using a content filter to generate modified content. The content filter is used to identify a certain keyword that is included in the audio, text or chat content that is perceived as offensive or inappropriate by the second user and replacing the certain keyword with a different keyword that is acceptable to the second user. The different keyword selected is specific for the second user and is identified using the content filter. The audio, the text or the chat content without the dynamic filtering is presented with content of the video game on a firsts [sic] client device of the first user for rendering and the modified content with the dynamic filtering of the audio, the text or the chat content are provided for rendering on a second client device associated with the second user” (Dorn, [0023])
	Dorn relates to using machine learning to identify and filter offensive language and is analogous to the claimed invention. The existing combination teaches a method of filtering offensive content from a user message. The claimed invention improves upon this method by hiding the filtration of user input from the user. Dorn teaches a method of displaying unfiltered content to the sender and filtered content to the receiver, applicable to the existing combination. A person of ordinary skill in the art would have recognized that displaying filtered content solely to the AI receiver would lead to the predictable result of hiding censorship from the users, and would improve the known device by preventing the AI from being exposed to or learning from offensive content, while preventing users from getting upset or doubling down on offensive language in response to their messages being filtered (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).

Claims 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari, and further in view of Thomas et al. (AUTOMATIC CLASSIFICATION AND REPORTING OF INAPPROPRIATE LANGUAGE IN ONLINE APPLICATIONS, published 12/2/2021, US 2021/0370188 A1), hereafter referred to as Thomas.

	Regarding claim 15, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. While Mehrabi and Pujari fail to disclose the further limitations of the claim, Thomas, in combination with Mehrabi, discloses a system, wherein the processor is configured to identify users entering untrusted instructions in a report configured to allow management to understand and address users entering potential violating commands: “in addition, to editing, muting, or removing portions of the audio that correspond to inappropriate language, the identified words or phrases may also be used to generate automatic reports of the behavior of the user. For example, audio clips containing the inappropriate words—and some portion of the audio before and/or after the offensive words for context, in embodiments—may be generated and provided as part of an upload for generating a report … The final report (including the metadata, audio clip, video clip, etc.) may be sent to an entity (e.g., a platform developer, a game developer, etc.) (management) charged with monitoring appropriate behavior during gameplay” (Thomas, [0019]); “The final abuse report may be sent to a host application 118 where an entity charged with monitoring inappropriate behavior (e.g., platform developer, game developer, etc.) (management) may review the abuse report and take appropriate action” (Thomas, [0036]). As discussed regarding parent claim 1, Mehrabi discloses a system that identifies untrusted language in user communications sent to AI systems. Thomas discloses a system that automatically generates reports for identified untrusted language and forwards them to appropriate management.
	Thomas relates to using machine learning to identify and report offensive language and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the existing combination to automatically generate and forward reports for detected untrusted language, as disclosed by Thomas. Manual reporting systems for abusive user behavior are slow and cumbersome. Additionally, there may be scenarios where parts of a user’s language should be maintained while others should be filtered. Thomas’ system expediates this process through automation, and moderates abusive language in a manner where only offensive portions of user language are filtered. See Thomas, [0001-0004].

	Regarding claim 16, the rejection of claim 15 in view of Mehrabi, Pujari, and Thomas is incorporated. Thomas further discloses a system, wherein the report is configured to be generated in real-time: “Systems and methods are disclosed that classify words as being inappropriate, and that determine a portion of audio data that corresponds to the inappropriate words in order to perform real-time, or near real-time, actions on the audio data” (Thomas, [0003])
	Thomas relates to using machine learning to identify and report offensive language and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the existing combination to automatically generate reports in real-time, as disclosed by Thomas. Manual reporting systems for abusive user behavior are slow and cumbersome. Thomas’ system expediates this process through automation. See Thomas, [0001-0004].

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Mehrabi et al. (Robust Conversational Agents against Imperceptible Toxicity Triggers, published January 2022, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831 – 2847, retrieved from https://www.researchgate.net/publication/362256325_Robust_Conversational_Agents_against_Imperceptible_Toxicity_Triggers), hereafter referred to as Mehrabi, in view of Pujari et al. (Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection, published 3/27/2022, arXiv:2203.14349v1), hereafter referred to as Pujari, and further in view of Cornel et al. (Cyberbullying Detection for Online Games Chat Logs using Deep Learning, published 2019, 2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management ( HNICEM ) pp. 1-5), hereafter referred to as Cornel, Vidgen et al. (Detecting weak and strong Islamophobic hate speech on social media, published 2018, arXiv:1812.10400v1), hereafter referred to as Vidgen, and Chiril et al. (Automatic Hate Speech Detection on Social Media, published 3/7/2022, Social and Information Networks [cs.SI]. Université Paul Sabatier - Toulouse III, 2021. English. NNT : 2021TOU30123.tel-03599458), hereafter referred to as Chiril.

	Regarding claim 17, the rejection of claim 1 in view of Mehrabi and Pujari is incorporated. Mehrabi further discloses a system wherein the untrusted instructions is selected from a group including cyberbullying, harassment, toxicity, islamophobia, misogyny, and journalistic qualities: 
    PNG
    media_image6.png
    589
    432
    media_image6.png
    Greyscale
 (Mehrabi, page 2836, Figure 6).
	While Mehrabi and Pujari fail to disclose the further limitations of the claim, Cornel discloses a system, wherein the untrusted instructions is selected from a group including cyberbullying, harassment, toxicity, islamophobia, misogyny, and journalistic qualities: “Cyberbullying is a form of harassment that takes place in the internet where a bully sends a harsh message to harass the receiver. In this study, a learning model is developed using Convolutional Neural Network (CNN), which is usually used for image, and is then used to create a system for detecting cyberbullying in online game chat logs.” (Cornel, page 1, left column, Abstract).
	Cornel relates to detecting offensive speech using machine learning and is analogous to the claimed invention. The existing combination teaches a method of detecting offensive language in AI prompt inputs. The claimed invention improves upon this method by detecting cyberbullying. Cornel teaches a method of detecting cyberbullying, applicable to the existing combination. A person of ordinary skill in the art would have recognized that incorporating cyberbullying detection into the existing combination’s offensive language detection would lead to the predictable result of increasing the range of offensive language it can detect, which would improve the known device by guarding against a wider variety of prompt injections (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).
	While Cornel fails to disclose the further limitations of the claim, Vidgen discloses a system, wherein the untrusted instructions is selected from a group including cyberbullying, harassment, toxicity, islamophobia, misogyny, and journalistic qualities: “Drawing on in-depth conceptual work we build a multi-class classifier which distinguishes between non-Islamophobic, weak Islamophobic and strong Islamophobic content.” (Vidgen, page 1, left column, Abstract).
Vidgen relates to detecting offensive speech using machine learning and is analogous to the claimed invention. The existing combination teaches a method of detecting offensive language in AI prompt inputs. The claimed invention improves upon this method by detecting islamophobia. Vidgen teaches a method of detecting islamophobia, applicable to the existing combination. A person of ordinary skill in the art would have recognized that incorporating islamophobia detection into the existing combination’s offensive language detection would lead to the predictable result of increasing the range of offensive language it can detect, which would improve the known device by guarding against a wider variety of prompt injections (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).
	While Vidgen fails to disclose the further limitations of the claim, Chiril discloses a system, wherein the untrusted instructions is selected from a group including cyberbullying, harassment, toxicity, islamophobia, misogyny, and journalistic qualities: 
“The nature and the effects of sexism have been deeply analyzed in fields such as social psychology. Sexism can be expressed at different linguistic granularity levels going from lexical to discursive (Cameron, 1992). For example, women are often designated through their relationship with men or motherhood (cf. (3.1)) or they are characterized through their physical characteristics (cf. (3.2)).
(3.1) A man killed in a shooting vs. Mother of 2 killed in a crash
(3.2) The journalist who presents the news vs. The blonde who presents the news” (Chiril, page 47, paragraph 2)
“In order to collect sexist and non sexist tweets, we followed Anzovino et al. (2018) approach using a set of representative keywords: femme, fille (woman, girl), enceinte (pregnant), some activities (cuisine (cooking), football, journaliste), insults (pute, salope, conne, connasse (slut, bitch), hystérique);” (Chiril, page 68, paragraph 3).”
	Chiril relates to detecting offensive speech using machine learning and is analogous to the claimed invention. The existing combination teaches a method of detecting offensive language in AI prompt inputs. The claimed invention improves upon this method by detecting journalistic qualities. Chiril teaches a method of detecting sexism through journalistic language, applicable to the existing combination. A person of ordinary skill in the art would have recognized that incorporating journalistic language-sexism detection into the existing combination’s offensive language detection would lead to the predictable result of increasing the range of offensive language it can detect, which would improve the known device by guarding against a wider variety of prompt injections (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Watanabe et al. (Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection, published 2018, IEEE Access, vol. 6, pp. 13825-13835) discloses a method of separating hateful, offensive, and clean language from user text messages.
Xu et al. (Recipes for Safety in Open-domain Chatbots, published 8/4/2021, arXiv:2010.07079v3) discloses a system that identifies trusted and untrusted AI prompt inputs, and trains the AI output to be a safe response.
Chai et al (How to Keep an Online Learning Chatbot From Being Corrupted, published 9/28/2020, 2020 International Joint Conference on Neural Networks (IJCNN)) discloses a method of identifying toxic pairs of user input-AI responses and using reinforcement learning to incentivize non-toxic AI responses.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Aaron P Gormley whose telephone number is (571)272-1372. The examiner can normally be reached Monday - Friday 12:00 PM - 8:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached at (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AG/Examiner, Art Unit 2148                                                                                                                                                                                                        /MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Mitigation for Prompt Injection in A.I. Models Capable of Accepting Text Input

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Mitigation for Prompt Injection in A.I. Models Capable of Accepting Text Input

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email