Last updated: April 19, 2026

Application No. 17/808,511

REVERSE REINFORCEMENT LEARNING TO TRAIN TRAINING DATA FOR NATURAL LANGUAGE PROCESSING NEURAL NETWORK

Final Rejection §103

Filed

Jun 23, 2022

Examiner

HOANG, MICHAEL H

Art Unit

2122

Tech Center

2100 — Computer Architecture & Software

Assignee

International Business Machines Corporation

OA Round

2 (Final)

Interview Optional

— +25.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 136 resolved cases, 2023–2026

Examiner Intelligence

HOANG, MICHAEL H View full profile →

Grants 52% of resolved cases

Career Allow Rate

70 granted / 136 resolved

-3.5% vs TC avg

Strong +26% interview lift

Without

With

+25.9%

Interview Lift

resolved cases with interview

Typical timeline

4y 1m

Avg Prosecution

26 currently pending

Career history

162

Total Applications

across all art units

Statute-Specific Performance

§101

30.3%

-9.7% vs TC avg

§103

45.3%

+5.3% vs TC avg

§102

9.1%

-30.9% vs TC avg

§112

12.3%

-27.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 136 resolved cases

Office Action

§103

DETAILED ACTION
This action is in response to the claims filed 11/12/2025 for Application number 17/808,511. Claims 1-2, 4-5, 9-10, 12-14, and 17-20 have been amended, claims 7 and 15 have been canceled, and claims 21-22 are new. Thus, claims 1-6, 8-14, and 16-22 are currently pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 6, 8-10, 14, 16-18, and 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. ("Inverse Reinforcement Learning with Natural Language Goals", cited by Applicant in the IDS filed 06/23/2022) in view of Guo et al. ("CN 115688734 A", hereinafter "Guo") and further in view of Zoph et al. ("NEURAL ARCHITECTURE SEARCH WITH REINFORCEMENT LEARNING", hereinafter "Zoph").

Regarding claim 1, Zhou teaches A computer-implemented method for modifying a training dataset, comprising: 
benchmarking the training dataset using a State Of The Art (SOTA) neural network to determine a benchmark for the training dataset (“In this problem setting, the policy and the reward function are conditioned on a NL goal G” (Goal corresponds to a “benchmark”) …We use soft actorcritic (SAC) (corresponds to “SOTA neural network”) one of the state-of-the art off-policy RL algorithms [pg. 2, §3.2, Inverse Reinforcement Learning with Natural Language Goals, ¶1]); 
dividing the training dataset into a plurality of slices (The dataset is split into train (61 environments and 14,025 instructions), seen validation (61 environments same as train set, and 1,020 instructions), unseen validation (11 new environments and 2,349 instructions), and test (18 new environments and 4,173 instructions); 
selecting, using a selection strategy generator operating on one of the plurality of slices, a sequence of a plurality of [atomic operations] (“A goal generator (“selection strategy generator”) takes a trajectory τ (sequence of states and actions) as input (taking a sequence of actions corresponds to “selecting a sequence of [atomic operations]) and outputs a NL goal G. Given expert demonstrations, a straightforward way of learning a goal generator is to train an encoder-decoder model that encodes a trajectory and decodes a NL goal.” [pg. 3, left col, ¶2]); 
performing reverse reinforcement learning on the revised one of the plurality of slices using the benchmark and the SOTA neural network (“In this problem setting, the policy and the reward function are conditioned on a NL goal G (“benchmark”) …We use soft actorcritic (SAC) (corresponds to “SOTA neural network”) one of the state-of-the art off-policy RL algorithms, to optimize policy π given the reward function gθ(s, a, G).” [pg. 2, §3.2, Inverse Reinforcement Learning with Natural Language Goals]]), and wherein a reward corresponding to the inverse reinforcement learning is generated based at least in part on the benchmark (“utilizing goal relabeling and sampling strategies to improve the generalization of both the policy and the reward function to new NL goals and new environments,” [pg. 2, top para]); and 
modifying the training dataset by replacing the one of the plurality of slices with the revised one of the plurality of slices to generate a modified training dataset (“A good discriminator learns a good reward function that generalizes well to new goals and new environments such that the learned policy can also well generalize to new goals (“replacing”) and new environment. To improve the discriminator, we propose to augment the positive examples of the discriminator by relabeling (corresponds to “modifying by replacing…” the goals of the sampled trajectory” [pg. 3, top right col]); and 
training the SOTA neural network using the modified training dataset (“To update the discriminator, we sample negative (s, a, s0 , G) examples from the replay buffer and sample positive (s, a, s0 , G) examples from the expert demonstrations. To update the policy, we sample a batch of (s, a, s0 , G) from the replay buffer and use gθ to estimate their rewards; then we update the Q- and policy network using these reward-augmented samples” [pg. 2, §3.2 Inverse Reinforcement Learning with Natural Language Goals, ¶2]).
However Zhou fails to explicitly teach a sequence of a plurality of atomic operations
applying the sequence of the plurality of atomic operations to modify the one of the plurality of slices to generate a revised one of the plurality of slices
wherein the benchmark corresponds to an accuracy of the SOTA neural network using the revised one of the plurality of slices,
Guo teaches a sequence of a plurality of atomic operations (“the atomic operation sequence comprises a plurality of atomic operations performed in order, the atomic operation comprises a replacement operation for the word, an insertion operation, deleting one of the operations” [pg. 2, ¶4])
applying the sequence of the plurality of atomic operations to modify the one of the plurality of slices to generate a revised one of the plurality of slices (“using the atomic operation sequence to operate the text sample to be operated, in the operation process, using blank word to insert operation in the atomic operation sequence, deleting operation and replacing operation are uniformly replaced operation;” [pg. 2, ¶5; a revised slice would be inherently generated after an atomic operation is performed.])
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Zhou’s teachings by substituting the action sequences with the atomic operation sequences as taught by Guo. One would have been motivated to make this modification to improve the robustness, security and analyzing capability of the machine learning model/algorithm. [bottom of pg. 1 – top of pg. 2, Guo]
Although Zhou does the disclose the use of a benchmark, the reference does not explicitly disclose that the benchmark corresponds to an accuracy of the SOTA neural network, thus Zhou/Guo fails to explicitly teach wherein the benchmark corresponds to an accuracy of the SOTA neural network using the revised one of the plurality of slices
Zoph teaches wherein the benchmark corresponds to an accuracy of the SOTA neural network using the revised one of the plurality of slices (“Training the network specified by the string– the “child network”– on the real data will result in an accuracy on a validation set. Using this accuracy as the reward signal, we can compute the policy gradient to update the controller. As a result, in the next iteration, the controller will give higher probabilities to architectures that receive high accuracies.” [pg. 2, top para; note: Guo teaches the revised plurality of slices while Zoph teaching the training and updating the neural network using training data thus when combined with Zoph would read on the recited limitation.])
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Zhuo’s/Guo’s teachings in order to use a benchmark that corresponds to an accuracy of a neural network as taught by Zoph. One would have been motivated to make this modification in order to maximize the accuracy of the neural network which works well for many difficult learning tasks in image, speech, and natural language understanding. [Abstract, Zoph]

Regarding claim 2, Zhou/Guo/Zoph teaches The method of claim 1, wherein the reverse reinforcement learning includes: 
Zhou teaches the SOTA neural network being an environment (“A task is defined as a pair (E, G), where E is an environment that the agent can interact with and G is a NL goal that the agent has to fulfill” [pg. 2, 2 Problem Formulation, ¶1]), 
and 
Guo teaches the modifying the one of the plurality of slices being an action (“using the atomic operation sequence to operate the text sample to be operated, in the operation process, using blank word to insert operation in the atomic operation sequence, deleting operation and replacing operation are uniformly replaced operation;” [pg. 2, ¶5])
Same motivation to combine the teachings of Zhou/Guo/Zoph as claim 1.

Regarding claim 6, Zhou/Guo/Zoph teaches The method of claim 1, 
Zhou teaches wherein the performing the reverse reinforcement learning is performed for a plurality of iterations. (“The process from Equation (2) to Equation (5) can be seen as using G1 and τ as seed, and iteratively sample τ v from Gv using the policy, and then sample Gv+1 from τ v using the goal generator” [pg. 3, right col, ¶3])

Regarding claim 8, Zhou/Guo/Zoph teaches The method of claim 1, Zhou teaches wherein the plurality of slices includes more than two slices, at least two of the plurality of slices are modified, and one of the plurality of slices is unmodified. (“The dataset is split into train (61 environments and 14,025 instructions), seen validation (61 environments same as train set, and 1,020 instructions), unseen validation (11 new environments and 2,349 instructions), and test (18 new environments and 4,173 instructions)” [pg. 4, §4. Experiments, ¶1])

Regarding claim 9, it is substantially similar to claim 1 respectively, and is rejected in the same manner, the same art, and reasoning applying. 

Regarding claim 10, it is substantially similar to claim 2 respectively, and is rejected in the same manner, the same art, and reasoning applying. 

Regarding claims 14 and 16, they are substantially similar to claims 6 and 8 respectively, and is rejected in the same manner, the same art, and reasoning applying. 

Claim 17 recites features similar to claim 1 and is rejected for at least the same reasons therein. Claim 17 additionally requires A computer program product, comprising: a computer readable storage medium having stored therein program code for training a training dataset, the program code, which when executed by a computer hardware system, cause the computer hardware system to perform (“One run includes 1 million interactions with the environments and takes about 30 hours to train on a NVIDIA V100 GPU” [pg. 12, left col])

Regarding claim 18, it is substantially similar to claim 2 respectively, and is rejected in the same manner, the same art, and reasoning applying. 

Regarding claim 21, it is substantially similar to claim 6 respectively, and is rejected in the same manner, the same art, and reasoning applying. 

Regarding claim 22, it is substantially similar to claim 8 respectively, and is rejected in the same manner, the same art, and reasoning applying. 

Claims 3-5, 11-13 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou in view of Guo and Zoph and further in view of Wang et al. ("Feature Extraction and Analysis of Natural Language Processing for Deep Learning English Language", hereinafter "Wang").

Regarding claim 3, Zhou/Guo/Zoph teaches The method of claim 1, where Zhou teaches wherein the selection strategy generator includes a long short term memory (LSTM) neural network (“The natural language goal is encoded by a LSTM model:” [pg. 10, § Base Model])
However Zhou/Guo/Zoph fails to explicitly teach
a conditional random field (CRF) layer.
Wang teaches a conditional random field (CRF) layer. (“We input the part-of-speech probability distribution of the words into the conditional random field layer, perform sequence labeling at the sentence level, and obtain the final word segmentation processing result.” [pg. 46342, left col, step (4)])
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Zhou’s/Guo’s/Zoph’s teachings in order to use a conditional random field layer as taught by Wang. One would have been motivated to make this modification as conditional random field models are traditionally used to solve long distance dependencies of texts and the user’s true intentions. [pg. 46336, right col, ¶2, Wang]

Regarding claim 4, Zhou/Guo/Zoph/Wang teaches The method of claim 3, where Guo teaches wherein the sequence of the plurality of atomic operations generated by the selection strategy generator includes at least one of a group consisting of: a mask atomic operation and an out of order atomic operation. (“using blank word to insert operation in the atomic operation sequence” [pg. 2, ¶5; “blank word” corresponds to a “mask”])
Same motivation to combine the teachings of Zhou/Guo/Zoph/Wang as claim 3.

Regarding claim 5, Zhou/Guo/Zoph/Wang teaches The method of claim 3, where Guo teaches wherein the sequence of the plurality of atomic operations generated by the selection strategy generator includes at least one of a group consisting of: a data deletion atomic operation, a data copy atomic operation, and a hidden layer transition atomic operation (“deleting operation”, pg. 2, ¶5])
Same motivation to combine the teachings of Zhou/Guo/Zoph/Wang as claim 3.

Regarding claims 11-13, they are substantially similar to claims 3-5 respectively, and is rejected in the same manner, the same art, and reasoning applying. 

Regarding claims 19-20, they are substantially similar to claims 3-5 respectively, and is rejected in the same manner, the same art, and reasoning applying. 

Response to Arguments
Regarding the 35 U.S.C. §101 Rejection:
Applicant’s arguments regarding the previous 101 rejection have been considered and are persuasive. In light of the newly amended claims and arguments provided by the applicant, the previous 101 rejection has been withdrawn. 
	Regarding the 35 U.S.C. §103 Rejection:
Applicant argues the previous prior art references fails to explicitly teach the newly amended limitations of the independent claims; however, these arguments are not found to be persuasive. Examiner asserts that the previous prior art of Zhou would still teach some of the limitations that were amended into claim 1 such as “wherein a reward corresponding to the reverse reinforcement learning is generated based at least in part on the benchmark” and “training the SOTA neural network using the modified training dataset”. No substantive arguments were provided in regards to why Zhou would not read on these particular limitations. Additionally, the newly amended limitation of “wherein the benchmark corresponds to an accuracy of the SOTA neural network using the revised one of the plurality of slices” is now taught by the newly presented prior art of Zoph in combination with Zhou/Guo. Please see the updated 103 rejection above. 

Applicant’s arguments with respect to the rejections of the dependent claims have been fully considered but they are not persuasive as they rely upon the allowability of the independent claims

Conclusion
Applicant's amendment necessitated the new ground of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL H HOANG/Examiner, Art Unit 2122

Read full office action

Prosecution Timeline

Jun 23, 2022

Application Filed

Aug 11, 2025

Non-Final Rejection — §103

Oct 31, 2025

Interview Requested

Nov 06, 2025

Applicant Interview (Telephonic)

Nov 06, 2025

Examiner Interview Summary

Nov 12, 2025

Response Filed

Feb 19, 2026

Final Rejection — §103

Apr 02, 2026

Interview Requested

Apr 09, 2026

Applicant Interview (Telephonic)

Apr 09, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

17/235,074

Patent 12518156

Training a Neural Network using Graph-Based Temporal Classification

2y 5m to grant Granted Jan 06, 2026

17/316,998

Patent 12468934

SYSTEMS AND METHODS FOR GENERATING DYNAMIC CONVERSATIONAL RESPONSES USING DEEP CONDITIONAL LEARNING

2y 5m to grant Granted Nov 11, 2025

18/982,085

Patent 12456115

METHODS, ARCHITECTURES AND SYSTEMS FOR PROGRAM DEFINED SYSTEMS

2y 5m to grant Granted Oct 28, 2025

18/313,050

Patent 12437211

System and Method for Predicting Fine-Grained Adversarial Multi-Agent Motion

2y 5m to grant Granted Oct 07, 2025

16/879,775

Patent 12430543

Structured Sparsity Guided Training In An Artificial Neural Network

2y 5m to grant Granted Sep 30, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

52%

Grant Probability

77%

With Interview (+25.9%)

4y 1m

Median Time to Grant

Moderate

PTA Risk

Based on 136 resolved cases by this examiner. Grant probability derived from career allow rate.