Last updated: April 19, 2026
Application No. 18/171,845
INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING COMPUTER PROGRAM PRODUCT

Non-Final OA §101§102§103
Filed
Feb 21, 2023
Examiner
NYE, LOUIS CHRISTOPHER
Art Unit
2141
Tech Center
2100 — Computer Architecture & Software
Assignee
Kabushiki Kaisha Toshiba
OA Round
1 (Non-Final)
Interview Optional

— +35.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 9 resolved cases, 2023–2026
Examiner Intelligence

NYE, LOUIS CHRISTOPHER View full profile →
Grants only 22% of cases
Career Allow Rate
2 granted / 9 resolved
-32.8% vs TC avg
Strong +36% interview lift
Without
With
+35.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
27 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
38.3%
-1.7% vs TC avg
§103
50.0%
+10.0% vs TC avg
§102
7.8%
-32.2% vs TC avg
§112
3.9%
-36.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 9 resolved cases
Office Action

§101 §102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-16 is/are rejected under 35 U.S.C. 101 because they are directed to an abstract idea without significantly more.
Regarding claims 1-16, 
Applying step 1: The preamble of claims 1-14 recite a device, which falls within the statutory category of an apparatus. The preamble of claim 15 recites a method, which falls within the statutory category of a process. The preamble of claim 16 recites a non-transitory computer readable medium, which falls within the statutory category of a manufacture.

Regarding claim 1,
Step 2A – Prong One: Claim 1 recites:
An information processing device comprising: 
one or more hardware processors configured to function as: 
an acquisition unit that acquires a current state of a device; 
a first action value function specifying unit that has a function of learning a first inference model by reinforcement learning, and specifies a first action value function of the device on a basis of the current state and the first inference model; 
a second action value function specifying unit that specifies a second action value function of the device on a basis of the current state and a second inference model that is not a parameter update target; and 
an action determination unit that determines a first action of the device on a basis of the first action value function and the second action value function.

The broadest reasonable interpretation of the bolded limitations above, in light of the specification, are abstract ideas directed to mathematical concepts. The limitations of “a first action value specifying unit that specifies a first action value function of the device on a basis of the current state” and “a second action value function specifying unit that specifies a second action value function of the device on a basis of the current state” are mathematical relationships and thus fall within the mathematical concepts grouping of abstract ideas. The limitation of “an action determination unit that determines a first action of the device on a basis of the first action value function and the second action value function” is a mathematical calculation and thus falls within the mathematical concepts grouping of abstract ideas. Step 2A – Prong One (Yes).

Step 2A – Prong Two: The additional element of claim 1 regarding “an acquisition unit that acquires a current state of a device;” is insignificant extra-solution activity that amounts to no more than mere data gathering (See MPEP 2106.05(g)). The additional elements of claim 1 regarding “An information processing device comprising: one or more hardware processors configured to function as:”, “a function of learning a first inference model by reinforcement learning”, “the first inference model”, and “a second inference model that is not a parameter update target” are instructions to apply the judicial exception on a generic computer (See MPEP 2106.05(f)). 
Even when viewed in combination, the additional elements do not integrate the judicial exception into a practical application. Step 2A – Prong Two (No).

Step 2B: The additional element of claim 1 regarding “an acquisition unit that acquires a current state of a device;” is insignificant extra-solution activity that amounts to no more than mere data gathering (See MPEP 2106.05(g)). Data gathering is a well-understood, routine conventional activity as recognized by the courts (See MPEP 2106.05(d)(II)). The additional elements of claim 1 regarding “An information processing device comprising: one or more hardware processors configured to function as:”, “a function of learning a first inference model by reinforcement learning”, “the first inference model”, and “a second inference model that is not a parameter update target” are instructions to apply the judicial exception on a generic computer (See MPEP 2106.05(f)). The computer is recited at a high level of generality and imposes no meaningful limitations on the claim.
Even when viewed in combination, the additional elements do not amount to significantly more than the judicial exception. Step 2B (No).
	Claim 1 is ineligible.

	Regarding claims 15-16,
	These claims contain substantively all the limitations of claim 1 in a process and non-transitory computer readable medium, and are rejected under similar rationale. The processors and memory recited in these claims are also generic computing components. 
	Claims 15 and 16 are ineligible.

	Dependent claims:
	Claims 2-10: These claims recite further abstract ideas (mathematical concepts) and thus are ineligible.
	Claims 11-13: Recite further additional elements that amount to instructions to apply the judicial exception on a generic computer (See MPEP 2106.05(f)). The computer is recited at a high level of generality and imposes no meaningful limitations on the claim. These additional elements do not integrate the judicial exception into practical application and do not amount to significantly more than the judicial exception, and thus these claims are ineligible.
	Claims 14: Recite further additional elements that are insignificant extra-solution activities that amount to no more than mere data gathering (See MPEP 2106.05(g)). Data gathering is a well-understood, routine conventional activity as recognized by the courts (See MPEP 2106.05(d)(II)). These additional elements do not integrate the judicial exception into practical application and do not amount to significantly more than the judicial exception, and thus the claim is ineligible.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-5 and 12-16 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Fuji et al. (NPL From IDS: US Pub. No. 2018/0181089, published June 2018, hereinafter Fuji).
Regarding claim 1, Fuji teaches an information processing device comprising: 
one or more hardware processors configured to function as (Fuji, [0035] – “The control device 4 can be configured on, for example, a general-purpose computer, and a hardware configuration (not illustrated) of the control device 4 includes an arithmetic unit configured by a central processing unit (CPU), a random access memory (RAM), and the like, a storage unit configured by a read only memory (ROM), a hard disk drive (HDD), a solid state drive (SSD) using a flash memory or the like, and the like, a connection device of a parallel interface format or a serial interface format, and the like.”): 
an acquisition unit that acquires a current state of a device (Fuji, [0027] – “the control device 4 according to the present embodiment includes a state acquisition unit 51 that processes input values from at least one sensor 2 or the like mounted inside the machine and determines state values that are output to control units 11 to 1n.sub.2 and 21 to 2n.sub.2 and a learning unit 71,” – teaches an acquisition unit that acquires a current state of a device (state acquisition unit 51 acquires state values from control device 4)); 
a first action value function specifying unit that has a function of learning a first inference model by reinforcement learning, and specifies a first action value function of the device on a basis of the current state and the first inference model (Fuji, [0028] – “The control device 4 according to the present embodiment operates the control units 11 to 1n.sub.2 identifying the control models 31 to 3n.sub.1 by learning and the control units 21 to 2n.sub.2 having one or more existing control models 41 to 4n.sub.2, which are illustrated in FIG. 1 in parallel to output the action value and the action of each of the control units 11 to 1n.sub.2 and 21 to 2n.sub.2 to the action value selection unit 61,” – teaches a first action value function specifying unit that has a function of learning a first inference model, and specifies a first action value function of the device on a basis of the current state and the first inference model (device operates control units identifying control models 31, models are identified by learning and output an action value based on a basis of current state, as state is input to control units in [0027], and the first inference model), and in [0050] – “In the present embodiment, an example using Q learning in reinforcement learning is illustrated as a learning method for acquiring a control model.” – teaches a function of learning a first inference model by reinforcement learning (uses Q-learning in reinforcement learning as learning method for acquiring a control model));
a second action value function specifying unit that specifies a second action value function of the device on a basis of the current state and a second inference model that is not a parameter update target (Fuji, [0028] – “The control device 4 according to the present embodiment operates the control units 11 to 1n.sub.2 identifying the control models 31 to 3n.sub.1 by learning and the control units 21 to 2n.sub.2 having one or more existing control models 41 to 4n.sub.2, which are illustrated in FIG. 1 in parallel to output the action value and the action of each of the control units 11 to 1n.sub.2 and 21 to 2n.sub.2 to the action value selection unit 61,” – teaches a second action value specifying unit (control models 41) that specifies a second action value function of the device (outputs action value of control unit 21) on a basis of current state (state is input to control unit at [0027]), and a second inference model, and in [0058] – “One control unit 11a that updates a parameter of the control model 31a and a control unit 21a having one existing control model 41a are operated in parallel.” – teaches that the second inference model is not a parameter update target (updates first inference model and existing second inference model is not a parameter update target)); and 
an action determination unit that determines a first action of the device on a basis of the first action value function and the second action value function (Fuji, [0028] – “The control device 4 according to the present embodiment operates the control units 11 to 1n.sub.2 identifying the control models 31 to 3n.sub.1 by learning and the control units 21 to 2n.sub.2 having one or more existing control models 41 to 4n.sub.2, which are illustrated in FIG. 1 in parallel to output the action value and the action of each of the control units 11 to 1n.sub.2 and 21 to 2n.sub.2 to the action value selection unit 61, outputs a control output value (action) selected by the action value selection unit 61 to at least one actuator 3 or the like mounted inside a machine, and updates the parameters of the control models 31 to 3n.sub.1 of the learning destination control units 11 to 1n.sub.1, based on observation data output from the sensor 2 and the selected action value.” – teaches an action determination unit (action selection unit 61) that determines a first action of the device on a basis of the first action value function and the second action value function (action value selection unit 61 selects one of the action values output by control models 31 and 41)).
	Claims 15 and 16 incorporate substantively all the limitations of claim 1 in a method and non-transitory computer-readable storage medium and are rejected on the same grounds as above.

	Regarding claim 2, Fuji teaches the information processing device according to claim 1, 
wherein the action determination unit selects, as a third action value function, any one of the first action value function and the second action value function, and determines the first action on a basis of the selected third action value function (Fuji, Eq. 3 and [0055] – “In general Q learning, the Q learning is updated by selecting the action with the highest action value in a certain state, but in Formula (2) and Formula (3), an action is selected by comparing the maximum action values of the synthesis destination control model Q.sub.Z and the existing control model Q.sub.A. At least one of the respective control models is required.” – a third action value function which is any one of the first action value function and second action value function, and determines the first action on a basis of the selected third action value function (the determined first action is based on the third action value function which is the maximum of the first and second action value functions, action is selected based on the comparison)).

	Regarding claim 3, Fuji teaches the information processing device according to claim 2, 
wherein the action determination unit changes a first selection probability of selecting the first action value function as the third action value function and a second selection probability of selecting the second action value function as the third action value function according to a learning time of the first inference model (Fuji, Eq.4 and [0056] – “Furthermore, in order to reduce a probability that an existing model is selected even in a state where learning is sufficiently progressed, for example, an oblivion factor f may be defined as in Formula (4), and a factor f multiplied by the action value according to the progress of learning may be provided.” – teaches wherein the action determination unit changes a first selection probability of selecting the first action value function as the third action value function and a second selection probability of selecting the second action value function as the third action value function according to a learning time of the first inference model (oblivion factor f changes probabilities associated with selecting the first and second action value functions according to the progress of learning)), 
decreases the first selection probability and increases the second selection probability as the learning time is shortened (Fuji, Eq. 4 and [0057] – “As for the factor f, a constant value may be subtracted from the oblivion factor for each trial, and a method of gradually making a selection probability of the existing control model approach zero may be adopted” – teaches decreasing the first selection probability and increasing the second selection probability as the learning time is shortened (f at shorter learning progress increases probability of existing (or second) control model being selected and thus decreases the first selection probability)), and 
increases the first selection probability and decreases the second selection probability as the learning time is lengthened(Fuji, Eq. 4 and [0057] – “As for the factor f, a constant value may be subtracted from the oblivion factor for each trial, and a method of gradually making a selection probability of the existing control model approach zero may be adopted” – teaches increasing the first selection probability and decreasing the second selection probability as the learning time is lengthened (method of updating f after each trial such that selection probability of existing (second) control model may approach 0, thus increasing the first selection probability as learning time is lengthened)).

Regarding claim 4, Fuji teaches the information processing device according to claim 1, 
wherein the action determination unit includes a third action value function specifying unit that specifies a third action value function obtained by synthesizing the first action value function and the second action value function, and an action selection unit that selects the first action on a basis of the third action value function (Fuji, Eq. 2 and [0053] – “The existing control model Q.sub.A is synthesized (learned) to the control model Q.sub.Z of a synthesis target by the following method. For example, Q.sub.A can be synthesized with Q.sub.Z by establishing the following updating formula.”, and [0055] – “in Formula (2) and Formula (3), an action is selected by comparing the maximum action values of the synthesis destination control model Q.sub.Z and the existing control model Q.sub.A. At least one of the respective control models is required.” – teaches wherein the action determination unit includes a third action value function specifying unit that specifies a third action value function obtained by synthesizing the first action value function and the second action value function, and an action selection unit that selects the first action on a basis of the third action value function (action is selected by a synthesis of the first and second action value functions, thus the third action value function is a synthesis of the first and second action value functions and the selection unit selects the first action on a basis of the third action value function)).

Regarding claim 5, Fuji teaches the information processing device according to claim 4, 
wherein the third action value function specifying unit specifies, as the third action value function, a maximum function of the first action value function and the second action value function (Fuji, Eq. 3 and [0055] – “In general Q learning, the Q learning is updated by selecting the action with the highest action value in a certain state, but in Formula (2) and Formula (3), an action is selected by comparing the maximum action values of the synthesis destination control model Q.sub.Z and the existing control model Q.sub.A. At least one of the respective control models is required.” – teaches wherein the third action value function specifying unit specifies a maximum function of the first action value function and the second action value function (action selected by comparing maximum action values of first and second action value functions)).

Regarding claim 12, Fuji teaches the information processing device according to claim 1, 
wherein the second inference model performs learning in advance on a basis of the current state and data of an action of the device based on a first rule (Fuji, [0053] – “In the present embodiment, the existing control model is specifically set as a Q table (Q.sub.A) in which a convergence condition is obtained when continuously reaching the goal 10 times on the shortest path in a shortest path search problem movable in the vertical and horizontal four directions. In addition, the control model of a synthesis destination (the control model for updating parameters) is specifically set as a Q table Q.sub.Z in which a convergence condition is obtained when continuously reaching the goal 10 times on the shortest path in a condition movable in eight directions to which the diagonal four directions are added.” – teaches wherein the second inference model (existing control model) performs learning in advance on a basis of the current state and data of an action of the device based on a first rule (performs Q-learning in advance, Q-learning utilizes state and action to perform learning as in [0050] and is based on a first rule, which is the convergence condition that is obtained by shortest path search problem)).

Regarding claim 13, Fuji teaches the information processing device according to claim 1, 
wherein the one or more hardware processors are configured to further function as a plurality of second action value function specifying units (Fuji, [0028] – “The control device 4 according to the present embodiment operates the control units 11 to 1n.sub.2 identifying the control models 31 to 3n.sub.1 by learning and the control units 21 to 2n.sub.2 having one or more existing control models 41 to 4n.sub.2,” – teaches wherein the one or more hardware processors (processors as in [0035]) are configured to further function as plurality of second action value function specifying units (one or more existing control models 41 to 4n.sub.2)).

Regarding claim 14, Fuji teaches the information processing device according to claim 1, wherein the one or more hardware processors are configured to further function as: 
a display control unit that displays, on a display unit, information indicating at least one of a progress of learning of the first inference model, a selection probability of the action determination unit selecting at least one of the first action value function and the second action value function, a number of times of selection of the action determination unit selecting at least one of the first action value function and the second action value function, whether the first action is an action that maximizes an action value represented by one of the first action value function and the second action value function, a selection probability that an action that maximizes an action value represented by the second action value function is selected as the first action, and a transition of the selection probability (Fuji, [0046] – “The selection monitoring unit 91 monitors a situation of learning by displaying the action value and the action selected by the action value selection unit 61 and the number of times of each of the selected control units 11 to 1n.sub.1 and 21 to 2n.sub.2, on, for example, a visualization tool such as a display connected to the outside of the control device 4, or by taking a log and describing in text. For example, it can be used as information for changing a connection relationship with the learning units 71 of the control models 31 to 3n.sub.1 of a learning destination and the existing control models 41 to 4n.sub.2, based on the monitoring results.” – teaches a display control unit that displays information indicating at least one of a progress of learning of the first inference model (monitors situation of learning), a number of times of selection of the action determination unit selecting at least one of the first action value function and the second action value function (number of times of each of the selected control units), and whether the first action is an action that maximizes an action value represented by one of the first action value function and the second action value function (selected action would be an action that maximizes an action value of either action value functions)).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 6, 8, and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuji in view of Zhu et al. (NPL dated Feb. 2021, Self-correcting Q-Learning, hereinafter “Zhu”).
Regarding claim 6, Fuji teaches the information processing device according to claim 4, wherein
the third action value function specifying unit specifies, as the third action value function, a maximum function of [a] the fourth action value function and the second action value function (Fuji, Eq. 3 and [0055] – “In general Q learning, the Q learning is updated by selecting the action with the highest action value in a certain state, but in Formula (2) and Formula (3), an action is selected by comparing the maximum action values of the synthesis destination control model Q.sub.Z and the existing control model Q.sub.A. At least one of the respective control models is required.” – teaches wherein the third action value function specifying unit specifies a maximum function of a fourth action value function and the second action value function (action selected by comparing maximum action values of fourth and second action value functions)).
Fuji fails to explicitly teach wherein the action determination unit includes an action value function correction unit that specifies a fourth action value function obtained by correcting the first action value function on a basis of the second action value function.
However, analogous to the field of the claimed invention, Zhu teaches:
wherein the action determination unit includes an action value function correction unit that specifies a fourth action value function obtained by correcting the first action value function on a basis of the second action value function (Zhu, Section 5 Paragraph 3 – “Specifically, our proposed Self-correcting Deep Q Network algorithm (ScDQN) equates the current and previous estimates of the action value function Qn(s 0 , a) and Qn−1(s 0 , a) in Eqn. (4) to the target network Q(s, a; θ −) and the online network Q(s, a; θ), respectively. In other words, we compute Eq. (5)” – teaches an action value function correction unit that specifies a fourth action value function (the result of Eq. (5)) obtained by correcting the first action value function on a basis of the second action value function (corrects target network based on online network)), 
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the action value function correction of Zhu to the action value functions and device of Fuji. Doing so would combine correlated estimators of the optimal action value into a corrected estimator, which would remove maximization bias and attain faster convergence speed (Zhu, Introduction).

Regarding claim 8, the combination of Fuji and Zhu teaches the information processing device according to claim 6, 
such that a second selection probability of selecting the second action value function as the third action value function at start of learning of the first inference model becomes a predetermined selection probability (Fuji, Eq.4 and [0056] – “Furthermore, in order to reduce a probability that an existing model is selected even in a state where learning is sufficiently progressed, for example, an oblivion factor f may be defined as in Formula (4), and a factor f multiplied by the action value according to the progress of learning may be provided.” – teaches a predetermined selection probability (oblivion factor f) that represents a probability of selecting the second action value function as the third action value function at the start of learning (predetermined probability selected at any state of learning, even when learning has progressed or has just started)).
Fuji fails to explicitly teach wherein the action value function correction unit specifies the fourth action value function obtained by correcting the first action value function.
However, analogous to the field of the claimed invention, Zhu teaches:
wherein the action value function correction unit specifies the fourth action value function obtained by correcting the first action value function (Zhu, Section 5 Paragraph 3 – “Specifically, our proposed Self-correcting Deep Q Network algorithm (ScDQN) equates the current and previous estimates of the action value function Qn(s 0 , a) and Qn−1(s 0 , a) in Eqn. (4) to the target network Q(s, a; θ −) and the online network Q(s, a; θ), respectively. In other words, we compute Eq. (5)” – teaches an action value function correction unit that specifies a fourth action value function (the result of Eq. (5)) obtained by correcting the first action value function on a basis of the second action value function (corrects target network based on online network))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the action value function correction of Zhu to the action value functions and predetermined selection probability of Fuji. Doing so would combine correlated estimators of the optimal action value into a corrected estimator, which would remove maximization bias and attain faster convergence speed (Zhu, Introduction).

	Regarding claim 11, the combination of Fuji and Zhu teaches the information processing device according to claim 6, 
wherein the first action value function specifying unit learns the first inference model by reinforcement learning by using the current state, a reward in the current state, and the first action, and learns the first inference model by reinforcement learning by using the first action when the fourth action value function is used to specify the third action value function (Fuji, [0050] – “ In the present embodiment, an example using Q learning in reinforcement learning is illustrated as a learning method for acquiring a control model. The Q learning is a method of learning a value (action value) Q(s,a) for selecting an action a under a certain state value s obtained by processing the observation data from the sensor 2 by using the state acquisition unit 51.”, [0051] – “ A Q table according to the present embodiment holds the grid square of each maze, and a coordinate value represented by symbols 1 to 10 and A to P in the vertical and horizontal directions is set as the state value s. In addition, scores are allocated for each grid square (predefined by a designer), and this is searched as a reward value r. The control model 330 in eight directions is handled one by one in the vertical, horizontal, and diagonal directions as the action a. For the Q learning, state transition calculation is performed by using the following updating formula.”, and in [0055] – “In general Q learning, the Q learning is updated by selecting the action with the highest action value in a certain state, but in Formula (2) and Formula (3), an action is selected by comparing the maximum action values of the synthesis destination control model Q.sub.Z and the existing control model Q.sub.A. At least one of the respective control models is required.” – teaches wherein the first action value function specifying unit learns the first inference model by reinforcement learning using the current state, a reward in the current state, and the first action (uses Q learning in reinforcement learning to learn a control model using current state, action, and reward), and learns the first inference model by reinforcement learning by using the first action when the fourth action value function is used to specify the third action value function (selects action with highest action value in certain state between action value functions, thus learning by reinforcement learning using the action defined by the fourth action value function or the control models)).

Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuji and Zhu as applied to claims 1 and 15-16 above, and further in view of Mousavi et al. (NPL dated Dec. 2014, Context Transfer in Reinforcement Learning Using Action-Value Functions, hereinafter “Mousavi”).
Regarding claim 7, the combination of Fuji and Zhu teaches the information processing device according to claim 6, 
wherein the action value function correction unit specifies the fourth action value function obtained by correcting the first action value function (Zhu, Section 5 Paragraph 3 – “Specifically, our proposed Self-correcting Deep Q Network algorithm (ScDQN) equates the current and previous estimates of the action value function Qn(s 0 , a) and Qn−1(s 0 , a) in Eqn. (4) to the target network Q(s, a; θ −) and the online network Q(s, a; θ), respectively. In other words, we compute Eq. (5)” – teaches an action value function correction unit that specifies a fourth action value function (the result of Eq. (5)) obtained by correcting the first action value function on a basis of the second action value function (corrects target network based on online network))
The combination of Fuji and Zhu fails to explicitly teach an action value for an action represented by the first action value function becomes a value between a maximum value and a minimum value of action values for actions represented by the second action value function.
However, analogous to the field of the claimed invention, Mousavi teaches:
such that an action value for an action represented by the first action value function becomes a value between a maximum value and a minimum value of action values for actions represented by the second action value function (Mousavi, Section 5 Paragraphs 3-4 – “Eq. (12) This is the set of possible Q-values for the pair of (sl, al) using the knowledge of the source tasks. These definitions are used to initialize the Q-values of the target task. We can use a statistical average operator to estimate a single value from the set-value CT(sl, al) as an initial value of Ql(sl, al). For example, we can use mean, median, or midrange operators. In this paper, we use the midrange operator, defined as follows: Eq. (13) where mathematical equation is an initial estimation of Ql(sl, al) and Eq. (14)” – teaches an action value for an action represented by the first action value function becoming a value between a maximum and minimum value of action values for actions represented by the second action value function (takes midrange, which is the average of the minimum and maximum values for the action value function, to initialize Q-values of the next target task)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the action values between  maximum and minimum action values of an action value function of Mousavi to the action value functions and system of Fuji and Zhu. Doing so would combine the knowledge of different source tasks to be used by the target task (Mousavi, Section 3) and increase average reward while decreasing regret (Mousavi, Section 7.2).

Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuji and Zhu as applied to claims 1 and 15-16 above, and further in view of Chen et al. (US Pub. No. 2017/0286860, published Oct. 2017, hereinafter “Chen”).
Regarding claim 9, the combination of Fuji and Zhu teaches the information processing device according to claim 8, 
wherein the action value function correction unit specifies the fourth action value function obtained by correcting the first action value function (Zhu, Section 5 Paragraph 3 – “Specifically, our proposed Self-correcting Deep Q Network algorithm (ScDQN) equates the current and previous estimates of the action value function Qn(s 0 , a) and Qn−1(s 0 , a) in Eqn. (4) to the target network Q(s, a; θ −) and the online network Q(s, a; θ), respectively. In other words, we compute Eq. (5)” – teaches an action value function correction unit that specifies a fourth action value function (the result of Eq. (5)) obtained by correcting the first action value function on a basis of the second action value function (corrects target network based on online network))
The combination of Fuji and Zhu fails to explicitly teach a selection probability input by a user.
However, analogous to the field of the claimed invention, Chen teaches:
so as to become a selection probability input by a user (Chen, [0046] – “In some examples, the control program 142 can be configured to receive inputs, e.g., via a keyboard, transmit corresponding queries to a computing device 102, receive responses from computing device 102, and present the responses, e.g., via a display. In some examples, training and operation of computational models are carried out on computing device(s) 102. In some examples, training and operation are carried out on a computing device 104. In some of these examples, the control program 142 can be configured to receive inputs, train and/or operate computational model(s) 128 using instructions of representation engine 120 and action engine 122 based at least in part on those inputs to determine an action, and implement the determined action.” and in [0049] – “In some examples, computing device 104 can include a user interface 146. For example, computing device 104(3) can provide user interface 146 to control and/or otherwise interact with cluster 106 and/or computing devices 102 therein. For example, processing unit(s) 136 can receive inputs of user actions via user interface 146 and transmit corresponding data via communications interface 144 to computing device(s) 102.” – teaches a computing device receiving inputs provided by a user to determine an action based at least in part on the user inputs (user interface enables user to control and/or otherwise interact with computing device)).
	Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the user interface of Chen to the selection probabilities and action value functions of Fuji and Zhu in order to utilize a user input selection probability. Doing so would permit users to control and interact with the computing devices (Chen, [0049]) and adapt the system according to user interactions to obtain effective results (Chen, [0017]).

Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuji, Zhu, and Mousavi as applied to claims 1 and 15-16 above, and further in view of Chen et al. (US Pub. No. 2017/0286860, published Oct. 2017, hereinafter “Chen”).
Regarding claim 10, the combination of Fuji, Zhu, and Mousavi teaches the information processing device according to claim 7, 
wherein the action value function correction unit specifies the fourth action value function obtained by correcting the first action value function (Zhu, Section 5 Paragraph 3 – “Specifically, our proposed Self-correcting Deep Q Network algorithm (ScDQN) equates the current and previous estimates of the action value function Qn(s 0 , a) and Qn−1(s 0 , a) in Eqn. (4) to the target network Q(s, a; θ −) and the online network Q(s, a; θ), respectively. In other words, we compute Eq. (5)” – teaches an action value function correction unit that specifies a fourth action value function (the result of Eq. (5)) obtained by correcting the first action value function on a basis of the second action value function (corrects target network based on online network))
such that an action value for an action represented by the first action value function becomes a value that is between a maximum value and a minimum value of action values for actions represented by the second action value function (Mousavi, Section 5 Paragraphs 3-4 – “Eq. (12) This is the set of possible Q-values for the pair of (sl, al) using the knowledge of the source tasks. These definitions are used to initialize the Q-values of the target task. We can use a statistical average operator to estimate a single value from the set-value CT(sl, al) as an initial value of Ql(sl, al). For example, we can use mean, median, or midrange operators. In this paper, we use the midrange operator, defined as follows: Eq. (13) where mathematical equation is an initial estimation of Ql(sl, al) and Eq. (14)” – teaches an action value for an action represented by the first action value function becoming a value between a maximum and minimum value of action values for actions represented by the second action value function (takes midrange, which is the average of the minimum and maximum values for the action value function, to initialize Q-values of the next target task)).
	The combination of Fuji, Zhu, and Mousavi fails to explicitly teach a value that is input by a user.
	However, analogous to the field of the claimed invention, Chen teaches:
	a value that is input by a user (Chen, [0046] – “In some examples, the control program 142 can be configured to receive inputs, e.g., via a keyboard, transmit corresponding queries to a computing device 102, receive responses from computing device 102, and present the responses, e.g., via a display. In some examples, training and operation of computational models are carried out on computing device(s) 102. In some examples, training and operation are carried out on a computing device 104. In some of these examples, the control program 142 can be configured to receive inputs, train and/or operate computational model(s) 128 using instructions of representation engine 120 and action engine 122 based at least in part on those inputs to determine an action, and implement the determined action.” and in [0049] – “In some examples, computing device 104 can include a user interface 146. For example, computing device 104(3) can provide user interface 146 to control and/or otherwise interact with cluster 106 and/or computing devices 102 therein. For example, processing unit(s) 136 can receive inputs of user actions via user interface 146 and transmit corresponding data via communications interface 144 to computing device(s) 102.” – teaches a computing device receiving inputs provided by a user to determine an action based at least in part on the user inputs (user interface enables user to control and/or otherwise interact with computing device)).
	Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the user interface of Chen to the action value function correction and values between a minimum and maximum of Fuji, Zhu, and Chen in order to enable users to input values between a minimum and maximum based on the second action value function. Doing so would permit users to control and interact with the computing devices (Chen, [0049]) and adapt the system according to user interactions to obtain effective results (Chen, [0017]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Qin et al. (US Pub. No. 2020/0074354, published March 2021) teaches systems and methods for ride order dispatching and vehicle repositioning. Teaches selecting a maximum value between two action value functions learned in reinforcement learning. Teaches utilizing a Deep Q-Network to learn a Q-value function for all drivers in an environment using states and actions. 
Wang et al. (NPL dated 2021: Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback) teaches an ensemble method for Q-learning, wherein the ensemble comprises several action value functions based on a state and action. Teaches computing the average across all approximators, or action value functions, to determine a final approximation. Teaches an estimation bias for the estimators, or action value functions, with upper and lower bounds to prevent both overestimation and underestimation.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOUIS C NYE whose telephone number is 571-272-0636. The examiner can normally be reached Monday - Friday 9:00AM - 5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MATT ELL can be reached at 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LOUIS CHRISTOPHER NYE/Examiner, Art Unit 2141                                                                                                                                                                                                        
/MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141
Read full office action
Prosecution Timeline

Feb 21, 2023
Application Filed
Jan 22, 2026
Non-Final Rejection — §101, §102, §103
Apr 13, 2026
Examiner Interview Summary
Apr 13, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/972,539
Patent 12524683
METHOD FOR PREDICTING REMAINING USEFUL LIFE (RUL) OF AERO-ENGINE BASED ON AUTOMATIC DIFFERENTIAL LEARNING DEEP NEURAL NETWORK (ADLDNN)
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
22%
Grant Probability
58%
With Interview (+35.7%)
3y 2m
Median Time to Grant
Low
PTA Risk
Based on 9 resolved cases by this examiner. Grant probability derived from career allow rate.