Last updated: April 19, 2026
Application No. 17/754,699
INFORMATION PROCESSING APPARATUS AND INFORMATION PROCESSING METHOD

Non-Final OA §102§103§112
Filed
Apr 08, 2022
Examiner
KASSIM, IMAD MUTEE
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Sony Group Corporation
OA Round
1 (Non-Final)
Interview Optional

— +33.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 160 resolved cases, 2023–2026
Examiner Intelligence

KASSIM, IMAD MUTEE View full profile →
Grants 72% — above average
Career Allow Rate
116 granted / 160 resolved
+17.5% vs TC avg
Strong +34% interview lift
Without
With
+33.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
23 currently pending
Career history
183
Total Applications
across all art units
Statute-Specific Performance

§101
22.6%
-17.4% vs TC avg
§103
44.2%
+4.2% vs TC avg
§102
11.8%
-28.2% vs TC avg
§112
12.9%
-27.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 160 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
	This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim limitation(s) do not recite(s) sufficient structure, materials, or acts to entirely perform the recited function.  Such claim limitation(s) is/are: 
“acquisition unit that acquires a machine learning model…” (claims 1, 2, 3, 14, 16, 18).
“a reception unit that receives training data…” (claims 1-3, and 12-18).
“reinforcement learning unit that trains a machine learning model…” (claim 19).
“estimation unit that estimates the weight of each of the rewards…”(claim 19).

For an analysis of the structure, material, or acts corresponding to the claimed functions, see rejection under 35 USC § 112(b) infra.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant wishes to provide further explanation or dispute the examiner’s interpretation of the corresponding structure, applicant must identify the corresponding structure with reference to the specification by page and line number, and to the drawing, if any, by reference characters in response to this Office action. 
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
	This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim limitation(s) do not recite(s) sufficient structure, materials, or acts to entirely perform the recited function.  Such claim limitation(s) is/are: 
“acquisition unit that acquires a machine learning model…” (claim 1).
“a reception unit that receives training data…” (claim 1).
“reinforcement learning unit that trains a machine learning model…” (claim 19).
“estimation unit that estimates the weight of each of the rewards…”(claim 19).

	However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. 

Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph. For the purpose of examination, any computer capable of performing the claimed functions reads on the claims.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181. 

Claims 2-18 are rejected as they are being directly or indirectly dependent on rejected claim 1.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claims 1, 19 and 20, the phrase "such that" renders the claim indefinite because it is unclear whether the limitations following the phrase are part of the claimed invention.  See MPEP § 2173.05(d).
All the dependent claims that are dependence on claim 1 are rejected based on the dependency.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-18 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al. (US 20190354859 A1) in view of Lee et al. (US 20150379429 A1).

Regarding claim 1. 
Xu teaches an information processing apparatus comprising: an acquisition unit that acquires a machine learning model trained with reinforcement learning such that, when first state information indicating a first state has been input, the model will output first action information indicating a first action corresponding to the first state, based on a plurality of rewards weighted by a weight of each of the rewards (see ¶ 16, “The one or more policy parameters may be one or more parameters that define the functioning of the reinforcement learning neural network (one or more parameters that define the actions taken by the neural network). The one or more return parameters may be parameters that define how returns are determined based on the rewards.”, also see ¶ 29, “The one or more parameters may comprise one or more of a discount factor of the return function and a bootstrapping factor of the return function. The return function may apply the discount factor γ to provide a discounted return. This discounted return may comprise a weighted sum of the rewards, with the discount factor defining the decay of the weighted sum. Optimizing the discount factor has been found to be a particularly effective method of improving the efficiency and accuracy of reinforcement learning. Equally, the return function might apply the bootstrapping factor λ to a geometrically weighted combination of returns (a bootstrapping parameter return function, or λ-return). The return function may calculate a weighted combination of returns, with each return being estimated over multiple steps (e.g., being a decaying weighted sum of rewards).”, also see ¶110 -¶114, and ¶ 142-143); 
	a reception unit that receives training data being a set of second state information indicating a second state and second action information indicating a second action corresponding to the second state (see ¶ 31-31, “update the one or more policy parameters for the reinforcement learning neural network based on the second set of the experiences; and update the one or more return parameters of the return function based on the one or more updated policy parameters and the first set of the experiences, wherein the one or more return parameters are updated via the gradient ascent or descent method. This improves the efficiency and effectiveness of training by repeating the update steps but swapping over the sets of experiences that are used for each update. As mentioned above, using different sets of experiences avoids overfitting. Nevertheless, this reduces the amount of training data that may be used for each update. To improve data efficiency, the first and second sets can be swapped over after the updates, so that the second set experiences can then be used to train the policy and the second set of experiences can then be used to train the return function.”, also see ¶ 98, “Each of the plurality of experiences 250 comprises an observation characterizing a state of the environment, an action performed by the agent in response to the observation and a reward received in response to the action.”, also see ¶ 33-50, formula and description of the policy for the reinforcement learning neural network for determining actions A′ from states S′ taken from the second set of experiences τ′, the policy π.sub.θ, operating according to the one or more updated policy parameters θ); 
	and  regarding the weight of each of the rewards estimated by training the machine learning model in which the weight of each of the rewards is defined as a part of a connection coefficient of the machine learning model such that, when the second state information included in the training data and a value based on the weight of each of the rewards have been input, the model will output the second action information included in the training data (see ¶ 22, “By conditioning the policy and/or the value function on the one or more return parameters, the agent is enforced to learn universal policies and/or value functions for various sets of return parameters.”, see ¶ 152, “In this implementation, the agent then explicitly learns the value function ν.sub.θ and the policy π 220 most appropriate for any given value of the meta-parameters η 225. This has the advantage of allowing the meta-parameters η to be adjusted without any need to wait for the approximator to “catch up”.”, also see ¶ 138-139, “in this formula, the first term represent a control objective that configures the policy π.sub.θ 210 to maximize the measured reward of the return function 220. The second term represents a prediction objective that configures the value function approximator ν.sub.θ to accurately estimate the return of the return function G.sub.η(τ). The third term is a term for entropy H that regularizes the policy 210, and c and d are coefficients that appropriately weight the different terms in the meta-objective function.”, also ¶ 140-144).
	Xu do not teach a display unit that displays information regarding the weight. 
	Lee teaches display unit that displays information regarding the weight (see ¶ 78, “a number of MLS programmatic interfaces (such as application programming interfaces (APIs)) may be defined by the service, which guide non-expert users to start using machine learning best practices relatively quickly, without the users having to expend a lot of time and effort on tuning models, or on learning advanced statistics or artificial intelligence techniques. The interfaces may, for example, allow non-experts to rely on default settings or parameters for various aspects of the procedures used for building, training and using machine learning models, where the defaults are derived from the accumulated experience of other practitioners addressing similar types of machine learning problems. At the same time, expert users may customize the parameters or settings they wish to use for various types of machine learning tasks, such as input record handling, feature processing, model building, execution and evaluation.”, also see ¶ 342, “The dynamic display of the effects of various possible settings changes may be made possible in various embodiments by efficient communications between the back-end components of the MLS (e.g., various MLS servers where the model execution results are obtained and stored, and where the impacts of the changes are rapidly quantified) and the front-end or client-side devices (e.g., web browsers or GUIs being executed at laptops, desktops, smart phones and the like) at which the execution results are displayed and the interactions of the clients with various control elements of the interface are first captured. As a client changes a setting via the interface, an indication of the change may be transmitted rapidly to a back-end server of the MLS in some embodiments. The back-end server may compute the results of the change on the data set to be displayed quickly, and transmit the data necessary to update the display back to the front-end device.”, also see ¶ 342).
Both Xu and Lee pertain to the problem of machine learning systems, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Xu and Lee  to teach the above limitations. The motivation for doing so would be to increase the flexibility of the system, which can be accomplished by allowing experts to customize it to their needs, “At the same time, expert users may customize the parameters or settings they wish to use for various types of machine learning tasks, such as input record handling, feature processing, model building, execution and evaluation.” (Lee  ¶ 78).
	
Regarding claim 2. 
Xu and Lee  teach The information processing apparatus according to claim 1, 
Xu further teach wherein the reception unit receives a range of the weight of each of the rewards, and the acquisition unit acquires the machine learning model trained with reinforcement learning based on the plurality of rewards weighted by the weight of each of the rewards falling within a range of the weight of each of the rewards received by the reception unit (see 19, “The one or more return parameters may be updated each time that the policy parameters are updated. This provides a computationally simpler and more efficient mechanism for training the system.”, also ¶ 22, “By conditioning the policy and/or the value function on the one or more return parameters, the agent is enforced to learn universal policies and/or value functions for various sets of return parameters.”, also see 152, “the agent then explicitly learns the value function ν.sub.θ and the policy π 220 most appropriate for any given value of the meta-parameters η 225. This has the advantage of allowing the meta-parameters η to be adjusted without any need to wait for the approximator to “catch up”.”).

Regarding claim 3. 
Xu and Lee  teach The information processing apparatus according to claim 1, 
Xu further teach wherein the reception unit receives information regarding the plurality of rewards, and the acquisition unit acquires the machine learning model trained with reinforcement learning based on a plurality of rewards based on information regarding the plurality of rewards received by the reception unit (see 9, “retrieve a plurality of experiences from a reinforcement learning neural network configured to control an agent interacting with an environment to perform a task in an attempt to achieve a specified result based on one or more policy parameters for the reinforcement learning neural network, each experience comprising an observation characterizing a state of the environment, an action performed by the agent in response to the observation and a reward received in response to the action.”, also see ¶ 41, “G.sub.η(τ) is the return function that calculates returns from the first set of experiences τ based on the one or more return parameters η.sub.t;”).

Regarding claim 4. 
Xu and Lee  teach The information processing apparatus according to claim 1, 
Xu further teach see ¶ 29, “The one or more parameters may comprise one or more of a discount factor of the return function and a bootstrapping factor of the return function. The return function may apply the discount factor γ to provide a discounted return. This discounted return may comprise a weighted sum of the rewards, with the discount factor defining the decay of the weighted sum. Optimizing the discount factor has been found to be a particularly effective method of improving the efficiency and accuracy of reinforcement learning. Equally, the return function might apply the bootstrapping factor λ to a geometrically weighted combination of returns (a bootstrapping parameter return function, or λ-return). The return function may calculate a weighted combination of returns, with each return being estimated over multiple steps (e.g., being a decaying weighted sum of rewards).”, also see ¶110 -¶114 and also see ¶ 138-139, “in this formula, the first term represent a control objective that configures the policy π.sub.θ 210 to maximize the measured reward of the return function 220. The second term represents a prediction objective that configures the value function approximator ν.sub.θ to accurately estimate the return of the return function G.sub.η(τ). The third term is a term for entropy H that regularizes the policy 210, and c and d are coefficients that appropriately weight the different terms in the meta-objective function.”, also ¶ 140-144). 
Lee teaches Wherein the display unit displays information (see ¶ 78, “The interfaces may, for example, allow non-experts to rely on default settings or parameters for various aspects of the procedures used for building, training and using machine learning models, where the defaults are derived from the accumulated experience of other practitioners addressing similar types of machine learning problems…”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 4.

Regarding claim 5. 
Xu and Lee  teach The information processing apparatus according to claim 4, 
Xu further teach wherein the training data includes subject information regarding a subject of an action of the second action information, see ¶ 16, “The one or more policy parameters may be one or more parameters that define the functioning of the reinforcement learning neural network (one or more parameters that define the actions taken by the neural network). The one or more return parameters may be parameters that define how returns are determined based on the rewards.”, also see ¶ 98, “Each of the plurality of experiences 250 comprises an observation characterizing a state of the environment, an action performed by the agent in response to the observation and a reward received in response to the action.”, also see ¶ 33-50, formula and description of the policy for the reinforcement learning neural network for determining actions A′ from states S′ taken from the second set of experiences τ′, the policy π.sub.θ, operating according to the one or more updated policy parameters θ).
Lee teaches the display unit displays information indicating the weight of the reward illustrated in different colors according to a difference between the subjects (see ¶ 320, “other visual cues such as color coding, lines of varying thickness, varying fonts etc. may be used to distinguish among the various parts of Graph G1, Bar B1, region 6351 etc”, also see ¶ 324, “The impact of the increase is illustrated in FIG. 64b. As the slider is moved towards the right, the visual properties (e.g., shadings, colors etc.) of several sub-areas of the graph G2 that would be affected by the changed cutoff may be changed in real time.”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 5.


Regarding claim 6. 
Xu and Lee  teach The information processing apparatus according to claim 1, 
Xu further teach wherein see ¶ 29, “The one or more parameters may comprise one or more of a discount factor of the return function and a bootstrapping factor of the return function. The return function may apply the discount factor γ to provide a discounted return. This discounted return may comprise a weighted sum of the rewards, with the discount factor defining the decay of the weighted sum. Optimizing the discount factor has been found to be a particularly effective method of improving the efficiency and accuracy of reinforcement learning. Equally, the return function might apply the bootstrapping factor λ to a geometrically weighted combination of returns (a bootstrapping parameter return function, or λ-return). The return function may calculate a weighted combination of returns, with each return being estimated over multiple steps (e.g., being a decaying weighted sum of rewards).”, also see ¶110 -¶114 and also see ¶ 138-139, “in this formula, the first term represent a control objective that configures the policy π.sub.θ 210 to maximize the measured reward of the return function 220. The second term represents a prediction objective that configures the value function approximator ν.sub.θ to accurately estimate the return of the return function G.sub.η(τ). The third term is a term for entropy H that regularizes the policy 210, and c and d are coefficients that appropriately weight the different terms in the meta-objective function.”, also ¶ 140-144).
Lee teaches wherein the display unit displays statistical information (see ¶ 78, “The interfaces may, for example, allow non-experts to rely on default settings or parameters for various aspects of the procedures used for building, training and using machine learning models, where the defaults are derived from the accumulated experience of other practitioners addressing similar types of machine learning problems…”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 6.

Regarding claim 7. 
Xu and Lee  teach The information processing apparatus according to claim 6, 
Lee further teach wherein the display unit displays a message related to a suggestion of retraining based on the statistical information (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 7.

Regarding claim 8. 
Xu and Lee  teach The information processing apparatus according to claim 6, 
Xu further teach wherein the training data includes environmental information regarding an environment in the second state, the statistical information is correlation information indicating a correlation between the weight of the reward and the environmental information (see ¶ 31-31, “update the one or more policy parameters for the reinforcement learning neural network based on the second set of the experiences; and update the one or more return parameters of the return function based on the one or more updated policy parameters and the first set of the experiences, wherein the one or more return parameters are updated via the gradient ascent or descent method. This improves the efficiency and effectiveness of training by repeating the update steps but swapping over the sets of experiences that are used for each update. As mentioned above, using different sets of experiences avoids overfitting. Nevertheless, this reduces the amount of training data that may be used for each update. To improve data efficiency, the first and second sets can be swapped over after the updates, so that the second set experiences can then be used to train the policy and the second set of experiences can then be used to train the return function.”, also see ¶ 98, “Each of the plurality of experiences 250 comprises an observation characterizing a state of the environment, an action performed by the agent in response to the observation and a reward received in response to the action.”, also see ¶ 33-50, formula and description of the policy for the reinforcement learning neural network for determining actions A′ from states S′ taken from the second set of experiences τ′, the policy π.sub.θ, operating according to the one or more updated policy parameters θ). 
Lee teaches the display unit displays the correlation information (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 8.

Regarding claim 9. 
Xu and Lee  teach The information processing apparatus according to claim 8, 
Lee further teach wherein, based on the correlation information, the display unit displays a message related to a suggestion of retraining in consideration of a reward based on the environmental information (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 9.

Regarding claim 10. 
Xu and Lee  teach The information processing apparatus according to claim 6, 
Xu further teach wherein the statistical information is correlation information indicating a correlation between weights of at least two rewards among the weights of each of the rewards (see ¶ 29, “This discounted return may comprise a weighted sum of the rewards, with the discount factor defining the decay of the weighted sum. Optimizing the discount factor has been found to be a particularly effective method of improving the efficiency and accuracy of reinforcement learning. Equally, the return function might apply the bootstrapping factor λ to a geometrically weighted combination of returns (a bootstrapping parameter return function, or λ-return). The return function may calculate a weighted combination of returns, with each return being estimated over multiple steps (e.g., being a decaying weighted sum of rewards). Optimizing the bootstrapping factor (potentially in combination with the discount factor) leads to more efficient and accurate reinforcement learning.”).
Lee teaches the display unit displays the correlation information (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 10.

Regarding claim 11. 
Xu and Lee  teach The information processing apparatus according to claim 10, 
Xu further teach wherein see ¶ 29, “This discounted return may comprise a weighted sum of the rewards, with the discount factor defining the decay of the weighted sum. Optimizing the discount factor has been found to be a particularly effective method of improving the efficiency and accuracy of reinforcement learning. Equally, the return function might apply the bootstrapping factor λ to a geometrically weighted combination of returns (a bootstrapping parameter return function, or λ-return). The return function may calculate a weighted combination of returns, with each return being estimated over multiple steps (e.g., being a decaying weighted sum of rewards). Optimizing the bootstrapping factor (potentially in combination with the discount factor) leads to more efficient and accurate reinforcement learning.”, also see ¶ 89, he discount factor γ determines the time-scale of the return. A discount factor close to γ=1 provides a long-sighted goal that accumulates rewards far into the future, while a discount factor with a value close to γ=0 provides a short-sighted goal that prioritizes short-term rewards. Even in problems where long-sightedness is desired, it is frequently observed that discount factor values of γ<1 achieve better results, especially during early learning. It is known that many algorithms converge faster with lower discounts factor values, but too low a discount factor value can lead to sub-optimal policies. In practice it can therefore be better to first optimize for a short-sighted horizon, e.g., with γ=0 at first, and then to repeatedly increase the discount factor value at later stages. “”).
Lee teaches the display unit displays a message (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 11.


Regarding claim 12. 
Xu and Lee  teach The information processing apparatus according to claim 4, 
Xu further teach wherein the reception unit receives a selection operation for information indicating the weight of the reward see ¶ 86, “An experience tuple corresponding to a time step may include: (i) an observation characterizing the state of the environment at the time step, (ii) an action that was selected to be performed by the agent at the time step, (iii) a subsequent observation characterizing a subsequent state of the environment subsequent to the agent performing the selected action, (iv) a reward received subsequent to the agent performing the selected action, and (v) a subsequent action that was selected to be performed at the subsequent time step.”).
Lee teaches wherein information indicating the weight of the reward displayed by the display unit, and the display unit displays information regarding the training data (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”, also see ¶ 78, “The interfaces may, for example, allow non-experts to rely on default settings or parameters for various aspects of the procedures used for building, training and using machine learning models, where the defaults are derived from the accumulated experience of other practitioners addressing similar types of machine learning problems…”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 12.

Regarding claim 13. 
Xu and Lee  teach The information processing apparatus according to claim 4, 
Xu further teach wherein the reception unit receives a deletion operation for information indicating the weight of the reward displayed see ¶ 65, “by learning an improved return function as the system trains, the system is able to converge on an optimal set of policy parameters quicker, using fewer updates. As improved rewards are used during training, the final trained system is also displays more accurate and effective learned behaviors.”, also see ¶ 110, “The meta-gradient reinforcement learning approach described in this specification is based on the principle of online cross-validation, using successive samples of experience. The underlying reinforcement learning method is applied to the first set of experiences τ, and its performance is measured using a second set of experiences τ′. Specifically, the method starts with policy parameters θ 215, and applies the update function to the first set of experiences τ, resulting in new parameters θ′. The gradient dθ/dη of these updates then indicates how the meta-parameters η 225 affected these new policy parameters θ′. The method then measures the performance of the new policy parameters θ′ on a subsequent, independent second set of experiences τ′, utilizing a differentiable meta-objective J′(τ′, θ′, η′). When validating the performance on the second set of experiences τ′, a fixed meta-parameter η′ in J′ is used as a reference value. In this way, a differentiable function of the meta-parameters η is formed, and the gradient of η can be obtained by taking the derivative of meta-objective J′ with respect to η and applying the chain rule”, also see ¶ 124, “the meta-gradient can be assessed for the one or more meta-parameters η 225, and the one or more meta-parameters η 225 can be adjusted accordingly in step 408 to ensure the optimum return function G, forming updated meta-parameters. These updated meta-parameters can then be used as the meta-parameters η 225 in the subsequent update iteration (where it is concluded that the optimum return function has not been reached).”).
Lee teaches wherein information indicating the weight of the reward displayed by the display unit, and the display unit displays information regarding … (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”, also see ¶ 78, “The interfaces may, for example, allow non-experts to rely on default settings or parameters for various aspects of the procedures used for building, training and using machine learning models, where the defaults are derived from the accumulated experience of other practitioners addressing similar types of machine learning problems…”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 13.


Regarding claim 14. 
Xu and Lee  teach The information processing apparatus according to claim 1, 
Xu further teach see ¶ 16, “The one or more policy parameters may be one or more parameters that define the functioning of the reinforcement learning neural network (one or more parameters that define the actions taken by the neural network). The one or more return parameters may be parameters that define how returns are determined based on the rewards.”, also see ¶ 29, “The one or more parameters may comprise one or more of a discount factor of the return function and a bootstrapping factor of the return function. The return function may apply the discount factor γ to provide a discounted return. This discounted return may comprise a weighted sum of the rewards, with the discount factor defining the decay of the weighted sum. Optimizing the discount factor has been found to be a particularly effective method of improving the efficiency and accuracy of reinforcement learning. Equally, the return function might apply the bootstrapping factor λ to a geometrically weighted combination of returns (a bootstrapping parameter return function, or λ-return). The return function may calculate a weighted combination of returns, with each return being estimated over multiple steps (e.g., being a decaying weighted sum of rewards).”, also see ¶ 91, “the return function itself is learned, in addition to the policy, by treating it as a parametric function with tunable meta-return parameters, or meta-parameters, η. Such meta-parameters η may for instance include the discount factor γ, or the bootstrapping parameter λ. For the avoidance of doubt, the “meta-return parameters” and “return-parameters” described thus far are equivalent to the “meta-parameters” η described hereafter.”, also see ¶110 -¶114).
Lee teach wherein the display unit displays a graph with a weight of at least one reward among the weights of the respective rewards defined as an axis, the reception unit receives a designation operation for a point in a region of the graph displayed by the display unit (see ¶ 230, “In the depicted graph, the prediction speed (for a given data set size for which predictions are expected to be made after training) increases from left to right along the X-axis. Each point 4110 (e.g., any of the twelve points 4110A-4110N) represents a prediction run of a model with a corresponding set of FPTs being used for training the model.”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 14.

Regarding claim 15. 
Xu and Lee  teach The information processing apparatus according to claim 2, 
Xu further teach wherein the reception unit (see ¶ 22, “By conditioning the policy and/or the value function on the one or more return parameters, the agent is enforced to learn universal policies and/or value functions for various sets of return parameters.” ,also see ¶ 19, “The one or more return parameters may be updated each time that the policy parameters are updated. This provides a computationally simpler and more efficient mechanism for training the system.”, also see ¶ 116, “This update may be done for example by applying a stochastic gradient descent to update the meta-parameters η in the direction of the meta-gradient. Alternatively, the meta-objective function J′ may be optimized by any other known gradient ascent or decent method.”, also see ¶ 124, “the meta-gradient can be assessed for the one or more meta-parameters η 225, and the one or more meta-parameters η 225 can be adjusted accordingly in step 408 to ensure the optimum return function G, forming updated meta-parameters. These updated meta-parameters can then be used as the meta-parameters η 225 in the subsequent update iteration (where it is concluded that the optimum return function has not been reached).”, also see ¶¶ 129, 134-137, i.e. updating training and adjusting for new parameter ranges).
Lee teaches the display unit displays information … (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”, also see ¶ 78, “The interfaces may, for example, allow non-experts to rely on default settings or parameters for various aspects of the procedures used for building, training and using machine learning models, where the defaults are derived from the accumulated experience of other practitioners addressing similar types of machine learning problems…”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 15.


Regarding claim 16. 
Xu and Lee  teach The information processing apparatus according to claim 15, 
Xu further teach wherein the reception unit receives a change operation for the information indicating the range of the weight of the reward see ¶ 19, “The one or more return parameters may be updated each time that the policy parameters are updated. This provides a computationally simpler and more efficient mechanism for training the system.”, see ¶ 22, “By conditioning the policy and/or the value function on the one or more return parameters, the agent is enforced to learn universal policies and/or value functions for various sets of return parameters.”, also see ¶ 116, “This update may be done for example by applying a stochastic gradient descent to update the meta-parameters η in the direction of the meta-gradient. Alternatively, the meta-objective function J′ may be optimized by any other known gradient ascent or decent method.”, also see ¶ 124, “the meta-gradient can be assessed for the one or more meta-parameters η 225, and the one or more meta-parameters η 225 can be adjusted accordingly in step 408 to ensure the optimum return function G, forming updated meta-parameters. These updated meta-parameters can then be used as the meta-parameters η 225 in the subsequent update iteration (where it is concluded that the optimum return function has not been reached).”, also see ¶¶ 129, 134-137, i.e. updating training and adjusting for new parameter ranges).
Lee teaches information indicating the range of the weight of the reward displayed by the display unit … (see ¶ 336, “the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run. The MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.”, also see ¶ 78, “The interfaces may, for example, allow non-experts to rely on default settings or parameters for various aspects of the procedures used for building, training and using machine learning models, where the defaults are derived from the accumulated experience of other practitioners addressing similar types of machine learning problems…”). The motivation utilized in the combination of claim 1, super, applies equally as well to claim 16.


Regarding claim 17. 
Xu and Lee  teach The information processing apparatus according to claim 3, 
Xu further teach wherein see ¶ 89, “The discount factor γ determines the time-scale of the return. A discount factor close to γ=1 provides a long-sighted goal that accumulates rewards far into the future, while a discount factor with a value close to γ=0 provides a short-sighted goal that prioritizes short-term rewards.” ,also see ¶ 90, “The return may also be bootstrapped to different time horizons. An n-step return accumulates rewards over n time-steps before then adding the value function at the nth time-step. The λ-return is a geometrically weighted combination of n-step returns. In either case, the parameters n or λ may be important to the performance of the algorithm, trading off bias and variance, and therefore an efficient selection of these parameters is desirable.”, also see ¶ 29, “The return function may calculate a weighted combination of returns, with each return being estimated over multiple steps (e.g., being a decaying weighted sum of rewards). Optimizing the bootstrapping factor (potentially in combination with the discount factor) leads to more efficient and accurate reinforcement learning.” and 65).
Lee teaches the display unit displays information … (see ¶ 336, “the graphical interface may also display alerts or informational messages 
Read full office action
Prosecution Timeline

Apr 08, 2022
Application Filed
Oct 15, 2025
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

16/152,868
Patent 12596923
MACHINE LEARNING OF KEYWORDS
2y 5m to grant Granted Apr 07, 2026
17/250,758
Patent 12572843
AGENT SYSTEM FOR CONTENT RECOMMENDATIONS
2y 5m to grant Granted Mar 10, 2026
18/180,097
Patent 12572854
ROOT CAUSE DISCOVERY ENGINE
2y 5m to grant Granted Mar 10, 2026
17/166,158
Patent 12566980
SYSTEM AND METHOD HAVING THE ARTIFICIAL INTELLIGENCE (AI) ALGORITHM OF K-NEAREST NEIGHBORS (K-NN)
2y 5m to grant Granted Mar 03, 2026
17/652,822
Patent 12566861
IDENTIFYING AND CORRECTING VULNERABILITIES IN MACHINE LEARNING MODELS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
99%
With Interview (+33.8%)
3y 8m
Median Time to Grant
Low
PTA Risk
Based on 160 resolved cases by this examiner. Grant probability derived from career allow rate.