Last updated: April 19, 2026
Application No. 18/686,882
LEARNING SYSTEM, LEARNING METHOD, AND PROGRAM

Non-Final OA §101§102§103§112
Filed
Feb 27, 2024
Examiner
SORRIN, AARON JOSEPH
Art Unit
2672
Tech Center
2600 — Communications
Assignee
Rakuten Group Inc.
OA Round
1 (Non-Final)
Interview Optional

— +50.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 62 resolved cases, 2023–2026
Examiner Intelligence

SORRIN, AARON JOSEPH View full profile →
Grants 74% — above average
Career Allow Rate
46 granted / 62 resolved
+12.2% vs TC avg
Strong +51% interview lift
Without
With
+50.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
22 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
20.4%
-19.6% vs TC avg
§103
35.6%
-4.4% vs TC avg
§102
14.1%
-25.9% vs TC avg
§112
29.3%
-10.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 62 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/27/2024 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 16 and 17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 16 recites, “calculate a plurality of first losses each of which is the first loss based on the first estimation result and the first numerical value”. As written, it is unclear how the ‘plurality of first losses’ can all be ‘the first loss’. In other words, the first loss is singular so it is unclear how a plurality of first losses, as a plural statement, is actually one singular thing. Accordingly, this is being interpreted such that a plurality of first losses are calculated.

Claim 17 recites, “calculate a plurality of related losses each of which is the related loss based on the first processing result and the second processing result”. As written, this limitation is unclear for the same reasons as described above with respect to claim 16. Accordingly, this is being interpreted such that a plurality of related losses are calculated.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101. 
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to the abstract idea of estimating a number associated with an object, without significantly more. 
The claim recites: “A learning system, comprising at least one processor configured to: acquire a first training image relating to a first object having a first numerical value; 
acquire a second training image relating to a second object having a second numerical value; 
and execute learning processing of a learning model which estimates a numerical value to be estimated relating to an object to be estimated included in an estimation-target image, based on metric learning using the first training image and the second training image.”
The limitations, as drafted, are processes that, under their broadest reasonable interpretation, cover performance of the limitation in the mind. Firstly, the receiving of the first and second training images amounts to insignificant extra-solution activity (data collection).  A person can estimate a number related to an object in an image (for example an age of the person in the image). The person can make this estimation based on the training images.
This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements of a processor and learning model that uses metric learning. The processor is recited at a level of generality such that it amounts to no more than generic computer equipment for the performance of the abstract idea. The learning model that uses metric learning is recited at a level of generality such that it amounts to a mathematical concept, which is also practicable by the human mind or by pen and paper.1 Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements are recited at a high-level of generality. It is therefore a judicial exception that is not integrated into a practical application, and does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This claim is not patent eligible.

Claim 2 is rejected under 35 U.S.C. 101 because the claimed invention is directed to acquiring a third training image (insignificant extra-solution activity, data gathering) and using all three images for the learning processing of claim 1, which amounts to a mathematical concept that could also be performed manually. The claim is not patent eligible. 

Claims 3-6, 8-9, 11 are rejected under 35 U.S.C. 101 because the claimed invention is directed to generic similarity/dissimilarity-based training using three different training images, which amounts to mathematical operations. These could also be done mentally or using pen and paper, especially because there are only three training images claimed. The claims are not patent eligible.

Claims 7 and 13 are rejected under 35 U.S.C. 101 because the claimed invention is directed to describing the estimation result as a distribution with probabilities, and using these for the learning. This amounts to mathematical concepts, which could also be performed by the human mind. Claim 13 also describes Kullback-Leibler divergence, which amounts to a mathematical formula. The claims are not patent eligible.

Claims 10 and 12 are rejected under 35 U.S.C. 101 because the claimed invention is directed to calculating feature amounts and executing learning based on these feature amounts (claim 10) and acquiring estimation results based on the training images and executing learning based on these estimation results (claim 12). These amount to mathematical concepts that could also be performed by the human mind. The claims are not patent eligible.

Claims 14-17 are rejected under 35 U.S.C. 101 because the claimed invention is directed to calculating losses as part of the learning processing, which amounts to mathematical operations. These could also be done mentally or using pen and paper, especially because there are only three training images claimed. The claims are not patent eligible.

Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to describing the object as human and the numerical values as ages. These elements do not integrate the abstract idea of claim 1 into a practical application. Additionally, the limitations of claim 1 are still mathematical concepts and mentally performable even with the modifications of claim 18. The claim is not patent eligible.

Claim 19 is rejected under 35 U.S.C. 101 because the claimed invention is directed to a method analogous to the system of claim 1. The claim is not patent eligible.

Claim 20 is rejected under 35 U.S.C. 101 because the claimed invention is directed to a non-transitory computer-readable information storage medium for storing a program for causing a computer to perform steps analogous to the system steps of claim 1. The non-transitory computer-readable information storage medium is recited generically, such that it amounts to generic storage. The claim is not patent eligible.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1-10, 12, and 14-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Liu (Label-Sensitive Deep Metric Learning for Facial Age Estimation).
Regarding claim 1, Liu teaches “A learning system, comprising at least one processor” (Liu, Table VI and section E.5. Paragraph 2, “We also investigated the computational time with different deep networks based on the efficient Caffe [51] toolbox, and the whole architectures were built on a speed-up parallel computing GPU with NVIDIA GTX 1080. Table VI tabulates the comparisons of the computational time during the testing phase. From these results, we see that the deep architectures achieve the real-time age estimation with a GPU for the feature extraction procedure. In addition, we carefully implemented the OHRANK by following the details provided in [9]. The OHRANK takes 0.04 seconds by using an Intel i5-CPU@3.20GHz PC, which also satisfies the real-time requirements.”)

“configured to: acquire a first training image relating to a first object having a first numerical value; acquire a second training image relating to a second object having a second numerical value;” (Liu, Figure 2 and section III.A. Paragraph 1, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.” The faces in the images are mapped to the objects and the groundtruth ages are mapped to the numerical values. Figure 2 shows training data, each associated with an image representing an age (numerical value).)

“and execute learning processing of a learning model which estimates a numerical value to be estimated relating to an object to be estimated included in an estimation-target image,” (Liu, Sections IV.D. and IV.E., “1) Mean Absolute Error: For the evaluation metrics, we utilized the mean absolute error (MAE) [1], [19], [25], [33] to measure the error between the predicted age and the ground-truth, which is computed as follows: 

    PNG
    media_image1.png
    109
    348
    media_image1.png
    Greyscale

where y^ and y∗ denote predicted and ground-truth age value, respectively, and N denotes the number of the testing samples. 2) Cumulative Score Curve: We also applied the cumulative score (CS) [23], [24], [26], [33] curve to quantitatively evaluate the performance of age estimation methods. The cumulative prediction accuracy at the error ϵ is computed as:

    PNG
    media_image2.png
    99
    245
    media_image2.png
    Greyscale

where K is the total number of testing images, Kn is the number of testing images whose absolute error between the estimated age and the ground-truth age is not greater than n years.”; “In our settings, we performed five folds cross-validation of our proposed approach on the MORPH (Album2) dataset. Specifically, we divided the whole dataset into five equal-size folds. Then we used one fold (20% of total data) as the testing set and the other four folds (80% of total data) as the training set. We repeated this procedure ten times and finally averaged the results as the facial age estimation results.” Note that the MORPH dataset has faces (objects to be estimated) in 55000 face images (estimation-target images). The model of Liu estimates age (numerical value to be estimated) related to these faces.)

“based on metric learning using the first training image and the second training image.” (Liu, section III.A. Paragraph 1, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.” Accordingly, the metric learning of Liu uses the training images.)

Regarding claim 2, Liu teaches “The learning system according to claim 1,” 
“wherein the at least one processor configured to: acquire a third training image relating to a third object having a third numerical value, and execute the learning processing based on the metric learning using the first training image, the second training image, and the third training image.” (Liu, Figure 2, section III.A. Paragraph 1, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.” It is understood that there are more than two in the training set of N samples. Still, this is confirmed by Liu which describes that 80% of the 55000 MORPH images are used for training (see claim 1 rejection). Figure 2 also shows a plurality of samples, each representative of an image with an age.)

Regarding claim 3, Liu teaches “The learning system according to claim 2,” 
“wherein the first numerical value is the same as the second numerical value, wherein the first numerical value is different from the third numerical value,” (Liu, Figure 2 shows that in the mini-batch samples, two samples (i, j) represent the same age (numerical value) and the third and fourth samples (k, l) represent a different age (numerical value).)
“and wherein the at least one processor is configured to: acquire a first processing result obtained by the learning model based on the first training image; acquire a second processing result obtained by the learning model based on the second training image; acquire a third processing result obtained by the learning model based on the third training image; and execute the learning processing so that a difference between the first processing result and the second processing result becomes smaller and a difference between the first processing result and the third processing result becomes larger.” (Liu, Figures 2-3 and section III.A. Paragraphs 1 and 3, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.”; “The crucial part of our LSDML is to learn the network parameters f(⋅) . To achieve this goal, we first pass a given mini-batch forward the deep network, and we select each quadruplet of (i,j,k,l) , such that (xi,xj)∈P , (xi,xk)∈N and (xj,xl)∈N , where P and N denote the positive and negative pair set, respectively. More details are illustrated in Fig. 3. Moreover, to achieve the discriminativeness of the feature similarity, our LSDML enforces each df(xi,xj) pair in positive set is close to each other, and at the same time df(xi,xj) and df(xk,xl) in negative set is pushed far away. As a result, the distance of inter-class pairs is minimized, and the distance of intra-class pairs is larger than a margin τ in the transformed subspace.” Note that Liu calculates feature representations (processing results) for each training image based on the model, and uses these for the learning processing as described above and in Figure 3. As described above and in Figure 2, right, the difference between the first and second is minimized, while the distance between the first and third is increased.)

Regarding claim 4, Liu teaches “The learning system according to claim 2,” 
“wherein the first numerical value is different from both the second numerical value and the third numerical value, wherein a difference between the first numerical value and the third numerical value is larger than a difference between the first numerical value and the second numerical value,” (Liu, Figure 2: the first numerical value is mapped to i, the second is mapped to k, and the third is mapped to j. Numerical value i is different from k and l, and the difference between i and l is larger than the difference between i and k.)
“and wherein the at least one processor is configured to: acquire a first processing result obtained by the learning model based on the first training image; acquire a second processing result obtained by the learning model based on the second training image; acquire a third processing result obtained by the learning model based on the third training image; and execute the learning processing so that a difference between the first processing result and the third processing result becomes larger than a difference between the first processing result and the second processing result.” (Liu, Figures 2-3 and section III.A. Paragraph 1, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.”; Note that Liu calculates feature representations (processing results) for each training image based on the model, and uses these for the learning processing as described above and in Figure 3. As shown in Figure 2, right, the difference between the first and third is greater than the distance between the first and second.)

Regarding claim 5, Liu teaches “The learning system according to claim 3,”
“wherein the at least one processor is configured to execute the learning processing so that the difference between the first processing result and the third processing result becomes a difference corresponding to a difference between the first numerical value and the third numerical value.” (Liu, Figure 2, Algorithm 1, Section III.A. Paragraph 3, “The crucial part of our LSDML is to learn the network parameters f(⋅) . To achieve this goal, we first pass a given mini-batch forward the deep network, and we select each quadruplet of (i,j,k,l) , such that (xi,xj)∈P , (xi,xk)∈N and (xj,xl)∈N , where P and N denote the positive and negative pair set, respectively. More details are illustrated in Fig. 3. Moreover, to achieve the discriminativeness of the feature similarity, our LSDML enforces each df(xi,xj) pair in positive set is close to each other, and at the same time df(xi,xj) and df(xk,xl) in negative set is pushed far away. As a result, the distance of inter-class pairs is minimized, and the distance of intra-class pairs is larger than a margin τ in the transformed subspace.” Note that parameter optimization for minimizing loss, as shown in algorithm 1, amounts to making the processing result differences correspond to the real ground truth differences. This is also embodied in Figure 2, which shows differences in the feature space correspond to real age differences.)

Regarding claim 6, Liu teaches “The learning system according to claim 3,”
“wherein the first processing result is a first estimation result obtained by the learning model, wherein the second processing result is a second estimation result obtained by the learning model, wherein the third processing result is a third estimation result obtained by the learning model, and wherein the at least one processor is configured to execute the learning processing based on the first estimation result, the second estimation result, and the third estimation result.” (As described in the rejection of claim 3, the processing results for the three images are mapped to feature representations. These feature representations further correspond to the estimation results of claim 6. These feature representations are used for the learning processing, as recited in the rejection of claim 3 and in figure 3.)

Regarding claim 7, Liu teaches “The learning system according to claim 6,”
“wherein the first estimation result is a first distribution including each of a plurality of numerical values and a first probability that the first object has the each of the plurality of numerical values, wherein the second estimation result is a second distribution including each of the plurality of numerical values and a second probability that the second object has the each of the plurality of numerical values, wherein the third estimation result is a third distribution including each of the plurality of numerical values and a third probability that the third object has the each of the plurality of numerical values, and wherein the at least one processor is configured to execute the learning processing based on the first distribution, the second distribution, and the third distribution.” (Liu, as described above, the feature representations are mapped to the estimation results for each image. The feature representations are vectors (distributions with a plurality of numerical values) with an associated probability (see Figure 3, ‘Similarity Scores’). Figure 3 shows that these are used for the learning processing.)

Regarding claim 8, Liu teaches “The learning system according to claim 3,”
“wherein the at least one processor is configured to execute the learning processing so that a difference between the second processing result and the third processing result becomes larger.” (Liu, Algorithm 1, Section III.A. Paragraph 3, “The crucial part of our LSDML is to learn the network parameters f(⋅) . To achieve this goal, we first pass a given mini-batch forward the deep network, and we select each quadruplet of (i,j,k,l) , such that (xi,xj)∈P , (xi,xk)∈N and (xj,xl)∈N , where P and N denote the positive and negative pair set, respectively. More details are illustrated in Fig. 3. Moreover, to achieve the discriminativeness of the feature similarity, our LSDML enforces each df(xi,xj) pair in positive set is close to each other, and at the same time df(xi,xj) and df(xk,xl) in negative set is pushed far away. As a result, the distance of inter-class pairs is minimized, and the distance of intra-class pairs is larger than a margin τ in the transformed subspace.” Note that parameter optimization for minimizing loss, as shown in algorithm 1, amounts to making the processing results correspond to the ground truth. In this case, making the difference between the second and third processing results larger, as the difference between the ground truth ages is larger. This is also embodied in Figure 2, which shows how in the feature space, the feature representations of farther ages move farther apart, and the arrow direction confirms this in the right panel.)

Regarding claim 9, Liu teaches “The learning system according to claim 8,”
“wherein the learning module at least one processor is configured to execute the learning processing so that the difference between the second processing result and the third processing result becomes a difference corresponding to a difference between the second numerical value and the third numerical value.” (Liu, Figure 2, Algorithm 1, Section III.A. Paragraph 3, “The crucial part of our LSDML is to learn the network parameters f(⋅) . To achieve this goal, we first pass a given mini-batch forward the deep network, and we select each quadruplet of (i,j,k,l) , such that (xi,xj)∈P , (xi,xk)∈N and (xj,xl)∈N , where P and N denote the positive and negative pair set, respectively. More details are illustrated in Fig. 3. Moreover, to achieve the discriminativeness of the feature similarity, our LSDML enforces each df(xi,xj) pair in positive set is close to each other, and at the same time df(xi,xj) and df(xk,xl) in negative set is pushed far away. As a result, the distance of inter-class pairs is minimized, and the distance of intra-class pairs is larger than a margin τ in the transformed subspace.” Note that parameter optimization for minimizing loss, as shown in algorithm 1, amounts to making the processing result differences correspond to the real ground truth differences. Accordingly, feature representations are optimized to have differences correlated to ground truth differences, including the differences for the second and third processing result. This is also embodied in Figure 2, which shows differences in the feature space correspond to real age differences.)

Regarding claim 10, Liu teaches “The learning system according to claim 1,” 
“wherein the at least one processor is configured to: calculate a first feature amount relating to the first training image based on the first training image and the learning model; calculate a second feature amount relating to the second training image based on the second training image and the learning model; and execute the learning processing based on the first feature amount and the second feature amount.” (Liu, Figure 3 and section III.A. Paragraph 1, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.” Note that Liu calculates feature representations (feature amounts) for each training image based on the model, and uses these for the learning processing as described above and in Figure 3.)

Regarding claim 12, Liu teaches “The learning system according to claim 1,” 
“wherein the at least one processor is configured to: acquire a first estimation result obtained by the learning model based on the first training image; acquire a second estimation result obtained by the learning model based on the second training image; and execute the learning processing based on the first estimation result and the second estimation result.” (Liu, Figure 3 and section III.A. Paragraph 1, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.” Note that Liu calculates feature representations (estimation results) for each training image using the model, and uses these feature representations for the learning processing as described above and in Figure 3.)

Regarding claim 14, Liu teaches “The learning system according to claim 1,” 
“wherein the at least one processor is configured to: acquire a first estimation result obtained by the learning model based on the first training image; calculate a first loss based on the first estimation result and the first numerical value;” (Liu, Algorithm 1, Figure 3, and section III.A. Paragraph 1, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.” Note that the feature representations computed for each image by the learning model is mapped to the estimation result. Also, the ‘residual learning method’ recited above requires loss value calculations (mapped to the first loss), as further demonstrated by the back-propagation discussed in Algorithm 1 and Figure 3 caption.)

“calculate a related loss based on a relationship between a first processing result of the learning model based on the first training image and a second processing result of the learning model based on the second training image;” (Liu, Figures 2-3, and Section III.A. Paragraph 2, including equations 1-2.

    PNG
    media_image3.png
    423
    731
    media_image3.png
    Greyscale

The distances between feature representations amounts to the relationship between processing results.)

“and execute the learning processing based on the first loss and the related loss.” (Liu, Figures 2 and 3 in combination demonstrate that both the first loss and related loss are used for the learning processing.)

Regarding claim 15, Liu teaches “The learning system according to claim 14,” 
“wherein the at least one processor is configured to execute the learning processing based on the first loss, a weighting coefficient relating to the related loss, and the related loss.” (Liu, Figure 2, Algorithm 1, Section III.A. Paragraph 3, “The crucial part of our LSDML is to learn the network parameters f(⋅) . To achieve this goal, we first pass a given mini-batch forward the deep network, and we select each quadruplet of (i,j,k,l) , such that (xi,xj)∈P , (xi,xk)∈N and (xj,xl)∈N , where P and N denote the positive and negative pair set, respectively. More details are illustrated in Fig. 3. Moreover, to achieve the discriminativeness of the feature similarity, our LSDML enforces each df(xi,xj) pair in positive set is close to each other, and at the same time df(xi,xj) and df(xk,xl) in negative set is pushed far away. As a result, the distance of inter-class pairs is minimized, and the distance of intra-class pairs is larger than a margin τ in the transformed subspace.” Note that the training to pull certain sets together and push other sets apart amounts to applying weight coefficients based on related losses. The use of the first loss and related loss, as claimed, are addressed in the rejection of claim 14, last limitation.)

Regarding claim 16, Liu teaches “The learning system according to claim 14,” 
“wherein the at least one processor is configured to: calculate a plurality of first losses each of which is the first loss based on the first estimation result and the first numerical value; and execute the learning processing based on the plurality of first losses and the related loss.” (This claim amounts to the performance of claim 14, but with the distinction that there are a plurality of first loss values used for the learning processing, along with the related loss. This is fully embodied in Figures 2-3 of Liu which show that there are a plurality of loss values used for the learning processing.)

Regarding claim 17, Liu teaches “The learning system according to claim 14,” 
“wherein the at least one processor is configured to: calculate a plurality of related losses each of which is the related loss based on the first processing result and the second processing result; and execute the learning processing based on the first loss and the plurality of related losses.” (This claim amounts to the performance of claim 14, but with the distinction that there are a plurality of related losses used for the learning processing, along with the first loss. This is fully embodied in Figures 2-3 of Liu which show that there are a plurality of loss values used for the learning processing.)

Regarding claim 18, Liu teaches “The learning system according to claim 1,” 
“wherein the first object and the second object are humans different from each other, wherein the first numerical value is an age of the first object, wherein the second numerical value is an age of the second object,” (Liu, section III.A. Paragraph 1, “To learn robust and discriminative feature similarity for facial age estimation, the basic idea of our LSDML is to exploit the label correlation among face samples in the transformed subspace. Unlike recent deep metric learning methods [37], [38] which utilize hand-crafted features to be fed to the deep networks, our model jointly optimizes both tasks of learning similarity and embedding features for face representation in a unified deep architecture. Let X={(xi,yi)}Ni=1 denote the training set which consists of N samples, where xi∈RD denote the i th face image of D pixels and yi∈R1 is the groundtruth age value, respectively. Our model is to compare the distance of face pairs by computing the feature representation f(xi) for the i th face image xi via deep neural networks. In terms of network architecture, we employ the residual learning method to optimize the whole network parameters, which have achieved superior performance in a volume of visual recognition tasks [43]. To better measure the learned face descriptors, we apply L2 normalization on the obtained outcomes from the fully connected layers.” Accordingly, each image has an image of a face and an associated groundtruth age value. One skilled in the art would understand that there would be images of different people. Liu describes the use of the MORPH database which includes thousands of different people (humans different from eachother).)

“wherein the object to be estimated is a human for which his or her age is to be estimated, and wherein the numerical value to be estimated is the age of the object to be estimated.” (Liu, Section IV.D. and IV.E., as referenced in the rejection of claim 1, describes that age is predicted (numerical value is estimated) for the K testing image dataset, wherein the K testing image dataset is the 20% of MORPH dataset which includes human faces (human object whose age is to be estimated).)

Regarding claim 19, Claim 19 recites a method with steps corresponding to the elements of the system recited in Claim 1. Therefore, the recited steps of this claim are mapped to the analogous elements in the corresponding system claim.

Regarding claim 20, Claim 20 recites a non-transitory computer-readable information storage medium storing a program with instructions corresponding to the steps recited in Claim 1.  Therefore, the recited programming instructions of this claim are mapped to the analogous steps in the corresponding method claim.  Additionally, as described in the rejection of Claim 1, Liu teaches neural network operations performed by a processor of a computer. These operations are therefore program instructions that are inherently stored for execution.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Ustinova (US 20200218971 A1).

Regarding claim 11, Liu teaches “The learning system according to claim 10,” 
While Liu calculates similarity between the first and second feature amounts and executes learning processing based on the similarity (Liu, Figure 3 shows learning based on similarity and Section III.A. Paragraph 2: 

    PNG
    media_image3.png
    423
    731
    media_image3.png
    Greyscale

Here, Liu describes similarity calculation as Euclidian distance between the first and second feature amounts.), Liu does not expressly disclose the use of cosine similarities.
	
Ustinova discloses the use of cosine similarities with respect to image representations (Ustinova, Paragraph 80, “These maps are combined into a single vector of deep representation of the original image with the length of 500 elements using a fully connected layer, in which each element of the output vector is linked to each element of the map of attributes of each part. Subsequently, L2-normalization is performed. The obtained deep representations of input images along with the marks of classes are used to determine the measures of similarity for all possible pairs through the calculation of a cosine measure of similarity between deep representations, then the probability distributions of similarity measures of positive and negative pairs are formed on the basis of marks of classes using histograms. The obtained distributions are used to calculate the loss function proposed, then the back propagation of the derived loss is performed to adjust the neural network parameters. The process of deep neural network learning at all bases differed only by the loss function selected. The learning with a binomial loss function was performed with two values of losses for negative pairs: c=10 and c=25.”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to replace the Euclidian distance-based similarity calculation of Liu with the cosine similarity calculation of Ustinova.
The motivation for doing so would have been to use a well-known and conventional alternative (cosine similarity) to compare the feature representations. Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Liu with the above teaching of Ustinova to fully disclose, “wherein the at least one processor is configured to: calculate a cosine similarity based on the first feature amount and the second feature amount; and execute the learning processing based on the cosine similarity.”

Allowable Subject Matter
Claim 13 is rejected under 35 USC 101 and objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, and amended to overcome the rejection under 35 USC 101.
The following is a statement of reasons for the indication of allowable subject matter:  With respect to claim 13, in addition to other limitations in the claims, the Prior Art of Record fails to teach, disclose or render obvious the applicant' s invention as claimed, in particular: 
Claim 13 recites: “The learning system according to wherein the first estimation result is a first distribution including each of a plurality of numerical values and a first probability that the first object has the each of the plurality of numerical values, wherein the second estimation result is a second distribution including each of the plurality of numerical values and a second probability that the second object has the each of the plurality of numerical values, and wherein the at least one processor is configured to: calculate a Kullback-Leibler divergence based on the first distribution and the second distribution; and execute the learning processing based on the Kullback-Leibler divergence.”

Liu teaches a label-sensitive deep metric learning (LSDML) model for recognizing the age of a person in an image. Ustinova teaches training of deep neural networks using pairwise similarity metrics, such as cosine similarity, for image recognition tasks. Schroff (FaceNet: A unified embedding for face recognition and clustering) teaches a strategy for face recognition, verification, and clustering using a deep convolutional neural network trained to optimize embeddings. Georgescu (Teacher-Student Training and Triplet Loss to Reduce the Effect of Drastic Face Occlusion) teaches facial recognition tasks, including age estimation, for cases where faces in images are partially occluded, including surgical settings and occlusion by virtual reality headsets. Seung (KR20190140824A) teaches age estimation from facial images using triplet loss calculations performed as part of a CNN. However, none of these references expressly disclose the bolded limitations above. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AARON JOSEPH SORRIN whose telephone number is (703)756-1565. The examiner can normally be reached Monday - Friday 9am - 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached at (571) 272-3638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AARON JOSEPH SORRIN/
Examiner, Art Unit 2672

/SUMATI LEFKOWITZ/Supervisory Patent Examiner, Art Unit 2672                                                                                                                                                                                                                                                                                                                                                                                                 

        1 The courts consider a mental process (thinking) that “can be performed in the human mind, or by a human using a pen and paper” to be an abstract idea. CyberSource Corp. v. Retail Decisions, Inc., 654 F.3d 1366, 1372, 99 USPQ2d 1690, 1695 (Fed. Cir. 2011). MPEP 2106.04(a)(2).
Read full office action
Prosecution Timeline

Feb 27, 2024
Application Filed
Mar 04, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/999,467
Patent 12592054
LOW-LIGHT VIDEO PROCESSING METHOD, DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/183,423
Patent 12586245
ROBUST LIDAR-TO-CAMERA SENSOR ALIGNMENT
2y 5m to grant Granted Mar 24, 2026
17/756,401
Patent 12566954
SOLVING MULTIPLE TASKS SIMULTANEOUSLY USING CAPSULE NEURAL NETWORKS
2y 5m to grant Granted Mar 03, 2026
18/060,645
Patent 12555394
IMAGE PROCESSING APPARATUS, METHOD, AND STORAGE MEDIUM FOR GENERATING DATA BASED ON A CAPTURED IMAGE
2y 5m to grant Granted Feb 17, 2026
17/809,781
Patent 12547658
RETRIEVING DIGITAL IMAGES IN RESPONSE TO SEARCH QUERIES FOR SEARCH-DRIVEN IMAGE EDITING
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+50.6%)
3y 5m
Median Time to Grant
Low
PTA Risk
Based on 62 resolved cases by this examiner. Grant probability derived from career allow rate.