Last updated: April 19, 2026
Application No. 17/616,674
LEARNING SYSTEM AND LEARNING METHOD FOR MULTI-LABEL DATA RECOGNITION

Non-Final OA §101§103§112
Filed
Dec 06, 2021
Examiner
GORMLEY, AARON PATRICK
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Rakuten Group Inc.
OA Round
3 (Non-Final)
Interview Optional

— -60.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 5 resolved cases, 2023–2026
Examiner Intelligence

GORMLEY, AARON PATRICK View full profile →
Grants 60% of resolved cases
Career Allow Rate
3 granted / 5 resolved
+5.0% vs TC avg
Minimal -60% lift
Without
With
+-60.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
30 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
30.2%
-9.8% vs TC avg
§103
36.0%
-4.0% vs TC avg
§102
8.4%
-31.6% vs TC avg
§112
21.5%
-18.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 5 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
	This action is in response to the application filed 12/06/2021. Claims 1-20 are pending and have been examined.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/06/2021 has been entered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Interpretation
	The following is a quotation of MPEP 2111.04 II:
The broadest reasonable interpretation of a method (or process) claim having contingent limitations requires only those steps that must be performed and does not include steps that are not required to be performed because the condition(s) precedent are not met. For example, assume a method claim requires step A if a first condition happens and step B if a second condition happens. If the claimed invention may be practiced without either the first or second condition happening, then neither step A or B is required by the broadest reasonable interpretation of the claim. If the claimed invention requires the first condition to occur, then the broadest reasonable interpretation of the claim requires step A. If the claimed invention requires both the first and second conditions to occur, then the broadest reasonable interpretation of the claim requires both steps A and B. 

The broadest reasonable interpretation of a system (or apparatus or product) claim having structure that performs a function, which only needs to occur if a condition precedent is met, requires structure for performing the function should the condition occur. The system claim interpretation differs from a method claim interpretation because the claimed structure must be present in the system regardless of whether the condition is met and the function is actually performed. 

	Limitation 4 of claim 12 (“calculating, when the query image is input to a learning model, a first loss based on an output of the learning model and a target output”) is found to be contingent, and is consequently interpreted as not being required components of the claimed method under broadest reasonable interpretation. To become a required limitation of the method, it must be rewritten as a positively recited element.

Claim Objections
Claim 3 is objected to because of the following informalities: “pieces” was previously lined through in the amended claims submitted on 8/14/2025, but it is once again lined through in the latest instant amendments. Appropriate correction is required.

Claim 6 is objected to because of the following informalities: In “calculate, for each unique combination of the three or more labels, the second loss”, “unique” was previously lined through in the amended claims submitted on 8/14/2025, but it is once again lined through in the latest instant amendments. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 1 recites “each of the plurality of images of the support data” and “feature amount of the multi-query image in its 6th and 7th limitations. There is insufficient antecedent basis for “the support data”, “the plurality of images of the support data” and “the multi-label query image”. Thus, the scope of the claim is rendered indefinite. These deficiencies are present in substantially similar independent claims 12-13, and are inherited by all dependent claims. “the support data” is interpreted as referring to the plurality of support images and any data associated with it. “the plurality of images of the support data” is interpreted as referring to the “plurality of support images” referenced in limitation 5. “the multi-label query image” is interpreted as synonymous with “the query image”.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed inventions are directed to non-statutory subject matter without significantly more.

Claim 1
Step 1: The claim recites “A learning system”, and is therefore directed to the statutory category of machine
Step 2A Prong 1: The claim recites the following judicial exception(s)
determine a first label for the query image: This can be performed as a mental process. One can merely decide a first label for the query image.
determine a second label for the query image: This can be performed as a mental process. One can merely decide a second label for the query image.
calculate, when the query image is input to a learning model, a first loss based on an output of the learning model and a target output can be performed as a mental process. One can merely assign a value based on the difference of the model and target output.
acquire a plurality of support images where each of the support images have at least the first label or the second label in common with the query image: This can be performed as a mental process. One can merely determine which support images in a set share label(s) with the query image.
acquire a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This can be performed as a mental process. One can merely identify a subset of features present in the query and support data.
calculate a second loss based on the feature amount of the multi-label query image and the feature amount of each of the plurality of images of the support data: This can be performed as a mental process. One can merely assign a value proportional to the differences between the (feature) subset of query data from each member of the (feature) subset of support data.
wherein each of the first label and second label of the query image are related to an edited feature of the query image: Determining the first and second labels can still be performed as a mental process. The process can merely be performed with a query image that has edited features.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the following additional element(s)
receive the query image: This amounts to mere reception of data and is insignificant extra-solution activity (MPEP 2106.05(g)).
comprising at least one processor configured to: This is mere instruction to apply the judicial exceptions by a generic computing machine (MPEP 2106.05(f)).
acquire a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This is mere instruction to use a generic data structure to execute a judicial exception in a generic manner (MPEP 2106.05(f)).
adjust the parameter based on the first loss and the second loss: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).
Step 2B: The following additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
receive the query image: This is an instance of retrieving information from memory, a limitation known to be well-understood, routine, and conventional (MPEP 2106.05(d) II. iv.)
comprising at least one processor configured to: This is mere instruction to apply the judicial exceptions by a generic computing machine (MPEP 2106.05(f)).
acquire a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This is mere instruction to use a generic data structure to execute a judicial exception in a generic manner (MPEP 2106.05(f)).
adjust the parameter based on the first loss and the second loss: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).

Claim 2
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the query image and the support data have at least one label which is the same: The calculation of the losses and acquisition of feature amounts can still be performed as mental processes.
wherein the at least one processor is configured to calculate the second loss so that, as a difference between the feature amount of the query image and the feature amount of the support data becomes larger, the second loss becomes larger: This can be performed as a mental process. One can merely assign a value proportional to the differences between the feature amount of query data from the feature amount of support data.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s).
wherein the at least one processor is configured to calculate the second loss so that, as a difference between the feature amount of the query image and the feature amount of the support data becomes larger, the second loss becomes larger: This is mere instruction to execute a recited judicial exception with a generic machine (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s).
wherein the at least one processor is configured to calculate the second loss so that, as a difference between the feature amount of the query image and the feature amount of the support data becomes larger, the second loss becomes larger: This is mere instruction to execute a recited judicial exception with a generic machine (MPEP 2106.05(f)).

Claim 3
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the at least one processor is configured to
calculate an average feature amount based on the feature amount of each of the plurality of images of the support data: This can be performed as a mental process. One can merely determine the most common class for images of support data with matching feature subsets.
acquire the second loss based on the feature amount of the query image and the average feature amount: This can be performed as a mental process. One can merely assign a value proportional to the differences between the feature amount of query data from the average feature amounts.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s).
wherein the at least one processor is configured to…: This is mere instruction to execute the judicial exceptions with a generic machine (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s).
wherein the at least one processor is configured to…: This is mere instruction to execute the judicial exceptions with a generic machine (MPEP 2106.05(f)).

Claim 4
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the at least one processor is configured to:
calculate a total loss based on the first loss and the second loss: This can be performed as a mental process. One can merely add the two losses together.
adjust the parameter based on the total loss: This can be performed as a mental process. One can merely add the total loss value to a parameter.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the at least one processor is configured to
calculate a total loss based on the first loss and the second loss: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
adjust the parameter based on the total loss: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the at least one processor is configured to
calculate a total loss based on the first loss and the second loss: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
adjust the parameter based on the total loss: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).

Claim 5
Step 1: The claim recites a machine, as in claim 4
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the at least one processor is configured to calculate the total loss based on the first loss, the second loss, and a weighting coefficient specified by a creator: This can be performed as a mental process. One can merely perform a weighted summation of the two losses.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s).
wherein the at least one processor is configured to calculate the total loss based on the first loss, the second loss, and a weighting coefficient specified by a creator: This is mere instruction to execute the judicial exceptions with a generic machine (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s).
wherein the at least one processor is configured to calculate the total loss based on the first loss, the second loss, and a weighting coefficient specified by a creator: This is mere instruction to execute the judicial exceptions with a generic machine (MPEP 2106.05(f)).

Claim 6
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the learning model is configured to recognize three or more labels, which can be arranged in unique combinations: While this further narrows the scope of the learning model, it’s still a generic data structure. Thus, calculating a first loss corresponding to its output is still considered a mental process. 
wherein the at least one processor is configured to calculate, for each unique combination of the three or more labels, the first loss based on the query image corresponding to the combination: This can be performed as a mental process. One can merely assign a value based on the difference of the model’s output classifying the query image and the target output.
wherein the at least one processor is configured to acquire, for each unique combination of the three or more labels, the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: This can be performed as a mental process. One can merely identify a subset of features present in the query and support data for each combination of labels.
wherein the at least one processor is configured to calculate, for each unique combination of the three or more labels, the second loss based on the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: This can be performed as a mental process. One can merely assign a value proportional to the differences between the subset of query data from the subset of support data.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein, for each unique combination of the three or more labels, a data set which includes the query image and the support data exists: This merely links the judicial exceptions to a particular field of use (multi-label supervised learning with three or more labels) (MPEP 2106.05(h)).
wherein the at least one processor is configured to…: This is mere instruction to execute the judicial exceptions with a generic machine (MPEP 2106.05(f)).
adjust the parameter based on the first loss and the second loss which are calculated for each unique combination of the three or more labels: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein, for each unique combination of the three or more labels, a data set which includes the query image and the support data exists: This merely links the judicial exceptions to a particular field of use (multi-label supervised learning with three or more labels) (MPEP 2106.05(h)).
wherein the at least one processor is configured to…: This is mere instruction to execute the judicial exceptions with a generic machine (MPEP 2106.05(f)).
adjust the parameter based on the first loss and the second loss which are calculated for each unique combination of the three or more labels: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).

Claim 7
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the at least one processor is configured to calculate the first loss based on the parameter of the first learning model: This can be performed as a mental process. One can merely subtract the output of the first learning model from the desired output.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the query image is input to a first learning model: This amounts to mere data transfer and is insignificant extra-solution activity (MPEP 2106.05(g)).
wherein the support data is input to a second learning model: This amounts to mere data transfer and is insignificant extra-solution activity (MPEP 2106.05(g)).
wherein a parameter of the first learning model and a parameter of the second learning model are shared: While this further narrows the scope of the learning models, they’re still generic data structures. Thus, this constitutes mere instruction to apply a judicial exception (MPEP 2106.05(f)).
wherein the at least one processor is configured to:
calculate the first loss based on the parameter of the first learning model: This is mere instruction to use generic computer hardware to calculate loss based on a parameter in a generic manner (MPEP 2106.05(f)).
acquire the feature amount of the query image calculated based on the parameter of the first learning model: This is mere instruction to use generic computer hardware to calculate loss based on a parameter in a generic manner (MPEP 2106.05(f)).
acquire … the feature amount of the support data calculated based on the parameter of the second learning model: This is mere instruction to use generic computer hardware to determine feature amounts based on a parameter in a generic manner (MPEP 2106.05(f)).
adjust each of the parameter of the first learning model and the parameter of the second learning model: This is mere instruction to use generic computer hardware to determine feature amounts based on a parameter in a generic manner (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
the multi-label query data is input to a first learning model is an instance of providing training data to a machine learning model, something that would be well-known to one of ordinary skill in the art: “For example, as well-known, neural networks or other machine learning systems can be trained to produce configured output based on training data provided to the neural network or other machine learning system in a training phase” (Yan, US 2019/0367019 A1, [0049]).
the support data is input to a second learning model is an instance of providing training data to a machine learning model, something that would be well-known to one of ordinary skill in the art: (Yan, US 2019/0367019 A1, [0049]).
the parameter of the first learning model and the parameter of the second learning model are shared: The machine learning models are still generic data structures. Thus, this constitutes mere instruction to apply a judicial exception (MPEP 2106.05(f)).
wherein the at least one processor is configured to perform operations: This is mere instruction to execute the judicial exceptions with a generic machine (MPEP 2106.05(f)).

Claim 8
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the query image and the support data have at least one label which is the same: The calculation of the losses and acquisition of feature amounts can still be performed as mental processes.
wherein the at least one processor is configured to acquire the second loss based on the feature amount of the query image, the feature amount of the support data, and a coefficient corresponding to a label similarity between the query image and the support data: This can be performed as a mental process. One can merely assign a value proportional (a coefficient) to the differences between the feature amount of query data from the feature amount of support data.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s).
wherein the at least one processor is configured to acquire the second loss based on the feature amount of the query image, the feature amount of the support data, and a coefficient corresponding to a label similarity between the query image and the support data: This is mere instruction to execute the judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s).
wherein the at least one processor is configured to acquire the second loss based on the feature amount of the query image, the feature amount of the support data, and a coefficient corresponding to a label similarity between the query image and the support data: This is mere instruction to execute the judicial exceptions with generic computer hardware (MPEP 2106.05(f)).

Claim 9
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
perform operations: This is mere instruction to execute the judicial exceptions with a generic machine (MPEP 2106.05(f)).
wherein the at least one processor is configured to acquire the query image and the support data from a data group having a long-tail distribution for multi-labels: This amounts to mere data gathering and is insignificant extra-solution activity (MPEP 2106.05(g)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the at least one processor is configured to acquire the query image and the support data from a data group having a long-tail distribution for multi-labels: This is an instance of retrieving information from memory, a limitation known to be well-understood, routine, and conventional (MPEP 2106.05(d) II. iv.)

Claim 10
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the at least one processor is configured to calculate the first loss based on the output of the learning model in which the last layer is replaced with the layer corresponding to the plurality of labels, and based on the target output: This can be performed as a mental process. One can merely assign a value proportional to the difference between model output and target output.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the learning model is configured such that a last layer of a model which has learned another label other than a plurality of labels to be recognized is replaced with a layer corresponding to the plurality of labels: While this further limits the scope of the learning model, it’s insufficient to be considered more than a generic data structure. Thus, calculating feature amounts based on a parameter of the learning model is still considered mere instruction to execute a judicial exception with a generic data structure (MPEP 2106.05(f)).
wherein the at least one processor is configured to calculate the first loss based on the output of the learning model in which the last layer is replaced with the layer corresponding to the plurality of labels, and based on the target output: This is mere instruction to execute the judicial exceptions with generic computer hardware (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the learning model is configured such that a last layer of a model which has learned another label other than a plurality of labels to be recognized is replaced with a layer corresponding to the plurality of labels: While this further limits the scope of the learning model, it’s insufficient to be considered more than a generic data structure. Thus, calculating feature amounts based on a parameter of the learning model is still considered mere instruction to execute a judicial exception with a generic data structure (MPEP 2106.05(f)).
wherein the at least one processor is configured to calculate the first loss based on the output of the learning model in which the last layer is replaced with the layer corresponding to the plurality of labels, and based on the target output: This is mere instruction to execute the judicial exceptions with generic computer hardware (MPEP 2106.05(f)).

Claim 11
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the learning model is a model for recognizing an object included in an image: This merely links the judicial exceptions to a field of use (computer vision) (MPEP 2106.05(h)).
wherein the query image is a multi-label query image: This merely links the judicial exceptions to a particular field of use (multi-label image processing) (MPEP 2106.05(h)).
wherein the support data is a support image corresponding to the multi-label query image: This merely links the judicial exceptions to a particular field of use (multi-label image processing) (MPEP 2106.05(h)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the learning model is a model for recognizing an object included in an image: This merely links the judicial exceptions to a field of use (computer vision) (MPEP 2106.05(h)).
wherein the query image is a multi-label query image: This merely links the judicial exceptions to a particular field of use (multi-label image processing) (MPEP 2106.05(h)).
wherein the support data is a support image corresponding to the multi-label query image: This merely links the judicial exceptions to a particular field of use (multi-label image processing) (MPEP 2106.05(h)).

Claim 12
Step 1: The claim recites “A learning method”, and is therefore directed to the statutory category of process
Step 2A Prong 1: The claim recites the following judicial exception(s)
determining a first label for the query image: This can be performed as a mental process. One can merely decide a first label for the query image.
determining a second label for the query image: This can be performed as a mental process. One can merely decide a second label for the query image.
calculating, when the query image is input to a learning model, a first loss based on an output of the learning model and a target output can be performed as a mental process. One can merely assign a value based on the difference of the model and target output.
acquiring a plurality of support images where each of the support images have at least the first label or the second label in common with the query image: This can be performed as a mental process. One can merely determine which support images in a set share label(s) with the query image.
acquiring a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This can be performed as a mental process. One can merely identify a subset of features present in the query and support data.
calculating a second loss based on the feature amount of the multi-label query image and the feature amount of each of the plurality of images of the support data: This can be performed as a mental process. One can merely assign a value proportional to the differences between the (feature) subset of query data from each member of the (feature) subset of support data.
wherein each of the first label and second label of the query image are related to an edited feature of the query image: Determining the first and second labels can still be performed as a mental process. The process can merely be performed with a query image that has edited features.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the following additional element(s)
receiving the query image: This amounts to mere reception of data and is insignificant extra-solution activity (MPEP 2106.05(g)).
acquiring a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This is mere instruction to use a generic data structure to execute a judicial exception in a generic manner (MPEP 2106.05(f)).
adjusting the parameter based on the first loss and the second loss: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).
Step 2B: The following additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
receiving the query image: This is an instance of retrieving information from memory, a limitation known to be well-understood, routine, and conventional (MPEP 2106.05(d) II. iv.)
acquiring a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This is mere instruction to use a generic data structure to execute a judicial exception in a generic manner (MPEP 2106.05(f)).
adjusting the parameter based on the first loss and the second loss: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).

Claim 13
Step 1: The claim recites “A non-transitory computer-readable information storage medium”, and is therefore directed to the statutory category of article of manufacture
Step 2A Prong 1: The claim recites the following judicial exception(s)
determine a first label for the query image: This can be performed as a mental process. One can merely decide a first label for the query image.
determine a second label for the query image: This can be performed as a mental process. One can merely decide a second label for the query image.
calculate, when the query image is input to a learning model, a first loss based on an output of the learning model and a target output can be performed as a mental process. One can merely assign a value based on the difference of the model and target output.
acquire a plurality of support images where each of the support images have at least the first label or the second label in common with the query image: This can be performed as a mental process. One can merely determine which support images in a set share label(s) with the query image.
acquire a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This can be performed as a mental process. One can merely identify a subset of features present in the query and support data.
calculate a second loss based on the feature amount of the multi-label query image and the feature amount of each of the plurality of images of the support data: This can be performed as a mental process. One can merely assign a value proportional to the differences between the (feature) subset of query data from each member of the (feature) subset of support data.
wherein each of the first label and second label of the query image are related to an edited feature of the query image: Determining the first and second labels can still be performed as a mental process. The process can merely be performed with a query image that has edited features.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the following additional element(s)
receive the query image: This amounts to mere reception of data and is insignificant extra-solution activity (MPEP 2106.05(g)).
A non-transitory computer-readable information storage medium for storing a program for causing a computer to: This is mere instruction to apply the judicial exceptions by a generic computing machine (MPEP 2106.05(f)).
acquire a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This is mere instruction to use a generic data structure to execute a judicial exception in a generic manner (MPEP 2106.05(f)).
adjust the parameter based on the first loss and the second loss: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).
Step 2B: The following additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
receive the query image: This is an instance of retrieving information from memory, a limitation known to be well-understood, routine, and conventional (MPEP 2106.05(d) II. iv.)
A non-transitory computer-readable information storage medium for storing a program for causing a computer to: This is mere instruction to apply the judicial exceptions by a generic computing machine (MPEP 2106.05(f)).
acquire a feature amount of the multi-label query image and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image, which are calculated based on a parameter of the learning model: This is mere instruction to use a generic data structure to execute a judicial exception in a generic manner (MPEP 2106.05(f)).
adjust the parameter based on the first loss and the second loss: This is mere instruction to apply a judicial exception to a generic data structure in a generic manner (MPEP 2106.05(f)).

Claim 14
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
the learning model recognizes multi-label data of the query image: This merely links the recited judicial exceptions to a field of use (multi-label image recognition) (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
the learning model recognizes multi-label data of the query image: This merely links the recited judicial exceptions to a field of use (multi-label image recognition) (MPEP 2106.05(f)).

 Claim 15
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the second loss is a contrastive loss between the feature amount of the query image and the feature amount of each of the plurality of images of the support data: Calculating the second loss can still be performed as a mental process. One can merely assign values proportional to the difference between a query data point and a support data point of matching classes, and the difference between a query data point and a non-matching support data point.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the additional element(s)
Step 2B: The additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)

Claim 16
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites no further judicial exception(s)
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the plurality of images of the support data are acquired using random sampling: This is an instance of randomly sampling image data, a conventional technique in machine learning, and is insignificant extra-solution activity (MPEP 2106.05(g)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the plurality of images of the support data are acquired using random sampling: This is an instance of randomly sampling image data, a conventional technique for acquiring image data for machine learning, as noted by Park et al. (GENERATING MODIFIED DIGITAL IMAGES UTILIZING A GLOBAL AND SPATIAL AUTOENCODER, published 11/18/2021, US 20210358177 A1): “many conventional systems utilize conventional generative models to generate digital images from random samples” (Park, [0033])

Claim 17
Step 1: The claim recites a machine, as in claim 6
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the learning model is configured to recognize three or more labels, which can be arranged in unique combinations; wherein, for every unique combination of the three or more labels, the data set which includes the query image and the support data exists: This can be performed as a mental process. One can merely recognize combinations of labels in the multi-label query dataset.
calculate, for every unique combination of the three or more labels, the first loss based on the query image corresponding to the combination: This can be performed as a mental process. One can merely compare the labels of the query data to each unique combination of labels, and assign a value based on whether or not they match.
acquire, for every unique combination of the three or more labels, the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: This can be performed as a mental process. One can merely identify a subset of features present in the query and support data across all unique label combinations.
calculate, for every unique combination of the three or more labels, the second loss based on the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: This can be performed as a mental process. One can merely assign a value proportional to the differences between the (feature) subset of query data from each member of the (feature) subset of support data across all unique combinations.
adjust the parameter based on the first loss and the second loss which are calculated for every unique combination of the three or more labels: This can be performed as a mental process. One can merely add the two loss values to the parameter across all unique combinations.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
the learning model is configured to recognize three or more labels, which can be arranged in unique combinations: This is mere instruction to apply a judicial exception with a generic data structure (MPEP 2106.05(f)).
wherein the at least one processor is configured to calculate, for every unique combination of the three or more labels, the first loss based on the query image corresponding to the combination: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
wherein the at least one processor is configured to acquire, for every unique combination of the three or more labels, the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
wherein the at least one processor is configured to calculate, for every unique combination of the three or more labels, the second loss based on the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
wherein the at least one processor is configured to adjust the parameter based on the first loss and the second loss which are calculated for every unique combination of the three or more labels: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
the learning model is configured to recognize three or more labels, which can be arranged in unique combinations: This is mere instruction to apply a judicial exception with a generic data structure (MPEP 2106.05(f)).
wherein the at least one processor is configured to calculate, for every unique combination of the three or more labels, the first loss based on the query image corresponding to the combination: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
wherein the at least one processor is configured to acquire, for every unique combination of the three or more labels, the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
wherein the at least one processor is configured to calculate, for every unique combination of the three or more labels, the second loss based on the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).
wherein the at least one processor is configured to adjust the parameter based on the first loss and the second loss which are calculated for every unique combination of the three or more labels: This is mere instruction to execute a judicial exception with generic computer hardware (MPEP 2106.05(f)).

Claim 18
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the learning system makes a determination as to if the query image was edited: This can be performed as a mental process. One can merely decide if a query image was edited upon observation.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the learning system makes a determination as to if the query image was edited: This is mere instruction to execute a judicial exception with a generic data structure (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the learning system makes a determination as to if the query image was edited: This is mere instruction to execute a judicial exception with a generic data structure (MPEP 2106.05(f)).

Claim 19
Step 1: The claim recites a machine, as in claim 1
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein the learning system makes a determination as to how the query image was edited: This can be performed as a mental process. One can merely imagine how an observed query image was edited.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein the learning system makes a determination as to how the query image was edited: This is mere instruction to execute a judicial exception with a generic data structure (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein the learning system makes a determination as to how the query image was edited: This is mere instruction to execute a judicial exception with a generic data structure (MPEP 2106.05(f)).

Claim 20
Step 1: The claim recites a machine, as in claim 19
Step 2A Prong 1: The claim recites the following further judicial exception(s)
wherein in determining how the query image was edited, the learning system determines if digital text was added to the query image: This can be performed as a mental process. One can merely identify whether there’s digital text in an observed query image.
Step 2A Prong 2: The judicial exception(s) are not integrated into a practical application through the further additional element(s)
wherein in determining how the query image was edited, the learning system determines if digital text was added to the query image: This is mere instruction to execute a judicial exception with a generic data structure (MPEP 2106.05(f)).
Step 2B: The further additional element(s) of the claim, taken alone or in combination, do not amount to significantly more than the recited judicial exception(s)
wherein in determining how the query image was edited, the learning system determines if digital text was added to the query image: This is mere instruction to execute a judicial exception with a generic data structure (MPEP 2106.05(f)).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-2, 4, 6, 8, 11-19 are rejected under 35 U.S.C. 103 as being unpatentable over Karlinsky (RepMet: Representative-based metric learning for classification and few-shot object detection, 2019, CVPR 2019 pp. 5192-5201) in view of Hu (US 2021/0142054 A1), and further in view of Bayar et al. (A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer, published 2016, IH&MMSec '16: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security pp. 5-10), hereafter referred to as Bayar.

Regarding claim 1, Karlinsky discloses [a] learning system for assigning multiple labels to a query image, comprising at least one processor configured to:
receive the query image: “Each episode, for the case of the n-shot, m-way few-shot detection task, contains random n training examples for each of the m randomly chosen classes, and 10 ·m random query images containing one or more instances belonging to these classes” (Karlinsky, page 5202, left column, paragraph 3)
determine a first label for the query image; determine a second label for the query image:
“For few-shot object detection, we build upon modern approaches … to generate regions of interest, and a classifier ‘head’ that classifies these ROIs into one of the object categories or a background region.  The input to this subnet are the feature vectors pooled from the ROIs, and the class posteriors (class label[s]) for a given ROI are computed by comparing its embedding vector to the set of representatives for each category.” (Karlinsky, page 5198, left column, paragraph 1)
“we will refer to the input of the subnet as a single (pooled) feature vector X ∈ Rf computed by the backbone for the given image (query image) (or ROI).” (Karlinsky, page 5199, right column, paragraph 4)

    PNG
    media_image1.png
    415
    426
    media_image1.png
    Greyscale
”Figure 2. Overview of our approach. (a) Train time: backbone, embedding space, and mixture models for the classes are learned jointly” (Karlinsky, page 5198, Figure 2). An instance where the query image is mapped to a first label (dog) and a second label (a bicycle).
calculate, when the query image is input to a learning model, a first loss based on an output of the learning model and a target output: “Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image representation) computed by the backbone for the given image (query image) (or ROI)” (Karlinsky, page 5199, right column, paragraph 4); “Having P(C = i|X) and P(B|X) (output of the learning model) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone). The first loss is the regular cross-entropy (CE) with the ground truth labels (target output) given for the image (or ROI) corresponding to X” (Karlinsky, page 5200, right column, paragraph 3). Note for further mappings below, “query image” and values derived from the query image, such as regions of interest and pooled feature vectors computed from the image, are used somewhat interchangeably.
acquire a plurality of support images: 

    PNG
    media_image1.png
    415
    426
    media_image1.png
    Greyscale
”Train time: backbone, embedding space and mixture models for the classes are learned jointly, class representatives (support images) are mixture mode centers in the embedding space” (Karlinsky, page 5198, Figure 2). Representatives, each a mode of a distribution representing an image class, can be represented as images.
“We represent each class by a mixture model with multiple modes, and consider the centers of these modes as the representative vectors (support images) for the class. Unlike previous methods, we simultaneously learn (acquire) the embedding space, backbone network parameters, and the representative vectors of the training categories, in a single end-to-end training process.” (Karlinsky, page 5197, right column, paragraph 2).
…where each of the support images have at least the first label or the second label in common with the query image: “in the few-shot classification literature, it is a common practice to evaluate the approaches by averaging the performance on multiple instances of the few-shot task, called episodes. We offer such an episodic benchmark for the few-shot detection problem, built on a challenging fine-grained few-shot detection task” (Karlinsky, page 5199, left column, paragraph 1); “An N-way, M-shot episode is an instance of the few-shot task represented by a set of M training examples (support images) from each of the N categories, and one query image of an object from one of the categories (at least one label in common). The goal is to determine the correct category for the query” (Karlinsky, page 5199, left column, paragraph 3).
acquire a feature amount of the multi-label query image
 
    PNG
    media_image2.png
    315
    524
    media_image2.png
    Greyscale
 (Karlinsky, page 5200, Figure 3). The DML embedding module creates an embedding vector (feature amount) of the feature vector input (query image).
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module, which consists of a few fully connected (FC) layers with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount)” (Karlinsky, page 5199, right column, paragraph 4). As indicated by paragraph [0087] of the published application, a feature amount can be an embedding vector.
…and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image: “As an additional set of trained parameters, we hold a set of ‘representatives’                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount of each of the plurality of images of the support data). Each vector                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             represents the center of the j-th mode of the learned discriminative mixture distribution in the embedding space, for the i-th class out of the total of N classes. We assume a fixed number of K modes (peaks) in the distribution of each class, so 1 ≤ j ≤ K” (Karlinsky, page 5200, left column, paragraph 1). Each support image has a label corresponding to the multi-label query image.
…which are calculated based on a parameter of the learning model:
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module, which consists of a few fully connected (FC) layers (parameter[s]) with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount of the query image)” (Karlinsky, page 5199, right column, paragraph 4).
“In our implementation, the representatives (feature amount of each of the plurality of images of the support data) are realized as weights (parameter[s]) of an FC layer of size N ·K · e receiving a fixed scalar input 1. The output of this layer is reshaped to an N× K×e tensor. During training, this simple construction flows the gradients to the weights of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2); 
calculate a second loss based on the feature amount of the multi-label query image and the feature amount of each of the plurality of images of the support data: “For a given image (or detector ROI) and its corresponding embedding vector E, our network computes a matrix of N × K distances                                 
                                    
                                        
                                            d
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (E) = d(E,                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ) between E (feature amount of the multi-label query image) and the representatives                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (feature amount of the plurality of images of the support data)” (Karlinsky, page 5200, left column, paragraph 2); “we use a sum of two losses to train our model ... The other (second loss) is intended to ensure there is at least                                 
                                    α
                                
                             margin between the distance of E to the closest representative of the correct class, and the distance of E to the closest representative of a wrong class: 
    PNG
    media_image3.png
    65
    447
    media_image3.png
    Greyscale
 where i∗ is the correct class index for the current example” (Karlinsky, page 5200, right column, paragraph 3). d, a measure of distance between the feature amounts of the query data and feature amounts of the support data, is used to calculate this second loss. Since the distance function is minimized over all representatives of all classes, distance between the query embedding and each of the plurality of images of the support data must be calculated.
adjust the parameter based on the first loss and the second loss: “During training, this simple construction flows the gradients to the weights (parameter[s]) of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2); “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone)” (Karlinsky, page 5200, right column, paragraph 3).
Karlinsky relates to few-shot learning for multi-label image processing systems and is analogous to the claimed invention.
While Karlinsky fails to disclose the further limitations of the claim, Hu teaches [a] learning system for assigning multiple labels to a query image, comprising at least one processor:
“[0012] Embodiments disclosed herein include a computer vision model that identifies a combination of graphic elements present in a query image based on a support set of images that include other various combinations of the graphic features.” (Hu, [0012])
“Software or firmware to implement the techniques introduced here may be stored on a machine-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A "machine-readable medium", as the term is used herein, includes any mechanism that can store information in a form accessible by a machine (a machine may be, for example, a computer, network device, cellular phone, personal digital assistant (PDA), manufacturing tool, any device with one or more processors, etc.)” (Hu, [0058]).
Hu relates to few-shot learning for image processing systems and is analogous to the claimed invention. Karlinsky teaches a method of training an image classifier with few-shot learning. The claimed invention improves upon this method by executing it with computer processors. Hu teaches a method of training an image classifier with few-shot learning which can be executed on computer processors, applicable to Karlinsky. A person of ordinary skill in the art would have recognized that running Karlinsky’s method on Hu’s processor hardware would lead to the predictable result of the method being performed by a computer as-described, and would improve the known device by allowing it to be used to process real data on a computer (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).
While Karlinsky and Hu fail to disclose the further limitations of the claim, Bayar discloses a system, wherein each of the first label and the second label of the query image are related to an edited feature of the query image: “In this paper, we proposed a novel CNN-based universal forgery detection technique that can automatically learn how to detect different image manipulations (edit[s]). To prevent the CNN from learning features that represent an image's content, we proposed a new form of convolutional specifically designed to suppress an image's content and learn manipulation detection features. We accomplished this by specifically constraining this new convolutional layer to learn prediction error filters. Through a series of experiments, we demonstrated that our CNN-based universal forensic approach can automatically learn how to detect multiple image manipulations (multiple label[s]) without relying on pre-selected features or any preprocessing.” (Bayar, page 7, left column, paragraph 4).
	Bayar relates to machine learning for image analysis and is analogous to the claimed invention. The combination of Karlinsky and Hu teaches a system that generates losses based on similarities of query images and support images, and based on model output vs. intended output. Bayar teaches a method of detecting edits in images. It would have been obvious to one of ordinary skill in the art to combine the combination of Karlinsky and Hu with Bayar by detecting edits in the query images with Bayar’s method. This would achieve the predictable result of identifying whether and how a query image has been manipulated, with the system of Karlinsky and Hu and Bayar’s method performing the same together as they did separately. (MPEP 2143 I. (A) Combining prior art elements according to known methods to yield predictable results).

	Regarding claim 2, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky, in combination with Hu, further teaches a method, wherein
the query image and the support data have at least one label which is the same: (Karlinsky) “An N-way, M-shot episode is an instance of the few-shot task represented by a set of M training examples (support images) from each of the N categories, and one query image of an object from one of the categories (at least one label which is the same). The goal is to determine the correct category for the query” (Karlinsky, page 5199, left column, paragraph 3).
the at least one processor is configured to calculate the second loss so that, as a difference between the feature amount of the query image and the feature amount of the support data becomes larger, the second loss becomes larger: (Karlinsky) “For a given image (or detector ROI) and its corresponding embedding vector E, our network computes a matrix of N × K distances                                 
                                    
                                        
                                            d
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (E) = d(E,                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ) (difference between the feature amount[s]) between E and the representatives                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ” (Karlinsky, page 5200, left column, paragraph 2); (Karlinsky) “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone) ... The other (second loss) is intended to ensure there is at least                                 
                                    α
                                
                             margin between the distance of E to the closest representative of the correct class, and the distance of E to the closest representative of a wrong class: 
    PNG
    media_image3.png
    65
    447
    media_image3.png
    Greyscale
 
where                         
                            
                                
                                    i
                                
                                
                                    *
                                
                            
                        
                     is the correct class index for the current example and | · |+ is the ReLU function.” (Karlinsky, page 5200, right column, paragraph 3). The loss increases as the distance d between the query feature amount and support feature amounts of correct classes increases. As noted in the rejection for claim 1, Hu discloses using a computer processor to perform similar operations as Karlinsky.

	Regarding claim 4, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky, in combination with Hu, further teaches a system, wherein the at least one processor is configured to
calculate a total loss based on the first loss and the second loss: (Karlinsky) “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone)” (Karlinsky, page 5200, right column, paragraph 3).
adjust the parameter based on the total loss: (Karlinsky) “During training, this simple construction flows the gradients to the weights (parameter[s]) of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2).

	Regarding claim 6, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky, in combination with Hu further teaches a method, wherein
the learning model is configured to recognize three or more labels, which can be arranged in unique combinations: (Karlinsky) “we trained our DML classifier on the ImageNet Attributes dataset defined in [25], which contains 116236 images from 90 classes (labels)” (Karlinsky, page 5201, right column, paragraph 4). In this instance, a combination of 90 classes is being used for training.
for each unique combination of the three or more labels, a data set which includes the query image and the support data exists: (Karlinsky)

    PNG
    media_image4.png
    563
    957
    media_image4.png
    Greyscale
 (Karlinsky, page 5200, Figure 3). This is part of a diagram showing Karlinsky’s training architecture. As shown here, an input (query image) and a set of representatives (support data) exist in the training scheme, supporting a variable number of labels (N classes). 
    PNG
    media_image5.png
    724
    782
    media_image5.png
    Greyscale
 
 “Overview of our approach. (a) Train time: backbone, embedding space and mixture models for the classes are learned jointly, class representatives are mixture mode centers in the embedding space” (Karlinsky, page 5198, Figure 2). Figure 2(a) shows the process of mapping classes to an embedding space during training. This training methodology supports at least one unique combination of three labels (Bike class, dog class, and truck class).
the at least one processor is configured to
calculate, for each unique combination of the three or more labels, the first loss based on the query image corresponding to the combination: (Karlinsky) “Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI)” (Karlinsky, page 5199, right column, paragraph 4); “The first loss is the regular cross-entropy (CE) with the ground truth labels given for the image (or ROI) corresponding to X” (Karlinsky, page 5200, right column, paragraph 3).
acquire, for each unique combination of the three or more labels, the feature amount of the query image corresponding to the combination: (Karlinsky)

    PNG
    media_image4.png
    563
    957
    media_image4.png
    Greyscale
 (Karlinsky, page 5200, Figure 3). The DML embedding module creates an embedding vector (feature amount) of the feature vector input (query image). This corresponds to the N classes of the representative set, against which it will be compared.
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query data) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module, which consists of a few fully connected (FC) layers with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             } (feature amount), where commonly embedding size e ≪ f” (Karlinsky, page 5199, right column, paragraph 4). As indicated by page 25 line 23 to page 26 line 3 of the instant Specification, a feature amount can be an embedding vector.
acquire, for each combination of the three or more labels, ... the feature amount of the support data corresponding to the combination: (Karlinsky) “As an additional set of trained parameters, we hold a set of ‘representatives’                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount). Each vector                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             represents the center of the j-th mode of the learned discriminative mixture distribution (support data) in the embedding space, for the i-th class out of the total of N classes. We assume a fixed number of K modes (peaks) in the distribution of each class, so 1 ≤ j ≤ K” (Karlinsky, page 5200, left column, paragraph 1).
calculate, for each unique combination of the three or more labels, the second loss based on the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: (Karlinsky) “For a given image (or detector ROI) and its corresponding embedding vector E, our network computes a matrix of N × K distances                                 
                                    
                                        
                                            d
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (E) = d(E,                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ) between E (feature amount of the query data) and the representatives                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (feature amount of the support data)” (Karlinsky, page 5200, left column, paragraph 2); “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone) ... The other (second loss) is intended to ensure there is at least                                 
                                    α
                                
                             margin between the distance of E to the closest representative of the correct class, and the distance of E to the closest representative of a wrong class: 
    PNG
    media_image3.png
    65
    447
    media_image3.png
    Greyscale
” (Karlinsky, page 5200, right column, paragraph 3). d, a measure of distance between the feature amounts of the query data and feature amounts of the support data, is used to calculate this second loss.
adjust the parameter based on the first loss and the second loss which are calculated for each unique combination of the three or more labels: (Karlinsky) “During training, this simple construction flows the gradients to the weights (parameter[s]) of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2); “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone)” (Karlinsky, page 5200, right column, paragraph 3).

	Regarding claim 8, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky, in combination with Hu, further teaches a system, wherein
the query image and the support data have at least one label which is the same: (Karlinsky) “in the few-shot classification literature, it is a common practice to evaluate the approaches by averaging the performance on multiple instances of the few-shot task, called episodes. We offer such an episodic benchmark for the few-shot detection problem, built on a challenging fine-grained few-shot detection task” (Karlinsky, page 5199, left column, paragraph 1); “An N-way, M-shot episode is an instance of the few-shot task represented by a set of M training examples from each of the N categories, and one query image of an object from one of the categories. The goal is to determine the correct category for the query” (Karlinsky, page 5199, left column, paragraph 3). If there’s a correct category for a query, that means it shares a label with at least one support label.
the at least one processor is configured to acquire the second loss based on the feature amount of the query image, the feature amount of the support data, and a coefficient corresponding to a label similarity between the query image and the support data: (Karlinsky) “For a given image (or detector ROI) and its corresponding embedding vector E, our network computes a matrix of N × K distances                                 
                                    
                                        
                                            d
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (E) = d(E,                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ) (label similarity) between E (feature amount of the query image) and the representatives                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (feature amount of the support data)” (Karlinsky, page 5200, left column, paragraph 2); “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone) ... The other (second loss) is intended to ensure there is at least                                 
                                    α
                                
                             margin between the distance of E to the closest representative of the correct class, and the distance of E to the closest representative of a wrong class: 
    PNG
    media_image3.png
    65
    447
    media_image3.png
    Greyscale
where                                 
                                    
                                        
                                            i
                                        
                                        
                                            *
                                        
                                    
                                
                             is the correct class index for the current example and | · |+ is the ReLU function.” (Karlinsky, page 5200, right column, paragraph 3). d measures the distance between the query embedding and a support embedding. It’s minimized for classes that include the query and maximized for classes that don’t, thus it can be considered a measure of label similarity.

	Regarding claim 11, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky further teaches a method, wherein
the learning model is a model for recognizing an object included in an image: “we demonstrate the effectiveness of our approach on the problem of few-shot object detection, by incorporating the proposed DML architecture as a classification head into a standard object detection model. We achieve the best results on the ImageNet-LOC dataset” (Karlinsky, page 5197, Abstract).
the query image is a multi-label query image: 
    PNG
    media_image1.png
    415
    426
    media_image1.png
    Greyscale
”Figure 2. Overview of our approach. (a) Train time: backbone, embedding space, and mixture models for the classes are learned jointly” (Karlinsky, page 5198, Figure 2). An instance where the query image is mapped to a first label (dog) and a second label (a bicycle).
the support data is a support image corresponding to the multi-label query image: ”Train time: backbone, embedding space and mixture models for the classes are learned jointly, class representatives (support images) are mixture mode centers in the embedding space” (Karlinsky, page 5198, Figure 2); “We represent each class by a mixture model with multiple modes, and consider the centers of these modes as the representative vectors (support images) for the class. Unlike previous methods, we simultaneously learn (acquire) the embedding space, backbone network parameters, and the representative vectors of the training categories, in a single end-to-end training process.” (Karlinsky, page 5197, right column, paragraph 2).
	


Regarding claim 12, Karlinsky teaches a learning method, comprising:
receiving the query image: “Each episode, for the case of the n-shot, m-way few-shot detection task, contains random n training examples for each of the m randomly chosen classes, and 10 ·m random query images containing one or more instances belonging to these classes” (Karlinsky, page 5202, left column, paragraph 3)
determining a first label for the query image; determining a second label for the query image:
“For few-shot object detection, we build upon modern approaches … to generate regions of interest, and a classifier ‘head’ that classifies these ROIs into one of the object categories or a background region.  The input to this subnet are the feature vectors pooled from the ROIs, and the class posteriors (class label[s]) for a given ROI are computed by comparing its embedding vector to the set of representatives for each category.” (Karlinsky, page 5198, left column, paragraph 1)
“we will refer to the input of the subnet as a single (pooled) feature vector X ∈ Rf computed by the backbone for the given image (query image) (or ROI).” (Karlinsky, page 5199, right column, paragraph 4)

    PNG
    media_image1.png
    415
    426
    media_image1.png
    Greyscale
”Figure 2. Overview of our approach. (a) Train time: backbone, embedding space, and mixture models for the classes are learned jointly” (Karlinsky, page 5198, Figure 2). An instance where the query image is mapped to a first label (dog) and a second label (a bicycle).
calculating, when the query image is input to a learning model, a first loss based on an output of the learning model and a target output: “Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image representation) computed by the backbone for the given image (query image) (or ROI)” (Karlinsky, page 5199, right column, paragraph 4); “Having P(C = i|X) and P(B|X) (output of the learning model) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone). The first loss is the regular cross-entropy (CE) with the ground truth labels (target output) given for the image (or ROI) corresponding to X” (Karlinsky, page 5200, right column, paragraph 3). Note for further mappings below, “query image” and values derived from the query image, such as regions of interest and pooled feature vectors computed from the image, are used somewhat interchangeably.
acquiring a plurality of support images: 

    PNG
    media_image1.png
    415
    426
    media_image1.png
    Greyscale
”Train time: backbone, embedding space and mixture models for the classes are learned jointly, class representatives (support images) are mixture mode centers in the embedding space” (Karlinsky, page 5198, Figure 2). Representatives, each a mode of a distribution representing an image class, can be represented as images.
“We represent each class by a mixture model with multiple modes, and consider the centers of these modes as the representative vectors (support images) for the class. Unlike previous methods, we simultaneously learn (acquire) the embedding space, backbone network parameters, and the representative vectors of the training categories, in a single end-to-end training process.” (Karlinsky, page 5197, right column, paragraph 2).
…where each of the support images have at least the first label or the second label in common with the query image: “in the few-shot classification literature, it is a common practice to evaluate the approaches by averaging the performance on multiple instances of the few-shot task, called episodes. We offer such an episodic benchmark for the few-shot detection problem, built on a challenging fine-grained few-shot detection task” (Karlinsky, page 5199, left column, paragraph 1); “An N-way, M-shot episode is an instance of the few-shot task represented by a set of M training examples (support images) from each of the N categories, and one query image of an object from one of the categories (at least one label in common). The goal is to determine the correct category for the query” (Karlinsky, page 5199, left column, paragraph 3).
acquiring a feature amount of the multi-label query image
 
    PNG
    media_image2.png
    315
    524
    media_image2.png
    Greyscale
 (Karlinsky, page 5200, Figure 3). The DML embedding module creates an embedding vector (feature amount) of the feature vector input (query image).
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module, which consists of a few fully connected (FC) layers with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount)” (Karlinsky, page 5199, right column, paragraph 4). As indicated by paragraph [0087] of the published application, a feature amount can be an embedding vector.
…and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image: “As an additional set of trained parameters, we hold a set of ‘representatives’                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount of each of the plurality of images of the support data). Each vector                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             represents the center of the j-th mode of the learned discriminative mixture distribution in the embedding space, for the i-th class out of the total of N classes. We assume a fixed number of K modes (peaks) in the distribution of each class, so 1 ≤ j ≤ K” (Karlinsky, page 5200, left column, paragraph 1). Each support image has a label corresponding to the multi-label query image.
…which are calculated based on a parameter of the learning model:
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module, which consists of a few fully connected (FC) layers (parameter[s]) with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount of the query image)” (Karlinsky, page 5199, right column, paragraph 4).
“In our implementation, the representatives (feature amount of each of the plurality of images of the support data) are realized as weights (parameter[s]) of an FC layer of size N ·K · e receiving a fixed scalar input 1. The output of this layer is reshaped to an N× K×e tensor. During training, this simple construction flows the gradients to the weights of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2); 
calculating a second loss based on the feature amount of the multi-label query image and the feature amount of each of the plurality of images of the support data: “For a given image (or detector ROI) and its corresponding embedding vector E, our network computes a matrix of N × K distances                                 
                                    
                                        
                                            d
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (E) = d(E,                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ) between E (feature amount of the multi-label query image) and the representatives                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (feature amount of the plurality of images of the support data)” (Karlinsky, page 5200, left column, paragraph 2); “we use a sum of two losses to train our model ... The other (second loss) is intended to ensure there is at least                                 
                                    α
                                
                             margin between the distance of E to the closest representative of the correct class, and the distance of E to the closest representative of a wrong class: 
    PNG
    media_image3.png
    65
    447
    media_image3.png
    Greyscale
 where i∗ is the correct class index for the current example” (Karlinsky, page 5200, right column, paragraph 3). d, a measure of distance between the feature amounts of the query data and feature amounts of the support data, is used to calculate this second loss. Since the distance function is minimized over all representatives of all classes, distance between the query embedding and each of the plurality of images of the support data must be calculated.
adjusting the parameter based on the first loss and the second loss: “During training, this simple construction flows the gradients to the weights (parameter[s]) of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2); “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone)” (Karlinsky, page 5200, right column, paragraph 3).
Karlinsky relates to few-shot learning for multi-label image processing systems and is analogous to the claimed invention.
While Karlinsky fails to disclose the further limitations of the claim, Bayar discloses a system, wherein each of the first label and the second label of the query image are related to an edited feature of the query image: “In this paper, we proposed a novel CNN-based universal forgery detection technique that can automatically learn how to detect different image manipulations (edit[s]). To prevent the CNN from learning features that represent an image's content, we proposed a new form of convolutional specifically designed to suppress an image's content and learn manipulation detection features. We accomplished this by specifically constraining this new convolutional layer to learn prediction error filters. Through a series of experiments, we demonstrated that our CNN-based universal forensic approach can automatically learn how to detect multiple image manipulations (multiple label[s]) without relying on pre-selected features or any preprocessing.” (Bayar, page 7, left column, paragraph 4).
	Bayar relates to machine learning for image analysis and is analogous to the claimed invention. The combination of Karlinsky and Hu teaches a system that generates losses based on similarities of query images and support images, and based on model output vs. intended output. Bayar teaches a method of detecting edits in images. It would have been obvious to one of ordinary skill in the art to combine the combination of Karlinsky and Hu with Bayar by detecting edits in the query images with Bayar’s method. This would achieve the predictable result of identifying whether and how a query image has been manipulated, with the system of Karlinsky and Hu and Bayar’s method performing the same together as they did separately. (MPEP 2143 I. (A) Combining prior art elements according to known methods to yield predictable results).

Regarding claim 13, Karlinsky teaches program instructions to:
receive the query image: “Each episode, for the case of the n-shot, m-way few-shot detection task, contains random n training examples for each of the m randomly chosen classes, and 10 ·m random query images containing one or more instances belonging to these classes” (Karlinsky, page 5202, left column, paragraph 3)
determine a first label for the query image; determine a second label for the query image:
“For few-shot object detection, we build upon modern approaches … to generate regions of interest, and a classifier ‘head’ that classifies these ROIs into one of the object categories or a background region.  The input to this subnet are the feature vectors pooled from the ROIs, and the class posteriors (class label[s]) for a given ROI are computed by comparing its embedding vector to the set of representatives for each category.” (Karlinsky, page 5198, left column, paragraph 1)
“we will refer to the input of the subnet as a single (pooled) feature vector X ∈ Rf computed by the backbone for the given image (query image) (or ROI).” (Karlinsky, page 5199, right column, paragraph 4)

    PNG
    media_image1.png
    415
    426
    media_image1.png
    Greyscale
”Figure 2. Overview of our approach. (a) Train time: backbone, embedding space, and mixture models for the classes are learned jointly” (Karlinsky, page 5198, Figure 2). An instance where the query image is mapped to a first label (dog) and a second label (a bicycle).
calculate, when the query image is input to a learning model, a first loss based on an output of the learning model and a target output: “Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image representation) computed by the backbone for the given image (query image) (or ROI)” (Karlinsky, page 5199, right column, paragraph 4); “Having P(C = i|X) and P(B|X) (output of the learning model) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone). The first loss is the regular cross-entropy (CE) with the ground truth labels (target output) given for the image (or ROI) corresponding to X” (Karlinsky, page 5200, right column, paragraph 3). Note for further mappings below, “query image” and values derived from the query image, such as regions of interest and pooled feature vectors computed from the image, are used somewhat interchangeably.
acquire a plurality of support images: 

    PNG
    media_image1.png
    415
    426
    media_image1.png
    Greyscale
”Train time: backbone, embedding space and mixture models for the classes are learned jointly, class representatives (support images) are mixture mode centers in the embedding space” (Karlinsky, page 5198, Figure 2). Representatives, each a mode of a distribution representing an image class, can be represented as images.
“We represent each class by a mixture model with multiple modes, and consider the centers of these modes as the representative vectors (support images) for the class. Unlike previous methods, we simultaneously learn (acquire) the embedding space, backbone network parameters, and the representative vectors of the training categories, in a single end-to-end training process.” (Karlinsky, page 5197, right column, paragraph 2).
…where each of the support images have at least the first label or the second label in common with the query image: “in the few-shot classification literature, it is a common practice to evaluate the approaches by averaging the performance on multiple instances of the few-shot task, called episodes. We offer such an episodic benchmark for the few-shot detection problem, built on a challenging fine-grained few-shot detection task” (Karlinsky, page 5199, left column, paragraph 1); “An N-way, M-shot episode is an instance of the few-shot task represented by a set of M training examples (support images) from each of the N categories, and one query image of an object from one of the categories (at least one label in common). The goal is to determine the correct category for the query” (Karlinsky, page 5199, left column, paragraph 3).
acquire a feature amount of the multi-label query image
 
    PNG
    media_image2.png
    315
    524
    media_image2.png
    Greyscale
 (Karlinsky, page 5200, Figure 3). The DML embedding module creates an embedding vector (feature amount) of the feature vector input (query image).
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module, which consists of a few fully connected (FC) layers with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount)” (Karlinsky, page 5199, right column, paragraph 4). As indicated by paragraph [0087] of the published application, a feature amount can be an embedding vector.
…and a feature amount of each of the plurality of images of the support data corresponding to the multi-label query image: “As an additional set of trained parameters, we hold a set of ‘representatives’                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount of each of the plurality of images of the support data). Each vector                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             represents the center of the j-th mode of the learned discriminative mixture distribution in the embedding space, for the i-th class out of the total of N classes. We assume a fixed number of K modes (peaks) in the distribution of each class, so 1 ≤ j ≤ K” (Karlinsky, page 5200, left column, paragraph 1). Each support image has a label corresponding to the multi-label query image.
…which are calculated based on a parameter of the learning model:
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module, which consists of a few fully connected (FC) layers (parameter[s]) with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount of the query image)” (Karlinsky, page 5199, right column, paragraph 4).
“In our implementation, the representatives (feature amount of each of the plurality of images of the support data) are realized as weights (parameter[s]) of an FC layer of size N ·K · e receiving a fixed scalar input 1. The output of this layer is reshaped to an N× K×e tensor. During training, this simple construction flows the gradients to the weights of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2); 
calculate a second loss based on the feature amount of the multi-label query image and the feature amount of each of the plurality of images of the support data: “For a given image (or detector ROI) and its corresponding embedding vector E, our network computes a matrix of N × K distances                                 
                                    
                                        
                                            d
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (E) = d(E,                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ) between E (feature amount of the multi-label query image) and the representatives                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (feature amount of the plurality of images of the support data)” (Karlinsky, page 5200, left column, paragraph 2); “we use a sum of two losses to train our model ... The other (second loss) is intended to ensure there is at least                                 
                                    α
                                
                             margin between the distance of E to the closest representative of the correct class, and the distance of E to the closest representative of a wrong class: 
    PNG
    media_image3.png
    65
    447
    media_image3.png
    Greyscale
 where i∗ is the correct class index for the current example” (Karlinsky, page 5200, right column, paragraph 3). d, a measure of distance between the feature amounts of the query data and feature amounts of the support data, is used to calculate this second loss. Since the distance function is minimized over all representatives of all classes, distance between the query embedding and each of the plurality of images of the support data must be calculated.
adjust the parameter based on the first loss and the second loss: “During training, this simple construction flows the gradients to the weights (parameter[s]) of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2); “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone)” (Karlinsky, page 5200, right column, paragraph 3).
Karlinsky relates to few-shot learning for multi-label image processing systems and is analogous to the claimed invention.
While Karlinsky fails to disclose the further limitations of the claim, Hu teaches [a] non-transitory computer-readable information storage medium for storing a program for causing a computer to: “FIG. 8 is a high-level block diagram showing an example of a processing device 800 that can represent a system to run any of the methods/algorithms described above” (Hu, [0054]); “Physical and functional components ( e.g., devices, engines, modules, and data repositories, etc.) associated with processing device 800 can be implemented as circuitry, firmware, software, other executable instructions, or any combination thereof ... the functional components described can be implemented as instructions on a tangible storage memory capable of being executed by a processor or other integrated circuit chip ( e.g., software, software libraries, application program interfaces, etc.). The tangible storage memory can be computer readable data storage. The tangible storage memory may be volatile or non-volatile memory. In some embodiments, the volatile memory may be considered "non-transitory" in the sense that it is not a transitory signal. Memory space and storages described in the figures can be implemented with the tangible storage memory as well, including volatile or nonvolatile memory” (Hu, [0059]).
Hu relates to few-shot learning for image processing systems and is analogous to the claimed invention. Karlinsky teaches a method of training an image classifier with few-shot learning. The claimed invention improves upon this method by storing it as instructions on non-transitory computer media. Hu teaches a method of training an image classifier with few-shot learning which can be encoded in instructions and stored in non-transitory computer media, applicable to Karlinsky. A person of ordinary skill in the art would have recognized that storing Karlinsky’s method on Hu’s computer media would lead to the predictable result of the method being executed in volatile memory or stored to be executed at a later time in non-volatile memory, and would improve the known device by allowing it to be used to process real data on a computer (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).
While Karlinsky and Hu fail to disclose the further limitations of the claim, Bayar discloses a system, wherein each of the first label and the second label of the query image are related to an edited feature of the query image: “In this paper, we proposed a novel CNN-based universal forgery detection technique that can automatically learn how to detect different image manipulations (edit[s]). To prevent the CNN from learning features that represent an image's content, we proposed a new form of convolutional specifically designed to suppress an image's content and learn manipulation detection features. We accomplished this by specifically constraining this new convolutional layer to learn prediction error filters. Through a series of experiments, we demonstrated that our CNN-based universal forensic approach can automatically learn how to detect multiple image manipulations (multiple label[s]) without relying on pre-selected features or any preprocessing.” (Bayar, page 7, left column, paragraph 4).
	Bayar relates to machine learning for image analysis and is analogous to the claimed invention. The combination of Karlinsky and Hu teaches a system that generates losses based on similarities of query images and support images, and based on model output vs. intended output. Bayar teaches a method of detecting edits in images. It would have been obvious to one of ordinary skill in the art to combine the combination of Karlinsky and Hu with Bayar by detecting edits in the query images with Bayar’s method. This would achieve the predictable result of identifying whether and how a query image has been manipulated, with the system of Karlinsky and Hu and Bayar’s method performing the same together as they did separately. (MPEP 2143 I. (A) Combining prior art elements according to known methods to yield predictable results).


	Regarding claim 14, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Hu further discloses a system, wherein the learning model recognizes multi-label data of the query image: 
    PNG
    media_image1.png
    415
    426
    media_image1.png
    Greyscale
”Figure 2. Overview of our approach. (a) Train time: backbone, embedding space, and mixture models for the classes are learned jointly” (Karlinsky, page 5198, Figure 2). An instance where the query image is mapped to a first label (dog) and a second label (a bicycle).

	Regarding claim 15, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky, in combination with Hu, further discloses a system, wherein the second loss is a contrastive loss between the feature amount of the query image and the feature amount of each of the plurality of images of the support data:
(Karlinsky) “For a given image (or detector ROI) and its corresponding embedding vector E, our network computes a matrix of N × K distances                                 
                                    
                                        
                                            d
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (E) = d(E,                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ) between E (feature amount of the query image) and the representatives                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (feature amount of the plurality of images of the support data)” (Karlinsky, page 5200, left column, paragraph 2); “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone) ... The other (second loss) is intended to ensure there is at least                                 
                                    α
                                
                             margin between the distance of E to the closest representative of the correct class, and the distance of E to the closest representative of a wrong class: 
    PNG
    media_image3.png
    65
    447
    media_image3.png
    Greyscale
” (Karlinsky, page 5200, right column, paragraph 3).
Examiner’s note: 
    PNG
    media_image6.png
    44
    95
    media_image6.png
    Greyscale
 minimizes the distance between the query input and nearest correct-class representative in the embedding space. 
    PNG
    media_image7.png
    36
    120
    media_image7.png
    Greyscale
 maximizes the distance between the query input and the nearest incorrect-class representatives in the embedding space. This loss is substantially similar to the contrastive loss disclosed on page 27, line 22 – page 28, line 23. Query and support data pairs are compared by distance in the embedding space, with loss increasing as incorrect class pairings are closer, and decreasing as correct class pairings are closer. Since the distance function is minimized over all representatives of all classes, distance between the query embedding and each of the plurality of images of the support data must be calculated.

	Regarding claim 16, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky further discloses a system, wherein the plurality of images of the support data are acquired using random sampling:
“our proposed approach end-to-end trains a single (monolithic) network architecture capable of learning the DML embedding together with the representatives (plurality of images of the support data)” (Karlinsky, page 5198, right column, paragraph 2)
“In our implementation, the representatives (plurality of images of the support data) are realized as weights of an FC layer of size N ·K · e receiving a fixed scalar input 1. The output of this layer is reshaped to an N× K×e tensor. During training, this simple construction flows the gradients to the weights of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2)
“In all of our DML-based classification experiments, we set                                 
                                    σ
                                
                             = 0.5 and use K = 3 representatives per category … Each training batch was constructed by randomly sampling M = 12 categories and sampling D = 4 random instances from each of those categories” (Karlinsky, page 5201, left column, paragraph 1)
“For few-shot detection, we used our DML sub-net … We trained using K = 5 representatives per class, and                                 
                                    σ
                                
                             = 0.5” (Karlinsky, page 5201, left column, paragraph 1).

	Regarding claim 17, the rejection of claim 6 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky, in combination with Hu, further teaches a method, wherein:
wherein the learning model is configured to recognize three or more labels, which can be arranged in unique combinations: (Karlinsky) “we trained our DML classifier on the ImageNet Attributes dataset defined in [25], which contains 116236 images from 90 classes (labels)” (Karlinsky, page 5201, right column, paragraph 4). In this instance, a combination of 90 classes is being used for training.
wherein, for every unique combination of the three or more labels, the data set which includes the query image and the support data exists: (Karlinsky)

    PNG
    media_image4.png
    563
    957
    media_image4.png
    Greyscale
 (Karlinsky, page 5200, Figure 3). This is part of a diagram showing Karlinsky’s training architecture. As shown here, an input (query image) and a set of representatives (support data) exist in the training scheme, supporting one unique combination of N labels.

    PNG
    media_image5.png
    724
    782
    media_image5.png
    Greyscale
 
 “Overview of our approach. (a) Train time: backbone, embedding space and mixture models for the classes are learned jointly, class representatives are mixture mode centers in the embedding space” (Karlinsky, page 5198, Figure 2). Figure 2(a) shows the process of mapping classes to an embedding space during training. This training methodology supports at least one unique combination of three labels (Bike class, dog class, and truck class).
wherein the at least one processor is configured to
calculate, for every unique combination of the three or more labels, the first loss based on the query image corresponding to the combination: (Karlinsky) “Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query data representation) computed by the backbone for the given image (or ROI)” (Karlinsky, page 5199, right column, paragraph 4); “The first loss is the regular cross-entropy (CE) with the ground truth labels given for the image (or ROI) corresponding to X” (Karlinsky, page 5200, right column, paragraph 3).
acquire, for every unique combination of the three or more labels, the feature amount of the query image corresponding to the combination: (Karlinsky)

    PNG
    media_image4.png
    563
    957
    media_image4.png
    Greyscale
 (Karlinsky, page 5200, Figure 3). The DML embedding module creates an embedding vector (feature amount) of the feature vector input (query image).
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module, which consists of a few fully connected (FC) layers with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount)” (Karlinsky, page 5199, right column, paragraph 4). As indicated by paragraph [0087] of the published application, a feature amount can be an embedding vector.
acquire, for every combination of the three or more labels, … the feature amount of the support data corresponding to the combination: (Karlinsky) “As an additional set of trained parameters, we hold a set of ‘representatives’                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount of each of the plurality of images of the support data). Each vector                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             represents the center of the j-th mode of the learned discriminative mixture distribution in the embedding space, for the i-th class out of the total of N classes. We assume a fixed number of K modes (peaks) in the distribution of each class, so 1 ≤ j ≤ K” (Karlinsky, page 5200, left column, paragraph 1). Each support image has a label corresponding to the multi-label query image.
calculate, for every unique combination of the three or more labels, the second loss based on the feature amount of the query image corresponding to the combination and the feature amount of the support data corresponding to the combination: (Karlinsky) “For a given image (or detector ROI) and its corresponding embedding vector E, our network computes a matrix of N × K distances                                 
                                    
                                        
                                            d
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (E) = d(E,                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                            ) between E (feature amount of the multi-label query image) and the representatives                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             (feature amount of the plurality of images of the support data)” (Karlinsky, page 5200, left column, paragraph 2); “we use a sum of two losses to train our model ... The other (second loss) is intended to ensure there is at least                                 
                                    α
                                
                             margin between the distance of E to the closest representative of the correct class, and the distance of E to the closest representative of a wrong class: 
    PNG
    media_image3.png
    65
    447
    media_image3.png
    Greyscale
 where i∗ is the correct class index for the current example” (Karlinsky, page 5200, right column, paragraph 3).
adjust[ing] the parameter based on the first loss and the second loss which are calculated for every unique combination of the three or more labels: (Karlinsky) “During training, this simple construction flows the gradients to the weights (parameter[s]) of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2); “Having P(C = i|X) and P(B|X) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone)” (Karlinsky, page 5200, right column, paragraph 3).

Regarding claim 18, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Bayar further discloses a system, wherein the learning system makes a determination as to if the query image was edited: “In this paper, we proposed a novel CNN-based universal forgery detection technique that can automatically learn how to detect different image manipulations (edit[s]). To prevent the CNN from learning features that represent an image's content, we proposed a new form of convolutional specifically designed to suppress an image's content and learn manipulation detection features. We accomplished this by specifically constraining this new convolutional layer to learn prediction error filters. Through a series of experiments, we demonstrated that our CNN-based universal forensic approach can automatically learn how to detect multiple image manipulations (multiple label[s]) without relying on pre-selected features or any preprocessing.” (Bayar, page 7, left column, paragraph 4).
	Bayar relates to machine learning for image analysis and is analogous to the claimed invention. The combination of Karlinsky, Hu, and Bayar teaches a system that generates losses based on similarities of query images and support images, and based on model output vs. intended output. Bayar teaches a method of detecting edits in images. It would have been obvious to one of ordinary skill in the art to combine the combination of Karlinsky, Hu, and Bayar by detecting edits in the query images with Bayar’s method. This would achieve the predictable result of identifying whether and how a query image has been manipulated, with the system of Karlinsky and Hu and Bayar’s method performing the same together as they did separately. (MPEP 2143 I. (A) Combining prior art elements according to known methods to yield predictable results).

Regarding claim 19, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Bayar further discloses a system, wherein the learning system makes a determination as to how the query image was edited:
“In this paper, we proposed a novel CNN-based universal forgery detection technique that can automatically learn how to detect different image manipulations (edit[s]). To prevent the CNN from learning features that represent an image's content, we proposed a new form of convolutional specifically designed to suppress an image's content and learn manipulation detection features. We accomplished this by specifically constraining this new convolutional layer to learn prediction error filters. Through a series of experiments, we demonstrated that our CNN-based universal forensic approach can automatically learn how to detect multiple image manipulations (multiple label[s]) without relying on pre-selected features or any preprocessing.” (Bayar, page 7, left column, paragraph 4).
“In our first set of experiments, we trained different CNNs to detect each of the four manipulations discussed in Section 5.1. Each CNN corresponds to a binary classifier that detects one type of possible image operation with the same architecture outlined in Section 4.” (Bayar, page 6, left column, paragraph 5)
	Bayar relates to machine learning for image analysis and is analogous to the claimed invention. The combination of Karlinsky, Hu, and Bayar teaches a system that generates losses based on similarities of query images and support images, and based on model output vs. intended output. Bayar teaches a method of detecting edits in images. It would have been obvious to one of ordinary skill in the art to combine the existing combination and Hu by detecting edits in the query images with Bayar’s method. This would achieve the predictable result of identifying whether and how a query image has been manipulated, with the system of Karlinsky and Hu and Bayar’s method performing the same together as they did separately. (MPEP 2143 I. (A) Combining prior art elements according to known methods to yield predictable results).


Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Karlinsky (RepMet: Representative-based metric learning for classification and few-shot object detection, 2019, CVPR 2019 pp. 5192-5201) in view of Hu (US 2021/0142054 A1), and further in view of Liu et al. (Transductive Prototypical Network For Few-Shot Classification, published 9/30/2020, 2020 IEEE International Conference on Image Processing (ICIP), pp. 1671-1675), hereafter referred to as Liu, and Bayar et al. (A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer, published 2016, IH&MMSec '16: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security pp. 5-10), hereafter referred to as Bayar.

Regarding claim 3, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. While Karlinsky fails to disclose the further limitations of the claim, Liu, in combination with Hu, discloses a system, wherein the at least one processor is configured to:
calculate an average feature amount based on the feature amount of each of the plurality of images of the support data:
(Liu) “Specifically, we construct each episode by randomly sampling N classes from Ctrain and K labeled samples per class as the support set S” (Liu, page 2, left column, paragraph 2)
(Liu) “it computes the mean vector (average feature amount) of embedded features (feature amount) per class from the support set (support data) as the corresponding prototype representation So, the prototype of the class j can be expressed using the following formulation: 
    PNG
    media_image8.png
    91
    331
    media_image8.png
    Greyscale
” (Liu, page 2, right column, paragraph 1)
“We evaluate our Td-PN approach compared with the recent state-of-the-art methods on two datasets, miniImageNet [3] and tieredImageNet [13]” (Liu, page 3, right column, paragraph 1); “The miniImageNet dataset is a subset of ILSVRC-12 [1], which is the most popular few-shot learning benchmark. It contains 100 classes randomly selected from ILSVRC-12 with 600 images per class” (Liu, page 3, right column, paragraph 2). Liu’s system is compatible with image classes.
acquire the second loss based on the feature amount of the query image and the average feature amount:
“Specifically, we construct each episode by randomly sampling N classes from Ctrain and K labeled samples per class as the support set S … while a portion of the rest samples from the same N classes as the query set, denoted as Q” (Liu, page 2, left column, paragraph 2).
“In order to find out the nearest class prototype for each query sample, Prototypical Network [6] calculates a soft assignment of embedded features in the query set to each class prototype (average feature amount) as follows 
    PNG
    media_image9.png
    80
    419
    media_image9.png
    Greyscale

Where A ∈                         
                            
                                
                                    R
                                
                                
                                    T
                                    ×
                                    N
                                
                            
                        
                     is an assignment matrix, and                         
                            d
                            (
                            
                                
                                    f
                                
                                
                                    θ
                                
                            
                            
                                
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                ~
                                            
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            ,
                            
                                
                                    p
                                
                                
                                    j
                                
                            
                            )
                        
                     denotes the euclidean distance between the embedded feature of the query sample i (feature amount of the query image) and the prototype of the class j (average feature amount)” (Liu, page 3, left column, paragraph 1).
“we design a weighted contrastive loss (second loss) to obtain a classifying-friendly (discriminative) feature embedding space. The designed loss conduces to select top k confident embedded features from the query set for each class in the next subsection. Specifically, if the query sample i belongs to the class j, we expect the value of the soft assignment Aij in the equation (2) to be close to 1; otherwise, we expect the value of the soft assignment Aij to be close to 0. Therefore, the contrastive loss function can be computed as: 
    PNG
    media_image10.png
    148
    630
    media_image10.png
    Greyscale
” (Liu, page 3, left column, paragraph 1).
Examiner’s note: While Liu does not disclose the usage of multi-label data or a processor, these attributes are already present in the combination of Karlinsky, Hu, and Bayar (see claim 1), and would simply necessitate using multi-label query data for Liu’s loss, and running Liu’s method on a generic computer processor.
	Liu relates to few-shot learning for image classification and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Karlinsky, Hu, and Bayar to use distances to average support data instead of distances to direct support data points in loss, as disclosed by Liu. Prototype distances are a simple and efficient implementation of few-shot learning. In particular, Liu’s method avoids overfitting common on prototypical network settings, and performs better than similar state-of-the-art approaches. See Liu, page 1, right column, paragraphs 2-3 and page 4, right column, paragraph 1.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Karlinsky (RepMet: Representative-based metric learning for classification and few-shot object detection, 2019, CVPR 2019 pp. 5192-5201) in view of Hu (US 2021/0142054 A1), and further in view of Bayar et al. (A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer, published 2016, IH&MMSec '16: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security pp. 5-10), hereafter referred to as Bayar, and Gunel (SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINED LANGUAGE MODEL FINE-TUNING, November 2020, arXiv:2011.01403v2).

Regarding claim 5, the rejection of claim 4 in view of Karlinsky, Hu, and Bayar is incorporated. While Karlinsky and Hu fail to disclose the further limitations of the claim, Gunel, in combination with Hu, discloses a system, wherein the at least one processor is configured to
calculate the total loss based on the first loss, the second loss, and a weighting coefficient specified by a creator: (Gunel) “                                
                                    λ
                                
                             is a scalar weighting hyperparameter (weighting coefficient) that we tune for each downstream task. The loss is given by the following formulas:

    PNG
    media_image11.png
    194
    742
    media_image11.png
    Greyscale

The overall loss is a weighted average of CE (first loss) and the SCL loss (second loss), as given in equation (1)” (Gunel, page 2, paragraph 5).
	Gunel relates to few-shot learning for classification and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Karlinsky, Hu, and Bayar to use a combined loss function, as disclosed by Gunel. Adding a contrastive loss term to a cross-entropy loss can improve performance in few-shot learning models, and makes the model more robust to noise in the training data as well as imbuing it with more generalization ability to related tasks. See (Gunel, page 2, paragraph 2) and (Gunel, page 8, paragraph 2).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Karlinsky (RepMet: Representative-based metric learning for classification and few-shot object detection, 2019, CVPR 2019 pp. 5192-5201) in view of Hu (US 2021/0142054 A1), and further in view of Bayar et al. (A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer, published 2016, IH&MMSec '16: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security pp. 5-10), hereafter referred to as Bayar, and Chen (Exploring Simple Siamese Representation Learning, November 2020, arXiv:2011.10566v1).

	Regarding claim 7, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky, in combination with Hu further teaches a method, wherein:
the query image is input to a first learning model: (Karlinsky)

    PNG
    media_image4.png
    563
    957
    media_image4.png
    Greyscale
 (Karlinsky, page 5200, Figure 3). The pooled feature vector (query image) is input into the DML embedding module of the model.
“Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module (first learning model), which consists of a few fully connected (FC) layers with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             }, where commonly embedding size e ≪ f” (Karlinsky, page 5199, right column, paragraph 4).
the support data is input to a second learning model: (Karlinsky) “As an additional set of trained parameters, we hold a set of ‘representatives’                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                            . Each vector                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             represents the center of the j-th mode of the learned discriminative mixture distribution (support data) in the embedding space, for the i-th class out of the total of N classes” (Karlinsky, page 5200, left column, paragraph 1); “In our implementation, the representatives are realized as weights of an FC layer (second learning model) of size N ·K · e receiving a fixed scalar input 1. The output of this layer is reshaped to an N× K×e tensor” (Karlinsky, page 5200, left column, paragraph 2).
the at least one processor is configured to
calculate the first loss based on the parameter of the first learning model: “Having P(C = i|X) and P(B|X) (output of the first learning model) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone). The first loss is the regular cross-entropy (CE) with the ground truth labels given for the image (or ROI) corresponding to X” (Karlinsky, page 5200, right column, paragraph 3). The first loss is based on the output of the first learning model, and is necessarily based on its parameters.
acquire the feature amount of the query image calculated based on the parameter of the first learning model: “Batch-training is used, but for simplicity we will refer to the input of the subnet as a single (pooled) feature vector X ∈                                  
                                    
                                        
                                            R
                                        
                                        
                                            f
                                        
                                    
                                
                             (query image) computed by the backbone for the given image (or ROI) ... We first employ a DML embedding module (first learning model), which consists of a few fully connected (FC) layers with batch normalization (BN) and ReLU nonlinearity (2-3 such layers in our experiments). The output of the embedding module is a vector E = E(X) ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             } (feature amount), where commonly embedding size e ≪ f” (Karlinsky, page 5199, right column, paragraph 4). Outputs of a learning model are based on parameters of that model.
acquire the feature amount of the ... support data calculated based on the parameter of the second learning model: “As an additional set of trained parameters, we hold a set of ‘representatives’                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             ∈                                 
                                    
                                        
                                            R
                                        
                                        
                                            e
                                        
                                    
                                
                             (feature amount). Each vector                                 
                                    
                                        
                                            R
                                        
                                        
                                            i
                                            j
                                        
                                    
                                
                             represents the center of the j-th mode of the learned discriminative mixture distribution (support data) in the embedding space, for the i-th class out of the total of N classes. We assume a fixed number of K modes (peaks) in the distribution of each class, so 1 ≤ j ≤ K” (Karlinsky, page 5200, left column, paragraph 1); “In our implementation, the representatives are realized as weights (parameter[s]) of an FC layer (second learning model) of size N ·K · e receiving a fixed scalar input 1. The output of this layer is reshaped to an N× K×e tensor. During training, this simple construction flows the gradients to the weights of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2). Outputs of a learning model are based on parameters of that model.
adjust each of the parameter of the first learning model and the parameter of the second learning model: “In our implementation, the representatives are realized as weights of an FC layer (second learning model) of size N ·K · e receiving a fixed scalar input 1. The output of this layer is reshaped to an N× K×e tensor. During training, this simple construction flows the gradients to the weights (parameter of the second learning model) of the FC layer and learns the representatives” (Karlinsky, page 5200, left column, paragraph 2).
	While Karlinsky, Hu, and Bayar fail to disclose the further limitations of the claim, Chen teaches a method, wherein a parameter of the first learning model and a parameter of the second learning model are shared: “Siamese networks are weight-sharing neural networks applied on two or more inputs. They are natural tools for comparing (including but not limited to “contrasting”) entities” (X Chen, page 1, left column, paragraph 1).
	Chen relates to n-shot learning for image classification and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Karlinsky, Hu, and Bayar to share parameters between two embedding modules, as disclosed by Chen. Siamese networks such as these can learn meaningful representations with minimal data, not requiring negative sample pairs, large batches, or momentum episodes. They can also model invariance with respect to complicated transformations / augmentations. See (Chen, page 1, Abstract) and (Chen, page 2, left column, paragraph 1).

	Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Karlinsky (RepMet: Representative-based metric learning for classification and few-shot object detection, 2019, CVPR 2019 pp. 5192-5201) in view of Hu (US 2021/0142054 A1), and further in view of Bayar et al. (A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer, published 2016, IH&MMSec '16: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security pp. 5-10), hereafter referred to as Bayar, and Rios (Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces, 2018, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3132–3142).

Regarding claim 9, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Hu discloses a system with at least one processor configured: “Software or firmware to implement the techniques introduced here may be stored on a machine-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A "machine-readable medium", as the term is used herein, includes any mechanism that can store information in a form accessible by a machine (a machine may be, for example, a computer, network device, cellular phone, personal digital assistant (PDA), manufacturing tool, any device with one or more processors, etc.)” (Hu, [0058]).
Hu relates to few-shot learning for image processing systems and is analogous to the claimed invention. The existing combination teaches a method of training an image classifier with few-shot learning. The claimed invention improves upon this method by executing it with computer processors. Hu teaches a method of training an image classifier with few-shot learning which can be executed on computer processors, applicable to The existing combination. A person of ordinary skill in the art would have recognized that running The existing combination’s method on Hu’s processor hardware would lead to the predictable result of the method being performed by a computer as-described, and would improve the known device by allowing it to be used to process real data on a computer (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).
While Karlinsky, Hu, and Bayar fail to disclose the further limitations of the claim, Rios teaches a system, able to acquire the query image and the support data from a data group having a long-tail distribution for multi-labels: “Rubin et al. (2012) refer to datasets that have long-tail frequency distributions as “power-law datasets”. Methods that predict in- frequent labels fall under the paradigm of few-shot classification which refers to supervised methods in which only a few examples, typically between 1 and 5, are available in the training dataset for each label ... time. In this paper, we explore both of these issues, long documents and power-law datasets, with an emphasis on analyzing the few- and zero-shot aspects of large-scale multi-label problems.” (Rios, page 1, right column, paragraph 1).
	Rios relates to few-shot learning for classification and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Karlinsky, Hu, and Bayar to acquire data from long-tail distributions, as disclosed by Rios. Doing so would build a model capable of resisting data sparsity, particularly for datasets with a large number of labels. See (Rios, page 1, right column, paragraph 1).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Karlinsky (RepMet: Representative-based metric learning for classification and few-shot object detection, 2019, CVPR 2019 pp. 5192-5201) in view of Hu (US 2021/0142054 A1), and further in view of Bayar et al. (A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer, published 2016, IH&MMSec '16: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security pp. 5-10), hereafter referred to as Bayar, and Szegedy (Going deeper with convolutions, 2014, arXiv:1409.4842v1).

	Regarding claim 10, the rejection of claim 1 in view of Karlinsky, Hu, and Bayar is incorporated. Karlinsky, in combination with Hu, further teaches a method, wherein
the learning model is configured such that a last layer of a model which has learned another label other than a plurality of labels to be recognized is replaced with a layer corresponding to the plurality of labels: (Karlinsky)
“We propose a subnet architecture and corresponding losses that allow us to train a DML embedding jointly with the multi-modal mixture distribution used for computing the class posterior in the resulting embedding space. This subnet then becomes a DML-based classifier head, which can be attached on top of a classification or a detection backbone. It is important to note that our DML-subnet is trained jointly with the feature producing backbone.” (Karlinsky, page 5199, right column, paragraph 3). The few-shot classifier taught by Karlinsky is attached to the head of an image classifier backbone for initial feature processing.

    PNG
    media_image12.png
    171
    1058
    media_image12.png
    Greyscale
 (Karlinsky, page 5201, figure 4). InceptionV3 can perform initial feature extraction of images, acting as a backbone of Karlinsky’s classifier.
“For the DML-based classification experiments, we used the InceptionV3 backbone (model), attaching the proposed DML subnet to the layer before its last FC layer (last layer)” (Karlinsky, page 5201, left column, paragraph 1). In Karlinsky’s experiments, the backbone is an Inception network.
“We tested our approach on a set of fine-grained classification datasets, widely adopted in the state-of-the-art DML classification works: Stanford Dogs, Oxford-IIIT Pet, Oxford 102 Flowers, and ImageNet Attributes” (Karlinsky, page 5201, right column, paragraph 3). Karlinsky trained and tested his classifier using these four image datasets.
the at least one processor is configured to calculate the first loss based on the output of the learning model in which the last layer is replaced with the layer corresponding to the plurality of labels, and based on the target output: (Karlinsky) “Having P(C = i|X) and P(B|X) (output of the learning model) computed in the network, we use a sum of two losses to train our model (DML subnet + backbone). The first loss is the regular cross-entropy (CE) with the ground truth labels given for the image (or ROI) corresponding to X” (Karlinsky, page 5200, right column, paragraph 3).
While Karlinsky, Hu, and Bayar fail to disclose the further limitations of the claim, Szegedy teaches a method of constructing a model which has learned another label other than a plurality of labels to be recognized: 
“We propose a deep convolutional neural network architecture codenamed Inception (model) ... One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection” (Szegedy, page 1, Abstract).
“The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet (another label [set]) hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions” (Szegedy, page 8, paragraph 4). The Examiner notes Karlinsky discloses that their model is an Inception model (Karlinsky, page 5201, left column, paragraph 1) Szegedy discloses training an Inception model using ImageNet image set (Szegedy, page 8, paragraph 4) Karlinsky further discloses further training the Inception model using Stanford Dogs, Oxford-IIIT Pet and Oxford Flowers Image sets (Karlinsky, page 5201, right column, paragraph 3). Accordingly, the combination of Karlinsky and Szegedy discloses a model that has learned another label other than the initial labels.
“We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them” (Szegedy, page 8, paragraph 5).
	Szegedy relates to image classification with neural networks and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Karlinsky, Hu, and Bayar to use Inception as the backbone network, as disclosed by Szegedy. The Inception architecture is particularly useful as the base model of an object detection network, and is easily adaptable or fine-tuned to different label sets. See (Szegedy, page 4, paragraph 2) and (Szegedy, page 6, paragraph 3).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Karlinsky (RepMet: Representative-based metric learning for classification and few-shot object detection, 2019, CVPR 2019 pp. 5192-5201) in view of Hu (US 2021/0142054 A1), and further in view of Bayar et al. (A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer, published 2016, IH&MMSec '16: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security pp. 5-10), hereafter referred to as Bayar, and Huang et al. (SYSTEMS AND METHODS FOR AUTOMATED VIDEO CLASSIFICATION, filed 12/27/2018, US 10,922,548 B1), hereafter referred to as Huang.

	Regarding claim 20, the rejection of claim 19 in view of Karlinsky, Hu, and Bayar is incorporated. While the currently cited prior art fails to disclose the further limitations of the claim, Huang discloses a system, wherein in determining how the query image was edited, the learning system determines if digital text was added to the query image:
“In general, a meme video may include text overlaid on image or video content. The text may be added after the image and/or video content has been captured, such that the text does not belong to the originally captured image or video content, but is added after-the-fact. In various embodiments, the meme video detector module 602 can identify meme videos based on identification of such synthetically added text. As shown in the example of FIG. 6A, the meme video detector module 602 can include a dynamic region detection module 604 and a synthetic text detection module 606.” (Huang, column 22, paragraph 2).

    PNG
    media_image13.png
    498
    744
    media_image13.png
    Greyscale
(Huang, Figure 6B)
“FIGS. 6B and 6C illustrate example scenarios 620, 640 that illustrate functionality of the meme video detector module 602, according to an embodiment of the present technology. In the example scenario 620 depicted in FIG. 6B, a video frame 622 from a meme video is depicted. The 35 meme video includes a video of a couple dancing (arrow 626) overlaid with static, synthetic text 624 which reads "When you thought it was Thursday and realize it's actually Friday." (Huang, column 23, paragraph 4)
	Huang relates to machine learning for image analysis and is analogous to the claimed invention. The combination of Karlinsky, Hu, and Bayar teaches a system that determines image manipulations in images. Huang teaches a system to detect the presence of synthetic text in images. It would have been obvious to one of ordinary skill in the art to combine the existing combination with Huang by detecting synthetic text in the query images with Huang’s system. This would achieve the predictable result of identifying whether a query image has had text added after being taken, with the system of the existing combination and the system of Huang working the same together as they did separately. (MPEP 2143 I. (A) Combining prior art elements according to known methods to yield predictable results).


Response to Arguments
	The following responses address arguments and remarks made in the instant remarks dated 12/02/2025.

Claim Interpretation	
On page 10 of the instant remarks, the Applicant argues that amended claim 12 has patentable weight for all its features:
“Claim Interpretation:
The examiner indicates that a feature of claim 12 does not have patentable weight.
Applicant amended claim 12 and submits that its features have patentable weight.”
	Regarding the Applicant’s argument above, the Examiner notes that they have not stated claim 12 doesn’t have patentable weight under claim interpretation. Rather, the Examiner has noted that the fourth limitation of claim 12 (“calculate, when the query image is input to a learning model, a first loss based on an output of the learning model and a target output”) is interpreted as contingent (i.e., calculation only occurs if the query image is input to a learning model, which isn’t guaranteed, thus the limitation isn’t an essential step of the method under BRI).

Objections
Previous objections to the claims have been withdrawn in light of the instant amendments.
However, new objections to the claims have been found in light of the instant amendments.

112 Rejections
	The Examiner notes that new rejections have been made for the claims under 35 U.S.C. 112(b).

101 Rejections
On pages 10-12 of the instant remarks, the Applicant argues that the claimed invention improves on existing technology:
“Claim Rejections under 35 U.S.C. § 101:

Claims 1-17 are rejected under 35 U.S.C. 101 because the claimed inventions are allegedly
directed to non-statutory subject matter without significantly more.
The exammer considered applicant's previous arguments that the claimed invention
resolved a technical problem and improved an existing technological process. The examiner
contends that the cited improvements are not sufficiently detailed and that calculations of a loss
can be performed as a mental process.

The Director of the USPTO authored a recent (but issued after the rejection)§ 101 decision
providing guidance to examiners when analyzing for§ 101 issues.

In the decision, a patent applicant at the PTAB argued under the same basis as applicant:
that the claims recited an "improvement in the functioning of a computer or improvement to other
technology or technical field". In the holding, the USPTO director noted:

... [C]laims directed to an improvement in the functioning of a computer, or an
improvement to other technology or technical field are patent eligible. (Emphasis
added.)
The USPTO Director went on to state:
Categorically excluding Al innovations from patent protection in the United States
jeopardizes America's leadership in this critical emerging technology. Yet, under
the panel's reasoning, many Al innovations are potentially unpatentable-even if
they are adequately described and nonobvious-because the panel essentially
equated any machine learning with an unpatentable "algorithm" and the remaining
additional elements as "generic computer components," without adequate
explanation. Dec. 24. Examiners and panels should not evaluate claims at such
a high level of generality. (Emphasis added.)

In light of the USPTO Director's guidance to patent examiners, applicant respectfully
suggest that the claims recite an "improvement in the functioning of a computer or improvement
to other technology or technical field".
Applicant directs the examiner's attention to paragraphs [0041], [0063] and [0092] of the
published application which provide a detailed discussion of how the technical problem is
overcome. The claims as a whole achieve this result.

Specifically, paragraph [0041] of the published application recites:
In view of the above, the learning system S of this embodiment creates a learning
model capable of handling multi-labels by applying few-shot learning which is
based on a contrastive learning approach. As a result, even in cases in which the
images have a long-tail distribution and features that are not noticeable like fine
grains (even when the above-mentioned first and second reasons exist), the
accuracy of the learning model is increased by using less training data. The details
of the learning system S are now described.

Paragraph [0063] recites:
The learning model M in this embodiment is a model which recognizes objects
included in an image, and therefore a multi-label query image xQ is described as
an example of query data. Further, an example of support data is the support image
xS corresponding to the query image xQ. The query image xQ and the support
image xS are each images used in few-shot learning.
Paragraph [0092] recites:

The second loss LCL shows an error (difference) between the embedded vector of
the query image xQ and the embedded vector of each support image xS. The second
loss LCL is an index which can be used to measure the accuracy of the learning
models Ml and M2. A high second loss LCL means a large error and a low
accuracy. A low second loss CL means a small error and a high accuracy. In this
embodiment, there is described a case in which the second loss LCL is a contrastive
loss, but the second loss LCL can be calculated by using any method. It is sufficient
that the second loss LCL can be calculated based on a predetermined loss function.

In light of the technical solutions to the technical problems, applicant submits that the
claims are eligible for under § 101

Additionally, as discussed below, applicant further defined the claims, which furthers the
above arguments”
	The Examiner acknowledges recent guidance from the Director of the USPTO regarding Ex parte Desjardins. However, the Examiner contends that while the improvement to technology was clearly noted in the specification and represented by the claims in Desjardins, such an improvement is less apparent in the claims of the instant application.
The improvement of a claimed invention must be sufficiently detailed, as noted in MPEP 2106.05(a): “If it is asserted that the invention improves upon conventional functioning of a computer, or upon conventional technology or technological processes, a technical explanation as to how to implement the invention should be present in the specification. That is, the disclosure must provide sufficient details such that one of ordinary skill in the art would recognize the claimed invention as providing an improvement. The specification need not explicitly set forth the improvement, but it must describe the invention such that the improvement would be apparent to one of ordinary skill in the art … After the examiner has consulted the specification and determined that the disclosed invention improves technology, the claim must be evaluated to ensure the claim itself reflects the disclosed improvement in technology. Intellectual Ventures I LLC v. Symantec Corp., 838 F.3d 1307, 1316, 120 USPQ2d 1353, 1359 (Fed. Cir. 2016) (patent owner argued that the claimed email filtering system improved technology by shrinking the protection gap and mooting the volume problem, but the court disagreed because the claims themselves did not have any limitations that addressed these issues). That is, the claim must include the components or steps of the invention that provide the improvement described in the specification.”
	The alleged improvement argued by the applicant is the application of few-shot learning to multi-label classification (Published application, [0041]). As one of ordinary skill in the art would recognize, this few-shot learning depends on comparing multi-label query data to support data (Id. at [0063]), a first loss measuring error between the target output and actual output of the learning model (Id. at [0077]), and a second loss measuring error between embeddings of the query and support data (Id. at [0092]).
	The Examiner disagrees that the first loss described above is represented within the claimed invention. At no point would the claimed first loss be measuring an error (difference) between target output and actual output of the learning model, it’s merely stated as being “based on an output of the learning model and a target output” in claim 1, “based on the query image corresponding to the combination” in claim 6, “based on the parameter of the first learning model” in claim 7, and “based on the output of the learning model” in claim 10. Even considering all of these limitations together, the claimed first loss is much broader than what’s being argued.
	Thus, no rejections are withdrawn on these grounds.

103 Rejections
On pages 13-16 of the instant remarks, the Applicant argues that Karlinsky and Hu fail to disclose amended claim 1:
“Applicant argued that the primary reference of Karlinsky did not disclose the feature of
calculating a second loss. As mentioned in applicant's previous response, this feature is discussed
in paragraphs [0091] - [0098] of the published application. The second loss is calculated based
on the feature vector of the query image and the feature vector of each of the support data images.
(Paragraph [0095] clarifies that the feature vector of the support images is averaged.)
…
The examiner contends that the support data of claim 1 is disclosed by the "class
representatives" of Karlinsky. A class representative is a vector that represents a center of a cluster
which captures a key characteristic of the class's appearance or features. 1 Examples of classes are
dogs, bikes and trucks, as shown in FIG. 2 of Karlinsky.

Thus, while both Karlinsky and the claimed invention have additional images that support
the query image, Karlinsky does not disclose the same purpose of the supporting images as the
present invention. That is, the support images chosen in claim 1 are selected based on the query
image. The support images therefore all have commonalities with the query image. (In FIG. 8,
shown above, the commonality is label 1 and label 2.)

In contrast, Karlinsky is attempting to classify the test image by determining which class
representative the test image is closest to (e.g. dog, bike, truck, etc.). Thus, the "class
representatives" of Karlinsky does not disclose the claimed "support data". In order to expedite
prosecution, applicant has further defined the feature of "support data".

As discussed above, in the claimed invention, the support images are selected to have a
commonality with the query image (e.g. having multiple labels in common with the query image).
In contrast, Karlinsky is directed toward a single image classification (e.g. dog, cat, bike, etc.).
Karlinsky does not know the classification of the image before the alleged support images are
selected. Thus, the examiner relies on Hu in attempting to disclose the multiple label feature.

Each of the images in the support set of Hu, however, do not have a common label with a
query image. As discussed in paragraph [0053] of Hu, the support set includes every possible
"label" (i.e. every checkbox filled out), regardless of a query image.

As such, claim 1 is not rendered obvious over Hu and Karlinsky as claim 1 clarifies that
each support image has at least one common label with the query image:
acquire a plurality of support images where each of the support images have at least
the first label or the second label in common with the query image

Another difference between Karlinsky and the claimed invention is that the claimed
invention is directed toward classifying edited content of the query image, not classifying the
image itself Accordingly, claim 1 has been amended to recite:
wherein each of the first label and second label of the query image are related to
an edited feature of the query image.
	For example, edited content would be any of the labels 1-6, which are completely unrelated
	to the primary image itself In contrast, Karlinsky is directed to object classification and does not
	disclose or suggest the above feature of independent claim 1.

	Independent Claims 12 and 13:
	Independent claims 12 and 13 were similarly amended as independent claim 1. Please see
	applicant's discussion of independent claim 1 above.

	Dependent Claims 2-11 and 14-1 7:
	As each of claims 2-11 and 14-17 ultimately depend from independent claim 1, please see
	applicant's discussion of claim 1 above.”
In response to applicant's arguments above, it is noted that the specific second loss calculation upon which the Applicant relies (as disclosed by published application paragraphs [0091-0098]) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
Regarding the argument that the representatives of Karlinsky differ substantially from the support images of the claimed invention, the Examiner respectfully disagrees. The relationship between the support images and query images is defined very loosely in the claimed invention. Amended claim 1 states that each support image has at least one label in common with the query image, and that the support data corresponds to the query image data. The representatives of Karlinsky, analogous to the claimed support images, similarly each share at least one label in common with the query image (Karlinsky, page 5199, left column, paragraphs 1-3), commensurate in scope with support images that correspond to query image data and common labels as claimed.
Regarding the argument that the classification method of Karlinsky differs significantly from that of the claimed invention, the Examiner respectfully disagrees. The claimed invention does not disclose selecting support images that have multiple labels in common with the query image, it merely states “the support images have at least the first label or the second label in common with the query”, i.e., at least one label in common. Karlinsky discloses a method of identifying multiple labels in a query image by comparing query images with representative images (Karlinsky, page 5198, Figure 2), commensurate in scope with the claimed invention’s limitations.
The argument that Hu fails to remedy the deficiencies of Karlinsky is rendered moot in light of new rejections following further search and consideration, as Karlisnky discloses using multi-label query images as input (Karlinsky, page 5198, left column, paragraph 1; page 5199, right column, paragraph 4; page 5198, Figure 2).
Regarding the argument that Karlinsky fails to disclose determining labels in relation to edited features of the query image, the Examiner agrees. However, upon further search and consideration, this limitation is found to be obvious over Bayar, which discloses detecting multiple different types of image manipulations on an input dataset (Bayer, page 7, left column, paragraph 4).
Thus, no rejections are withdrawn on these grounds. See the 103 rejections section for more detail.

On page 16 of the instant remarks, the Applicant argues that newly added claims 18-20 are not disclosed or rendered obvious by the cited references:
“New Claims 18-20:
Applicant added new claims 18-20 and respectfully submit that the features therein are not
disclosed or rendered obvious by the cited references.”
	Upon further search and consideration, claims 18-20 are found to be obvious over Karlinsky, Hu, Bayar, and Huang. See the 103 rejections section for more detail.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Fu (Transductive Multi-label Zero-shot Learning, 2015, arXiv:1503.07790v1) teaches a method of synthesizing a multi-label dataset containing all possible combinations of labels, applicable to zero-shot learning
Saliou (US 20190073565 A1) teaches a method of using CNNs with shared parameters to map multiple inputs to the same embedding space
Alfassy (LaSO: Label-Set Operations networks for multi-label few-shot learning, 2019, arXiv:1902.09811v1)

A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Aaron P Gormley whose telephone number is (571)272-1372. The examiner can normally be reached Monday - Friday 12:00 PM - 8:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached at (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AG/Examiner, Art Unit 2148                                                                                                                                                                                                        
/MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Dec 06, 2021
Application Filed
May 07, 2025
Non-Final Rejection — §101, §103, §112
Jul 18, 2025
Interview Requested
Jul 30, 2025
Applicant Interview (Telephonic)
Jul 30, 2025
Examiner Interview Summary
Aug 14, 2025
Response Filed
Sep 02, 2025
Final Rejection — §101, §103, §112
Nov 05, 2025
Interview Requested
Dec 02, 2025
Request for Continued Examination
Dec 08, 2025
Response after Non-Final Action
Jan 30, 2026
Non-Final Rejection — §101, §103, §112
Apr 08, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

17/537,475
Patent 12585955
Minimal Trust Data Sharing
2y 5m to grant Granted Mar 24, 2026
17/524,338
Patent 12579440
Training Artificial Neural Networks Using Context-Dependent Gating with Weight Stabilization
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
With Interview (-60.0%)
4y 4m
Median Time to Grant
High
PTA Risk
Based on 5 resolved cases by this examiner. Grant probability derived from career allow rate.
LEARNING SYSTEM AND LEARNING METHOD FOR MULTI-LABEL DATA RECOGNITION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email