Last updated: May 29, 2026
Application No. 17/867,311
SURROGATE HIERARCHICAL MACHINE-LEARNING MODEL TO PROVIDE CONCEPT EXPLANATIONS FOR A MACHINE-LEARNING CLASSIFIER

Non-Final OA §103§112
Filed
Jul 18, 2022
Priority
Oct 14, 2020 — provisional 63/091,807 +6 more
Examiner
AHMED, SYED RAYHAN
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Feedzai - Consultadoria E Inovação Tecnológica S A
OA Round
1 (Non-Final)
Interview Optional

— +40.0% interview lift. Examiner has a relatively high allowance rate (78%); +40.0% interview lift. A written response may suffice.
Based on 9 resolved cases, 2023–2026
Examiner Intelligence

AHMED, SYED RAYHAN View full profile →
Grants 78% — above average
Career Allowance Rate
7 granted / 9 resolved
+22.8% vs TC avg
Strong +40% interview lift
Without
With
+40.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
15 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
100.0%
+60.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 9 resolved cases
Office Action

§103 §112
DETAILED ACTION
This Office Action is sent in response to the Applicant’s Communication received on 09/16/2025 for application number 17/867,311. The Office hereby acknowledges receipt of the following and placed of record in file: Specification, Drawings, Abstract, Oath/Declaration, IDS, and Claims.
Claims 1-19 and 21 are pending.
Claim 20 is canceled.	

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

The term “substantially” in claim 17 is a relative term which renders the claim indefinite. The term “substantially” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-9, 11-12, 19, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (Explaining Neural Networks Using Attentive Knowledge Distillation, published 11 February 2021), hereinafter Lee, in view of Kingetsu (US 20220215294 A1), hereinafter Kingetsu, and Kuen et al. (US 20220108131 A1), hereinafter Kuen.

Regarding claim 1, Lee teaches,
A method, comprising: configuring a surrogate hierarchical multi-task machine learning model [Abstract, we proposed a novel approach to explain the model prediction by developing an attentive surrogate network using the knowledge distillation] to perform both (i) a knowledge distillation task associated with a pre-trained machine learning model classifier [Sect 3.1, para 3, knowledge distillation aims to transfer the prediction capability of the large target network, called the teacher network; Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation… any classification network can be the teacher network because the knowledge distillation only requires the output of the last layers of the teacher network; Sect 4.1, para 3, We used ResNet-50 that was pretrained on ImageNet as the target network T] and (ii) an explanation task to predict a plurality of semantic concepts for explainability associated with the knowledge distillation task [Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention, and an explanation network that takes the learned features from the encoder network and generates the final saliency map for the input], wherein the surrogate hierarchical multi-task machine learning model includes: a concept layer to perform the explanation task (Sect 1, para 5, explanation network); a decision layer to perform the knowledge distillation task (Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation), wherein the output of the concept layer is utilized as an input to the decision layer [See also Fig. 2 below, Explanation Network [Wingdings font/0xE0] Saliency Map [Wingdings font/0xE0] Explanation [Wingdings font/0xE0] Teacher Network; Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention, and an explanation network that takes the learned features from the encoder network and generates the final saliency map for the input; Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation… any classification network can be the teacher network because the knowledge distillation only requires the output of the last layers of the teacher network]; 

    PNG
    media_image1.png
    796
    1056
    media_image1.png
    Greyscale


Lee does not teach receiving training data, wherein the training data includes input records and corresponding concept labels; using one or more computer processors to train the surrogate hierarchical multi-task machine learning model including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task and a loss function associated with the explanation task, wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer and an output of the pre-trained machine learning model classifier.

Kingetsu teaches,
receiving training data, wherein the training data includes input records and corresponding concept labels [Para 0091, The validation data 141b is data for validate the machine training model trained by the training data set 141a. A correct answer label is given to the validation data 141b. For example, when the validation data 141b is input to the machine training model, if an output result that is output from the machine training model matches the correct answer label that is given to the validation data 141b, this state indicates that the machine training model is appropriately trained by the training data set 141a]; 
Kingetsu is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Kingetsu and provide training data including input records and concept labels in order to improve training by providing a validation dataset for comparison with output results.

Lee-Kingetsu teach the above limitations of claim 1 including the surrogate hierarchical multi-task machine learning model (Lee, Abstract) and the pre-trained machine learning model classifier (Lee, Sect 3.1, para 3).

Lee-Kingetsu do not teach using one or more computer processors to train machine learning model including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task and a loss function associated with the explanation task, wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer and an output of classifier.

Kuen teaches,
using one or more computer processors [Para 0110, the components of the knowledge distillation system 102 can include one or more instructions stored on a computer-readable storage medium and executable by processors] to train machine learning model (Para 0019, Over multiple iterations, or epochs, the knowledge distillation system repeats the process… to modify neural network parameters to improve the accuracy of the source neural network) including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task (Para 0024, knowledge distillation loss function) and a loss function associated with the explanation task (Para 0024, a classification loss function) [Para 0019, the knowledge distillation system further back propagates to modify parameters of the source neural network to reduce a measure of error or loss associated with the classification loss function. Over multiple iterations, or epochs, the knowledge distillation system repeats the process of generating classifications for lightly augmented digital images, comparing the classifications with ground truth labels, and back propagating to modify neural network parameters to improve the accuracy of the source neural network; Para 0024, the knowledge distillation system further compares the respective classifications of the source neural network and distilled neural network via a knowledge distillation loss function… the knowledge distillation system back propagates to modify parameters of the distilled neural network to improve classification accuracy by reducing a measure of loss determined via the knowledge distillation loss function (and/or a classification loss function)); Para 0066, In particular, the knowledge distillation system 102 determines one or more measures of loss associated with the various comparisons described above and further modifies the parameters of the distilled neural network 118 to reduce the measure(s) of loss. Indeed, over multiple training iterations of classifying different digital images, applying loss functions to compare classifications, and modifying parameters to reduce loss(es) associated with the loss functions, the knowledge distillation system 102 learns parameters for the distilled neural network 118 that result in the distilled neural network 118 mimicking predictions of the source neural network 116], 
wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer (Para 0024, distilled neural network) and an output of classifier (Para 0024, classifications of the source neural network) [Para 0024, the knowledge distillation system further compares the respective classifications of the source neural network and distilled neural network via a knowledge distillation loss function].
Kuen is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Kuen and provide minimizing a joint loss function in order to [Kuen, para 0024] improve prediction accuracy.

Regarding claim 2, Lee-Kingetsu-Kuen teach the limitations of claim 1.

Lee further teaches, 
wherein the output of the pre-trained machine learning model classifier (Sect 3.1, para 3, the target model) is utilized as an input (Sect 3.1, para 3, implant the knowledge of the target model to the surrogate networks; Sect 3.1, para 4, E generates a saliency map Hx0 by exploiting the attentive features transferred from S) to at least one layer of the concept layer (Sect 3.1, para 4, branch E) [Sect 3.1, para 3, we use knowledge distillation [18] to implant the knowledge of the target model to the surrogate networks that effectively reveal meaningful information of the target model; Sect 3.1, para 4, S encodes the knowledge of T using attention to better learn the features of T. We train S using the knowledge distillation. Another branch E generates a saliency map Hx0 by exploiting the attentive features transferred from S].

Regarding claim 3, Lee-Kingetsu-Kuen teach the limitations of claim 1.

Lee further teaches, 
pre-training the concept layer [Sect 4.1 (Experimental Setups), para 3, The explanation network E was trained as in the case of S but with the different intervals to adjust the learning rate, which were 10 epochs for the initial training and 2 epochs of the duration to reduce the learning rate, respectively].

Regarding claim 4, Lee-Kingetsu-Kuen teach the limitations of claim 1.

Lee further teaches,
wherein the surrogate hierarchical multi-task machine learning model includes an attention layer [Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention] and an input to the attention layer includes at least one of:
the input records;
the output of the concept layer; or
the output of the pre-trained machine learning model classifier [Sect 3.1, para 3, we use knowledge distillation [18] to implant the knowledge of the target model to the surrogate networks that effectively reveal meaningful information of the target model; Sect 3.1, para 4, S encodes the knowledge of T using attention to better learn the features of T].

Regarding claim 5, Lee-Kingetsu-Kuen teach the limitations of claim 1 including input records (Kingetsu, para 0091).

Lee further teaches,
wherein: the concept layer includes a common layer (Sect 3.4, para 1, three layers at different scales, as shown in Figure 4) to receive input [Sect 3.4, para 1, As explained earlier, the student network provides the explanation network with information on the attentive features that are taken from the three layers at different scales, as shown in Figure 4]; and the common layer is coupled to at least one of: 
the decision layer or 
another layer of the concept layer [Sect 3.4, para 1, The explanation network has three main blocks, called upsample, each of which consists of convolutional and interpolation layers, as shown on the right of Figure 4. In this way, the dimensions of the features in the explanation network grow toward those of the input].

    PNG
    media_image2.png
    705
    988
    media_image2.png
    Greyscale


Regarding claim 6, Lee-Kingetsu-Kuen teach the limitations of claim 5.

Lee further teaches,
wherein the output of the pre-trained machine learning model classifier is utilized as an input to the common layer [Sect 3.1, para 3, we use knowledge distillation [18] to implant the knowledge of the target model to the surrogate networks that effectively reveal meaningful information of the target model; Sect 3.1, para 4, S encodes the knowledge of T using attention to better learn the features of T. We train S using the knowledge distillation. Another branch E generates a saliency map Hx0 by exploiting the attentive features transferred from S].

Regarding claim 7, Lee-Kingetsu-Kuen teach the limitations of claim 5.

Lee further teaches,
pre-training the common layer [Sect 4.1 (Experimental Setups), para 3, The explanation network E was trained as in the case of S but with the different intervals to adjust the learning rate, which were 10 epochs for the initial training and 2 epochs of the duration to reduce the learning rate, respectively].

Regarding claim 8, Lee-Kingetsu-Kuen teach the limitations of claim 5.

Lee further teaches,
wherein the surrogate hierarchical multi-task machine learning model includes an attention layer [Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention] and an input to the attention layer includes at least one of: 
the input records;
the output of the common layer;
the output of at least one layer of the concept layer; or 
the output of the pre-trained machine learning model classifier (Sect 3.1, para 3, the target model) [Sect 3.1, para 3, we use knowledge distillation [18] to implant the knowledge of the target model to the surrogate networks that effectively reveal meaningful information of the target model; Sect 3.1, para 4, S encodes the knowledge of T using attention to better learn the features of T].

Regarding claim 9, Lee-Kingetsu-Kuen teach the limitations of claim 5 including.

Lee further teaches,
wherein the surrogate hierarchical multi-task machine learning model includes an attention layer [Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention] and an input to the attention layer includes input [See Fig. 4, Input and ResBlock w/ attention module].

Lee does not teach input being input records

Kingetsu teaches,
the input records [Para 0091, The validation data 141b is data for validate the machine training model trained by the training data set 141a]; 
Kingetsu is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Kingetsu and provide training data including input records in order to improve training by providing a validation dataset for comparison with output results.

    PNG
    media_image3.png
    585
    535
    media_image3.png
    Greyscale


Regarding claim 11, Lee-Kingetsu-Kuen teach the limitations of claim 5.

Lee further teaches,
wherein the surrogate hierarchical multi-task machine learning model includes an attention layer [Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention] and an input to the attention layer includes the output of the pre-trained machine learning model classifier [Sect 3.1, para 3, we use knowledge distillation [18] to implant the knowledge of the target model to the surrogate networks that effectively reveal meaningful information of the target model; Sect 3.1, para 4, S encodes the knowledge of T using attention to better learn the features of T].

Regarding claim 12, Lee-Kingetsu-Kuen teach the limitations of claim 1 including the surrogate hierarchical multi-task machine learning model (Lee, Abstract).

Kuen further teaches,
wherein using the one or more computer processors to train machine learning model includes backpropagating a calculated gradient of the joint loss function to update weights of the machine learning model [Para 0024, the knowledge distillation system back propagates to modify parameters of the distilled neural network to improve classification accuracy by reducing a measure of loss determined via the knowledge distillation loss function (and/or a classification loss function). By thus utilizing the knowledge distillation loss function and modifying parameters over multiple training iterations, the knowledge distillation system improves the prediction accuracy of the distilled neural network to more closely mimic predictions of the source neural network (e.g., by modifying the parameters of the distilled neural network to more closely resemble those of the source neural network)].
Kuen is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Kuen and provide backpropagation in order to [Kuen, para 0024] improve prediction accuracy.

Regarding claim 19, Lee teaches,
A system, comprising: a processor [associated system and processor implementing process of Fig. 2] adapted to: configure a surrogate hierarchical multi-task machine learning model [Abstract, we proposed a novel approach to explain the model prediction by developing an attentive surrogate network using the knowledge distillation] to perform both (i) a knowledge distillation task associated with a pre-trained machine learning model classifier [Sect 3.1, para 3, knowledge distillation aims to transfer the prediction capability of the large target network, called the teacher network; Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation… any classification network can be the teacher network because the knowledge distillation only requires the output of the last layers of the teacher network; Sect 4.1, para 3, We used ResNet-50 that was pretrained on ImageNet as the target network T] and (ii) an explanation task to predict a plurality of semantic concepts for explainability associated with the knowledge distillation task [Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention, and an explanation network that takes the learned features from the encoder network and generates the final saliency map for the input], wherein the surrogate hierarchical multi-task machine learning model includes: a concept layer to perform the explanation task (Sect 1, para 5, explanation network); a decision layer to perform the knowledge distillation task (Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation), wherein the output of the concept layer is utilized as an input to the decision layer [See also Fig. 2 below, Explanation Network [Wingdings font/0xE0] Saliency Map [Wingdings font/0xE0] Explanation [Wingdings font/0xE0] Teacher Network; Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention, and an explanation network that takes the learned features from the encoder network and generates the final saliency map for the input; Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation… any classification network can be the teacher network because the knowledge distillation only requires the output of the last layers of the teacher network]; 

    PNG
    media_image1.png
    796
    1056
    media_image1.png
    Greyscale

and a memory coupled to the processor and configured to provide the processor with instructions [associated memory, processor, and instructions implementing process of Fig. 2].

Lee does not explicitly teach receive training data, wherein the training data includes input records and corresponding concept labels; and use one or more computer processors to train the surrogate hierarchical multi-task machine learning model including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task and a loss function associated with the explanation task, wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer and an output of the pre-trained machine learning model classifier. Furthermore, Lee implicitly teaches, but does not explicitly teach, A system, comprising: a processor and a memory coupled to the processor and configured to provide the processor with instructions.

Kingetsu teaches,
receive training data, wherein the training data includes input records and corresponding concept labels [Para 0091, The validation data 141b is data for validate the machine training model trained by the training data set 141a. A correct answer label is given to the validation data 141b. For example, when the validation data 141b is input to the machine training model, if an output result that is output from the machine training model matches the correct answer label that is given to the validation data 141b, this state indicates that the machine training model is appropriately trained by the training data set 141a]; 
Kingetsu is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Kingetsu and provide training data including input records and concept labels in order to improve training by providing a validation dataset for comparison with output results.

Lee-Kingetsu teach the above limitations of claim 19 including the surrogate hierarchical multi-task machine learning model (Lee, Abstract) and the pre-trained machine learning model classifier (Lee, Sect 3.1, para 3).

Lee-Kingetsu do not teach A system, comprising: a processor; use one or more computer processors to train machine learning model including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task and a loss function associated with the explanation task, wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer and an output of classifier; and a memory coupled to the processor and configured to provide the processor with instructions.

Kuen teaches,
A system [Abstract, The present disclosure relates to systems], comprising: a processor; use one or more computer processors [Para 0110, the components of the knowledge distillation system 102 can include one or more instructions stored on a computer-readable storage medium and executable by processors] to train machine learning model (Para 0019, Over multiple iterations, or epochs, the knowledge distillation system repeats the process… to modify neural network parameters to improve the accuracy of the source neural network) including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task (Para 0024, knowledge distillation loss function) and a loss function associated with the explanation task (Para 0024, a classification loss function) [Para 0019, the knowledge distillation system further back propagates to modify parameters of the source neural network to reduce a measure of error or loss associated with the classification loss function. Over multiple iterations, or epochs, the knowledge distillation system repeats the process of generating classifications for lightly augmented digital images, comparing the classifications with ground truth labels, and back propagating to modify neural network parameters to improve the accuracy of the source neural network; Para 0024, the knowledge distillation system further compares the respective classifications of the source neural network and distilled neural network via a knowledge distillation loss function… the knowledge distillation system back propagates to modify parameters of the distilled neural network to improve classification accuracy by reducing a measure of loss determined via the knowledge distillation loss function (and/or a classification loss function)); Para 0066, In particular, the knowledge distillation system 102 determines one or more measures of loss associated with the various comparisons described above and further modifies the parameters of the distilled neural network 118 to reduce the measure(s) of loss. Indeed, over multiple training iterations of classifying different digital images, applying loss functions to compare classifications, and modifying parameters to reduce loss(es) associated with the loss functions, the knowledge distillation system 102 learns parameters for the distilled neural network 118 that result in the distilled neural network 118 mimicking predictions of the source neural network 116], 
wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer (Para 0024, distilled neural network) and an output of classifier (Para 0024, classifications of the source neural network) [Para 0024, the knowledge distillation system further compares the respective classifications of the source neural network and distilled neural network via a knowledge distillation loss function]; a memory coupled to the processor and configured to provide the processor with instructions [Para 0128, Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions]. 
Kuen is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Kuen and provide minimizing a joint loss function in order to [Kuen, para 0024] improve prediction accuracy.

Regarding claim 21, Lee teaches,
A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions [associated computer program product, non-transitory computer readable medium, and computer instructions implementing process of Fig. 2] for: configuring a surrogate hierarchical multi-task machine learning model [Abstract, we proposed a novel approach to explain the model prediction by developing an attentive surrogate network using the knowledge distillation] to perform both (i) a knowledge distillation task associated with a pre-trained machine learning model classifier [Sect 3.1, para 3, knowledge distillation aims to transfer the prediction capability of the large target network, called the teacher network; Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation… any classification network can be the teacher network because the knowledge distillation only requires the output of the last layers of the teacher network; Sect 4.1, para 3, We used ResNet-50 that was pretrained on ImageNet as the target network T] and (ii) an explanation task to predict a plurality of semantic concepts for explainability associated with the knowledge distillation task [Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention, and an explanation network that takes the learned features from the encoder network and generates the final saliency map for the input], wherein the surrogate hierarchical multi-task machine learning model includes: a concept layer to perform the explanation task (Sect 1, para 5, explanation network); a decision layer to perform the knowledge distillation task (Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation), wherein the output of the concept layer is utilized as an input to the decision layer [See also Fig. 2, Explanation Network [Wingdings font/0xE0] Saliency Map [Wingdings font/0xE0] Explanation [Wingdings font/0xE0] Teacher Network; Sect 1, para 5, The surrogate networks have two network branches: an attentive encoder network that approximates the features of the target network and extracts layer-wise attention, and an explanation network that takes the learned features from the encoder network and generates the final saliency map for the input; Sect 3.2, para 2, the target model to be explained is the teacher network to use the knowledge distillation… any classification network can be the teacher network because the knowledge distillation only requires the output of the last layers of the teacher network]. 

    PNG
    media_image1.png
    796
    1056
    media_image1.png
    Greyscale


Lee does not explicitly teach receiving training data, wherein the training data includes input records and corresponding concept labels; and using one or more computer processors to train the surrogate hierarchical multi-task machine learning model including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task and a loss function associated with the explanation task, wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer and an output of classifier. Furthermore, Lee implicitly teaches, but does not explicitly teach, A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions.

Kingetsu teaches,
receiving training data, wherein the training data includes input records and corresponding concept labels [Para 0091, The validation data 141b is data for validate the machine training model trained by the training data set 141a. A correct answer label is given to the validation data 141b. For example, when the validation data 141b is input to the machine training model, if an output result that is output from the machine training model matches the correct answer label that is given to the validation data 141b, this state indicates that the machine training model is appropriately trained by the training data set 141a]; 
Kingetsu is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Kingetsu and provide training data including input records and concept labels in order to improve training by providing a validation dataset for comparison with output results.

Lee-Kingetsu teach the above limitations of claim 21 including the surrogate hierarchical multi-task machine learning model (Lee, Abstract) and the pre-trained machine learning model classifier (Lee, Sect 3.1, para 3).

Lee-Kingetsu do not teach A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions; using one or more computer processors to train machine learning model including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task and a loss function associated with the explanation task, wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer and an output of classifier.

Kuen teaches,
A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions [Para 0128, Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions]; 
using one or more computer processors [Para 0110, the components of the knowledge distillation system 102 can include one or more instructions stored on a computer-readable storage medium and executable by processors] to train machine learning model (Para 0019, Over multiple iterations, or epochs, the knowledge distillation system repeats the process… to modify neural network parameters to improve the accuracy of the source neural network) including by minimizing a joint loss function that combines a loss function associated with the knowledge distillation task (Para 0024, knowledge distillation loss function) and a loss function associated with the explanation task (Para 0024, a classification loss function) [Para 0019, the knowledge distillation system further back propagates to modify parameters of the source neural network to reduce a measure of error or loss associated with the classification loss function. Over multiple iterations, or epochs, the knowledge distillation system repeats the process of generating classifications for lightly augmented digital images, comparing the classifications with ground truth labels, and back propagating to modify neural network parameters to improve the accuracy of the source neural network; Para 0024, the knowledge distillation system further compares the respective classifications of the source neural network and distilled neural network via a knowledge distillation loss function… the knowledge distillation system back propagates to modify parameters of the distilled neural network to improve classification accuracy by reducing a measure of loss determined via the knowledge distillation loss function (and/or a classification loss function)); Para 0066, In particular, the knowledge distillation system 102 determines one or more measures of loss associated with the various comparisons described above and further modifies the parameters of the distilled neural network 118 to reduce the measure(s) of loss. Indeed, over multiple training iterations of classifying different digital images, applying loss functions to compare classifications, and modifying parameters to reduce loss(es) associated with the loss functions, the knowledge distillation system 102 learns parameters for the distilled neural network 118 that result in the distilled neural network 118 mimicking predictions of the source neural network 116], 
wherein the loss function associated with the knowledge distillation task is determined by comparing an output of the decision layer (Para 0024, distilled neural network) and an output of classifier (Para 0024, classifications of the source neural network) [Para 0024, the knowledge distillation system further compares the respective classifications of the source neural network and distilled neural network via a knowledge distillation loss function].
Kuen is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Kuen and provide minimizing a joint loss function in order to [Kuen, para 0024] improve prediction accuracy.

Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Kingetsu and Kuen, and in further view of Liu et al. (CN110059717A), hereinafter Liu.

Regarding claim 10, Lee-Kingetsu-Kuen teach the limitations of claim 5 including the surrogate hierarchical multi-task machine learning model (claim 1: Lee, Abstract) and an attention layer (claim 5: Lee, Sect 3.4, para 1).

Lee-Kingetsu-Kuen do not teach an input to the attention layer includes the output of the common layer.

Liu teaches,
an input to the attention layer (Para 0078, mapping function F) includes the output of the common layer (Para 0078, one layer of a convolutional neural network and its corresponding activation tensor (feature map)) [Para 0075, Spatial attention is a type of heatmap used to decode the contribution of spatial regions of an input image to the output; Para 0078, Consider one layer of a convolutional neural network and its corresponding activation tensor (feature map) A∈R<sup>C×H×W</sup>, which consists of C feature planes with spatial dimensions H×W. The mapping function F of this layer takes the aforementioned three-dimensional feature map A as input and outputs a two-dimensional spatial attention map].
Liu is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Liu and provide an attention layer in order to improve model interpretability by outputting maps that are easily understood by human operators.
	
Claim(s) 13 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Kingetsu and Kuen, and in further view of Rastrow et al. (US 20210295833 A1), hereinafter Rastrow.

Regarding claim 13, Lee-Kingetsu-Kuen teach the limitations of claim 12 including backpropagating of the calculated gradient of the joint loss function to update the weights of machine learning model (claim 12: Kuen, para 0024), the surrogate hierarchical multi-task machine learning model (claim 1: Lee, Abstract), and the decision layer (claim 1: Lee, Sect 3.2, para 2).

Lee-Kingetsu-Kuen do not teach interrupt between decision and a concept classifier.

Rastrow teaches,
interrupt between decision (Para 0215, decision table) and a concept classifier (Para 0215, classifier) [Para 0215, When the interrupt detector output indicates that speech is not detected, the device directed classifier 1020 does not process the audio data and nothing else occurs, indicated in the decision table 1300 as not applicable (e.g., N/A).].
Rastrow is analogous to the claimed invention as they both relate to utilization of interrupt architecture. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Rastrow and provide an interrupt for computational efficiency by preventing unnecessary execution of additional processing.

Regarding claim 14, Lee-Kingetsu-Kuen-Rastrow teach the limitations of claim 13.

Rastrow further teaches,
wherein the concept classifier is configured to receive class labels [Para 0057, Examples of trained classifiers include Support Vector Machines (SVMs), neural networks, decision trees, adaptive boosting (AdaBoost) combined with decision trees, and random forests. For example, SVM is a supervised learning model with associated learning algorithms that analyze data and recognize patterns in the data, and which are commonly used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other].
Rastrow is analogous to the claimed invention as they both relate to utilization of interrupt architecture. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Rastrow and provide receiving class labels in order to provide a means for analyzing prediction accuracy.

Claim(s) 15 is rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Kingetsu and Kuen, and in further view of Zhang et al. (US 20200175384 A1), hereinafter Zhang.

Regarding claim 15, Lee-Kingetsu-Kuen teach the limitations of claim 1 including the concept labels (Kingetsu, Para 0091).

Lee-Kingetsu-Kuen do not teach labels are obtained using a concept extractor.

Zhang teaches, 
	labels are obtained using a concept extractor [Para 0084, The system then processes the user feedback to extract image features and label the object].
Zhang is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Zhang and provide extracting concepts in order to learn more accurate methodologies of object identification.

Claim(s) 16 is rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Kingetsu and Kuen, and in further view of Chen et al. (US 20250131184 A1), hereinafter Chen.

Regarding claim 16, Lee-Kingetsu-Kuen teach the limitations of claim 1 including the surrogate hierarchical multi-task machine learning model (Lee, Abstract) and the machine learning model classifier (Lee, Sect 3.2).

Lee-Kingetsu-Kuen do not teach wherein machine learning model and machine learning model are executed in parallel.

Chen teaches,
wherein machine learning model and machine learning model are executed in parallel [Abstract, The system can execute multiple machine learning models on the medical records in parallel using multi-threaded approach wherein each machine learning model executes using its own, dedicated computational thread in order to significantly speed up the time with which relevant information can be identified from documents by the system. The multi-threaded machine learning models can include, but are not limited to, sentence classification models, comorbidity models, ICD models, body parts models, prescription models, and provider name models. The system can also utilize combined convolutional neural networks and long short-term models (CNN+LSTMs) as well as ensemble machine learning models to categorize sentences in medical records. The system can also extract service provider, medical specializations, and dates of service information from medical records.].
Chen is analogous to the claimed invention as they both relate to parallel processing. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Chen and provide parallel processing in order to [Chen, Abstract] improve time efficiency.

Claim(s) 17 is rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Kingetsu and Kuen, and in further view of Duerig et al. (US 20200401929 A1), hereinafter Duerig.

Regarding claim 17, Lee-Kingetsu-Kuen teach the limitations of claim 1 including the surrogate hierarchical multi-task machine learning model (Lee, Abstract) and the machine learning model classifier (Lee, Sect 3.2).

Lee-Kingetsu-Kuen do not teach wherein machine learning model and model classifier are trained substantially simultaneously.

Duerig teaches (as best understood by the Examiner in view of 35 USC 112(b) rejection above),
wherein machine learning model and model classifier are trained substantially simultaneously [Para 0033, The trust model(s) for each pre-trained model can learn to predict, based on the input, when the pre-trained model is providing a correct output (e.g., which can be structured, for each pre-trained model, as a binary classification problem)… additionally, a single trust model can be trained to select, based on the input and from all available pre-trained machine-learned models, one or more of the pre-trained machine-learned models to serve as an expert specifically for that input (e.g., which can be structured as a multi-class and/or multi-label classification problem); Para 0034, training the distillation model may be performed in parallel with training the one or more trust models].
Duerig is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Duerig and provide simultaneous training for improved efficiency via learning related tasks at one instance.

Claim(s) 18 is rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Kingetsu and Kuen, and in further view of Hall et al. (WO 2021056043 A1), hereinafter Hall.

Regarding claim 18, Lee-Kingetsu-Kuen teach the limitations of claim 1 including the loss function associated with the knowledge distillation task (Kuen, para 0024), the output of the pre-trained machine learning model classifier (Lee, Sect 3.1, para 3), the output of the decision layer (Kuen, para 0024), and the surrogate hierarchical multi-task machine learning model (Lee, Abstract).

Lee-Kingetsu-Kuen do not teach wherein the loss function is determined by calculating a binary cross entropy between output of classifier (Para 0005, binary output label) and output of layer of machine learning model.

Hall teaches,
wherein the loss function is determined by calculating a binary cross entropy between output of classifier (Para 0005, binary output label) and output of layer of machine learning model (Para 0005, a ground truth label) [Para 0005, A neural network is trained by optimizing the parameters or weights of the model to minimize a task-dependent Toss function’. This loss function encodes the method of measuring the success of the neural network at optimizing the parameters for a given problem. For example, if we consider a Binary Image Classification problem, that is, separating a set of images into exactly two categories, the input images are first run through the model where a binary output label is computed e.g. 0 or 1 - to represent the two categories of interest. The predicted output is then compared against a ground truth label, and a loss (or error) is calculated. In the binary classification example, a Binary Cross-Entropy loss function is the most commonly used loss function. Using the loss value obtained from this function, we can compute the error gradients with respect to the input for each layer in the network. This process is known as back- propagation. The gradients are vectors that describe the direction in which the neural network parameters (or ‘weights’) are being altered during the optimization process in order to minimize the loss function; Para 0087, The training process comprises pre-processing data... The data may also be labelled and cleaned. Once the data is suitably pre-processed it can then be used to train one or more AI models.].
Hall is analogous to the claimed invention as they both relate to knowledge distillation. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lee’s teachings to incorporate the teachings of Hall and provide a binary cross entropy in order to [Hall, para 0005] obtain a more accurate prediction.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SYED RAYHAN AHMED whose telephone number is (571)270-0286. The examiner can normally be reached Mon-Fri ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SYED RAYHAN AHMED/Examiner, Art Unit 2126                                                                                                                                                                                                        
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

Jul 18, 2022
Application Filed
Dec 11, 2025
Non-Final Rejection mailed — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/657,217
Patent 12620008
MACHINE LEARNING TECHNIQUES FOR INTEGRATING DISTINCT CLUSTERING SCHEMES GIVEN TEMPORAL VARIATIONS
4y 1m to grant Granted May 05, 2026
17/345,702
Patent 12450891
IMAGE CLASSIFIER COMPRISING A NON-INJECTIVE TRANSFORMATION
4y 4m to grant Granted Oct 21, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
78%
Grant Probability
99%
With Interview (+40.0%)
4y 1m (~2m remaining)
Median Time to Grant
Low
PTA Risk
Based on 9 resolved cases by this examiner. Grant probability derived from career allowance rate.