Office Action Analysis: 17959900 — SYSTEM AND METHOD FOR MACHINE LEARNING ARCHITECTURE FOR MULTI-TASK LEARNING WITH DYNAMIC NEURAL NETWORKS

Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
	Claims 1-20 are presented for examination in this application (17959900) filed 10/04/2022 claiming priority to provisional 63252003 and being granted an effective filing data of 10/04/2021. The examiner cites particular sections in the references as applied to the claims below for the convenience of the applicant(s). Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant(s) fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner. 

Response to Arguments
	Applicant’s arguments and remarks filed 1/23/2026 have been fully considered. The arguments and remarks regarding the 35 U.S.C 112 rejections were found to be persuasive. The arguments and remarks regarding the 35 U.S.C 101 rejections were found to be persuasive. The arguments and remarks regarding the 35 U.S.C 103 rejections were found to be persuasive however the amendments have necessitated a change in the references applied. The 35 U.S.C 103 rejections have been maintained via a new ground of rejection.

35 U.S.C 103
Applicant’s response:
	Applicant asserts “Neither Veit nor Masse disclose, teach or suggest a gating mechanism including both a task gating and an instance gating. To the contrary, Veit's gated inference (section 3.1 as referenced by the Office Action) only includes a gate conditioned in the input to a layer. Veit provides no teaching of a task gating or any other gating hierarchy of gating apart from single layer input gating. These features are also not disclosed by Masse. 
As such, no combination of these references would have led the skilled person to the subject matter of claim 1.
For at least these reasons, claim 1, and for at least similar reasons, claims 2-7, 9, 11-16, 18 and 20 are non- obvious and comply with 35 USC 103 having regard to Veit and Masse. 
The Office Action rejects claims 8, 10, 17 and 19 under 35 USC 103 having regard to Veit, Masse and Sun ("AdaShare: Learning What to Share For Efficient Deep Multi-Task Learning". 
As noted above, Veit, Masse fail to disclose a gating mechanism including both a task gating and an instance gating. This is not remedied by further combination with Sun. 
As such, at least by virtue of their dependencies, claims 8, 10, 17 and 19 are non-obvious and comply with 35 USC 103 having regard to Veit, Masse and Sun.”
Examiner’s response:
	Examiner agrees that Veit does not teach a task gating mechanism. However, the claims have necessitated a change in the references applied. Examiner asserts, using broadest reasonable interpretation in light of the specification, that Sun, however, does teach a task gating mechanism as seen in figs. 1 and  2. The gating mechanism uses task specific, as well as shared features, to determine what layers should be skipped, therefore acting as a gating mechanism. For at least these reasons the claims remain rejected under 35 U.S.C 103. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-2, 8-12, and 17-20  are rejected under 35 U.S.C 103 as being unpatentable over Sun et al. (“AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning” hereinafter, Sun) in view of Masse et al. (US20200250483A1 hereinafter, Masse). 
Regarding claim 1 (currently amended):
	Sun teaches a computer-implemented system for computing an action for an automated agent (pg. 1 section abstract: “Unlike existing methods, we propose an adaptive sharing approach, called AdaShare, that decides what to share across which tasks to achieve the best recognition accuracy, while taking resource efficiency into account. Specifically, our main idea is to learn the sharing pattern through a task-specific policy that selectively chooses which layers to execute for a given task in the multi-task network. We efficiently optimize the task-specific policy jointly with the network weights, using standard back-propagation.”) the system comprising:
multi task neural network comprising weights for learning a plurality of tasks, the neural network including a plurality of layers wherein two or more subsets of layers of the plurality of layers are connected by a gating mechanism, the gating mechanism configured to dynamically activate or deactivate at least one layer of the corresponding subset of layers, the gating mechanism including a task gating configured to activate or deactivate based on a task-specific policy and an instance gating configured to activate or deactivate based on feature representations of an input (see pg. 4 section 3 : “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output. During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”)
	train the neural network to update the weights and the gating mechanism (see pg. 4 fig. 2: “During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”)
	after training, receive, (see pg. 3 section 3: “Given a set of K tasks T = {T1,T2,··· ,TK} defined over a dataset, our goal is to seek an adaptive feature sharing mechanism that decides what network layers should be shared across which tasks and what layers should be task-specific in order to improve the accuracy, while taking the resource efficiency into account for scalable multi-task learning.”)
	propagating the input data through the multi-task neural network including the gating mechanism which (see pg. 4 fig. 2: “During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”):
	 the task gating activates or deactivates a subset of layers based on at least the task type (see pg. 4 fig. 2: “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output.); and 
	the instance gating activates or deactivates at least one layer of the subset of layers (see pg. 4 fig. 2: “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output.)
; and
	generate an action signal based on a forward pass of the neural network using the dynamically activated at least one layer of the neural network (see pg. 5 section 3: “During the training, we use the soft task-specific decision vl,k given by Eq. 2 in both forward and backward passes”).
	Sun does not explicitly teach a communication interface, at least one processor, or memory in communication with the at least one processor.
	Masse, however, analogously teaches a communication interface, at least one processor, and memory in communication with the at least one processor (see para [0041]: “FIG. 1 is a simplified block diagram of a computing device 100, in accordance with example embodiments. As shown, the computing device 100 may include processor(s) 102, memory 104, network interface(s) 106, and an input/output unit 108. By way of example, the components are communicatively connected by a bus 110."). 
	Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun and Masse before him or her, to modify the system of claim 1 to include attributes of a communication interface, at least one processor, and memory in communication with the at least one processor in order to perform example embodiments with a computing device (see Masse para [0041]: “FIG. 1 is a simplified block diagram of a computing device 100, in accordance with example embodiments. As shown, the computing device 100 may include processor(s) 102, memory 104, network interface(s) 106, and an input/output unit 108. By way of example, the components are communicatively connected by a bus 110. The bus could also provide power from a power supply (not shown). In particular, computing device 100 may be configured to perform at least one function of and/or related to components of artificial neural network 200, gating table 500 and/or 502, machine learning system 700, and/or method 800, all of which are described below.”)

Regarding claim 2:
	Sun in view of Masse teaches the system of claim 1. 
	Sun further teaches wherein training the neural network includes training a loss function which is a function of a layer execution probability and a task knowledge sharing parameter (see pg. 5 section 3: “Furthermore, we introduce a loss Lsharing that encourages residual block sharing across tasks to avoid the whole network being split up by tasks with little knowledge shared among them.”. Also see pg. 5 eqs. 3-5).

Regarding claim 12:
	Claim 12 recites analogous limitations to claim 2 and therefore is rejected on the same grounds as claim 2. 

Regarding claim 8:
	Sun in view of Masse teaches the system of claim 1.
Sun further teaches wherein the selection of the subset of layers based on at least the task type is determined based on a task-specific policy (see abstract: “Specifically, our main idea is to learn the sharing pattern through a task-specific policy that selectively chooses which layers to execute for a given task in the multi-task network.”.).

Regarding claim 17: 
Claim 17 recites analogous limitations to claim 8 and therefore is rejected on the same grounds as claim 8.

Regarding claim 9: 
Sun in view of Masse teaches the system of claim 1.
Sun does not explicitly teach wherein the dynamically activating the at least one layer of the subset of layers comprises: determining an output using a Rectified Linear Unit.
Masse, however, analogously teaches wherein the dynamically activating the at least one layer of the subset of layers comprises: determining an output using a Rectified Linear Unit (ReLU) (see section 3.2 ‘Estimating Layer Relevance’: “To capture the dependencies between channels, we add a simple non-linear function of two fully-connected layers connected with a ReLU [7] activation function. The output of this operation is the relevance score for the layer. Specifically, it is a vector β containing unnormalized scores for the two actions of (a) computing and (b) skipping the following layer, respectively.

    PNG
    media_image1.png
    31
    375
    media_image1.png
    Greyscale

”.).
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun and Masse before him or her, to modify the system of claim 9 to include attributes of wherein the dynamically activating the at least one layer of the subset of layers comprises: determining an output using a Rectified Linear Unit (ReLU) in order to train the neural network (see Masse para [0008]: “The neuron sums the weighted input values together with a bias term, computes the activation function of the sum, and outputs the activation function to the next layer or as output from the ANN. It is the weights and biases of all the connections of the ANN that are adjusted or tuned during training.”).

Regarding claim 18:
Claim 18 recites analogous limitations to claim 9 and therefore is rejected on the same grounds as claim 9.

Regarding claim 10: 
Sun in view of Masse teaches the system of claim 1.
	Sun further teaches wherein training of the neural network comprises: optimizing a loss function that includes a first term for reducing a probability of an execution of a given layer and a second term that increases knowledge sharing between a plurality of tasks (see section 3 ‘Proposed Method’ subsection ‘Loss Functions’: “Furthermore, we introduce a loss Lsharing that encourages residual block sharing across tasks to avoid the whole network being split up by tasks with little knowledge shared among them. Encouraging sharing reduces the redundancy of knowledge separately kept in task-specific blocks of related tasks and results in an more efficient sharing scheme that better utilizes residual blocks. Specifically, we minimize the weighted sum of L1 distances between the policy logits of different tasks with an emphasis on encouraging the sharing of bottom blocks which contain low-level knowledge. More formally, we define Lsharing as

    PNG
    media_image2.png
    46
    495
    media_image2.png
    Greyscale

Finally, the overall loss L is defined as 

    PNG
    media_image3.png
    38
    584
    media_image3.png
    Greyscale

”.).

Regarding claim 19: 
Claim 19 recites analogous limitations to claim 10 and therefore is rejected on the same grounds as claim 10.

Regarding claim 11: 
Sun teaches a computer-implemented system for computing an action for an automated agent (pg. 1 section abstract: “Unlike existing methods, we propose an adaptive sharing approach, called AdaShare, that decides what to share across which tasks to achieve the best recognition accuracy, while taking resource efficiency into account. Specifically, our main idea is to learn the sharing pattern through a task-specific policy that selectively chooses which layers to execute for a given task in the multi-task network. We efficiently optimize the task-specific policy jointly with the network weights, using standard back-propagation.”) the system comprising:
instantiating, (see pg. 4 section 3 : “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output. During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”)
	train the neural network to update the weights and the gating mechanism (see pg. 4 fig. 2: “During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”)
	after training, receive, with a task type (see pg. 3 section 3: “Given a set of K tasks T = {T1,T2,··· ,TK} defined over a dataset, our goal is to seek an adaptive feature sharing mechanism that decides what network layers should be shared across which tasks and what layers should be task-specific in order to improve the accuracy, while taking the resource efficiency into account for scalable multi-task learning.”)
	propagating the input data through the multi-task neural network including the gating mechanism which (see pg. 4 fig. 2: “During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”):
	 the task gating activates or deactivates a subset of layers based on at least the task type (see pg. 4 fig. 2: “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output.); and 
	the instance gating activates or deactivates at least one layer of the subset of layers (see pg. 4 fig. 2: “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output.)
; and
	generate an action signal based on a forward pass of the neural network using the dynamically activated at least one layer of the neural network (see pg. 5 section 3: “During the training, we use the soft task-specific decision vl,k given by Eq. 2 in both forward and backward passes”).
	Sun does not explicitly teach a communication interface, at least one processor, or memory in communication with the at least one processor.
	Masse, however, analogously teaches a communication interface, at least one processor, and memory in communication with the at least one processor (see para [0041]: “FIG. 1 is a simplified block diagram of a computing device 100, in accordance with example embodiments. As shown, the computing device 100 may include processor(s) 102, memory 104, network interface(s) 106, and an input/output unit 108. By way of example, the components are communicatively connected by a bus 110."). 
	Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun and Masse before him or her, to modify the system of claim 11 to include attributes of a communication interface, at least one processor, and memory in communication with the at least one processor in order to perform example embodiments with a computing device (see Masse para [0041]: “FIG. 1 is a simplified block diagram of a computing device 100, in accordance with example embodiments. As shown, the computing device 100 may include processor(s) 102, memory 104, network interface(s) 106, and an input/output unit 108. By way of example, the components are communicatively connected by a bus 110. The bus could also provide power from a power supply (not shown). In particular, computing device 100 may be configured to perform at least one function of and/or related to components of artificial neural network 200, gating table 500 and/or 502, machine learning system 700, and/or method 800, all of which are described below.”)

Regarding claim 20:
Sun teaches a multi task neural network comprising weights for learning a plurality of tasks, the neural network including a plurality of layers wherein two or more subsets of layers of the plurality of layers are connected by a gating mechanism, the gating mechanism configured to dynamically activate or deactivate at least one layer of the corresponding subset of layers, the gating mechanism including a task gating configured to activate or deactivate based on a task-specific policy and an instance gating configured to activate or deactivate based on feature representations of an input (see pg. 4 section 3 : “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output. During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”)
	train the neural network to update the weights and the gating mechanism (see pg. 4 fig. 2: “During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”)
	after training, receive, (see pg. 3 section 3: “Given a set of K tasks T = {T1,T2,··· ,TK} defined over a dataset, our goal is to seek an adaptive feature sharing mechanism that decides what network layers should be shared across which tasks and what layers should be task-specific in order to improve the accuracy, while taking the resource efficiency into account for scalable multi-task learning.”)
	propagating the input data through the multi-task neural network including the gating mechanism which (see pg. 4 fig. 2: “During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling.”):
	 the task gating activates or deactivates a subset of layers based on at least the task type (see pg. 4 fig. 2: “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output.); and 
	the instance gating activates or deactivates at least one layer of the subset of layers (see pg. 4 fig. 2: “In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard backpropagation from our designed loss functions.”. Also see pg. 4 fig. 2 : “AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or skip policy decision sampled from the learned task-specific policy distribution (logits).These select-or-skip vectors define which blocks should be executed indifferent tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output.)
; and
	generate an action signal based on a forward pass of the neural network using the dynamically activated at least one layer of the neural network (see pg. 5 section 3: “During the training, we use the soft task-specific decision vl,k given by Eq. 2 in both forward and backward passes”).
	Sun does not explicitly teach a non-transitory computer-readable storage medium storing instructions, communication interface, or at least one processor.
	Masse, however, analogously teaches a non-transitory computer-readable storage medium storing instructions communication interface, at least one processor, and memory in communication with the at least one processor (see para [0041]: “FIG. 1 is a simplified block diagram of a computing device 100, in accordance with example embodiments. As shown, the computing device 100 may include processor(s) 102, memory 104, network interface(s) 106, and an input/output unit 108. By way of example, the components are communicatively connected by a bus 110.". Also see para [0016]: “In still another respect, example embodiments may involve an article of manufacture comprising non-transitory computer readable media having computer-readable instructions stored thereon that, when executed by one or more processors of a computing device, cause the computing device to carry out operations.”). 
	Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun and Masse before him or her, to modify the system of claim 20 to include attributes of a non-transitory computer-readable storage medium storing instructions, communication interface, or at least one processor in order to perform example embodiments with a computing device (see Masse para [0041]: “FIG. 1 is a simplified block diagram of a computing device 100, in accordance with example embodiments. As shown, the computing device 100 may include processor(s) 102, memory 104, network interface(s) 106, and an input/output unit 108. By way of example, the components are communicatively connected by a bus 110. The bus could also provide power from a power supply (not shown). In particular, computing device 100 may be configured to perform at least one function of and/or related to components of artificial neural network 200, gating table 500 and/or 502, machine learning system 700, and/or method 800, all of which are described below.”)

Claims 3-7 and 13-16 are rejected under 35 U.S.C 103 as being unpatentable over Sun et al. (“AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning” hereinafter, Sun) in view of Masse et al. (US20200250483A1 hereinafter, Masse) in further view of Veit et al.(“Convolutional Networks with Adaptive Inference Graphs” hereinafter, Veit).
Regarding claim 3:
Sun in view of Masse teaches the system of claim 2.
Sun does not explicitly teach wherein the respective gating mechanism dynamically activates the respective layer of the subset of layers by: computing, by a relevance estimator, a relevance metric of an intermediate feature input to the respective layer connected to the respective gating mechanism, and dynamically activating the respective layer connected to the respective gating mechanism based on the relevance metric.
Veit, however, analogously teaches wherein the respective gating mechanism dynamically activates the respective layer of the subset of layers by: computing, by a relevance estimator, a relevance metric of an intermediate feature input to the respective layer connected to the respective gating mechanism (see section 3.2 ‘Estimating Layer Relevance’: “To capture the dependencies between channels, we add a simple non-linear function of two fully-connected layers connected with a ReLU [7] activation function. The output of this operation is the relevance score for the layer. Specifically, it is a vector β containing unnormalized scores for the two actions of (a) computing and (b) skipping the following layer, respectively.

    PNG
    media_image1.png
    31
    375
    media_image1.png
    Greyscale

”.),
dynamically activating the respective layer connected to the respective gating mechanism based on the relevance metric (see section 3.2 ‘Estimating Layer Relevance’: “Specifically, it is a vector β containing unnormalized scores for the two actions of (a) computing and (b) skipping the following layer, respectively.”. Also see section 3.3 ‘Greedy Grumbel Sampling: “The goal of the second component is to make a discrete decision based on the relevance scores … Ideally, we would like to choose among the two options proportional to their relevance scores. A standard way to introduce such stochasticity is to add noise to the scores … Then, we can sample from the discrete variable X by sampling from the Gumbel random variables 

    PNG
    media_image4.png
    39
    350
    media_image4.png
    Greyscale

”. Also see fig 2.). 
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun, Masse, and Veit before him or her, to modify the system of claim 3 to include attributes of a wherein the respective gating mechanism dynamically activates the respective layer of the subset of layers by: computing, by a relevance estimator, a relevance metric of an intermediate feature input to the respective layer connected to the respective gating mechanism, and dynamically activating the respective layer connected to the respective gating mechanism based on the relevance metric in order to group parameters into layers for related classes thereby improving both efficiency and overall classification quality (see Veit pg. 1 section abstract : “By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality”).

Regarding claim 13:
	Claim 13 recites analogous limitations to claim 3 and therefore is rejected on the same grounds as claim 3. 

Regarding claim 4:
Sun in view of Masse in further view of Veit teaches the system of claim 3.
Sun does not explicitly teach wherein the respective gating unit dynamically activates the respective layer of the subset of layers when the relevance metric is at or above a predetermined threshold.
Veit, however, analogously teaches wherein the respective gating unit dynamically activates the respective layer of the subset of layers when the relevance metric is at or above a predetermined threshold (see fig. 2 depicting whether a forward pass of a layer is executed [activated] based on if the argmax value is 1.).
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun, Masse, and Veit before him or her, to modify the system of claim 4 to include attributes of wherein the respective gating unit dynamically activates the respective layer of the subset of layers when the relevance metric is at or above a predetermined threshold in order to group parameters into layers for related classes thereby improving both efficiency and overall classification quality (see Veit pg. 1 section abstract : “By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality”).

Regarding claim 14:
Claim 14 recites analogous limitations to claim 4 and therefore is rejected on the same grounds as claim 4.

Regarding claim 5:
Sun in view of Masse in further view of Veit teaches the system of claim 3.
Sun does not explicitly teach wherein the respective gating unit dynamically deactivates the respective layer of the subset of layers when the relevance metric is below a predetermined threshold.
Veit, however, analogously teaches wherein the respective gating unit dynamically deactivates the respective layer of the subset of layers when the relevance metric is below a predetermined threshold (see fig. 2 depicting whether a forward pass of a layer is skipped [deactivated] based on if the argmax value is not 1.).
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun, Masse, and Veit before him or her, to modify the system of claim 5 to include attributes of wherein the respective gating unit dynamically deactivates the respective layer of the subset of layers when the relevance metric is below a predetermined threshold in order to group parameters into layers for related classes thereby improving both efficiency and overall classification quality (see Veit pg. 1 section abstract : “By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality”).

Regarding claim 15: 
Claim 15 recites analogous limitations to claim 5 and therefore is rejected on the same grounds as claim 5.

Regarding claim 6: 
Sun in view of Masse in further view of Veit teaches the system of claim 3.
Sun does not explicitly teach wherein the relevance estimator comprises two convolution layers and an activation function.
Veit, however, analogously teaches wherein the relevance estimator comprises two convolution layers and an activation function (see section 3.2 ‘Estimating Layer Relevance’: “To capture the dependencies between channels, we add a simple non-linear function of two fully-connected layers connected with a ReLU [7] activation function.”. Also see fig. 2.).
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun, Masse, and Veit before him or her, to modify the system of claim 6 to include attributes of wherein the relevance estimator comprises two convolution layers and an activation function in order to group parameters into layers for related classes thereby improving both efficiency and overall classification quality (see Veit pg. 1 section abstract : “By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality”).

Regarding claim 16:
Claim 16 recites analogous limitations to claim 6 and therefore is rejected on the same grounds as claim 6.

Regarding claim 7: 
Sun in view of Masse in further view of Veit teaches the system of claim 4.
Sun does not explicitly teach wherein the relevance estimator comprises an average pooling function between the convolution layers and the activation function.
Veit, however, analogously teaches wherein the relevance estimator comprises an average pooling function between the convolution layers and the activation function See section 3.2 ‘Estimating Layer Relevance’): “The goal of the gate’s first component is to estimate its layer’s relevance given the input features. The input to the gate is the output of the previous layer xl−1 ∈ RW×H×C . Since operating on the full feature map is computationally expensive, we build upon recent studies [13, 17, 23] which show that much of the information in convolutional features is captured by the statistics of the different channels and their interdependencies. In particular, we only consider channel-wise means gathered by global average pooling. This compresses the input features into a 1 × 1 × C channel descriptor. 

    PNG
    media_image5.png
    57
    184
    media_image5.png
    Greyscale

To capture the dependencies between channels, we add a simple non-linear function of two fully-connected layers connected with a ReLU [7] activation function.”. Also see fig. 2.).
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Sun, Masse, and Veit before him or her, to modify the system of claim 7 to include attributes of wherein the relevance estimator comprises an average pooling function between the convolution layers and the activation function in order to group parameters into layers for related classes thereby improving both efficiency and overall classification quality (see Veit pg. 1 section abstract : “By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality”).

Pertinent Prior Art
	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure: 
US11494626 — discloses uses gating to turn off parts of a neural network in the field of tasks.
US12271800B2 — discloses gating in neural networks and conditional computation in the field of tasks.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Andrew A Bracero whose telephone number is (571)270-0592. The examiner can normally be reached Monday - Friday 9:00a.m. - 5:00 p.m. ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached  Monday - Friday 9:00a.m. - 5:00 p.m. ET at (571)270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANDREW BRACERO/Examiner, Art Unit 2126                                                                                                                                                                                                        
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
SYSTEM AND METHOD FOR MACHINE LEARNING ARCHITECTURE FOR MULTI-TASK LEARNING WITH DYNAMIC NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SYSTEM AND METHOD FOR MACHINE LEARNING ARCHITECTURE FOR MULTI-TASK LEARNING WITH DYNAMIC NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email