Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 10/29/2025 has been entered.
Remarks
This Office Action is responsive to Applicants' Amendment filed on October 29, 2025, in which claims 1, 12, 17, and 21 are currently amended. Claims 1-21 are currently pending.
Response to Arguments
Applicant’s arguments with respect to rejection of claims 1-21 under 35 U.S.C. 103 based on amendment have been considered and are persuasive. The argument is moot in view of a new ground of rejection set forth below.
Specification
The disclosure is objected to because of the following informalities:
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
Claim Objections
Claims 1 and 17 objected to because of the following informalities: "training the neural network by training" is redundant. "training the neural network" is recommended. Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-11 and 17-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claims 1 and 17, "training the neural network by training, based on a model parameter of the previous task, the model parameter and an adaptive parameter of a previous task of recognizing another object with respect to the current task" is indefinite. First, it's ambiguous whether "a previous task" and "the previous task" should be interpreted as the same or different tasks. Taken at face value, if "a previous task" is intended to be a different task, then subsequent recitations of "the previous task" in claim 1 lack antecedent basis. Secondly, structurally the claim limitation is ambiguous. It's unclear if "based on a model parameter of the previous task" modifies "training", "the model parameter", "an adaptive parameter", or something else altogether. Since these interpretations are contradictory the scope of the claim cannot reasonably be determined. In the interest of further examination the claim limitation is interpreted as "training the neural network based on a model parameter and an adaptive parameter of the previous task, the previous task directed towards recognizing another object than the object recognized by the current task".
Regarding claims 2-11 and 18-20, claims 2-11 and 18-20 are rejected with respect to their dependence on claims 1 and 17, respectively.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-6, 9-13, and 15-21 are rejected under U.S.C. §103 as being unpatentable over the combination of Dixit (US20200193296A1) and Ketz (US20200134426A1).
Regarding claim 1, Dixit teaches A processor-implemented neural network method, the method comprising: determining an adaptive parameter ([¶0065] "Some aspects of the technology described herein present a framework for network adaptation that is more efficient and scalable compared to fine-tuning as well as the residual adapter scheme. Some aspects learn lightweight adaptation modules that attach to the filters in a pre-trained large-scale neural network and attenuate their activations using generated attention masks" Adaptation module interpreted as synonymous with adaptive parameter.)
and an adaptive mask of a current task of recognizing an object to be learned among a plurality of tasks of a neural network;([¶0019] "an image recognition neural network may be adapted for problems such as pedestrian recognition, traffic control device recognition, road hazard recognition, geographic location indicator recognition, and the like" [¶0020] "During the process of across-task transfer, the adaptation modules are attached across convolutional units of the original network and trained from scratch using a supervised objective for the target task. The adapters generate attention masks to attenuate responses of the convolutional units to which they are connected. These masks effectively guide the subsequent stages of the original network towards possible regions of interest to the new problem" Attention/confidence mask/map interpreted as synonymous with adaptive mask.)
determining a model parameter of the current task based on the adaptive parameter, the adaptive mask, and a shared parameter of the plurality of tasks; ([¶0064] "A possible approach to achieving efficient adaptation is to set aside a small number of layers in a pre-trained network for task-specificity. For every new task, only these parameters are learned through adaptation while the rest of the network is kept unchanged [...] One scheme uses a residual module that combines a batch normalization layer and a linear layer embedded in the original network for adaptation [...] This scheme, therefore, is perhaps more suited to a multi-task learning scenario, where the whole system is trained concurrently on the original large-scale problem (e.g., an object recognition neural network trained using ImageNet) and other related problems (e.g., scene classification, object detection, etc.)." the small number of layers set aside for task-specificity are interpreted as shared parameters learned through adapter module.)
wherein the adaptive parameter of the previous task and the shared parameter are trained with respect to the previous task([¶0020] "These masks effectively guide the subsequent stages of the original network towards possible regions of interest to the new problem" [¶0021] "For example, an object recognition network trained on the ImageNet dataset can be adapted with relatively few scene images for a scene recognition task" [¶0022] "ImageNet is referred to herein as an example of a data set for which a previously trained neural network may exist. However, it should be noted that any dataset can be used in place of ImageNet") [¶0045] "training of a DNN architecture, a regression, which is structured as a set of statistical processes for estimating the relationships among variables, can include a minimization of a cost function" See FIG. 7A. Dixit explicit teaches guiding the stages of the original neural network (previous task) towards regions of interest of the new problem based on minimization of the adaptive parameter. Dixit teaches training the model parameter and an adaptive parameter of a previous task (object recognition) with respect to the current task (scene recognition). See FIG. 5A/5B 510A/B W0 is the shared parameter (depicted as 790A w0 CONVkxk in FIG. 7A), 515B H(., ϕ0) in FIG. 5B is the Adaptive parameter depicted as 700A in FIG. 7A. Dixit explicitly teaches that 515B is trained with respect to the shared parameter (See Eqn. 8 in ¶0084 "{circumflex over (z)}0 i,j∝a0 i,jW0*(x0 i,j−{circumflex over (μ)}0). (8)" ¶0085 "Model adaptation is performed using the encoding in Equation 8 by training a task-specific branch H(., ϕ0) that generates the attention or confidence map a0").
However, Dixit does not explicitly teach and training the neural network by training, based on a model parameter of the previous task, the model parameter and an adaptive parameter of a previous task of recognizing another object with respect to the current task,
wherein the model parameter of the previous task, determined based on the adaptive parameter of the previous task and the shared parameter, is maintained to be a same value as prior to training of the adaptive parameter
by updating the adaptive parameter of the previous task based on a change in the shared parameter resulting from training of the model parameter of the current task for the current task.
PNG
media_image1.png
432
708
media_image1.png
Greyscale
FIG. 1 of Ketz
Ketz, in the same field of endeavor, teaches and training the neural network by training, based on a model parameter of the previous task, the model parameter and an adaptive parameter of a previous task of recognizing another object with respect to the current task, ([¶0038] "the embodiments of the present disclosure may enable an autonomous or semi-autonomous system to learn to navigate in a variety of conditions (e.g., wet, icy, foggy) without the need for specifying what all those conditions would be a priori, or re-experiencing the various conditions it has already learned to perform well in. For instance, the methods of the present disclosure would enable, for example, a self-driving car to learn to recognize tricycles without forgetting how to recognize bicycles [...] can then be trained to perform a new task on demand (e.g., washing windows) while also retaining its ability to perform its original task" new/current task interpreted as recognizing tricycles, previous task interpreted as recognizing bicycles)
wherein the model parameter of the previous task, determined based on the adaptive parameter of the previous task and the shared parameter, is maintained to be a same value as prior to training of the adaptive parameter ([¶0014] "to train a temporal prediction network on a first set of samples from an environment of an autonomous or semi-autonomous system during performance of a first task, train a controller on the first set of samples from the environment and a hidden state output by the temporal prediction network, store a preserved copy of the temporal prediction network, store a preserved copy of the controller, generate simulated rollouts from the preserved copy of the temporal prediction network and the preserved copy of the controller, and interleave the simulated rollouts with a second set of samples from the environment during performance of a second task to preserve knowledge of the temporal prediction network for performing the first task." [¶0045] "The preserved copy of the temporal prediction network 104 and the preserved copy of the controller 105 are configured to generate samples from simulated past experiences, which may be interleaved with samples from actual experiences during training on subsequent tasks. " Preserving a copy of the neural network model and using the preserved copy as a teacher is interpreted as synonymous with maintaining the model parameter of the previous task to be a same value as prior to training of the adaptive parameter)
by updating the adaptive parameter of the previous task based on a change in the shared parameter resulting from training of the model parameter of the current task for the current task ([¶0054] "M*,C*<-M,C" [¶0052] " provided a given simulated sample zt sim as input, the temperature modulated softmax of the controller's 103 output distribution [...] is forced to be similar to the temperature modulated softmax of the simulated output distribution" See also FIG. 1. M*,C* are the preserved parameters (adaptive parameters of the previous task) , M,C are the shared parameters and have changed due to current task training (C<-dLRL(C,M(Vtaski))). Therefore, when Ketz performs "M*,C*<-M,C" they are updating the prior-task adaptive parameter (M*,C*) based on the changed shared parameters (M,C), the change having resulted from training on the current task. During current-task training, the previous-task model (preserved copy) is maintained (student distribution is forced to be similar), while the live shared parameters are updated; then the preserved-copy parameters are refreshed based on the updated shared parameters.).
Dixit as well as Ketz are directed towards transfer learning. Therefore, Dixit as well as Ketz are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Dixit with the teachings of Ketz by using Ketz’s distillation loss algorithm in the student teacher network in Dixit. Ketz provides as additional motivation for combination ([¶0006] “many artificial neural networks are susceptible to a phenomenon known as catastrophic forgetting in which the artificial neural network rapidly forgets previously learned tasks when presented with new training data” [¶0037] “The artificial neural networks of the present disclosure are configured to learn new tasks without forgetting the tasks they have already learned (i.e., learn new tasks without suffering catastrophic forgetting)”. This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 2, the combination of Dixit and Ketz teaches The method of claim 1, wherein the training comprises training the adaptive parameter of the previous task such that a change in a model parameter of the previous task is minimized as the shared parameter is trained with respect to the current task(Dixit [¶0065] "During the disclosed adaptation process, most units from the original pre-trained network are kept unchanged while the parameters within the adapter modules are learned to minimize a supervised loss for the new task. The attention masks generated by the adapters disclosed herein guide the subsequent layers of the pre-trained network towards areas of the receptive field relevant to the new problem").
Regarding claim 3, the combination of Dixit and Ketz teaches The method of claim 1, wherein the training comprises training the model parameter based on training data of the current task(Dixit [¶0047] "Training set 302 is illustrates a training set, which includes multiple classes 304. Each class 304 includes multiple images 306 associated with the class 304. Each class 304 may correspond to a type of object in the image 306 (e.g., a digit 0-9, a man or a woman, a cat or a dog, etc.)" [¶0049] "The training set 302 includes a plurality of images 306 for each class 304 (e.g., image 306), and each image 306 is associated with one of the categories to be recognized (e.g., a class 304)" [¶0056] "Determining a subset of the initial features is called feature selection. The selected features are expected to contain the relevant information from the input data, so that the desired task can be performed by using this reduced representation instead of the complete initial data. DNN utilizes a stack of layers, where each layer performs a function. For example, the layer could be a convolution, a non-linear transform, the calculation of an average, etc. Eventually this DNN produces outputs by classifier layer 414. In FIG. 4, the data travels from left to right and the features are extracted. The goal of training the neural network is to find the parameters of all the layers that make them adequate for the desired task.").
Regarding claim 4, the combination of Dixit and Ketz teaches The method of claim 1, wherein the determining of the model parameter comprises determining the model parameter of the current task by applying the adaptive mask of the current task to the shared parameter and then adding the adaptive parameter to a result of the applying(Dixit See FIG. 5B X0+X2).
Regarding claim 5, the combination of Dixit and Ketz teaches The method of claim 1, wherein the determining of the adaptive parameter and the adaptive mask comprises determining the adaptive parameter based on the shared parameter trained with respect to the previous task, (Dixit [¶0020] "This is achieved with the help of lightweight adaptation modules that learn to modify the signals generated within such networks (hidden layer responses). During the process of across-task transfer, the adaptation modules are attached across convolutional units of the original network and trained from scratch using a supervised objective for the target task. The adapters generate attention masks to attenuate responses of the convolutional units to which they are connected. These masks effectively guide the subsequent stages of the original network towards possible regions of interest to the new problem")
and determining the adaptive mask at random(Dixit [¶0035] "Once an epoch is run, the models are evaluated and the values of their variables are adjusted to attempt to better refine the model in an iterative fashion. In various aspects, the evaluations are biased against false negatives, biased against false positives, or evenly biased with respect to the overall accuracy of the model. The values may be adjusted in several ways depending on the machine learning technique used. For example, in a genetic or evolutionary algorithm, the values for the models that are most successful in predicting the desired outputs are used to develop values for models to use during the subsequent epoch, which may include random variation/mutation to provide additional data points" Using random variation via genetic algorithm to generate second model which is masked with first model is interpreted as synonymous with determining the adaptive mask at random.).
Regarding claim 6, the combination of Dixit and Ketz teaches The method of claim 1, wherein the determining of the adaptive parameter and the adaptive mask, the determining of the model parameter, and the training are iteratively performed with respect to each of the plurality of tasks(Dixit [¶0020] "This is achieved with the help of lightweight adaptation modules that learn to modify the signals generated within such networks (hidden layer responses). During the process of across-task transfer, the adaptation modules are attached across convolutional units of the original network and trained from scratch using a supervised objective for the target task. The adapters generate attention masks to attenuate responses of the convolutional units to which they are connected. These masks effectively guide the subsequent stages of the original network towards possible regions of interest to the new problem").
Regarding claim 9, the combination of Dixit and Ketz teaches The method of claim 1, wherein a structure of the neural network is maintained unchanged, and a connection weight between nodes included in the neural network is determined based on the model parameter(Dixit [¶0064] "A possible approach to achieving efficient adaptation is to set aside a small number of layers in a pre-trained network for task-specificity. For every new task, only these parameters are learned through adaptation while the rest of the network is kept unchanged" [¶0068] "FIG. 5A illustrates a building block 500A of a residual neural network, in accordance with some embodiments. As shown in FIG. 5A, an input x0 is provided to a first layer 510A, where a weight W0 is applied to the input to generate the output x1. The mean (μ0), standard deviation (σ0), gamma scaling parameter (γ0), and beta scaling parameter (β0) may be computed for the first layer 510A" See also FIG. 5B where nodes W0 and W1 are unchanged structure and mu, sigma, gamma, and beta are determined based on the model parameter and are seen as connection weights between nodes.).
Regarding claim 10, the combination of Dixit and Ketz teaches The method of claim 1, further comprising obtaining output data based on the trained model parameter and input data to be inferred(Dixit [¶0075] "FIG. 6 illustrates an example of response attenuation 600, in accordance with some embodiments. Block 610 represents an image that corresponds to the input to a layer of a neural network (e.g., layer 510B), such as an image recognition neural network trained with ImageNet. Block 620 corresponds to the output of that layer. Block 630 corresponds to the output of an adaptation module (e.g., adaptation module 515B) coupled with that layer of the neural network. As shown in FIG. 6, block 620 and block 630 are pointwise multiplied together to yield the adapted output 640 (e.g., x1) of the layer of the neural network. This adapted output 640 serves as the input to the next layer of the neural network or the output of the neural network if the layer of block 610 is the final layer.").
Regarding claim 11, the combination of Dixit and Ketz teaches A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, configure the processor to perform the method of claim 1(Dixit [¶0108] "Example 9 is a non-transitory machine-readable medium storing instructions which, when executed by one or more computing machines, cause the one or more computing machines to perform operations comprising: accessing an input vector, the input vector comprising a numeric representation of an input to a neural network; providing the input vector to the neural network comprising a plurality of ordered layers, wherein each layer in at least a subset of the plurality of ordered layers is coupled with an adaptation module, wherein the adaptation module receives a same input value as a coupled layer for the adaptation module, and wherein an output value of the adaptation module is pointwise multiplied with an output value of the coupled layer to generate a next layer input value; and generating an output of the neural network based on an output of a last one of the plurality of ordered layers in the neural network.").
Regarding claim 12, Dixit teaches A processor-implemented neural network method, the method comprising: selecting an adaptive parameter ([¶0065] "Some aspects of the technology described herein present a framework for network adaptation that is more efficient and scalable compared to fine-tuning as well as the residual adapter scheme. Some aspects learn lightweight adaptation modules that attach to the filters in a pre-trained large-scale neural network and attenuate their activations using generated attention masks" Adaptation module interpreted as synonymous with adaptive parameter.)
and an adaptive mask of a target task of recognizing an object to be performed among a plurality of tasks of a neural network;([¶0020] "During the process of across-task transfer, the adaptation modules are attached across convolutional units of the original network and trained from scratch using a supervised objective for the target task. The adapters generate attention masks to attenuate responses of the convolutional units to which they are connected. These masks effectively guide the subsequent stages of the original network towards possible regions of interest to the new problem" Attention/confidence mask/map interpreted as synonymous with adaptive mask.)
determining a model of the target task based on the adaptive parameter, the adaptive mask, and a shared parameter of the plurality of tasks;([¶0064] "A possible approach to achieving efficient adaptation is to set aside a small number of layers in a pre-trained network for task-specificity. For every new task, only these parameters are learned through adaptation while the rest of the network is kept unchanged [...] One scheme uses a residual module that combines a batch normalization layer and a linear layer embedded in the original network for adaptation [...] This scheme, therefore, is perhaps more suited to a multi-task learning scenario, where the whole system is trained concurrently on the original large-scale problem (e.g., an object recognition neural network trained using ImageNet) and other related problems (e.g., scene classification, object detection, etc.)." the small number of layers set aside for task-specificity are interpreted as shared parameters learned by the adaptor module for a target task of the plurality of tasks (multi-task learning).)
and obtaining output data from the model by inputting input data to be inferred into the determined model([¶0075] "FIG. 6 illustrates an example of response attenuation 600, in accordance with some embodiments. Block 610 represents an image that corresponds to the input to a layer of a neural network (e.g., layer 510B), such as an image recognition neural network trained with ImageNet. Block 620 corresponds to the output of that layer. Block 630 corresponds to the output of an adaptation module (e.g., adaptation module 515B) coupled with that layer of the neural network. As shown in FIG. 6, block 620 and block 630 are pointwise multiplied together to yield the adapted output 640 (e.g., x1) of the layer of the neural network. This adapted output 640 serves as the input to the next layer of the neural network or the output of the neural network if the layer of block 610 is the final layer.").
However, Dixit does not explicitly teach wherein the model is trained, based on a model parameter of a previous task of recognizing another object, such that the model parameter of the previous task, determined based on an adaptive parameter of the previous task and the shared parameter, is maintained to be a same value as prior to training of the adaptive parameter
by updating the adaptive parameter of the previous task based on a change in the shared parameter resulting from training of a model parameter of a current task for the current task.
Ketz, in the same field of endeavor, teaches wherein the model is trained, based on a model parameter of a previous task of recognizing another object, such that the model parameter of the previous task, determined based on an adaptive parameter of the previous task and the shared parameter, is maintained to be a same value as prior to training of the adaptive parameter([¶0014] "to train a temporal prediction network on a first set of samples from an environment of an autonomous or semi-autonomous system during performance of a first task, train a controller on the first set of samples from the environment and a hidden state output by the temporal prediction network, store a preserved copy of the temporal prediction network, store a preserved copy of the controller, generate simulated rollouts from the preserved copy of the temporal prediction network and the preserved copy of the controller, and interleave the simulated rollouts with a second set of samples from the environment during performance of a second task to preserve knowledge of the temporal prediction network for performing the first task." [¶0045] "The preserved copy of the temporal prediction network 104 and the preserved copy of the controller 105 are configured to generate samples from simulated past experiences, which may be interleaved with samples from actual experiences during training on subsequent tasks. " Preserving a copy of the neural network model and using the preserved copy as a teacher is interpreted as synonymous with maintaining the model parameter of the previous task to be a same value as prior to training of the adaptive parameter)
by updating the adaptive parameter of the previous task based on a change in the shared parameter resulting from training of a model parameter of a current task for the current task([¶0054] "M*,C*<-M,C" [¶0052] " provided a given simulated sample zt sim as input, the temperature modulated softmax of the controller's 103 output distribution [...] is forced to be similar to the temperature modulated softmax of the simulated output distribution" See also FIG. 1. M*,C* are the preserved parameters (adaptive parameters of the previous task) , M,C are the shared parameters and have changed due to current task training (C<-dLRL(C,M(Vtaski))). Therefore, when Ketz performs "M*,C*<-M,C" they are updating the prior-task adaptive parameter (M*,C*) based on the changed shared parameters (M,C), the change having resulted from training on the current task. During current-task training, the previous-task model (preserved copy) is maintained (student distribution is forced to be similar), while the live shared parameters are updated; then the preserved-copy parameters are refreshed based on the updated shared parameters.).
Dixit as well as Ketz are directed towards transfer learning. Therefore, Dixit as well as Ketz are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Dixit with the teachings of Ketz by using Ketz’s distillation loss algorithm in the student teacher network in Dixit. Ketz provides as additional motivation for combination ([¶0006] “many artificial neural networks are susceptible to a phenomenon known as catastrophic forgetting in which the artificial neural network rapidly forgets previously learned tasks when presented with new training data” [¶0037] “The artificial neural networks of the present disclosure are configured to learn new tasks without forgetting the tasks they have already learned (i.e., learn new tasks without suffering catastrophic forgetting)”. This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 13, the combination of Dixit and Ketz teaches The method of claim 12, wherein the determining of the model comprises determining the model parameter of the target task by applying the adaptive mask of the target task to the shared parameter and adding the adaptive parameter to a result of the applying, (Dixit See FIG. 5B X0+X2)
and determining a connection weight between nodes included in the neural network based on the model parameter(Dixit [¶0020] "This is achieved with the help of lightweight adaptation modules that learn to modify the signals generated within such networks (hidden layer responses). During the process of across-task transfer, the adaptation modules are attached across convolutional units of the original network and trained from scratch using a supervised objective for the target task. The adapters generate attention masks to attenuate responses of the convolutional units to which they are connected. These masks effectively guide the subsequent stages of the original network towards possible regions of interest to the new problem").
Regarding claim 15, the combination of Dixit and Ketz teaches The method of claim 12, wherein an adaptive parameter of a task to be removed from among the plurality of tasks is deleted(Dixit [¶0040] "the neural network 204 (e.g., deep learning, deep convolutional, or recurrent neural network) comprises a series of neurons 208, such as Long Short Term Memory (LSTM) nodes, arranged into a network. A neuron 208 is an architectural element used in data processing and artificial intelligence, particularly machine learning, which includes memory that may determine when to “remember” and when to “forget” values held in that memory based on the weights of inputs provided to the given neuron 208. Each of the neurons 208 used herein is configured to accept a predefined number of inputs from other neurons 208 in the neural network 204 to provide relational and sub-relational outputs for the content of the frames being analyzed. Individual neurons 208 may be chained together and/or organized into tree structures in various configurations of neural networks to provide interactions and relationship learning modelling for how each of the frames in an utterance is related to one another." [¶0041] "an LSTM serving as a neuron includes several gates to handle input vectors (e.g., phonemes from an utterance), a memory cell, and an output vector (e.g., contextual representation). The input gate and output gate control the information flowing into and out of the memory cell, respectively, whereas forget gates optionally remove information from the memory cell based on the inputs from linked cells earlier in the neural network.").
Regarding claim 16, the combination of Dixit and Ketz teaches The method of claim 12, wherein the plurality of tasks have a same data type to be input into the neural network(Dixit [¶0075] "FIG. 6 illustrates an example of response attenuation 600, in accordance with some embodiments. Block 610 represents an image that corresponds to the input to a layer of a neural network (e.g., layer 510B), such as an image recognition neural network trained with ImageNet. Block 620 corresponds to the output of that layer. Block 630 corresponds to the output of an adaptation module (e.g., adaptation module 515B) coupled with that layer of the neural network." See also FIG. 6. The data type input into the networks are of the same type "image".).
Regarding claims 17-20, claims 17-20 are directed towards a system for performing the processor implemented methods of claims 1-4, respectively. Therefore, the rejections applied to claims 1-4 also apply to claims 17-20.
Regarding claim 21, claim 21 is directed towards a system for performing the processor implemented method of claim 12. Therefore, the rejection applied to claim 12 also applies to claim 21.
Claims 7, 8, and 14 are rejected under U.S.C. §103 as being unpatentable over the combination of Dixit and Ketz and in further view of Yan (“FPGAN: An FPGA Accelerator for Graph Attention Networks With Software and Hardware Co-Optimization”, 2020).
Regarding claim 7, the combination of Dixit and Ketz teaches The method of claim 1, further comprising: grouping a plurality of adaptive parameters of the plurality of tasks into a plurality of groups; (Dixit [¶0034] " a model is developed to cluster the dataset into n groups, and is evaluated over several epochs as to how consistently it places a given input into a given group and how reliably it produces the n desired clusters across each epoch").
However, the combination of Dixit and Ketz doesn't explicitly teach and decomposing each of the adaptive parameters into a locally shared parameter shared by adaptive parameters grouped into a same group
and a second adaptive parameter sparser than the respective adaptive parameter, based on whether elements included in each of the adaptive parameters grouped into the same group satisfy a predetermined condition.
Yan, in the same field of endeavor, teaches and decomposing each of the adaptive parameters into a locally shared parameter shared by adaptive parameters grouped into a same group ([p. 171610 §IIA] "As the first step, a weight matrix, W ∈ R F×F 0, as a shared parameter, is applied to every node for linear transformation. The results of first step is expressed as h◦ and the first step can be expressed as" [p. 171615 §III2F] "The adjacency matrix is conducive to parallel computing, and the adjacency nodes of a certain node are stored in one row, which is convenient for using optimization techniques such as sequential memory access, cache, and vector computation. However, the adjacency matrix is usually sparse because graphs in real life follow power law")
and a second adaptive parameter sparser than the respective adaptive parameter, based on whether elements included in each of the adaptive parameters grouped into the same group satisfy a predetermined condition([p. 171610 §IIA] "As the first step, a weight matrix, W ∈ R F×F 0, as a shared parameter, is applied to every node for linear transformation. The results of first step is expressed as h◦ and the first step can be expressed as" [p. 171615 §III2F] "The adjacency matrix is conducive to parallel computing, and the adjacency nodes of a certain node are stored in one row, which is convenient for using optimization techniques such as sequential memory access, cache, and vector computation. However, the adjacency matrix is usually sparse because graphs in real life follow power law").
Dixit as well as Yan are directed towards masked attention networks. Therefore, Dixit as well as Yan are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Dixit with the teachings of Yan by using a sparse adjacency matrix. Yan provides as additional motivation for combination ([p. 171615 §IIIF] "In Figure 4, data before and after the conversion can be matched according to the color and the number in the box. The vectorization and alignment of other data are the same as nodes features. After data vectorization and alignment, we can perform scalable vector calculations and efficient memory accesses for performance improvement."). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 8, the combination of Dixit, Ketz, and Yan teaches The method of claim 7, wherein the model parameter of the current task is determined based on the shared parameter, the locally shared parameter of the group to which the current task belongs, and a second adaptive parameter and the adaptive mask of the current task(Yan [p. 171610 §IIA] "As the first step, a weight matrix, W ∈ R F×F 0, as a shared parameter, is applied to every node for linear transformation. The results of first step is expressed as h◦ and the first step can be expressed as" [p. 171615 §III2F] "The adjacency matrix is conducive to parallel computing, and the adjacency nodes of a certain node are stored in one row, which is convenient for using optimization techniques such as sequential memory access, cache, and vector computation. However, the adjacency matrix is usually sparse because graphs in real life follow power law").
Regarding claim 14, the combination of Dixit and Ketz teaches The method of claim 12, wherein the adaptive parameter is among adaptive parameters of the plurality of tasks grouped into a plurality of groups, (Dixit [¶0034] " a model is developed to cluster the dataset into n groups, and is evaluated over several epochs as to how consistently it places a given input into a given group and how reliably it produces the n desired clusters across each epoch").
However, the combination of Dixit and Ketz doesn't explicitly teach and the adaptive parameter is determined based on a locally shared parameter of a group to which the target task belongs and a second adaptive parameter corresponding to the target task and being sparser than the adaptive parameter.
Yan, in the same field of endeavor, teaches and the adaptive parameter is determined based on a locally shared parameter of a group to which the target task belongs and a second adaptive parameter corresponding to the target task and being sparser than the adaptive parameter ([p. 171610 §IIA] "As the first step, a weight matrix, W ∈ R F×F 0, as a shared parameter, is applied to every node for linear transformation. The results of first step is expressed as h◦ and the first step can be expressed as" [p. 171615 §III2F] "The adjacency matrix is conducive to parallel computing, and the adjacency nodes of a certain node are stored in one row, which is convenient for using optimization techniques such as sequential memory access, cache, and vector computation. However, the adjacency matrix is usually sparse because graphs in real life follow power law").
Dixit as well as Yan are directed towards masked attention networks. Therefore, Dixit as well as Yan are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Dixit with the teachings of Yan by using a sparse adjacency matrix. Yan provides as additional motivation for combination ([p. 171615 §IIIF] "In Figure 4, data before and after the conversion can be matched according to the color and the number in the box. The vectorization and alignment of other data are the same as nodes features. After data vectorization and alignment, we can perform scalable vector calculations and efficient memory accesses for performance improvement."). This motivation for combination also applies to the remaining claims which depend on this combination.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Zhao (US20200104699A1) is directed towards a neural network system which fixes adapter weights between tasks.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124