Office Action Analysis: 17756461 — METHOD AND SYSTEM FOR DETERMINING TASK COMPATIBILITY IN NEURAL NETWORKS

Examiner Intelligence

PHAM, JESSICA THUY View full profile →
Grants only 33% of cases
Career Allow Rate
1 granted / 3 resolved
-21.7% vs TC avg
Minimal -33% lift
Without
With
+-33.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
38 currently pending
Career history
41
Total Applications
across all art units
Statute-Specific Performance

§101
26.8%
-13.2% vs TC avg
§103
35.5%
-4.5% vs TC avg
§102
11.0%
-29.0% vs TC avg
§112
22.7%
-17.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 3 resolved cases
Office Action

§103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/31/2025 has been entered.
 
Response to Amendment/Status of Claims
Claims 1 and 14 were amended.
Claims 1-2 and 4-15 are pending and examined herein.
Claims 7 and 8 are rejected under 35 U.S.C. 112(b).
Claims 1-2 and 4-15 are rejected under 35 U.S.C. 103.

Response to Arguments
Applicant’s arguments, see pages 13, filed 12/31/2025, with respect to the rejection(s) of claim(s) 1-2 and 4-15 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Vandenhende (“Branched Multi-Task Networks: Deciding What Layers to Share”, 2019), Rusu (“Progressive Neural Networks”, 2016), Belghazi (“Mutual Information Neural Estimation”, 2018), and Boudiaf (“A Unifying Mutual Information View of Metric Learning: Cross-Entropy vs. Pairwise Losses”, 2020). 
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 7 and 8 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 7 recites the limitation "a parametrizable distribution, the parametrizable distribution providing a parametrizable probability distribution function". However, claim 1 recites the limitation "a parameterizable distribution describing a probability function". It is unclear whether the “a parametrizable distribution” in claim 7 refers to the same parametrizable distribution in claim 1 or if it is an additional parametrizable distribution. For purposes of examination, the limitations will be treated as referring to the same element.
Additionally, it is unclear whether the “parametrizable probability distribution function” in claim 7 refers to the “probability function" in claim 1 or if it is an additional probability function. For purposes of examination, the limitations will be treated as referring to the same element.
Dependent claim 8 fails to resolve the issues and is rejected with the same rationale.

Claim 8 recites "the parametrizable distribution". It is unclear as to which “a parametrizable distribution" claim 8 refers to, as claims 1 and 7 each recite “a parametrizable distribution.” As stated above, the limitations in claim 1 and 7 will be treated as referring to the same element, and therefore, claim 8 will also be treated as referring to the same element.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim(s) 1, 9, 10, 11, and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vandenhende (“Branched Multi-Task Networks: Deciding What Layers to Share”, 2019), Rusu (“Progressive Neural Networks”, 2016), Belghazi (“Mutual Information Neural Estimation”, 2018), and Boudiaf (“A Unifying Mutual Information View of Metric Learning: Cross-Entropy vs. Pairwise Losses”, 2020).

	Regarding claim 1, Vandenhende teaches
	A computer-implemented method for determining clusters of tasks, the clusters at least partially including multiple tasks to be executed by a computer processor configured with at least a joint encoder portion of a neural network, the method comprising (Section 1, page 2 states "The proposed method aims to find an effective task grouping for the sharable layers                         
                            
                                    f
                                
                                    l
                                
                     of the encoder, i.e. grouping related tasks together in the same branches of the tree." The grouping is interpreted as clustering, and the sharable layers of the encoder are interpreted as the joint encoder portion of a neural network. Algorithm 1 on page 4 shows the method, which one of ordinary skill in the art would realize is implemented on a computer which necessarily uses a processor to perform the method.)
a) receiving information regarding a tuple of processing tasks to be performed by a neural network; (Algorithm 1, line 1 states that the tasks are an input to the method. Figure 1(b) shows the tasks in a branched multi-task network, a type of neural network.)
b) training a first neural network for performing a first processing task of the tuple of tasks and a second neural network for performing a second processing task of the tuple of tasks; (Section 3.1, page 4 states “As a first step, we train a single-task model for each task                         
                            
                                    t
                                
                                    i
                                
                            ∈
                            T
                        
                    .” Figure 1(a) shows that there are at least four tasks, which means there is a first and second task and a corresponding first and second neural network.)
d) forming an estimation neural network, the estimation neural network comprising the trained first neural network, the trained second neural network, and an auxiliary [function] which receives information from the trained first and second neural networks; (Section 3.1 states “To calculate these task affinities, we have to compare the representation dissimilarity matrices (RDM) of the single-task networks – trained in the previous step – at the specified                         
                            D
                        
                     locations.” Section 3.1 further states “Specifically,                         
                            
                                    RDM
                                
                                    d
                                    ,
                                     
                                    i
                                    ,
                                     
                                    j
                                
                     is found by calculating the dissimilarity score between the features at location                         
                            d
                        
                     for image                         
                            i
                        
                     and                         
                            j
                        
                    .” Section 3.1 further states “For a specific location                         
                            d
                        
                     in the network, the computed RDMs are symmetrical, with a diagonal of zeros. For every such location, we measure the similarity between the upper or lower triangular part of the RDMs belonging to the different single-task networks. We use the Spearman’s correlation coefficient                         
                            
                                    r
                                
                                    s
                                
                     to measure similarity. When repeated for every pair of tasks, at a specific location                         
                            d
                        
                    , the result is a symmetrical matrix of size                         
                            N
                            ×
                            N
                        
                    , with a diagonal of ones. Concatenating over the                         
                            D
                        
                     locations in the sharable encoder, we end up with the desired task affinity tensor of size                         
                            D
                            ×
                            N
                            ×
                             
                            N
                        
                    .” The function that obtains the task affinity is interpreted as the auxiliary function. It receives the features from images processed through the trained neural networks, and therefore receives information from the trained first and second neural networks.)
e) receiving image information as an input to the estimation neural network, the trained first neural network generating first encoded image information from the input Image information and trained second neural network generating second encoded image information from the input image information; (Section 3.1 states “The single-task models use an identical encoder                         
                            E
                        
                     – made of all sharable layers                         
                            
                                    f
                                
                                    l
                                
                     – followed by a task-specific decoder                         
                            
                                    D
                                
                                            t
                                        
                                            i
                                        
                    .” Section 3.1 further states “To do this, a held-out subset of                         
                            K
                        
                     images is required. The latter images serve to compare the dissimilarity of their feature representations in the single-task networks for every pair of images. Specifically, for every task                         
                            
                                    t
                                
                                    i
                                
                    , we characterize these learned feature representations at the selected locations by filling a tensor of size                         
                            D
                            ×
                            K
                            ×
                            K
                        
                    .” As the images are used to input to the trained single-task networks, and the single-task networks comprise an encoder and a decoder, it is inherent that each neural network will generate encoded image information.) 
f) estimating an information share measure using the auxiliary [function], the information share measure indicating how much of a first portion of the received image information for completing one of the first processing task and the second processing task is included in one of the second encoded image information or the first encoded image information, wherein the second encoded image information and the first encoded image also include second image information for completing the second processing task and the first processing task, respectively; (Section 1 states “To this end, we base the layer sharing on measurable levels of task affinity or task relatedness: two tasks are strongly related, if their single task models rely on a similar set of features.” Features are obtained from the encoder-decoder single-task models, and, as established above, the single-task models provide encoded image information. Section 1 further states “Given a dataset and a number of tasks, our approach uses RSA [representation similarity analysis] to assess the task affinity at arbitrary locations in a neural network.”  As one of ordinary skill in the art would understand, similarity between representations/features (interpreted as encoded information) measures how much information is shared between the encoded information that is being compared. The two neural networks are single task neural networks. Therefore, finding the similarity measure is finding the amount of information that is in each neural network for completing the other task. Section 3.1 states “Specifically,                         
                            
                                    RDM
                                
                                    d
                                    ,
                                     
                                    i
                                    ,
                                     
                                    j
                                
                     is found by calculating the dissimilarity score between the features at location                         
                            d
                        
                     for image                         
                            i
                        
                     and                         
                            j
                        
                    .” At location                         
                            d
                        
                    , images                         
                            i
                        
                     and                         
                            j
                        
                     are encoded. Therefore, as discussed above, the task affinity, calculated using the RDM, is interpreted as the information share measure, and the encoded images are compared using the information share measure.)
[having a measure for the] first and second encoded image information which measures a difficulty for the first processing task to be executed together with the second processing task in a shared multitask architecture (Page 2 states "Given a dataset and a number of tasks, our approach uses RSA to assess the task affinity at arbitrary locations in a neural network. The task affinity scores are then used to construct a branched multi task network in a fully automated manner. In particular, our task clustering algorithm groups similar tasks together in common branches, and separates dissimilar tasks by assigning them to different branches, thereby reducing the negative transfer between tasks." The negative transfer is interpreted as the difficulty for the first and second processing task to be executed together in the shared multitask architecture. As RSA reduces this, it is a measure of the difficulty for task grouping.)
h) repeating steps b) – f) for further tuples of processing tasks, thereby obtaining multiple information share measures for different tuples of processing tasks; (Line 6 in algorithm one includes a for loop that calculates the task affinity (which includes Spearman’s correlation), for each pair of tasks in the set of tasks.)
i) providing a threshold value for the multiple information share measures, the threshold value indicating a limit for an information overlap according to which processing tasks of the different processing tasks of the different tuples should be executed in a joint encoder potion of the neural network; (Section 3.2, page 5 states “The task dissimilarity score of a tree is defined as                         
                            
                                    C
                                
                                    c
                                    l
                                    u
                                    s
                                    t
                                    e
                                    r
                                
                            =
                            
                                    ∑
                                    
                                        l
                                    
                                            C
                                        
                                            c
                                            l
                                            u
                                            s
                                            t
                                            e
                                            r
                                        
                                            l
                                        
                    , where                         
                            
                                    C
                                
                                    c
                                    l
                                    u
                                    s
                                    t
                                    e
                                    r
                                
                                    l
                                
                     is found by averaging the maximum distance between the dissimilarity scores of the elements in every cluster.” Section 3.2 further states “The branched multi-task network is built with the intention to separate dissimilarity tasks by assigning them to separate branches. To this end, we define the dissimilarity score between two tasks                         
                            
                                    t
                                
                                    i
                                
                     and                         
                            
                                    t
                                
                                    j
                                
                     at location                         
                            d
                        
                     as                         
                            1
                            -
                            
                                    A
                                
                                    d
                                    ,
                                     
                                    i
                                    ,
                                     
                                    j
                                
                    , with A the task affinity tensor.” By using the maximum distance between tasks in a cluster, there exists a functional threshold of dissimilarity, beyond which tasks will increase the cluster cost and be excluded from the same group. Tasks below this threshold are clustered together. As dissimilarity involves the task affinity, which measures information overlap between representations, the threshold on dissimilarity also indicates information overlap.)
j) determining clusters of processing tasks to be executed in a joint encoder portion of the neural network based on the information share measures and the threshold value, and (The claim mapping of step g) explains how the clusters are determined based on the information share measure and the threshold value. Additionally, Section 3.2 states “By taking into account the clustering cost at all depths, the procedure can find a task grouping that is considered optimal in a global sense.” The task grouping is interpreted as the clusters of tasks.)
k) allocating computational resources to perform the clusters of processing tasks in the joint encoder portion of the neural network based on the determination (Page 5 states "Given a computational budget                         
                            C
                        
                    , we need to derive how the layers (or blocks) in the sharable                         
                            
                                    f
                                
                                    l
                                
                     encoder should be shared among the tasks in                         
                            T
                        
                    ." Page 5 further states "Since the number of tasks is finite, we can enumerate all possible trees that fall within the given computational budget                         
                            C
                        
                    ." Page 6 states "Depending on the available computational budget                         
                            C
                        
                    , our method generates a specific task grouping." As the available computational budget is applied to the task grouping, the resources (computational budget) are allocated based on the determination of the task grouping.)
Vandenhende does not appear to explicitly teach
wherein parameters of each of the trained first and second neural networks are fixed;
c) selecting from each of the trained first and second neural networks multiple fixed parameters that establish a parameterizable distribution describing a probability function;
[an auxiliary] neural network
[forming a neural network] using the multiple fixed parameters selected from each of the trained first and second neural networks;
g) reducing a cross entropy loss defined on information output of the auxiliary neural network by training the auxiliary neural network based on a conditional entropy between [pairs]
However, Rusu—directed to analogous art—teaches
wherein parameters of each of the trained first and second neural networks are fixed; (Page 2 states "Progressive networks integrate these desiderata directly into the model architecture: catastrophic forgetting is prevented by instantiating a new neural network (a column) for each task being solved, while transfer is enabled via lateral connections to features of previously learned columns." Page 2 further states “A progressive network starts with a single column: a deep neural network having                         
                            L
                        
                     layers with hidden activations                         
                            
                                    h
                                
                                    i
                                
                                    (
                                    1
                                    )
                                
                            ∈
                            
                                    R
                                
                                            n
                                        
                                            t
                                        
                    , with                         
                            
                                    n
                                
                                    i
                                
                     the number of units at layer                         
                            i
                            <
                            L
                        
                    , and parameters                         
                            
                                    Θ
                                
                                            1
                                        
                     trained to convergence. When switching to a second task, the parameters                         
                            
                                    Θ
                                
                                            1
                                        
                     are “frozen” and a new column with parameters                         
                            
                                    Θ
                                
                                            2
                                        
                     is instantiated (with random initialization), where layers                         
                            
                                    h
                                
                                    i
                                
                                            2
                                        
                     receives input from both                         
                            
                                    h
                                
                                    i
                                    -
                                    1
                                
                                    (
                                    2
                                    )
                                
                     and                         
                            
                                    h
                                
                                    i
                                    -
                                    1
                                
                                    (
                                    1
                                    )
                                
                     via lateral connections. This generalizes to                         
                            K
                        
                     tasks as follows:
                
                                            h
                                        
                                            i
                                        
                                                    k
                                                
                                    =
                                    f
                                    
                                                    W
                                                
                                                    i
                                                
                                                            k
                                                        
                                                    h
                                                
                                                    i
                                                    -
                                                    1
                                                
                                                            k
                                                        
                                            +
                                            
                                                    ∑
                                                    
                                                        j
                                                        <
                                                        k
                                                    
                                                            U
                                                        
                                                            i
                                                        
                                                                    k
                                                                    :
                                                                    j
                                                                
                                                            h
                                                        
                                                            i
                                                            -
                                                            1
                                                        
                                                                    j
                                                                
                                    ,
                                     
                                    #
                                    
                                            1
                                        
where                         
                            
                                    W
                                
                                    i
                                
                                            k
                                        
                            ∈
                            
                                    R
                                
                                            n
                                        
                                            i
                                        
                                    ×
                                    
                                            n
                                        
                                            i
                                            -
                                            1
                                        
                     is the weight matrix of layer                         
                            i
                        
                     of column                         
                            k
                        
                    ,                         
                            
                                    U
                                
                                    i
                                
                                            k
                                            :
                                            j
                                        
                            ∈
                            
                                    R
                                
                                            n
                                        
                                            i
                                        
                                    ×
                                    
                                            n
                                        
                                            j
                                        
                     are the lateral connections from layer                         
                            i
                            -
                            1
                        
                     of column                         
                            j
                        
                    , to layer                         
                            i
                        
                     of column                         
                            k
                        
                     and                         
                            
                                    h
                                
                                    0
                                
                     is the network input.” Page 3 states "Because also the parameters                         
                            
                                            Θ
                                        
                                                    j
                                                
                                    ;
                                    j
                                    <
                                    k
                                
                     are kept frozen (i.e. are constants for the optimizer) when training                         
                            
                                    Θ
                                
                                            k
                                        
                    there is no interference between tasks and hence no catastrophic forgetting." Therefore, the parameters of the first and second neural networks are fixed.)
c) selecting from each of the trained first and second neural networks multiple fixed parameters that establish a parameterizable distribution describing a probability function; (As the networks are trained separately and then their fixed parameters are used in the multi-task neural network, the parameters are selected from each of the neural networks. Page 3 states "In this case, each column is trained to solve a particular Markov Decision Process (MDP): the                         
                            k
                        
                    -th column thus defines a policy                         
                            
                                    π
                                
                                            k
                                        
                                    a
                                     
                            s
                            )
                        
                     taking as input a state                         
                            s
                        
                     given by the environment, and generating probabilities over actions                         
                            
                                    π
                                
                                            k
                                        
                                    a
                                     
                            s
                            )
                             
                            :
                            =
                            
                                    h
                                
                                    L
                                
                                            k
                                        
                            (
                            s
                            )
                        
                    . At each time-step, an action is sampled from this distribution and taken in the environment, yielding the subsequent state. This policy implicitly defines a stationary distribution                         
                            
                                    p
                                
                                            π
                                        
                                                    k
                                                
                            (
                            s
                            ,
                             
                            a
                            )
                        
                     over states and actions." As each column is a trained neural network, their parameters                         
                            
                                    h
                                
                                    L
                                
                                            k
                                        
                    define a parameterizable distribution                         
                            
                                    p
                                
                                            π
                                        
                                                    k
                                                
                            (
                            s
                            ,
                             
                            a
                            )
                        
                     describing a probability function                          
                            
                                    π
                                
                                            k
                                        
                                    a
                                     
                            s
                            )
                             
                            :
                            =
                            
                                    h
                                
                                    L
                                
                                            k
                                        
                            (
                            s
                            )
                        
                    .)
[forming a neural network] using the multiple fixed parameters selected from each of the trained first and second neural networks; (Figure 1 shows an example neural network generated from the trained first and second neural networks and their fixed parameters                          
                            
                                    h
                                
                                    L
                                
                                            1
                                        
                     and                         
                            
                                    h
                                
                                    L
                                
                                            2
                                        
                    .)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende with the teachings of Rusu because, as stated by Rusu on page 3, "Progressive networks are a stepping stone towards a full continual learning agent: they contain the necessary ingredients to learn multiple tasks, in sequence, while enabling transfer and being immune to catastrophic forgetting."
The combination of Vandenhende and Rusu does not appear to explicitly teach
[an auxiliary] neural network
g) reducing a cross entropy loss defined on information output of the auxiliary neural network by training the auxiliary neural network based on a conditional entropy between [pairs]
However, Belghazi—directed to analogous art—teaches
[an auxiliary] neural network (Section 3.1 states “Using both Eqn. 3 for the mutual information and the dual representation of the KL-divergence, the idea is to choose                         
                            F
                        
                     to be the family of functions                         
                            
                                    T
                                
                                    θ
                                
                            ∶
                            X
                            ×
                            Z
                            →
                            R
                        
                     parameterized by a neural network with parameters                         
                            θ
                            ∈
                             
                            Θ
                        
                    . We call this network the statistics network.) The statistics network is interpreted as the auxiliary neural network, as it computes an information share measure (mutual information).)
[estimating an information share measure based on the auxiliary] neural network (The abstract states "We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. We present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent." Mutual information is interpreted as the information share measure.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende and Rusu with the neural network mutual information estimator taught by Belghazi because, as taught by Belghazi in section 1, "Mutual information is a fundamental quantity for measuring the relationship between random variables." Additionally, Belghazi states "Despite being a pivotal quantity across data science, mutual information has historically been difficult to compute (Paninski, 2003)." Belghazi states that "Our estimator is scalable, flexible, and completely trainable via back-propagation."
The combination of Vandenhende, Rusu, and Belghazi does not appear to explicitly teach
wherein parameters of each of the trained first and second neural networks are fixed;
 by training the auxiliary neural network based on a conditional entropy between [pairs]
However, Boudiaf—directed to analogous art—teaches
g) reducing a cross entropy loss defined on information output of the auxiliary neural network by training the auxiliary neural network based on a conditional entropy between [pairs] (Page 555 states "Alternately minimizing the cross-entropy loss LCE with respect to the encoder’s parameters W and the classifier’s weights θ can be viewed as an approximate bound-optimization of a Pairwise Cross-Entropy (PCE) loss, which we define as follows:". The minimization of the cross-entropy loss is interpreted as the training of the auxiliary neural network. As the parameters and weights are used, the information outputs of the auxiliary neural network is used in the determination of the cross entropy loss.  Page 555 further states "Second, we show that, more generally, minimization of the cross-entropy is actually equivalent to maximization of the mutual information, to which we connected various DML losses. These findings indicate that the cross-entropy represents a proxy for maximizing                         
                            I
                            (
                            
                                    Z
                                
                                ^
                            
                            ,
                             
                            Y
                            )
                        
                    , just like pairwise losses, without the need for dealing with the complex sample mining and optimization schemes associated to the latter." Page 551 states "The Mutual Information (MI) is a well known-measure designed to quantify the amount of information shared by two random variables.")
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende, Rusu, and Belghazi with the cross-entropy loss of Boudiaf because, as Boudiaf states on page 555, “Second, we show that, more generally, minimization of the cross-entropy is actually equivalent to maximization of the mutual information, to which we connected various DML losses. These findings indicate that the cross-entropy represents a proxy for maximizing                         
                            I
                            (
                            
                                    Z
                                
                                ^
                            
                            ,
                             
                            Y
                            )
                        
                    , just like pairwise losses, without the need for dealing with the complex sample mining and optimization schemes associated to the latter." Additionally, page 551 states "The Mutual Information (MI) is a well known-measure designed to quantify the amount of information shared by two random variables."

Regarding claim 9, the rejection of claim 1 is incorporated herein. Vandenhende teaches
wherein estimating an information share measure based on the auxiliary neural network comprises calculating information share measures for multiple different data points of multi-dimensional image information and calculating a mean information share measure by averaging the information share measures. (Section 3.2, page 5 states “The task dissimilarity score of a tree is defined as                         
                            
                                    C
                                
                                    c
                                    l
                                    u
                                    s
                                    t
                                    e
                                    r
                                
                            =
                            
                                    ∑
                                    
                                        l
                                    
                                            C
                                        
                                            c
                                            l
                                            u
                                            s
                                            t
                                            e
                                            r
                                        
                                            l
                                        
                    , where                         
                            
                                    C
                                
                                    c
                                    l
                                    u
                                    s
                                    t
                                    e
                                    r
                                
                                    l
                                
                     is found by averaging the maximum distance between the dissimilarity scores of the elements in every cluster.” Section 3.2, page 5 further states “To this end, we define the dissimilarity score between two tasks                         
                            
                                    t
                                
                                    i
                                
                     and                         
                            
                                    t
                                
                                    j
                                
                     at location                         
                            d
                        
                     as                         
                            1
                            -
                            
                                    A
                                
                                    d
                                    ,
                                    i
                                    ,
                                    j
                                
                    , with A the task affinity tensor.” Therefore, by averaging the distance between the dissimilarity scores, an average of task affinity, interpreted as the information share measure, is found.)

Regarding claim 10, the rejection of claim 1 is incorporated herein. Vandenhende teaches
wherein information share measures for different tuples of tasks, for each task in the different tuples of tasks against each other task in the different tuples of tasks. (Section 3.1, page 5 states “When repeated for every pair of tasks, at a specific location                         
                            d
                        
                    , the result is a symmetrical matrix of size                         
                            N
                            ×
                            N
                        
                    , with a diagonal of ones.” As the matrix is symmetrical, one of ordinary skill in the art could reason that the tasks are compared both ways. Section 1 states “Given a dataset and a number of tasks, our approach uses RSA [representation similarity analysis] to assess the task affinity at arbitrary locations in a neural network.”  As one of ordinary skill in the art would understand, similarity between representations/features (interpreted as encoded information) measures how much information is shared between the encoded information that is being compared. Section 1 further states “To this end, we base the layer sharing on measurable levels of task affinity or task relatedness: two tasks are strongly related, if their single task models rely on a similar set of features.” This means that the information in the representations are indicative of the tasks used to encode the representation.)

Regarding claim 11, the rejection of claim 10 is incorporated herein. Vandenhende teaches
wherein determining clusters of processing tasks comprises grouping processing tasks together to be executed in the joint encoder portion of the neural network if the first and second information share measure is below the threshold value. (Section 3.2, page 5 states “The task dissimilarity score of a tree is defined as                         
                            
                                    C
                                
                                    c
                                    l
                                    u
                                    s
                                    t
                                    e
                                    r
                                
                            =
                            
                                    ∑
                                    
                                        l
                                    
                                            C
                                        
                                            c
                                            l
                                            u
                                            s
                                            t
                                            e
                                            r
                                        
                                            l
                                        
                    , where                         
                            
                                    C
                                
                                    c
                                    l
                                    u
                                    s
                                    t
                                    e
                                    r
                                
                                    l
                                
                     is found by averaging the maximum distance between the dissimilarity scores of the elements in every cluster.” Section 3.2 further states “The branched multi-task network is built with the intention to separate dissimilarity tasks by assigning them to separate branches. To this end, we define the dissimilarity score between two tasks                         
                            
                                    t
                                
                                    i
                                
                     and                         
                            
                                    t
                                
                                    j
                                
                     at location                         
                            d
                        
                     as                         
                            1
                            -
                            
                                    A
                                
                                    d
                                    ,
                                     
                                    i
                                    ,
                                     
                                    j
                                
                    , with A the task affinity tensor.” By using the maximum distance between tasks in a cluster, there exists a functional threshold of dissimilarity, beyond which tasks will increase the cluster cost and be excluded from the same group. Tasks below this threshold are clustered together. As dissimilarity involves the task affinity, which measures information overlap between representations, the threshold on dissimilarity also indicates information overlap. As stated above in regards to claim 10, the first and second information share measures are identical, as the task affinity tensor is symmetrical, meaning that comparing one to a threshold would be the same as comparing both to a threshold.)

Regarding claim 14, Vandenhende teaches
A system for determining clusters of tasks, the clusters at least partially including multiple processing tasks to be executed in a joint encoder portion of a neural network, the system comprusing: (Section 1, page 2 states "The proposed method aims to find an effective task grouping for the sharable layers                         
                            
                                    f
                                
                                    l
                                
                     of the encoder, i.e. grouping related tasks together in the same branches of the tree." The grouping is interpreted as clustering, and the sharable layers of the encoder are interpreted as the joint encoder portion of a neural network. Algorithm 1 on page 4 shows the method, which one of ordinary skill in the art would realize is implemented on a computer. Page 5 states that “In this section, we quantitatively and qualitatively evaluate the proposed method on a number of diverse multi-tasking datasets, that range from real to semi-real data, from few to many tasks, from dense prediction to classification tasks, and so on.” This means that a system was used to execute the method.)
a processor configured to execute instructions to perform operations including: (As the method is implemented on a computer, the computer necessarily has a processor that executes the instructions to perform the method.)
The remainder of claim 14 recites substantially similar subject matter to claim 1 and is rejected with the same rationale, mutatis mutandis.

Claim(s) 2, 5, and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vandenhende (“Branched Multi-Task Networks: Deciding What Layers to Share”, 2019), Belghazi (“Mutual Information Neural Estimation”, 2018), Rusu (“Progressive Neural Networks”, 2016), and Boudiaf (“A Unifying Mutual Information View of Metric Learning: Cross-Entropy vs. Pairwise Losses”, 2020) as applied to claim 1 above, further in view of Poole (“On Variational Bounds of Mutual Information”, May 2019).

Regarding claim 2, the rejection of claim 1 is incorporated herein. Vandenhende teaches
second encoded image information (See below.)
	received image information included in the first encoded image information (Section 3.1 states “The single-task models use an identical encoder                         
                            E
                        
                     – made of all sharable layers                         
                            
                                    f
                                
                                    l
                                
                     – followed by a task-specific decoder                         
                            
                                    D
                                
                                            t
                                        
                                            i
                                        
                    .” Section 3.1 further states “To do this, a held-out subset of                         
                            K
                        
                     images is required. The latter images serve to compare the dissimilarity of their feature representations in the single-task networks for every pair of images. Specifically, for every task                         
                            
                                    t
                                
                                    i
                                
                    , we characterize these learned feature representations at the selected locations by filling a tensor of size                         
                            D
                            ×
                            K
                            ×
                            K
                        
                    .” As the images are used to input to the trained single-task networks, and the single-task networks comprise an encoder and a decoder, it is inherent that each neural network will provide encoded image information.)
	The combination of Vandenhende and Rusu does not appear to explicitly teach
estimating an information share measure based on the auxiliary neural network includes approximating an upper bound of the received information missing in [data] compared to the amount of [other data]. 
estimating an information share measure based on the auxiliary neural network includes approximating a lower bound of information included in [data] compared to the amount of [other data].
However, Belghazi—directed to analogous art—teaches
estimating an information share measure based on the auxiliary neural network includes approximating a lower bound of information included in [data] compared to the amount of [other data]. (Algorithm 1 in Section 3 states that the neural estimator includes evaluating the lower bound. In the algorithm, the mutual information, interpreted as the information share measure, is estimated for variables X and Z. The algorithm uses samples from the data.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende with the teachings of Belghazi for the reasons given above in regards to claim 1.
The combination of Vendenhende, Rusu, Belghazi, and Boudiaf does not appear to teach
estimating an information share measure based on the auxiliary neural network includes approximating an upper bound of information missing in [data] compared to the amount of [other data].
However, Poole—directed to analogous art—teaches
estimating an information share measure based on the auxiliary neural network includes approximating an upper bound of information missing in [data] compared to the amount of [other data]. ("Upper bounding MI is challenging, but is possible when the conditional distribution                         
                            p
                            (
                            y
                            |
                            x
                            )
                        
                     is known (e.g. in deep representation learning where                         
                            y
                             
                    is the stochastic representation). We can build a tractable variational upper bound by introducing a variational approximation                         
                            q
                            (
                            y
                            )
                        
                     to the intractable marginal                         
                            p
                            
                                    y
                                
                            =
                            ∫
                            d
                            x
                             
                            p
                            (
                            x
                            )
                            p
                            (
                            y
                            |
                            x
                            )
                        
                    . By multiplying and dividing the integrand in MI by                         
                            q
                            (
                            y
                            )
                        
                     and dropping a negative KL term, we get a tractable variational upper bound". Deep representation learning includes a neural network. Mutual information is again interpreted as the information share measure. In this method, x and y are the data being compared.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende, Rusu, Belghazi, and Boudiaf with the variational upper bound of Poole because obtaining an upper bound on mutual information, as one of ordinary skill in the art would realize, can further provide information on how similar the two representations are. As stated by Belghazi, mutual information is difficult to compute, and as one of ordinary skill in the art would understand, finding an additional bound on mutual information will provide a better estimate.

Regarding claim 5, the rejection of claim 1 is incorporated herein. The combination of Vandenhende, Rusu, Belghazi, and Boudiaf does not appear to explicitly teach
wherein estimating an information share measure is performed at least a random probability function
However, Poole—directed to analogous art—teaches
wherein estimating an information share measure is performed based on a random probability function (Section 5 states "Upper bounding MI is challenging, but is possible when the conditional distribution                         
                            p
                            (
                            y
                            |
                            x
                            )
                        
                     is known (e.g. in deep representation learning where                         
                            y
                             
                    is the stochastic representation). We can build a tractable variational upper bound by introducing a variational approximation                         
                            q
                            (
                            y
                            )
                        
                     to the intractable marginal                         
                            p
                            
                                    y
                                
                            =
                            ∫
                            d
                            x
                             
                            p
                            (
                            x
                            )
                            p
                            (
                            y
                            |
                            x
                            )
                        
                    . By multiplying and dividing the integrand in MI by                         
                            q
                            (
                            y
                            )
                        
                     and dropping a negative KL term, we get a tractable variational upper bound". Mutual information is again interpreted as the information share measure. In this method, x and y are the data being compared.                         
                            q
                            (
                            y
                            )
                        
                     is interpreted as the random probability function.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vandenhende, Rusu, Belghazi, and Boudiaf with the variational upper bound of Poole because of the reasons given above in regards to claim 2.

Regarding claim 15, the rejection of claim 5 is incorporated within. The combination of Vandenhende, Rusu, Belghazi, and Boudiaf do not appear to teach
wherein the random probability function is determined by performing a variational mutual information maximization approach. 
However, Poole—directed to analogous art—teaches
	wherein the random probability function is determined by performing a variational mutual information maximization approach.  (Page 8 states “We use the convolutional encoder architecture from Burgess et al. (2018); Locatello et al. (2018) for                         
                            p
                            (
                            y
                            |
                            x
                            )
                        
                    , and a two hidden layer fully-connected neural network to parameterize the unnormalized variational marginal                         
                            q
                            
                                    y
                                
                     used by                         
                            
                                    I
                                
                                    J
                                    S
                                
                    . Empirically, we find that this variational regularized info-max objective is able to learn x and y position, and scale, but not rotation”.                         
                            q
                            
                                    y
                                
                     is again interpreted as the random probability function)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vandenhende, Rusu, Belghazi, and Boudiaf with the variational maximization taught by Poole because, as stated by Poole on page 9, “We showed that our new interpolated bounds are able to trade off bias for variance to yield better estimates of MI.”

Claim(s) 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vandenhende (“Branched Multi-Task Networks: Deciding What Layers to Share”, 2019), Rusu (“Progressive Neural Networks”, 2016), Belghazi (“Mutual Information Neural Estimation”, 2018), and Boudiaf (“A Unifying Mutual Information View of Metric Learning: Cross-Entropy vs. Pairwise Losses”, 2020) as applied to claim 1 above, further in view of Hjelm (“Learning Deep Representations by Mutual Information Estimation and Maximization”, February 2019).

Regarding claim 4, the rejection of claim 1 is incorporated herein. Vandenhende teaches
the trained first neural network [and] the trained second neural network (Section 3.1, page 4 states “As a first step, we train a single-task model for each task                         
                            
                                    t
                                
                                    i
                                
                            ∈
                            T
                        
                    .” Figure 1(a) shows that there are at least four tasks, which means there is a first and second task and a corresponding first and second neural network.)
The combination of Vandenhende, Rusu, Belghazi, and Boudiaf does not appear to explicitly teach
wherein estimating an information share measure includes training the auxiliary neural network by adapting the weights of the auxiliary neural network and keeping weights of [a trained neural network] constant. 
However, Hjelm—directed to analogous art—teaches
wherein estimating an information share measure includes training the auxiliary neural network by adapting the weights of the auxiliary neural network and keeping weights of [a trained neural network] constant. (Section 4.1 states, "To summarize, we use the following metrics for evaluating representations. For each of these, the encoder is held fixed unless noted otherwise: " and further states, as one of the options, “Mutual information neural estimate (MINE),                          
                            
                                            I
                                        
                                            ρ
                                        
                                ^
                            
                            (
                            X
                            ,
                             
                                    E
                                
                                    ψ
                                
                                    x
                                
                            )
                        
                    , between the input,                         
                            X
                        
                    , and the output representation,                         
                            
                                    E
                                
                                    ψ
                                
                            (
                            x
                            )
                        
                    , by training a discriminator with parameters                         
                            ρ
                        
                     to maximize the DV estimator of the KL-divergence.” MINE is interpreted as the auxiliary neural network that estimates mutual information, interpreted as the information share measure, and the encoder is interpreted as the trained neural network. As one of ordinary skill in the art would understand, holding the encoder fixed means keeping the weights of the encoder constant. Training a discriminator with parameters, as one of ordinary skill in the art would understand, is adapting the weights of the discriminator.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende, Rusu, Belghazi, and Boudiaf with the teachings of Hjelm because, as stated by Hjelm, "Evaluation of representations is case-driven and relies on various proxies. Linear separability is commonly used as a proxy for disentanglement and mutual information (MI) between representations and class labels. Unfortunately, this will not show whether the representation has high MI with the class labels when the representation is not disentangled." In the instant application’s claimed invention, the representations obtained by the trained neural networks are not disentangled, and would therefore require another method of evaluation as taught by Helm. Therefore, one of ordinary skill in the art would be motivated to add this element.

Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vandenhende (“Branched Multi-Task Networks: Deciding What Layers to Share”, 2019), Rusu (“Progressive Neural Networks”, 2016), Belghazi (“Mutual Information Neural Estimation”, 2018), and Boudiaf (“A Unifying Mutual Information View of Metric Learning: Cross-Entropy vs. Pairwise Losses”, 2020) as applied to claim 1 above, further in view of Deng (“Multi-Task Learning with Multi-View Attention for Answer Selection and Knowledge Base Question Answering”, January 2019).

Regarding claim 6, the rejection of claim 1 is incorporated herein. Vandenhende teaches
wherein training a first neural network comprises training an encoder of the first neural network for performing the first processing task, and (Section 3.1 states “The single-task models use an identical encoder                         
                            E
                        
                     – made of all sharable layers                         
                            
                                    f
                                
                                    l
                                
                     – followed by a task-specific decoder                         
                            
                                    D
                                
                                            t
                                        
                                            i
                                        
                    .” Section 3.1 further states “To do this, a held-out subset of                         
                            K
                        
                     images is required. The latter images serve to compare the dissimilarity of their feature representations in the single-task networks for every pair of images. Specifically, for every task                         
                            
                                    t
                                
                                    i
                                
                    , we characterize these learned feature representations at the selected locations by filling a tensor of size                         
                            D
                            ×
                            K
                            ×
                            K
                        
                    .” As the images are used to input to the trained single-task networks, and the single-task networks comprise an encoder and a decoder, it is inherent that each neural network will generate encoded image information. Page 4 states “We train a single-tasl model for every task                         
                            t
                        
                     in                         
                            T
                        
                    .)
The combination of Vandenhende, Rusu, Belghazi, and Boudiaf does not appear to explicitly teach
training a second neural network comprises training an encoder of the second neural network for performing the second processing task. 
However, Deng—directed to analogous art—teaches
training a second neural network comprises training an encoder of the second neural network for performing the second processing task.   (Page 6319, “Task-specific Encoder Layer”, states “Therefore, each task is equipped with a task-specific Siamese encoder for both question and answers, and each task-specific encoder contains a word encoder and a knowledge encoder to learn the integral sentence representations, as shown in Figure 2.” Page 6320, “Multi-Task Learning” states “The overall multi-task learning model is trained” which means that the task-specific encoders are also trained. As there is more than one task, a second encoder is trained for the second task.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende, Rusu, Belghazi, and Boudiaf with the teachings of Deng because, as stated on page 6319, “Task-specific Encoder Layer”, "Different QA tasks are supposed to be diverse in data distributions and low-level representations. Therefore, each task is equipped with a task-specific siamese encoder for both questions and answers, and each task-specific encoder contains a word encoder and a knowledge encoder to learn the integral sentence representations, as shown in Figure 2." One would be motivated to combine because the tasks, as in Deng, may be diverse in data distributions and low-level representations.

Claim(s) 7 and 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vandenhende (“Branched Multi-Task Networks: Deciding What Layers to Share”, 2019), Rusu (“Progressive Neural Networks”, 2016), Belghazi (“Mutual Information Neural Estimation”, 2018), and Boudiaf (“A Unifying Mutual Information View of Metric Learning: Cross-Entropy vs. Pairwise Losses”, 2020) as applied to claim 1 above, further in view of Chen (“InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets”, 2016).

Regarding claim 7, the rejection of claim 1 is incorporated herein. Vandenhende teaches
the first encoded image [and] the second encoded image (Section 3.1 states “The single-task models use an identical encoder                         
                            E
                        
                     – made of all sharable layers                         
                            
                                    f
                                
                                    l
                                
                     – followed by a task-specific decoder                         
                            
                                    D
                                
                                            t
                                        
                                            i
                                        
                    .” Section 3.1 further states “To do this, a held-out subset of                         
                            K
                        
                     images is required. The latter images serve to compare the dissimilarity of their feature representations in the single-task networks for every pair of images. Specifically, for every task                         
                            
                                    t
                                
                                    i
                                
                    , we characterize these learned feature representations at the selected locations by filling a tensor of size                         
                            D
                            ×
                            K
                            ×
                            K
                        
                    .” As the images are used to input to the trained single-task networks, and the single-task networks comprise an encoder and a decoder, it is inherent that each neural network will provide encoded image information.)
The combination of Vandenhende, Rusu, Belghazi, and Boudiaf does not appear to explicitly teach
wherein estimating an information share measure based on the auxiliary neural network comprises choosing a parametrizable distribution, the parameterizable distribution providing a parametrizable probability distribution function used for determining the conditional entropy of [data] given [other data], is the conditional entropy indicating how much information content exists in [the data] is not covered in [the other data]. 
However, Chen—directed to analogous art—teaches
wherein estimating an information share measure based on the auxiliary neural network comprises choosing a parametrizable distribution, the parameterizable distribution providing a parametrizable probability distribution function used for determining the conditional entropy of [Y] given [X], that is how much information content exists in [Y] is not covered in [X]. (Page 3, section 5 states "In practice, the mutual information term                        
                             
                            I
                            (
                            c
                            ;
                            G
                            (
                            z
                            ,
                            c
                            )
                            )
                        
                     is hard to maximize directly as it requires access to the posterior                         
                            P
                            (
                            c
                            |
                            x
                            )
                        
                    . Fortunately we can obtain a lower bound of it by defining an auxiliary distribution                         
                            Q
                            (
                            c
                            |
                            x
                            )
                        
                     to approximate                         
                            P
                            (
                            c
                            |
                            x
                            )
                        
                    :
                
                    I
                    
                            c
                            ;
                            G
                            
                                    z
                                    ,
                                    c
                                
                    =
                    H
                    
                            c
                        
                    -
                    H
                    
                            c
                        
                            G
                            
                                    z
                                    ,
                                    c
                                
In the equation, one can see that the part of the equation in the underbrace is equal to the conditional entropy                         
                            H
                            
                                    c
                                
                                    G
                                    
                                            z
                                            ,
                                            c
                                        
                    . Wikipedia (“Conditional Entropy”) states that the conditional entropy is written as                         
                            H
                            
                                    Y
                                
                                    X
                                
                    , in this case, the variable                         
                            Y
                        
                     is instead                         
                            c
                        
                     and                         
                            X
                        
                     is                         
                            G
                            
                                    z
                                    ,
                                    c
                                
                    . Further, Wikipedia, in the Venn diagram on page 1, where X is on the left in red and Y is on the right in blue, states that the conditional entropy is the part of the Venn diagram that is blue. This means that conditional entropy is the part of Y that is not covered by X. Therefore, this is inherent in Chen. Chen further states on page 4, section 5, that “Eq. (4) shows that the lower bound becomes tight as the auxiliary distribution                         
                            Q
                        
                     approaches the true posterior distribution:                         
                            
                                            E
                                        
                                            x
                                        
                                    [
                                    D
                                
                                    K
                                    L
                                
                                    P
                                    
                                            ⋅
                                        
                                            x
                                        
                            Q
                            
                                    ⋅
                                
                                    x
                                
                            )
                            ]
                            →
                            0
                        
                    .” The auxiliary distribution                         
                            Q
                        
                     therefore determines the conditional entropy when it is maximized. Section 6 states “In practice, we parameterize the auxiliary distribution                         
                            Q
                        
                     as a neural network. In most experiments                         
                            Q
                        
                     and                         
                            D
                        
                     share all convolutional layers and there is one final fully connected layer to output parameters for the conditional distribution                         
                            Q
                            (
                            c
                            |
                            x
                            )
                        
                    , which means InfoGAN only adds a negligible computation cost to GAN.” The conditional distribution                         
                            Q
                            (
                            c
                            |
                            x
                            )
                        
                     is interpreted as the parametrizable probability distribution function.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende, Rusu, Belghazi, and Boudiaf with the teachings of Chen because as Chen states on page 4, We note in addition that the entropy of latent codes                        
                             
                            H
                            (
                            c
                            )
                        
                     can be optimized over as well since for common distributions it has a simple analytical form. However, in this paper we opt for simplicity by fixing the latent code distribution and we will treat                         
                            H
                            (
                            c
                            )
                        
                     as a constant. So far we have bypassed the problem of having to compute the posterior                         
                            P
                            (
                            c
                            |
                            x
                            )
                             
                    explicitly via this lower bound but we still need to be able to sample from the posterior in the inner expectation. Next we state a simple lemma, with its proof deferred to Appendix, that removes the need to sample from the posterior.”

Regarding claim 8, the rejection of claim 7 is incorporated herein. The combination of Vandenhende, Rusu, Belghazi, and Boudiaf does not appear to explicitly teach
wherein estimating an information share measure based on the auxiliary neural network comprises determining parameters of the parametrizable distribution by training the auxiliary neural network in order to obtain the parametrizable probability distribution function.
However, Chen—directed to analogous art—teaches
wherein estimating an information share measure based on the auxiliary neural network comprises determining parameters of the parametrizable distribution by training the auxiliary neural network in order to obtain the parametrizable probability distribution function. (As stated above in regards to claim 7, the conditional distribution is parameterized by the neural network. As the output of the neural network is the parameters, the neural network is trained to obtain the parametrizable probability distribution function.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende, Rusu, Belghazi, and Boudiaf with the teachings of Chen for the reasons stated above in regards to claim 7.

Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vandenhende (“Branched Multi-Task Networks: Deciding What Layers to Share”, 2019), Rusu (“Progressive Neural Networks”, 2016), Belghazi (“Mutual Information Neural Estimation”, 2018), and Boudiaf (“A Unifying Mutual Information View of Metric Learning: Cross-Entropy vs. Pairwise Losses”, 2020) as applied to claim 1 above, further in view of Zheng (“Data-driven Task Allocation for Multi-task Transfer Learning on the Edge”, July 2019).

Regarding claim 12, the rejection of claim 1 is incorporated herein. Vandenhende teaches
joint encoder portion of the neural network (Section 1, page 2 states "The proposed method aims to find an effective task grouping for the sharable layers                         
                            
                                    f
                                
                                    l
                                
                     of the encoder, i.e. grouping related tasks together in the same branches of the tree." The sharable layers of the encoder are interpreted as the joint encoder portion of a neural network.)
The combination of Vandenhende, Rusu, Belghazi, and Boudiaf does not appear to explicitly teach
wherein the computing resources are allocated to each [section] based on a total number of processing tasks being handled by the respective [section]. 
However, Zheng —directed to analogous art—teaches
wherein computing resources are allocated to each [section] based on the number of tasks being handled by the respective [section]. (Page 1043 states “Since each task is assigned to exactly one processor, we have the following constraint:                         
                            
                                    ∑
                                    
                                        p
                                        ∈
                                        P
                                    
                                            u
                                        
                                            j
                                            ,
                                            p
                                        
                                    =
                                    1
                                    ,
                                     
                                    ∀
                                    j
                                    ∈
                                    J
                                
                    . Additionally, the execution time and resource of all tasks assigned to the processor                         
                            p
                        
                     should satisfy the following constraints:                         
                            
                                    ∑
                                    
                                        j
                                        ∈
                                        J
                                    
                                            t
                                        
                                            j
                                        
                                    ⋅
                                    
                                            u
                                        
                                            j
                                            ,
                                            p
                                        
                                    ≤
                                    T
                                    ,
                                     
                                    ∀
                                    p
                                    ∈
                                    P
                                
                            ,
                             
                                    ∑
                                    
                                        j
                                        ∈
                                        J
                                    
                                            v
                                        
                                            j
                                        
                                    ⋅
                                    
                                            u
                                        
                                            j
                                            ,
                                            p
                                        
                                    ≤
                                    
                                            V
                                        
                                            p
                                        
                                    ,
                                     
                                    ∀
                                    p
                                    ∈
                                    P
                                
                    , where                         
                            
                                    t
                                
                                    j
                                
                     denotes the execution time of task                         
                            j
                        
                    ;                         
                            T
                        
                     denotes the time limit;                         
                            
                                    v
                                
                                    j
                                
                     denotes the resource required for task                         
                            j
                        
                    ;                         
                            
                                    V
                                
                                    p
                                
                     denotes the resource capacity of processor                         
                            p
                        
                    .” Each task is allocated to exactly one processor, meaning that the resources are based on the number of tasks.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende, Rusu, Belghazi, and Boudiaf with the teachings of Zheng because, as Zheng states on page 1041, “The benefits of multiple tasks come in mainly two ways. First, similar tasks can transfer their knowledge between each other during the training process, which reduces the negative effect of data scarcity, especially on the edge. Second, in the real-world scenario, it is common to make the final decision by aggregating the output of multiple tasks. Maintaining the high performance of all these tasks contribute to the final aggregated decision performance. Again in the example of a self-driving car, the final driving operation of the car is conducted based on the result of multiple data-driven tasks, e.g., the neighboring car, traffic-sign, and pedestrian detection.” 

Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vandenhende (“Branched Multi-Task Networks: Deciding What Layers to Share”, 2019), Rusu (“Progressive Neural Networks”, 2016), Belghazi (“Mutual Information Neural Estimation”, 2018), and Boudiaf (“A Unifying Mutual Information View of Metric Learning: Cross-Entropy vs. Pairwise Losses”, 2020) as applied to claim 1 above, further in view of Hattori (“Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance”, 2018), 

Regarding claim 13, the rejection of claim 1 is incorporated herein. The combination of Vandenhende, Rusu, Belghazi, and Boudiaf does not appear to explicitly teach
wherein the tuple of processing tasks to be processed by the neural network include at least one of: depth estimation, detection of pedestrians, detection of traffic signs, pose detection of pedestrians, detection of drivable area.
However, Hattori —directed to analogous art—teaches
wherein the set of tasks to be processed by the neural network include at least one of the following tasks: depth estimation, detection of pedestrians, detection of traffic signs, pose detection of pedestrians, detection of drivable area. (Page 1031, Fig. 2 shows “Overview of our fully convolutional neural network architecture for multi-task learning method: with physically grounded and geometrically accurate renders of pedestrians for every grid location, our region specific pedestrian detection and pose estimation networks are trained on this synthetic data. At test time, our model takes a single image and outputs pedestrian detections, segmentation mask and body pose estimates.”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Vandenhende, Rusu, Belghazi, and Boudiaf with the task of Hattori because, as stated by Hattori in the abstract, "We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for an array of tasks for human activity analysis including detection, pose estimation and segmentation. Experimental results show that our approach (1) outperforms classical models and hybrid synthetic-real models, (2) outperforms various combinations of off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data, and (3) surprisingly, our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited." Additionally, as stated by Hattori in the abstract, "We consider scenarios where we have zero instances of real pedestrian data (e.g., a newly installed surveillance system in a novel location in which no labeled real data or unsupervised real data exists yet) and a pedestrian detector must be developed prior to any observations of pedestrians. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, our approach infers and generates a large variety of geometrically and photometrically accurate potential images of synthetic pedestrians along with purely accurate ground-truth labels through the use of computer graphics rendering engine."

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Liebel (“MultiDepth: Single-Image Depth Estimation via Multi-Task Regression and Classification”, October 2019)
Luo (“Traffic Sign Recognition Using a Multi-Task Convolutional Neural Network”, 2018)
Wang (“ShuDA-RFBNet for Real-time Multi-task Traffic Scene Perception, November 22, 2019)

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JESSICA THUY PHAM whose telephone number is (571)272-2605. The examiner can normally be reached Monday - Friday, 9 A.M. - 5:00 P.M..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached at (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J.T.P./Examiner, Art Unit 2121                                                                                                                                                                                                        
                                                                                                                                                                          /Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action
Prosecution Timeline

May 25, 2022
Application Filed
Jun 02, 2025
Non-Final Rejection — §103, §112
Aug 08, 2025
Response Filed
Nov 04, 2025
Final Rejection — §103, §112
Dec 31, 2025
Response after Non-Final Action
Jan 28, 2026
Request for Continued Examination
Feb 05, 2026
Response after Non-Final Action
Feb 26, 2026
Non-Final Rejection — §103, §112
Mar 26, 2026
Interview Requested
Apr 02, 2026
Applicant Interview (Telephonic)
Apr 02, 2026
Examiner Interview Summary
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

3-4
Expected OA Rounds
33%
Grant Probability
0%
With Interview (-33.3%)
3y 3m
Median Time to Grant
High
PTA Risk
Based on 3 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD AND SYSTEM FOR DETERMINING TASK COMPATIBILITY IN NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

METHOD AND SYSTEM FOR DETERMINING TASK COMPATIBILITY IN NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email