Office Action Analysis: 17883439 — METHOD AND APPARATUS FOR CONSTRUCTING MULTI-TASK LEARNING MODEL, ELECTRONIC DEVICE, AND STORAGE MEDIUM

Examiner Intelligence

DAY, ROBERT N View full profile →
Grants only 23% of cases
Career Allow Rate
5 granted / 22 resolved
-32.3% vs TC avg
Strong +23% interview lift
Without
With
+23.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
38 currently pending
Career history
60
Total Applications
across all art units
Statute-Specific Performance

§101
32.6%
-7.4% vs TC avg
§103
35.3%
-4.7% vs TC avg
§102
12.9%
-27.1% vs TC avg
§112
18.3%
-21.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 22 resolved cases
Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This action is in response to the amendments filed 26 November 2025. Claims 7, 9, and 17 are cancelled. Claims 1, 6, 8, 10, 11, 12, 13, 16, 18, and 19 are amended. Claims 21-23 are newly added. Claims 1-6, 8, 10-16, and 18-23 are pending and have been examined.

Response to Arguments
Applicant's arguments, see page 14, filed 26 November 2025, with respect to the rejection of Claims 1-20 under 35 U.S.C. 101 have been fully considered but they are not persuasive.

APPLICANT'S ARGUMENT: Applicant argues (page 14, paragraphs 1-2) that "Claim 1 as amended is directed to a method for constructing a multi-task learning model for predicting multiple tasks of an information recommendation system. Such multitask learning model may be used for recommending videos or news to a target user based on the values of multiple tasks predicted by the model. ... ¶ Moreover, by incorporating claims 7 and 9 as well as new claim features defining the search space and the process of training the model for multiple tasks in an information recommendation system, the original claim 1 has been further amended to integrate the invention into a practical application that is significantly more than an abstract idea."
EXAMINER'S RESPONSE: Examiner respectfully disagrees. Amended Claim 1 currently recites an intended use of the claimed method, "for predicting multiple tasks of an information recommendation system," which is given no patentable weight and therefore cannot integrate the claimed method into a practical application. Examiner notes that, although the claims are interpreted in light of the specification, limitations from the specification, such as recommending videos or news, are not read into the claims.
The limitations of amended Claim 1 previously recited by Claim 7, "sampling each search block in a respective search layer ... to obtain a local structure ...," appear to recite a mental process step.
The limitation of amended Claim 1 previously recited by Claim 9, "determining ... an optimized network structure for the multi-task prediction according to the optimized structural parameters of the search space as the multi-task learning model," appear to recite a mental process step. The steps of training the multi-task learning model previously recited by Claim 9, "training the network parameters ..." and "training the structural parameters ...," appear to recite additional elements of the mental process steps that invoke a computer or other machinery merely as a tool to perform an existing process.
Amended Claim 1 also recites additional elements pertaining to tasks of the claimed information recommendation system and related sample data, which appear to amount to no more than generally linking the mental processes steps of the claimed method to a particular field of use.
In the absence of additional elements providing a practical application or significantly more, the amended Claim 1 is directed to the recited mental processes.

Applicant' s arguments, see pages 15-17, filed 26 November 2025, with respect to the rejection of Claims 1-20 under 35 U.S.C. 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

APPLICANT'S ARGUMENT: Applicant argues (page 17, paragraphs 1-2) that "There are a many-to-one mapping relationship between a layer of subnetwork modules and a subsequent layer of search blocks and a one-to-one mapping relationship between the layer of search blocks and a subsequent layer of subnetwork modules. ¶ Wierstra fails to teach the specific way of constructing the search space and then searching the search space for the optimized multi-task learning model as newly added to the pending claims. ... Nor does Cai salvage the deficiencies of Wiestra."
EXAMINER'S RESPONSE: Examiner notes that Applicant's arguments are moot. Amended Claim 1 is now rejected under 35 U.S.C. 103 as being obvious in view of Wierstra in view of Guo. Guo is relied on to teach the argued features involving the relationship between subnetwork and search blocks of the multi-task network.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding Claim 1
Step 1
Claim 1 recites a method for constructing a multi-task learning model for predicting multiple tasks of an information recommendation system, and thus the claimed process falls within a statutory category of invention.
Step 2A Prong 1
The claim recites constructing a search space between an input node and a plurality of ... nodes ... by arranging a plurality of subnetwork layers and a plurality of search layers in a staggered manner, therebetween, wherein a search layer in the plurality of search layers is arranged between two subnetwork layers of the plurality of subnetwork layers in an order of many-to-one mapping relationship and one-to-one mapping relationship, each search layer of the plurality of search layers having a plurality of search blocks and each subnetwork layer of the plurality of subnetwork layers having a plurality of subnetwork modules the input node is connected to each of a plurality of subnetwork modules within a first subnetwork layer of the plurality of subnetwork layers and each of the plurality of task nodes is connected to a corresponding one of a plurality of search blocks of a last search layer of the plurality of search layers, which is a mental process, as it can be practically performed in the human mind, or with use a physical aid, such as pen and paper. The claim recites sampling a plurality of paths from the input node to the plurality of ... nodes through the search space to obtain a plurality of candidate paths as a plurality of candidate network structures, which is a mental process. The claim recites sampling each search block in a respective search layer of the plurality of search layers in the search space according to a network parameter and a structural parameter of the search space to obtain a local structure corresponding to the search block wherein the network parameter and the structural parameter define a mapping relationship between the search block and a plurality of subnetwork modules within a respective subnetwork layer of the plurality of subnetwork layers immediately before the respective search layer of the plurality of search layers, which is a mental process. The claim recites connecting the local structures in two different search layers via subnetwork modules in a subnetwork layer between the two different search layers iteratively until establishing the plurality of candidate paths from the input node to the plurality of task nodes as the plurality of candidate network structures, which is a mental process. The claim recites determining, from the candidate network structures, an optimized network structure for the multi-task prediction according to the optimized structural parameters of the search space as the multi-task learning model, which is a mental process. The claim recites the information recommendation system is configured to recommend information to a target user based on values of the multiple tasks associated with the target user predicted by the multi-task learning model, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element task nodes corresponding to the multiple tasks of the information recommendation system does not amount to more than generally linking the use of a judicial exception to a particular field of use (see MPEP 2106.05(h), "limit the use of the abstract idea to a particular technological environment"). The additional element training the network parameters and the structural parameters of the candidate network structures according to ... sample data to generate the multi-task learning model for performing a multi-task prediction invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element recommendation sample data ..., wherein the recommendation sample data includes data corresponding to a first task of the information recommendation system and data corresponding to a second task of the information recommendation system does not amount to more than generally linking the use of a judicial exception to a particular field of use (see MPEP 2106.05(h), "limit the use of the abstract idea to a particular technological environment"). The additional element training the network parameters of the candidate network structures to obtain an optimized network parameters of the candidate network structures invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element training the structural parameters of the search space according to the optimized network parameter of the candidate network structures to obtain optimized structural parameters of the search space invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 2
Step 1
Regarding Claim 2, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites prior to constructing the search space, for a respective subnetwork layer of the plurality of subnetwork layers: performing sampling processing on outputs of a plurality of subnetwork modules in the respective subnetwork layer to obtain a plurality of sampled outputs of the plurality of subnetwork modules, which is a mental process. The claim recites performing weighted summation on the plurality of sampled outputs of the plurality of subnetwork modules according to a weight of each subnetwork module of the plurality of subnetwork modules, which is a mental process. The claim recites constructing a transmission path of a search block using a result of the weighted summation as an output of a local structure of the search block, wherein the search block is a module in a search layer adjacent to the subnetwork layer, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 3
Step 1
Regarding Claim 3, the rejection of Claim 2 is incorporated.
Step 2A Prong 1
The claim recites constructing a transmission path of a search block using a result of the weighted summation as an output of a local structure of the search block, wherein the search block is a module in a search layer adjacent to the subnetwork layer (as recited by Claim 2), wherein the search block further comprises a gated node, which is a mental process, as the gated node of the claim recites a description of a functional structure of the previously recited model, which can be practically performed in the human mind, or with use a physical aid, such as pen and paper. The claim recites after performing sampling processing on the outputs of the plurality of subnetwork modules in the respective subnetwork layer: sampling a signal source from a signal source set of the subnetwork layer, the signal source being an output of the input node or an output of a predecessor subnetwork module in the subnetwork layer, which is a mental process. The claim recites predicting the signal source by using the gated node, to obtain a predicted value of each subnetwork module of the plurality of subnetwork modules, which is a mental process. The claim recites performing normalization processing on the predicted value of each subnetwork module to obtain the weight of each subnetwork module, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 4
Step 1
Regarding Claim 4, the rejection of Claim 3 is incorporated.
Step 2A Prong 1
The claim recites constructing a search space between an input node and a plurality of ... nodes by arranging a plurality of subnetwork layers and a plurality of search layers in a staggered manner, wherein a search layer in the plurality of search layers is arranged between two subnetwork layers of the plurality of subnetwork layers(as recited by Claim 1), wherein the search space comprises N subnetwork layers and N search layers, and N is a natural number greater than 1, which is a mental process, as the claim recites a structural property of the previously recited model, which can be practically performed in the human mind, or with use a physical aid, such as pen and paper. The claim recites sampling outputs of a plurality of subnetwork modules from a first subnetwork layer using an ith search block in a first search layer, wherein i is a positive integer, which is a mental process. The claim recites performing weighted summation on the outputs of the plurality of subnetwork modules according to a weight of each subnetwork module of the plurality of subnetwork modules when the signal source is the output of the input node, which is a mental process. The claim recites using a result of the weighted summation as an output of a local structure of the ith search block, to construct a transmission path of the ith search block, until transmission paths of all local structures of the ith search block in the first search layer are constructed, which is a mental process. The claim recites sampling outputs of a plurality of subnetwork modules from a jth subnetwork layer by using an ith search block in a jth search layer, 1 <j<=N, and j being a positive integer, which is a mental process. The claim recites performing weighted summation on the outputs of the plurality of subnetwork modules according to a weight of each subnetwork module of the plurality of subnetwork modules when an output of a predecessor subnetwork module in the jth subnetwork layer, which is a mental process. The claim recites using a result of the weighted summation as an output of a local structure of the ith search block in the jth search layer, to construct a transmission path of the ith search block in the jth search layer, until transmission paths of all local structures of the ith search block in the jth search layer are constructed, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 5
Step 1
Regarding Claim 5, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites when a successor node in the search layer is a subnetwork module in a subsequent subnetwork layer, an output of a search block in the search layer is an input of the subnetwork module, which is a mental process. The claim recites when the successor node in the search layer is the task node, the output of the search block in the search layer is an input of the task node, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 6
Step 1
Regarding Claim 6, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites combining nodes of the plurality of subnetwork layers and the plurality of search layers and the edges of a directed graph between the plurality of subnetwork layers and the plurality of search layers to obtain the search space for multi-task learning, wherein subnetwork modules in the plurality of subnetwork layers and search blocks in the plurality of search layers are nodes of the directed graph, and wherein transmission paths from (i) the input node to a first subnetwork layer, (ii) intermediate subnetwork layers to adjacent search layers, and (iii) a last search layer to the task nodes are edges of the directed graph, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 8
Step 1
Regarding Claim 8, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites performing mapping processing on the structural parameter of the search space to obtain sampling probabilities corresponding to local structures of each search block in the search space, which is a mental process. The claim recites constructing a polynomial distribution of each search block according to the sampling probabilities of the local structures of each search block, which is a mental process. The claim recites sampling the polynomial distribution of each search block to obtain the local structure corresponding to each search block, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 10
Step 1
Regarding Claim 10, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites constructing a loss function of the candidate network structure according to the multi task prediction result and a multi-task label of the sample data, which is a mental process. The claim recites updating the network parameter of the candidate network structure until the loss function converges, which is a mental process. The claim recites setting the updated network parameter of the candidate network structure as the optimized network parameter of the candidate network structure when the loss function converges, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element performing multi-task prediction processing on the sample data using the candidate network structure to obtain a multi-task prediction result of the sample data invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 11
Step 1
Regarding Claim 11, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites evaluating a network structure according to the sample data and the optimized network parameter of the candidate network structure, to obtain an evaluation result of the optimized candidate network structure, which is a mental process. The claim recites constructing a target function of the structural parameter of the search space according to the evaluation result, which is a mental process. The claim recites updating the structural parameter of the search space until the target function converges, which is a mental process. The claim recites setting the updated structural parameter of the search space as the optimized structural parameter of the search space when the target function converges, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 12
Step 1
Regarding Claim 12, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites determining, from the candidate network structures, an optimized network structure for the multi-task prediction according to the optimized structural parameters of the search space as the multi-task learning model, which is a mental process. The claim recites performing mapping processing on the optimized structural parameter of the search space to obtain sampling probabilities corresponding to the local structures of each search block in the search space, which is a mental process. The claim recites selecting a local structure having a maximum sampling probability in the local structures of each search block as a local structure of the candidate network structure for multi-task prediction, which is a mental process. The claim recites combining the local structure of each candidate network structure to obtain the multi-task learning model, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 13
Step 1
Claim 13 recites an electronic device, and thus the claimed machine falls within a statutory category of invention.
Step 2A Prong 1
The claim recites constructing a search space between an input node and a plurality of ... nodes ... by arranging a plurality of subnetwork layers and a plurality of search layers in a staggered manner, therebetween, wherein a search layer in the plurality of search layers is arranged between two subnetwork layers of the plurality of subnetwork layers in an order of many-to-one mapping relationship and one-to-one mapping relationship, each search layer of the plurality of search layers having a plurality of search blocks and each subnetwork layer of the plurality of subnetwork layers having a plurality of subnetwork modules the input node is connected to each of a plurality of subnetwork modules within a first subnetwork layer of the plurality of subnetwork layers and each of the plurality of task nodes is connected to a corresponding one of a plurality of search blocks of a last search layer of the plurality of search layers, which is a mental process, as it can be practically performed in the human mind, or with use a physical aid, such as pen and paper. The claim recites sampling a plurality of paths from the input node to the plurality of ... nodes through the search space to obtain a plurality of candidate paths as a plurality of candidate network structures, which is a mental process. The claim recites sampling each search block in a respective search layer of the plurality of search layers in the search space according to a network parameter and a structural parameter of the search space to obtain a local structure corresponding to the search block wherein the network parameter and the structural parameter define a mapping relationship between the search block and a plurality of subnetwork modules within a respective subnetwork layer of the plurality of subnetwork layers immediately before the respective search layer of the plurality of search layers, which is a mental process. The claim recites connecting the local structures in two different search layers via subnetwork modules in a subnetwork layer between the two different search layers iteratively until establishing the plurality of candidate paths from the input node to the plurality of task nodes as the plurality of candidate network structures, which is a mental process. The claim recites determining, from the candidate network structures, an optimized network structure for the multi-task prediction according to the optimized structural parameters of the search space as the multi-task learning model, which is a mental process. The claim recites the information recommendation system is configured to recommend information to a target user based on values of the multiple tasks associated with the target user predicted by the multi-task learning model, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element one or more processors; and memory storing one or more programs, the one or more programs comprising instructions that, when executed by the one or more processors, cause the one or more processors to perform a method for constructing a multi-task learning model for predicting multiple tasks of an information recommendation system invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element task nodes corresponding to the multiple tasks of the information recommendation system does not amount to more than generally linking the use of a judicial exception to a particular field of use (see MPEP 2106.05(h), "limit the use of the abstract idea to a particular technological environment"). The additional element training the network parameters and the structural parameters of the candidate network structures according to ... sample data to generate the multi-task learning model for performing a multi-task prediction invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element recommendation sample data ..., wherein the recommendation sample data includes data corresponding to a first task of the information recommendation system and data corresponding to a second task of the information recommendation system does not amount to more than generally linking the use of a judicial exception to a particular field of use (see MPEP 2106.05(h), "limit the use of the abstract idea to a particular technological environment"). The additional element training the network parameters of the candidate network structures to obtain an optimized network parameters of the candidate network structures invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element training the structural parameters of the search space according to the optimized network parameter of the candidate network structures to obtain optimized structural parameters of the search space invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Claims 14-17, dependent on Claim 13, incorporate the rejection of Claim 13. Claims 14 and 15 incorporate substantively all the limitations of Claims 2, 3, 6, and 7, respectively, in electronic device form and are rejected under the same rationales.

Regarding Claim 18
Step 1
Claim 18 recites a non-transitory computer-readable storage medium, storing a computer program, and thus the claimed manufacture falls within a statutory category of invention.
Step 2A Prong 1
The claim recites constructing a search space between an input node and a plurality of ... nodes ... by arranging a plurality of subnetwork layers and a plurality of search layers in a staggered manner, therebetween, wherein a search layer in the plurality of search layers is arranged between two subnetwork layers of the plurality of subnetwork layers in an order of many-to-one mapping relationship and one-to-one mapping relationship, each search layer of the plurality of search layers having a plurality of search blocks and each subnetwork layer of the plurality of subnetwork layers having a plurality of subnetwork modules the input node is connected to each of a plurality of subnetwork modules within a first subnetwork layer of the plurality of subnetwork layers and each of the plurality of task nodes is connected to a corresponding one of a plurality of search blocks of a last search layer of the plurality of search layers, which is a mental process, as it can be practically performed in the human mind, or with use a physical aid, such as pen and paper. The claim recites sampling a plurality of paths from the input node to the plurality of ... nodes through the search space to obtain a plurality of candidate paths as a plurality of candidate network structures, which is a mental process. The claim recites sampling each search block in a respective search layer of the plurality of search layers in the search space according to a network parameter and a structural parameter of the search space to obtain a local structure corresponding to the search block wherein the network parameter and the structural parameter define a mapping relationship between the search block and a plurality of subnetwork modules within a respective subnetwork layer of the plurality of subnetwork layers immediately before the respective search layer of the plurality of search layers, which is a mental process. The claim recites connecting the local structures in two different search layers via subnetwork modules in a subnetwork layer between the two different search layers iteratively until establishing the plurality of candidate paths from the input node to the plurality of task nodes as the plurality of candidate network structures, which is a mental process. The claim recites determining, from the candidate network structures, an optimized network structure for the multi-task prediction according to the optimized structural parameters of the search space as the multi-task learning model, which is a mental process. The claim recites the information recommendation system is configured to recommend information to a target user based on values of the multiple tasks associated with the target user predicted by the multi-task learning model, which is a mental process.

Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element the computer program, when executed by one or more processors of an electronic device, cause the one or more processors to perform a method for constructing a multi-task learning model for predicting multiple tasks of an information recommendation system invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element task nodes corresponding to the multiple tasks of the information recommendation system does not amount to more than generally linking the use of a judicial exception to a particular field of use (see MPEP 2106.05(h), "limit the use of the abstract idea to a particular technological environment"). The additional element training the network parameters and the structural parameters of the candidate network structures according to ... sample data to generate the multi-task learning model for performing a multi-task prediction invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element recommendation sample data ..., wherein the recommendation sample data includes data corresponding to a first task of the information recommendation system and data corresponding to a second task of the information recommendation system does not amount to more than generally linking the use of a judicial exception to a particular field of use (see MPEP 2106.05(h), "limit the use of the abstract idea to a particular technological environment"). The additional element training the network parameters of the candidate network structures to obtain an optimized network parameters of the candidate network structures invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element training the structural parameters of the search space according to the optimized network parameter of the candidate network structures to obtain optimized structural parameters of the search space invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Claims 19 and 20, dependent on Claim 18, incorporate the rejection of Claim 18. Claims 19 and 20 incorporate substantively all the limitations of Claims 2 and 3, respectively, in non-transitory computer-readable storage medium form and are rejected under the same rationales.

Regarding Claim 21
Step 1
Regarding Claim 21, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
Claim 21 recites the abstract ideas recited by parent Claim 1.
Step 2A Prong 2, Step 2B
The additional element wherein the first task is a click-through rate of the target user after receiving a piece of recommended information and the second task is a degree of completion by the target user after receiving the piece of recommended information does not amount to more than generally linking the use of a judicial exception to a particular field of use (see MPEP 2106.05(h), "limit the use of the abstract idea to a particular technological environment").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Claim 22, dependent on Claim 13, incorporates the rejection of Claim 13. Claim 21 incorporates substantively all the limitations of Claim 21 in electronic device form and is rejected under the same rationale.

Claim 23, dependent on Claim 18, incorporates the rejection of Claim 18. Claim 23 incorporates substantively all the limitations of Claim 21 in non-transitory computer-readable storage medium form and is rejected under the same rationale.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 5, 6, 8, 11-13, 14, 16, 18, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wierstra, et al. (US 2019/0354868 A1, hereinafter "Wierstra") in view of Guo, et al., "Learning to Branch for Multi-Task Learning" (hereinafter "Guo").
Regarding Claim 1, Wierstra teaches:
A method for constructing a multi-task learning model (Wierstra, Claim 1: "A system comprising ... instructions that when executed by the one or more computers cause the one or more computers to implement: a super neural network" and [0044]: "the neural network system 100 is capable of receiving network inputs and generating network outputs for multiple different machine learning tasks") for predicting multiple tasks of an information recommendation system (Wierstra, [0042]: "the multiple machine learning tasks may include multiple different content recommendation tasks, e.g., each task may be to effectively recommend content to different users or user groups. The tasks may also include processing input data ... to determine a score representing a likelihood that a resource relates to a particular topic"), comprising:
constructing a search space (Wierstra, [0089]-[0091]: "The system initializes a population of candidate paths (step 302). ¶ Each of the candidate paths specifies, for each of the layers of the super neural network, a respective proper subset of the modular neural networks in the layer to designate as active when performing the particular machine learning task. ¶ In particular, the system selects a fixed number of candidate paths randomly, subject to certain criteria," where Wierstra's population of candidates and selection correspond to the instant search space) between an input ... (Wierstra, Fig. 1A 130A, Layer A, comprising modular subnetworks, which receive input 102 for super network 110, where [0044]: "The neural network system 100 is a system that receives a network input 102 and processes the network input 102 using a super neural network 110 to generate a network output 112 for the network input 102") and a plurality of task nodes corresponding to the multiple tasks of the information recommendation system (Wierstra, [0006]: "The super neural network also comprises a plurality of sets of one or more output layers, wherein each set of output layers corresponds to a different machine learning task from a plurality of machine learning tasks, and wherein each set of one or more output layers is (collectively) configured to receive a stack output and to generate a neural network output that is specific to the corresponding machine learning task," where Wierstra's output layers correspond to the instant task nodes, and [0042]: "the multiple machine learning tasks may include multiple different content recommendation tasks") by arranging a plurality of subnetwork layers and a plurality of search layers in a staggered manner therebetween, wherein a search layer in the plurality of search layers is arranged between two subnetwork layers of the plurality of subnetwork layers ... (Wierstra, Fig. 1A, depicting alternating organization of Modular Subnetwork and Combining Layer layers, corresponding to the instant subnetwork and search layers, respectively) ... each subnetwork layer of the plurality of subnetwork layers having a plurality of subnetwork modules (Wierstra, Fig. 1A, depicting, e.g., Layer N comprising Modular Subnetworks N-P, where Wierstra's  Modular Subnetwork N-P correspond to the instant modules), ... each of the plurality of task nodes is connected to ... a last search layer of the plurality of search layers (Wierstra, Fig. 1A, depicting, e.g., Combining Layer N connected to Output Layer A, where Wierstra's Combining Layer N and Output Layer A correspond to the instant search layer and task node, respectively);
sampling a plurality of paths (Wierstra, Fig. 3, step 302, "Initialize population," and step 304, "Select candidate paths," where [0091]: "the system selects a fixed number of candidate paths randomly, subject to certain criteria," where Wierstra's random selection corresponds to the instant sampling) from the input ... to the plurality of task nodes through the search space to obtain a plurality of candidate paths as a plurality of candidate network structures (Wierstra, [0094]: "for each of the candidate paths, the system trains the super neural network while processing training inputs using only the modular neural networks designated as active by the candidate path and the output layer corresponding to the particular machine learning task, i.e., and not using any modular neural networks that are not designated as active by the candidate path"); ...
connecting the local structures in two different search layers via subnetwork modules in a subnetwork layer between the two different search layers (Wierstra, Fig. 1A, depicting a path connecting subnetwork and combining layers, and [0015]: "The method may comprise: selecting a plurality of candidate paths through the plurality of layers, each of the candidate paths specifying, for each of the layers, a respective proper subset of the modular neural networks in the layer that are designated as active when performing the particular machine learning task") iteratively until establishing the plurality of candidate paths from the input ... to the plurality of task nodes as the plurality of candidate network structures (Wierstra, [0016]: "Selecting the plurality of candidate paths may comprise selecting a first candidate path and a second candidate path ... and based on determining that the first candidate path has a better fitness than the second candidate path: mutating the first candidate path by changing one or more of the active modular neural networks in the first candidate path; and replacing the second candidate path with the mutated first candidate path," where Wierstra's selecting and mutating corresponds to the instant iteratively); and
training ... network parameters ... of the candidate network structures (Wierstra, [0103]: "The system can repeatedly perform steps 304-314 for all of the candidate paths in the population to update the population and to adjust the values of the parameters of the super neural network" where steps 304-314 of Fig. 3 comprise [0033]: "FIG. 3 is a flow diagram of an example process for training a super neural network on a new machine learning task") according to recommendation sample data to generate the multi-task learning model for performing a multi-task prediction wherein the recommendation sample data includes data corresponding to a first task of the information recommendation system and data corresponding to a second task of the information recommendation system (Wierstra, [0021]: "The method may comprise obtaining first training data for a first machine learning task; and training the super neural network on the first training data to determine a best fit path through the plurality of layers for the first machine learning task" and [0023]: "The method may further comprise ... obtaining second training data for a second machine learning task that follows the first machine learning task in the sequence; and training the super neural network on the second training data to determine a best fit path through the plurality of layers for the second machine learning task" and [0042]: "the multiple machine learning tasks may include multiple different content recommendation tasks, e.g., each task may be to effectively recommend content to different users or user groups. The tasks may also include processing input data ... to determine a score representing a likelihood that a resource relates to a particular topic," where Wierstra's different content recommendations correspond to the instant first and second tasks), further including:
training ... network parameters of the candidate network structures (Wierstra, Fig. 3, Step 306, "Train super neural network on selected paths," where training includes network parameters, as in [0067]: "when training the super neural network 110 on a given task, the system determines both (i) the path for the machine learning task and (ii) trained values of the parameters of the modular neural networks in the path") to obtain an optimized network parameters of the candidate network structures (Wierstra, [0118]: "After the training has been completed, i.e., after the last iteration of the steps 404-416 has been performed, the system selects the candidate path in the population having the best fitness as the path for the new machine learning task" and [0019]: "Training the super neural network on each of the plurality of candidate paths may comprise, during the training, holding fixed values of parameters of any modular neural networks that are in best fit paths for any machine learning tasks in the plurality of machine learning tasks for which a best fit path has already been determined," where Wierstra's best fitness correspond to the instant optimized); ...; and
determining, from the candidate network structures, an optimized network structure for the multi-task prediction according to the optimized ... parameters of the search space as the multi-task learning model (Wierstra, [0118]: "After the training has been completed, i.e., after the last iteration of the steps 404-416 has been performed, the system selects the candidate path in the population having the best fitness as the path for the new machine learning task" and [0019]: "Training the super neural network on each of the plurality of candidate paths may comprise, during the training, holding fixed values of parameters of any modular neural networks that are in best fit paths for any machine learning tasks in the plurality of machine learning tasks for which a best fit path has already been determined," where Wierstra's best fitness correspond to the instant optimized), wherein the information recommendation system is configured to recommend information to a target user based on values of the multiple tasks associated with the target user predicted by the multi-task learning model (Wierstra, [0042]: "the multiple machine learning tasks may include multiple different content recommendation tasks, e.g., each task may be to effectively recommend content to different users or user groups. The tasks may also include processing input data ... to determine a score representing a likelihood that a resource relates to a particular topic," where Wierstra's likelihood score corresponds to the instant predicted).
Wierstra teaches a method for constructing a multi-task learning model for predicting multiple tasks of an information recommendation system, comprising constructing a search space between an input and a plurality of task nodes, sampling a plurality of paths from the input to the plurality of task nodes through the search space, connecting the local structures via subnetwork modules, training network parameters of the candidate network structures according to recommendation sample data, training network parameters of the candidate network structures to obtain an optimized network parameters of the candidate network structures, and determining an optimized network structure for the multi-task prediction according to the optimized parameters of the search space.
Wierstra does not explicitly teach a search layer in the plurality of search layers is arranged between two subnetwork layers of the plurality of subnetwork layers in an order of many-to-one mapping relationship and one-to-one mapping relationship ... sampling each search block in a respective search layer of the plurality of search layers in the search space according to a network parameter and a structural parameter of the search space to obtain a local structure corresponding to the search block; wherein the network parameter and the structural parameter define a mapping relationship between the search block and a plurality of subnetwork modules within a respective subnetwork layer of the plurality of subnetwork layers immediately before the respective search layer of the plurality of search layers; and connecting the local structures in two different search layers via subnetwork modules in a subnetwork layer between the two different search layers iteratively until establishing the plurality of candidate paths from the input node to the plurality of task nodes as the plurality of candidate network structures; and training the structural parameters of the search space according to the optimized network parameter of the candidate network structures to obtain optimized structural parameters of the search space.
However, Guo teaches:
a search layer in the plurality of search layers is arranged between two subnetwork layers of the plurality of subnetwork layers in an order of many-to-one mapping relationship and one-to-one mapping relationship (Guo, p. 3, Figure 1: "Illustration of the proposed branching block. Each child node j is equipped with a categorical distribution so it can sample a parent node to receive input data after the training," where Guo's parent and child layers correspond to the instant subnetwork layers, connected by way of a branching layer corresponding to the instant search layer, as in p. 3, 3.1. Formulation Setup: "The tree structure in the network is realized by branching operations at certain layers. Each branching layer can have an arbitrary number of child (next) layers up to the computational budget available"), each search layer of the plurality of search layers having a plurality of search blocks (Guo, p. 3, Figure 1, depicting the branching layer comprising four search blocks) ... an input node ... the input node is connected to each of a plurality of subnetwork modules within a first subnetwork layer of the plurality of subnetwork layers (Guo, p. 5, Figure 3, "Learned network architectures by our method in three different experimental settings," depicting in network (a) an input node connected to first subnetwork layer comprising                         
                            
                                    W
                                
                                    0,0
                                
                    ,                         
                            
                                    W
                                
                                    0,1
                                
                    , and                         
                            
                                    W
                                
                                    0,2
                                
                    ) and each of the plurality of task nodes is connected to a corresponding one of a plurality of search blocks of a last search layer of the plurality of search layers (Guo, p. 8, Figure 4: "Four randomly sampled network architectures trained on Taskonomy dataset. ... Network (b) branches out at one layer later compared to the others but still shares the same task grouping strategy," where network 4(b) depicts task nodes Normal, Depth, Segmentation, Edge, and Keypoints connected to prior nodes                         
                            
                                    W
                                
                                    15,2
                                
                     and                         
                            
                                    W
                                
                                    15,3
                                
                    , and thus directly and indirectly connected to the nodes of the branching layers corresponding to the network connections, as in p. 3, 3.1. Formulation Setup: "The tree structure in the network is realized by branching operations at certain layers. Each branching layer can have an arbitrary number of child (next) layers up to the computational budget available");
sampling each search block in a respective search layer of the plurality of search layers in the search space according to a network parameter and a structural parameter of the search space to obtain a local structure corresponding to the search block (Guo, p. 5, Figure 3, "our proposed tree-structured network topology is end-to-end trainable -- the network architecture                         
                            Ω
                        
                     and the weight matrices                         
                            ω
                        
                     of the network are jointly optimized during training," where Guo's network architecture and weight matrices correspond to the instant structural and network parameters, respectively);
wherein the network parameter and the structural parameter define a mapping relationship between the search block and a plurality of subnetwork modules within a respective subnetwork layer of the plurality of subnetwork layers immediately before the respective search layer of the plurality of search layers (Guo, p. 3, 3.2. Network Topological Space: "each parent node at layer                         
                            l
                        
                     propagates its output activations as input                         
                            
                                    x
                                
                                    j
                                
                                    l
                                    +
                                    1
                                
                     to one or more child nodes                         
                            j
                        
                     based on the sampling distributions. The sampling distribution is parameterized by                         
                            
                                    θ
                                
                                    j
                                
                    . ... ¶ We update the parameter                         
                            
                                    θ
                                
                                    j
                                
                     of the sampling distribution                         
                            
                                    p
                                
                                            θ
                                        
                                            j
                                        
                     ... to make it more likely to generate network configurations                         
                            Ω
                        
                     toward the direction of minimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                    " and p. 4, 3.3. Differentiable Branching Operation: "For every two layers in a branching block (shown in Figure 1), we construct a matrix                         
                            M
                            ∈
                            
                                    R
                                
                                    I
                                    ×
                                    J
                                
                     to represent the connectivity from parent nodes                         
                            i
                        
                     to child nodes                         
                            j
                        
                    . Each entry                         
                            
                                    θ
                                
                                    i
                                    ,
                                    j
                                
                     in such a matrix                         
                            M
                        
                     stores the probability value that represents how likely the parent node                         
                            i
                        
                     would be sampled to connect with the child node                         
                            j
                        
                    ," where Guo's probability of a parent-child connection corresponds to the instant mapping relationship, defined at [0090] as obtaining sampling probabilities for search-block connections); and
connecting the local structures in two different search layers via subnetwork modules in a subnetwork layer between the two different search layers (Guo, p. 3, 3.1. Formulation Setup: "we construct multiple parent nodes and child nodes for each block and allow a child node to sample a path from all the paths between it and all its parent nodes. The selected connectivities therefore define the tree structure by such sampling (branching) procedure") iteratively until establishing the plurality of candidate paths from the input node to the plurality of task nodes as the plurality of candidate network structures (Guo, p. 3, 3.1. Formulation Setup: "During training, we first sample a network configuration from the design space distribution and then perform forward propagation to compute the overall loss value                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                    . We then obtain corresponding gradients to update both the design space distribution and the weight matrices                         
                            ω
                        
                     in the network in backward fashion. We iterate through the process until the overall validation loss converges and then we sample our final network configuration using the converged design space distribution"); and
training the structural parameters of the search space according to the optimized network parameter of the candidate network structures to obtain optimized structural parameters of the search space (Guo, p. 5, Figure 3, "our proposed tree-structured network topology is end-to-end trainable -- the network architecture                         
                            Ω
                        
                     and the weight matrices                         
                            ω
                        
                     of the network are jointly optimized during training," where Guo's network architecture and weight matrices correspond to the instant structural and network parameters, respectively); and
determining, from the candidate network structures, an optimized network structure for the multi-task prediction according to the optimized structural parameters of the search space as the multi-task learning model (Guo, p. 4, 3.4. Final Architecture Selection: "During the training stage, the network topology distribution and the weight matrices of the network are jointly optimized over the loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     across all tasks. Once the validation loss converges, we simply select the final network configuration using the same categorical distribution but without the noise                         
                            ϵ
                        
                     for every block in the network").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the above teachings of Wierstra with those above of Guo. The motivation to do so would be to facilitate end-to-end training of multi-task networks that does not result in over-generalized networks and does not rely on manual layer splitting (Guo, p. 1, Abstract: "we present an automated multi-task learning algorithm that learns where to share or branch within a network, designing an effective network topology that is directly optimized for multiple objectives across tasks. Specifically, we propose a novel tree-structured design space that casts a tree branching operation as a gumbel-softmax sampling procedure. This enables differentiable network splitting that is end-to-end trainable" and p. 2, 1. Introduction: "A key challenge towards answering the question is then deciding what layers should be shared across tasks and what layers should be untied. Over-sharing a network could erroneously enforce over-generalization, causing negative knowledge transfer across tasks. In this work, we propose a tree-structured network design space that can automatically learn how to branch a network such that the overall multi-task loss is minimized. ... This data-driven network structure searching approach does not require prior knowledge of the relationship between tasks nor human intuition on what layers capture task-specific features and should be split").

Regarding Claim 13, Wierstra teaches:
An electronic device, comprising: one or more processors (Wierstra, [0124]: "The term 'data processing apparatus' refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers"); and memory storing one or more programs (Wierstra, [0123]: "Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus"), the one or more programs comprising instructions that, when executed by the one or more processors (Wierstra, [0122]: "For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions"), cause the one or more processors to perform precisely those steps recited by the method of Claim 1. Claim 13 is rejected under the same rationale as Claim 1.

Regarding Claim 18, Wierstra teaches:
A non-transitory computer-readable storage medium, storing a computer program (Wierstra, [0123]: "Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus"), the computer program, when executed by one or more processors of an electronic device (Wierstra, [0124]: "The term 'data processing apparatus' refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers"), cause the one or more processors to perform precisely those steps recited by the method of Claim 1. Claim 18 is rejected under the same rationale as Claim 1.

Regarding Claim 2, the method of Claim 1 is incorporated.
Guo further teaches:
prior to constructing the search space, for a respective subnetwork layer of the plurality of subnetwork layers: performing sampling processing on outputs of a plurality of subnetwork modules in the respective subnetwork layer to obtain a plurality of sampled outputs of the plurality of subnetwork modules (Guo, p. 3, 3.2. Network Topological Space: "we construct multiple parent nodes and child nodes for each block and allow a child node to sample a path from all the paths between it and all its parent nodes. The selected connectivities therefore define the tree structure by such sampling (branching) procedure. We formulate the branching operation at layer                         
                            l
                        
                     as: 
                                            
                                                        x
                                                    
                                                        j
                                                    
                                                        l
                                                        +
                                                        1
                                                    
                                                =
                                                
                                                        E
                                                    
                                                                d
                                                            
                                                                j
                                                            
                                                        ∼
                                                        
                                                                p
                                                            
                                                                        θ
                                                                    
                                                                        j
                                                                    
                                                                d
                                                            
                                                                j
                                                            
                                                        ⋅
                                                        
                                                                Y
                                                            
                                                                l
                                                            
(2)

where                         
                            
                                    Y
                                
                                    l
                                
                            =
                            
                                            y
                                        
                                            1
                                        
                                            l
                                        
                                    ,
                                    …
                                    ,
                                    
                                            y
                                        
                                            I
                                        
                                            I
                                        
                                            l
                                        
                     concatenates outputs from all parent nodes at layer                         
                            l
                        
                    , and                         
                            
                                    d
                                
                                    j
                                
                     is an indicator vector sampled from a certain distribution                         
                            
                                    p
                                
                                            θ
                                        
                                            j
                                        
                    . The indicator                         
                            
                                    d
                                
                                    j
                                
                     is a one-hot vector. Hence the dot product in Eq 2 essentially assigns one of the parent nodes to each child node                         
                            j
                        
                    . In other words, each parent node at layer                         
                            l
                        
                     propagates its output activations as input                         
                            
                                    x
                                
                                    j
                                
                                    l
                                    +
                                    1
                                
                     to one or more child nodes                         
                            j
                        
                     based on the sampling distributions");
performing weighted summation on the plurality of sampled outputs of the plurality of subnetwork modules according to a weight of each subnetwork module of the plurality of subnetwork modules (Guo, p. 3, 3.2. Network Topological Space: "                        
                            
                                    d
                                
                                    j
                                
                     is an indicator vector sampled from a certain distribution                         
                            
                                    p
                                
                                            θ
                                        
                                            j
                                        
                    . The indicator                         
                            
                                    d
                                
                                    j
                                
                     is a one-hot vector. Hence the dot product in Eq 2 essentially assigns one of the parent nodes to each child node                         
                            j
                        
                    ," where Guo's dot product corresponds to the instant weighted sum); and
constructing a transmission path of a search block using a result of the weighted summation as an output of a local structure of the search block, wherein the search block is a module in a search layer adjacent to the subnetwork layer (Guo, p. 3, 3.2. Network Topological Space: "In other words, each parent node at layer                         
                            l
                        
                     propagates its output activations as input                         
                            
                                    x
                                
                                    j
                                
                                    l
                                    +
                                    1
                                
                     to one or more child nodes                         
                            j
                        
                     based on the sampling distributions. The sampling distribution is parameterized by                         
                            
                                    θ
                                
                                    j
                                
                    . The proposed topological space degenerates into a conventional single-path (convolutional) neural network if each block only contains one parent node and one child node").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Wierstra/Guo combination regarding constructing a search space between an input node and a plurality of task nodes with the further teachings of Guo regarding prior to constructing the search space, for a respective subnetwork layer of the plurality of subnetwork layers, performing sampling processing on outputs of a plurality of subnetwork modules in the respective subnetwork layer to obtain a plurality of sampled outputs of the plurality of subnetwork modules, performing weighted summation on the plurality of sampled outputs of the plurality of subnetwork modules according to a weight of each subnetwork module of the plurality of subnetwork modules, and constructing a transmission path of a search block using a result of the weighted summation as an output of a local structure of the search block, wherein the search block is a module in a search layer adjacent to the subnetwork layer.
The motivation to do so would be to facilitate training of multi-task model that is end-to-end trainable and is based on desired model capacity (Guo, p. 3, 3.2. Network Topological Space: "The branching blocks in Figure 1 can be stacked to form a deeper tree-structured neural network (illustrated in Figure 2(d)) and the number of parent nodes and the number of child nodes can be adjusted based on the desired model capacity. Different from the greedy layer-wise optimization approach ..., our proposed tree-structured network topology is end-to-end trainable -- the network architecture and the weight matrices ! of the network are jointly optimized during training").

Claims 14 and 19 incorporate substantively all the limitations of Claim 2 in electronic device and non-transitory computer-readable storage medium forms, respectively, and are rejected under the same rationale.

Regarding Claim 5, the method of Claim 1 is incorporated. The Wierstra/Guo combination teaches:
when a successor node in the search layer is a subnetwork module in a subsequent subnetwork layer, an output of a search block in the search layer is an input of the subnetwork module (Wierstra, Fig. 1A, depicting the output of block 132A Combining Layer A as input to modular subnetworks of block 130N Layer N); and
when the successor node in the search layer is the task node, the output of the search block in the search layer is an input of the task node (Wierstra, Fig. 1A, depicting the output of block 132N Combining Layer N as input to block 150A Output Layer A).

Regarding Claim 6, the method of Claim 1 is incorporated. The Wierstra/Guo combination teaches: 
wherein constructing the search space comprises: combining nodes and the edges of a directed graph (Wierstra, Fig. 1B, depicting edges and nodes of a graph directed from inputs to outputs) to obtain the search space for multi-task learning (Wierstra, Fig. 3, blocks 304, "select candidate paths," and 310, "determine which path has best fitness"), wherein subnetwork modules in the plurality of subnetwork layers and search blocks in the plurality of search layers are nodes of the directed graph (Wierstra, Fig. 1A, depicting directed edges connecting subnetwork and combining layers from input to output), and
wherein transmission paths from (i) the input node to a first subnetwork layer (Wierstra, Fig. 1A, edge form Network Input 102 to Layer A 130A), (ii) intermediate subnetwork layers to adjacent search layers (Wierstra, Fig. 1A, edge form Layer A 130A to Combining Layer A 132A), and (iii) a last search layer to the task nodes are edges of the directed graph (Wierstra, Fig. 1A, edge form Combining Layer N 132N to Output Layer A 150A).

Claim 16 incorporates substantively all the limitations of Claim 6 in electronic device form and is rejected under the same rationale.

Regarding Claim 8, the method of Claim 1 is incorporated.
Guo further teaches: 
wherein sampling each search block in the respective search layer of the plurality of search layers in the search space according to the structural parameter of the search space to obtain a local structure corresponding to each search block comprises: performing mapping processing on the structural parameter of the search space to obtain sampling probabilities corresponding to local structures of each search block in the search space (Guo, p. 3, 3.2. Network Topological Space: "each parent node at layer                         
                            l
                        
                     propagates its output activations as input                         
                            
                                    x
                                
                                    j
                                
                                    l
                                    +
                                    1
                                
                     to one or more child nodes                         
                            j
                        
                     based on the sampling distributions. The sampling distribution is parameterized by                         
                            
                                    θ
                                
                                    j
                                
                    . ... ¶ We update the parameter                         
                            
                                    θ
                                
                                    j
                                
                     of the sampling distribution                         
                            
                                    p
                                
                                            θ
                                        
                                            j
                                        
                     ... to make it more likely to generate network configurations                         
                            Ω
                        
                     toward the direction of minimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                    " and p. 4, 3.3. Differentiable Branching Operation: "For every two layers in a branching block (shown in Figure 1), we construct a matrix                         
                            M
                            ∈
                            
                                    R
                                
                                    I
                                    ×
                                    J
                                
                     to represent the connectivity from parent nodes                         
                            i
                        
                     to child nodes                         
                            j
                        
                    . Each entry                         
                            
                                    θ
                                
                                    i
                                    ,
                                    j
                                
                     in such a matrix                         
                            M
                        
                     stores the probability value that represents how likely the parent node                         
                            i
                        
                     would be sampled to connect with the child node                         
                            j
                        
                    ," where Guo's probability of a parent-child connection corresponds to the instant mapping relationship, defined at [0090] as obtaining sampling probabilities for search-block connections);
constructing a polynomial distribution of each search block according to the sampling probabilities of the local structures of each search block (Guo, p. 4, 3.3. Differentiable Branching Operation: "During every forward propagation, each child node                         
                            j
                        
                     makes a discrete decision drawn from a categorical distribution based on the distribution: ... Again                         
                            
                                    d
                                
                                    j
                                
                            ∈
                            
                                    R
                                
                                    I
                                
                     is a one-hot vector with dimension the same as the number of parent nodes                         
                            I
                        
                     at the current level," where Guo's dimension of                         
                            I
                        
                     corresponds to the instant polynomial); and
sampling the polynomial distribution of each search block to obtain the local structure corresponding to each search block (Guo, p. 4, 3.3. Differentiable Branching Operation: "For every two layers in a branching block (shown in Figure 1), we construct a matrix ... to represent the connectivity from parent nodes                         
                            i
                        
                     to child nodes                         
                            j
                        
                    . Each entry ... in such a matrix ... stores the probability value that represents how likely the parent node                         
                            i
                        
                     would be sampled to connect with the child node                         
                            j
                        
                    . During every forward propagation, each child node                         
                            j
                        
                     makes a discrete decision drawn from a categorical distribution").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Wierstra/Guo combination regarding sampling each search block in the respective search layer of the plurality of search layers in the search space according to the structural parameter of the search space to obtain a local structure corresponding to each search block with the further teachings of Guo regarding performing mapping processing on the structural parameter of the search space to obtain sampling probabilities corresponding to local structures of each search block in the search space, constructing a polynomial distribution of each search block according to the sampling probabilities of the local structures of each search block, and sampling the polynomial distribution of each search block to obtain the local structure corresponding to each search block.
The motivation to do so would be to facilitate training of a neural network according to a training loss (Guo, p. 4, 3.3. Differentiable Branching Operation: "To sample a categorical value from the continuous sampling distribution, we utilize the gumbel-softmax estimator trick ... to enable the differentiability for the branching operation ... ¶ [T]he branching probabilities are fully differentiable with respect to the training loss and can readily be inserted to a neural network and stacked to construct a tree-structured neural network").

Regarding Claim 11, the method of Claim 1 is incorporated. 
Guo further teaches:
wherein training the structural parameter of the search space according to the optimized network parameter of the candidate network structure to obtain the optimized structural parameter of the search space comprises:
evaluating a network structure according to the sample data and the optimized network parameter of the candidate network structure, to obtain an evaluation result of the optimized candidate network structure (Guo, p. 3, 3.2. Network Topological Space: "We formulate the branching operation at layer                         
                            l
                        
                     as: 
                                            
                                                        x
                                                    
                                                        j
                                                    
                                                        l
                                                        +
                                                        1
                                                    
                                                =
                                                
                                                        E
                                                    
                                                                d
                                                            
                                                                j
                                                            
                                                        ∼
                                                        
                                                                p
                                                            
                                                                        θ
                                                                    
                                                                        j
                                                                    
                                                                d
                                                            
                                                                j
                                                            
                                                        ⋅
                                                        
                                                                Y
                                                            
                                                                l
                                                            
(2)

where                         
                            
                                    Y
                                
                                    l
                                
                            =
                            
                                            y
                                        
                                            1
                                        
                                            l
                                        
                                    ,
                                    …
                                    ,
                                    
                                            y
                                        
                                            I
                                        
                                            I
                                        
                                            l
                                        
                     concatenates outputs from all parent nodes at layer                         
                            l
                        
                    , and                         
                            
                                    d
                                
                                    j
                                
                     is an indicator vector sampled from a certain distribution                         
                            
                                    p
                                
                                            θ
                                        
                                            j
                                        
                    . The indicator                         
                            
                                    d
                                
                                    j
                                
                     is a one-hot vector. Hence the dot product in Eq 2 essentially assigns one of the parent nodes to each child node                         
                            j
                        
                    . In other words, each parent node at layer                         
                            l
                        
                     propagates its output activations as input                         
                            
                                    x
                                
                                    j
                                
                                    l
                                    +
                                    1
                                
                     to one or more child nodes                         
                            j
                        
                     based on the sampling distributions");
constructing a target function of the structural parameter of the search space according to the evaluation result (Guo, p. 3, 3.2. Network Topological Space: "We update the parameter                         
                            
                                    θ
                                
                                    j
                                
                     of the sampling distribution                         
                            
                                    p
                                
                                            θ
                                        
                                            j
                                        
                     using the chain rule with respect to the final loss,
                                            
                                                        ∂
                                                        
                                                                L
                                                            
                                                                t
                                                                o
                                                                t
                                                                a
                                                                l
                                                            
                                                        ∂
                                                        
                                                                θ
                                                            
                                                                j
                                                            
                                                =
                                                
                                                        ∂
                                                        
                                                                L
                                                            
                                                                t
                                                                o
                                                                t
                                                                a
                                                                l
                                                            
                                                        ∂
                                                        
                                                                x
                                                            
                                                                j
                                                            
                                                                l
                                                                +
                                                                1
                                                            
                                                        ∂
                                                        
                                                                x
                                                            
                                                                j
                                                            
                                                                l
                                                                +
                                                                1
                                                            
                                                        ∂
                                                        
                                                                θ
                                                            
                                                                j
                                                            
                                                =
                                                
                                                        ∂
                                                        
                                                                L
                                                            
                                                                t
                                                                o
                                                                t
                                                                a
                                                                l
                                                            
                                                        ∂
                                                        
                                                                x
                                                            
                                                                j
                                                            
                                                                l
                                                                +
                                                                1
                                                            
                                                        ∂
                                                    
                                                        ∂
                                                        
                                                                θ
                                                            
                                                                j
                                                            
                                                        E
                                                    
                                                                d
                                                            
                                                                j
                                                            
                                                        ∼
                                                        
                                                                p
                                                            
                                                                        θ
                                                                    
                                                                        j
                                                                    
                                                                d
                                                            
                                                                j
                                                            
                                                        ⋅
                                                        
                                                                Y
                                                            
                                                                l
                                                            
(3)

the backward pass then adjusts the sampling distribution                         
                            
                                    p
                                
                                            θ
                                        
                                            j
                                        
                     to make it more likely to generate network configurations toward the direction of minimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                    "); and
updating the structural parameter of the search space until the target function converges, and setting the updated structural parameter of the search space as the optimized structural parameter of the search space when the target function converges (Guo, p. 4, 3.4. Final Architecture Selection: "During the training stage, the network topology distribution and the weight matrices of the network are jointly optimized over the loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     across all tasks. Once the validation loss converges, we simply select the final network configuration using the same categorical distribution but without the noise                         
                            ϵ
                        
                     for every block in the network.... We then re-train the final network architecture from scratch to obtain the final performance").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Wierstra/Guo combination regarding training the structural parameter of the search space according to the optimized network parameter of the candidate network structure to obtain the optimized structural parameter of the search space with the further teachings of Guo regarding training the network parameter of the candidate network structure to obtain the optimized network parameter of the candidate network structure comprises: performing multi-task prediction processing on the sample data using the candidate network structure to obtain a multi-task prediction result of the sample data; constructing a loss function of the candidate network structure according to the multitask prediction result and a multi-task label of the sample data; updating the network parameter of the candidate network structure until the loss function converges; and setting the updated network parameter of the candidate network structure as the optimized network parameter of the candidate network structure when the loss function converges.
The motivation to do so would be to facilitate training of a multi-task network where the performance of the final trained network is highly correlated with the performance of the network during architecture search (Guo, p. 4, 3.4. Final Architecture Selection: "The same procedure has also been shown effective in previous literature ... where such weight sharing network search schema demonstrates high correlation between the intermediate network performance during search phase and the final performance obtained by re-train the network from scratch").

Regarding Claim 12, the method of Claim 1 is incorporated.
Guo further teaches:
wherein determining, from the candidate network structures, an optimized network structure for the multi-task prediction according to the optimized structural parameters of the search space as the multi-task learning model comprises:
performing mapping processing on the optimized structural parameter of the search space to obtain sampling probabilities corresponding to the local structures of each search block in the search space (Guo, p. 4, 3.3. Differentiable Branching Operation: "During every forward propagation, each child node j makes a discrete decision drawn from a categorical distribution based on the distribution:                         
                            
                                    d
                                
                                    j
                                
                     ... Again                         
                            
                                    d
                                
                                    j
                                
                            ∈
                            
                                    R
                                
                                    I
                                
                     is a one-hot vector with dimension the same as the number of parent nodes                         
                            I
                        
                     at the current level. ... ¶ To enable differentiability of the discrete sampling function, we use the gumbel-softmax trick ... to relax                         
                            
                                    d
                                
                                    j
                                
                     during backward propagation as ... [Eq. 5] ... with                         
                            i
                        
                     equal to the sampled index value of parent node during forward pass. The discrete categorical sampling function is approximated by a softmax operation over the parent nodes, and the parameter                         
                            τ
                        
                     is the temperature that controls how sharp the distribution is after the approximation");
selecting a local structure having a maximum sampling probability in the local structures of each search block as a local structure of the candidate network structure for multi-task prediction (Guo, p. 4, 3.4. Final Architecture Selection: "During the training stage, the network topology distribution and the weight matrices of the network are jointly optimized over the loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     across all tasks. Once the validation loss converges, we simply select the final network configuration using the same categorical distribution but without the noise for every block in the network.... We then re-train the final network architecture from scratch to obtain the final performance"); and
combining the local structure of each candidate network structure to obtain the multi-task learning model (Guo, p. 4, 3.4. Final Architecture Selection: "Once the validation loss converges, we simply select the final network configuration using the same categorical distribution but without the noise                         
                            ϵ
                        
                     for every block in the network.... We then re-train the final network architecture from scratch to obtain the final performance"
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Wierstra/Guo combination regarding determining the candidate network structure for multi-task prediction from the optimized candidate network structures with the further teachings of Guo regarding wherein determining the candidate network structure for prediction from the optimized candidate network structures according to the optimized structural parameter of the search space as the learning model comprises: performing mapping processing on the optimized structural parameter of the search space to obtain sampling probabilities corresponding to the local structures of each search block in the search space; selecting a local structure having a maximum sampling probability in the local structures of each search block as a local structure of the candidate network structure for prediction; and combining the local structure of each candidate network structure to obtain the learning model.
The motivation to do so would be to facilitate training of a multi-task network where the performance of the final trained network is highly correlated with the performance of the network during architecture search (Guo, p. 4, 3.4. Final Architecture Selection: "The same procedure has also been shown effective in previous literature ... where such weight sharing network search schema demonstrates high correlation between the intermediate network performance during search phase and the final performance obtained by re-train the network from scratch").

Claims 3, 4, 10, 15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wierstra, et al., (US 2019/0354868 A1, hereinafter "Wierstra") in view of Guo, et al., "Learning to Branch for Multi-Task Learning" (hereinafter "Guo") in further view of Ma, et al., "Modeling task relationships in multi-task learning with multi-gate mixture-of-experts" (hereinafter "Ma-1").
Regarding Claim 3, the method of Claim 2 is incorporated.
The Wierstra/Guo combination teaches a multi-task learning model wherein each search layer of the plurality of search layers having a plurality of search blocks and each subnetwork layer of the plurality of subnetwork layers having a plurality of subnetwork modules.
The Wierstra/Guo combination does not explicitly teach the search block ... comprises a gated node, and the method ... comprises: after performing sampling processing on the outputs of the plurality of subnetwork modules in the respective subnetwork layer: sampling a signal source from a signal source set of the subnetwork layer, the signal source being an output of the input node or an output of a predecessor subnetwork module in the subnetwork layer; predicting the signal source by using the gated node, to obtain a predicted value of each subnetwork module of the plurality of subnetwork modules; and performing normalization processing on the predicted value of each subnetwork module to obtain the weight of each subnetwork module.
However, Ma-1 teaches: 
the search block ... comprises a gated node (Ma-1, p. 1931, Figure 1, (c) Multi-gate MoE model, depicting block Gate A with connection to gated nodes), and the method ... comprises:
after performing sampling processing on the outputs of the plurality of subnetwork modules in the respective subnetwork layer: sampling a signal source from a signal source set of the subnetwork layer, the signal source being an output of the input node or an output of a predecessor subnetwork module in the subnetwork layer (Ma-1, p. 1934, 4.2 Multi-gate Mixture-of-Experts: "The gating networks are simply linear transformations of the input with a softmax layer.... Each gating network can learn to 'select' a subset of experts to use conditioned on the input example," where input data is sampled, as in "Randomly sample an input data point");
predicting the signal source by using the gated node, to obtain a predicted value of each subnetwork module of the plurality of subnetwork modules (Ma-1, p. 1934, 4.2 Multi-gate Mixture-of-Experts: "we add a separate gating network                         
                            
                                    g
                                
                                    k
                                
                     for each task                         
                            k
                        
                    . More precisely, the output of task                         
                            k
                        
                     is 
                                            
                                                        y
                                                    
                                                        k
                                                    
                                                =
                                                
                                                        h
                                                    
                                                        k
                                                    
                                                                f
                                                            
                                                                k
                                                            
                                                                x
                                                            
                                                ,
                                            
(6)
                                            
                                                where 
                                                
                                                        f
                                                    
                                                        k
                                                    
                                                        x
                                                    
                                                =
                                                
                                                        ∑
                                                        
                                                            i
                                                            =
                                                            1
                                                        
                                                            n
                                                        
                                                                g
                                                            
                                                                k
                                                            
                                                                        x
                                                                    
                                                                i
                                                            
                                                                f
                                                            
                                                                i
                                                            
                                                                x
                                                            
(7)

... The gating networks are simply linear transformations of the input with a softmax layer:
                                            
                                                        g
                                                    
                                                        k
                                                    
                                                        x
                                                    
                                                =
                                                softmax
                                                
                                                                W
                                                            
                                                                g
                                                                k
                                                            
                                                        x
                                                    
(8)

... Each gating network can learn to 'select' a subset of experts to use conditioned on the input example," where Ma-1's learning to select corresponds to the instant predict); and
performing normalization processing on the predicted value of each subnetwork module to obtain the weight of each subnetwork module (Ma-1, p. XXX, XXX: "The gating networks are simply linear transformations of the input with a softmax layer," where Ma-1's softmax layer corresponds to the instant normalizing processing, as in the normalizing exponential function of the instant spec at [0037]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Wierstra/Guo combination regarding sampling processing on the outputs of the plurality of subnetwork modules in the respective subnetwork layer with the those of Ma-1 regarding after performing sampling processing on the outputs of the plurality of subnetwork modules in the respective subnetwork layer: sampling a signal source from a signal source set of the subnetwork layer, the signal source being an output of the input node or an output of a predecessor subnetwork module in the subnetwork layer and predicting the signal source by using the gated node, to obtain a predicted value of each subnetwork module of the plurality of subnetwork modules.
The motivation to do so would be to facilitate more efficient training and improved performance of multi-task models (Ma-1, p. 1931, 1 Introduction: "the gating networks for different tasks can learn different mixture patterns of experts assembling, and thus capture the task relationships. ¶ ... Our approach outperforms baseline methods under this setup, especially when task correlation is low. In this set of experiments, we also discover that MMoE is easier to train and converges to a better loss during multiple runs. This relates to recent discoveries that modulation and gating mechanisms can improve the trainability in training non-convex deep neural networks").

Claims 15 and 20 incorporate substantively all the limitations of Claim 3 in electronic device and non-transitory computer-readable storage medium forms, respectively, and are rejected under the same rationale.

Regarding Claim 4, the method of Claim 3 is incorporated.
Guo further teaches:
the search space comprises N subnetwork layers and N search layers, and N is a natural number greater than 1 (Guo, p. 3, Figure 1: "Illustration of the proposed branching block," depicting the parent and child subnetwork layers and the branching layer, and p. 3, 3.1. Formulation Setup: "The tree structure in the network is realized by branching operations at certain layers. Each branching layer can have an arbitrary number of child (next) layers up to the computational budget available," where Guo reasonably suggests an arbitrary number N of branching layers and at least N next child layers), and the method further comprises:
prior to constructing the search space:
sampling outputs of a plurality of subnetwork modules from a first subnetwork layer using an ith search block in a first search layer, wherein i is a positive integer (Guo, p. 3, Figure 1: "Illustration of the proposed branching block. Each child node                         
                            j
                        
                     is equipped with a categorical distribution so it can sample a parent node to receive input data after the training," depicting first subnetwork layer parent                         
                            i
                        
                     and subsequent search layer, as in p. 3, 3.2. Network Topological Space: "Figure 1 illustrates a certain block of a DAG which contains parent nodes                         
                            i
                        
                     for                         
                            i
                             
                            ∈
                            {
                            1
                            ,
                             
                            …
                             
                            ,
                             
                            I
                            }
                        
                     and child nodes                         
                            j
                        
                     for                         
                            j
                             
                            ∈
                            {
                            1
                            ,
                             
                            …
                            ,
                             
                            J
                            }
                        
                    ," and p. 3, 3.2. Network Topological Space: "we construct multiple parent nodes and child nodes for each block and allow a child node to sample a path from all the paths between it and all its parent nodes. The selected connectivities therefore define the tree structure by such sampling (branching) procedure");
performing weighted summation on the outputs of the plurality of subnetwork modules according to a weight of each subnetwork module of the plurality of subnetwork modules when the signal source is the output of the input node, and using a result of the weighted summation as an output of a local structure of the ith search block, to construct a transmission path of the ith search block, until transmission paths of all local structures of the ith search block in the first search layer are constructed (Guo, p. 3, 3.2. Network Topological Space: "we construct multiple parent nodes and child nodes for each block and allow a child node to sample a path from all the paths between it and all its parent nodes. The selected connectivities therefore define the tree structure by such sampling (branching) procedure. We formulate the branching operation at layer                         
                            l
                        
                     as: 
                                            
                                                        x
                                                    
                                                        j
                                                    
                                                        l
                                                        +
                                                        1
                                                    
                                                =
                                                
                                                        E
                                                    
                                                                d
                                                            
                                                                j
                                                            
                                                        ∼
                                                        
                                                                p
                                                            
                                                                        θ
                                                                    
                                                                        j
                                                                    
                                                                d
                                                            
                                                                j
                                                            
                                                        ⋅
                                                        
                                                                Y
                                                            
                                                                l
                                                            
(2)

where                         
                            
                                    Y
                                
                                    l
                                
                            =
                            
                                            y
                                        
                                            1
                                        
                                            l
                                        
                                    ,
                                    …
                                    ,
                                    
                                            y
                                        
                                            I
                                        
                                            I
                                        
                                            l
                                        
                     concatenates outputs from all parent nodes at layer                         
                            l
                        
                    , and                         
                            
                                    d
                                
                                    j
                                
                     is an indicator vector sampled from a certain distribution                         
                            
                                    p
                                
                                            θ
                                        
                                            j
                                        
                    . The indicator                         
                            
                                    d
                                
                                    j
                                
                     is a one-hot vector. Hence the dot product in Eq 2 essentially assigns one of the parent nodes to each child node                         
                            j
                        
                    . In other words, each parent node at layer                         
                            l
                        
                     propagates its output activations as input                         
                            
                                    x
                                
                                    j
                                
                                    l
                                    +
                                    1
                                
                     to one or more child nodes                         
                            j
                        
                     based on the sampling distributions," where Guo's dot product correspond to the instant weighted summation);
sampling outputs of a plurality of subnetwork modules from a jth subnetwork layer by using an ith search block in a jth search layer, 1<j<=N, and j being a positive integer; and
performing weighted summation on the outputs of the plurality of subnetwork modules according to a weight of each subnetwork module of the plurality of subnetwork modules when an output of a predecessor subnetwork module in the jth subnetwork layer, and using a result of the weighted summation as an output of a local structure of the ith search block in the jth search layer, to construct a transmission path of the ith search block in the jth search layer, until transmission paths of all local structures of the ith search block in the jth search layer are constructed (Guo, p. 3, 3.2. Network Topological Space: "Figure 1 illustrates a certain block of a DAG which contains parent nodes                         
                            i
                        
                     for                         
                            i
                             
                            ∈
                            {
                            1
                            ,
                             
                            …
                             
                            ,
                             
                            I
                            }
                        
                     and child nodes                         
                            j
                        
                     for                         
                            j
                             
                            ∈
                            {
                            1
                            ,
                             
                            …
                            ,
                             
                            J
                            }
                        
                    . The nodes can perform any common operations of choice such as convolution or pooling. The input to a certain node is denoted as x and the output is denoted as y. ... ¶ The branching blocks in Figure 1 can be stacked to form a deeper tree-structured neural network (illustrated in Figure 2(d)) and the number of parent nodes and the number of child nodes can be adjusted based on the desired model capacity. ... [O]ur proposed tree-structured network topology is end-to-end trainable – the network architecture                         
                            Ω
                        
                     and the weight matrices                         
                            ω
                        
                     of the network are jointly optimized during training," where Guo's parent/child network layers for                         
                            i
                        
                     and                         
                            j
                        
                     greater than 1 correspond to the instant subsequent subnetwork layers).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Wierstra/Guo combination regarding constructing a search space between an input node and a plurality of task nodes corresponding to the multiple tasks of the information recommendation system by arranging a plurality of subnetwork layers and a plurality of search layers in a staggered manner therebetween with the further teachings of Guo regarding sampling outputs of the subnetwork modules and performing weighted summation thereupon.
The motivation to do so would be to facilitate training of multi-task model that is end-to-end trainable and is based on desired model capacity (Guo, p. 3, 3.2. Network Topological Space: "The branching blocks in Figure 1 can be stacked to form a deeper tree-structured neural network (illustrated in Figure 2(d)) and the number of parent nodes and the number of child nodes can be adjusted based on the desired model capacity. Different from the greedy layer-wise optimization approach ..., our proposed tree-structured network topology is end-to-end trainable -- the network architecture and the weight matrices ! of the network are jointly optimized during training").

Regarding Claim 10, the method of Claim 1 is incorporated. 
The Wierstra/Guo combination teaches:
wherein training the network parameter of the candidate network structure to obtain the optimized network parameter of the candidate network structure comprises: performing multi-task prediction processing ... using the candidate network structure to obtain a multi-task prediction result ... (Wierstra, [0006]: "The super neural network also comprises a plurality of sets of one or more output layers, wherein each set of output layers corresponds to a different machine learning task from a plurality of machine learning tasks," where Wierstra's tasks include prediction, as in [0003]: "Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input") ... on the sample data (Wierstra, [0021]: "The method may comprise obtaining first training data for a first machine learning task" and [0023]: "The method may further comprise ... obtaining second training data for a second machine learning task" and [0042]: "the multiple machine learning tasks may include multiple different content recommendation tasks.... The tasks may also include processing input data ... to determine a score representing a likelihood that a resource relates to a particular topic").
Ma-1 further teaches:
constructing a loss function of the candidate network structure (Ma-1, p. 1934, Figure 4, "Average performance of MMoE ... on synthetic data with diifferent correlations," depicting model loss against number of training steps, where Ma-1's MMoE [Multi-gate Mixture-of-Experts] corresponds to the instant candidate network structure) according to the multi-task prediction result and a multi-task label of the sample data (Ma-1, p. 1932, 3.2 Synthetic Data Generation: "we generate two regression tasks and use the Pearson correlation of the labels of these two tasks as the quantitative indicator of task relationships ... Specifically, we generate the synthetic data as follows. ... Generate two labels                         
                            
                                    y
                                
                                    1
                                
                            ,
                            
                                    y
                                
                                    2
                                
                     for two regression tasks");
updating the network parameter of the candidate network structure (Ma-1, p. 1937, 6.4.1 Experiment Setup: "For the Shared-Bottom model, we implement the shared bottom network as a feedforward neural network with several fully-connected layers with ReLU activation. ... For MMoE, we simply change the top layer of the shared bottom network to an MMoE layer and keep the output hidden units with the same dimensionality. Therefore, we don’t add extra noticeable computation costs in model training and serving") until the loss function converges (Ma-1, p. 1931, 1 Introduction: "To understand how MMoE learns its experts and task gating networks for different levels of task relatedness, we conduct a synthetic experiment ... Our approach outperforms baseline methods under this setup, especially when task correlation is low. In this set of experiments, we also discover that MMoE is easier to train and converges to a better loss during multiple runs"); and
setting the updated network parameter of the candidate network structure as the optimized network parameter of the candidate network structure when the loss function converges (Ma-1, 5.1 Performance on Data with Different Task Correlations: "We note that the total number of model parameters in the shared experts and the towers is ... 13056. ... All the models are trained with the Adam optimizer" and Ma-1, p. 1931, 1 Introduction: "we also discover that MMoE is easier to train and converges to a better loss during multiple runs").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Wierstra/Guo combination regarding performing multi-task prediction processing on the sample data using the candidate network structure to obtain a multi-task prediction result of the sample data with those of Ma-1 regarding constructing a loss function of the candidate network structure according to the multi-task prediction result and a multi-task label of the sample data, updating the network parameter of the candidate network structure until the loss function converges, and setting the updated network parameter of the candidate network structure as the optimized network parameter of the candidate network structure when the loss function converges.
The motivation to do so would be to facilitate training for multi-task learning with the benefit of comparable knowledge transfer while limiting additional model parameters (Ma-1, 4.2 Multi-gate Mixture-of-Experts: "We propose a new MoE model that is designed to capture the task differences without requiring significantly more model parameters compared to the shared-bottom multi-task model. ... [T]he MMoE only has several additional gating networks, and the number of model parameters in the gating network is negligible. Therefore the whole model still enjoys the benefit of knowledge transfer in multi-task learning as much as possible").

Claims 21-23 are rejected under 35 U.S.C. 103 as being unpatentable over Wierstra, et al., (US 2019/0354868 A1, hereinafter "Wierstra") in view of Guo, et al., "Learning to Branch for Multi-Task Learning" (hereinafter "Guo") in view of Ma, et al., Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate (hereinafter "Ma-2").
Regarding Claim 21, the method of Claim 1 is incorporated.
The Wierstra/Guo combination teaches a multi-task learning model for predicting multiple tasks of an information recommendation system.
The Wierstra/Guo combination does not teach wherein the first task is a click-through rate of the target user after receiving a piece of recommended information and the second task is a degree of completion by the target user after receiving the piece of recommended information.
However, Ma-2 teaches:
wherein the first task is a click-through rate of the target user after receiving a piece of recommended information and the second task is a degree of completion by the target user after receiving the piece of recommended information (Ma-2, p. 1, 1 Introduction: "In this paper, we focus on the task of post-click CVR [conversion rate] estimation. To simplify the discussion, we take the CVR modeling in recommender system in e-commerce site as an example. Given recommended items, users might click interested ones and further buy some of them. In other words, user actions follow a sequential pattern of impression                 
                    →
                
             click                 
                    →
                
             conversion. In this way, CVR modeling refers to the task of estimating the post-click conversion rate. ... ¶ ... In ESMM [Entire Space Multitask Model], two auxiliary tasks of predicting the post-view click-through rate (CTR) and post-view clickthrough & conversion rate (CTCVR) are introduced," where Ma-2's CTR and CTCVR correspond to the instant first and second task, respectively).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Wierstra/Guo combination regarding a multi-task learning model for predicting multiple tasks of an information recommendation system with those of Ma-2 regarding wherein the first task is a click-through rate of the target user after receiving a piece of recommended information and the second task is a degree of completion by the target user after receiving the piece of recommended information.
The motivation to do so would be to alleviate issues relating to sample selection bias and data sparsity while training multi-task models (Ma-2, p. 2, 1 Introduction: "two auxiliary tasks of predicting the post-view click-through rate (CTR) and post-view clickthrough & conversion rate (CTCVR) are introduced. ...  Both pCTCVR and pCTR are estimated over the entire space with samples of all impressions, thus the derived pCVR is also applicable over the entire space. It indicates that SSB [sample selection bias] problem is eliminated. Besides, parameters of feature representation of CVR network is shared with CTR network. The latter one is trained with much richer samples. This kind of parameter transfer learning [7] helps to alleviate the DS [data sparsity] trouble remarkablely").

Claims 22 and 23 incorporate substantively all the limitations of Claim 21 in electronic device and non-transitory computer-readable storage medium forms, respectively, and are rejected under the same rationale.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT N DAY whose telephone number is (703)756-1519. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/R.N.D./Examiner, Art Unit 2122                                                                                                                                                                                                        

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Aug 08, 2022
Application Filed
Aug 25, 2025
Non-Final Rejection — §101, §103
Nov 26, 2025
Response Filed
Mar 05, 2026
Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/195,116
Patent 12406181
METHOD, DEVICE, AND COMPUTER PROGRAM PRODUCT FOR UPDATING MODEL
2y 5m to grant Granted Sep 02, 2025
17/155,997
Patent 12229685
MODEL SUITABILITY COEFFICIENTS BASED ON GENERATIVE ADVERSARIAL NETWORKS AND ACTIVATION MAPS
2y 5m to grant Granted Feb 18, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

3-4
Expected OA Rounds
23%
Grant Probability
46%
With Interview (+23.2%)
4y 3m
Median Time to Grant
Moderate
PTA Risk
Based on 22 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD AND APPARATUS FOR CONSTRUCTING MULTI-TASK LEARNING MODEL, ELECTRONIC DEVICE, AND STORAGE MEDIUM

This examiner grants 23% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

METHOD AND APPARATUS FOR CONSTRUCTING MULTI-TASK LEARNING MODEL, ELECTRONIC DEVICE, AND STORAGE MEDIUM

This examiner grants 23% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email