Office Action Analysis: 18341892 — METHOD AND APPARATUS WITH MACHINE LEARNING MODEL

Examiner Intelligence

LU, HWEI-MIN View full profile →
Grants 62% of resolved cases
Career Allow Rate
134 granted / 217 resolved
+6.8% vs TC avg
Strong +40% interview lift
Without
With
+39.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
37 currently pending
Career history
254
Total Applications
across all art units
Statute-Specific Performance

§101
11.2%
-28.8% vs TC avg
§103
43.8%
+3.8% vs TC avg
§102
9.4%
-30.6% vs TC avg
§112
33.0%
-7.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 217 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This office action is in responsive to communication(s): original application filed on 06/27/2023, said application claims a priority filing date of 11/02/2022.  Claims 1-20 are pending. Claims 1, 13, and 20 are independent.

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: PR21.  Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 4-6 and 8-10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 4 recites the limitation "the outputting of the first soft label" in line 1.  There is insufficient antecedent basis for this limitation in the claim.  Since "the outputting of the first soft label" is recited in Claim 3 and not in Claim 2, for examination purpose, Claim 4 is depending on Claim 3 instead of Claim 2.
Claim 5 recites the limitation "the correcting of the first label and the first prediction label" in lines 1-2.  There is insufficient antecedent basis for this limitation in the claim.  Since "the correcting of the first label and the first prediction label" is recited in Claim 4 and not in Claim 3, for examination purpose, Claim 5 is depending on Claim 4 instead of Claim 3.
Claim 6 recites the limitation "the first soft label" in line 6.  There is insufficient antecedent basis for this limitation in the claim.  Since "the first soft label" is recited in Claim 3 and not in Claim 2, for examination purpose, Claim 6 is depending on Claim 3 instead of Claim 2.
Claims 8 and 10 are rejected for fully incorporating the deficiency of their respective base claims.
Claim 8 recites the limitation "the outputting of the second soft label" in line.  There is insufficient antecedent basis for this limitation in the claim.  Since "the outputting of the second soft label" is recited in Claim 7 and not Claim 6, for examination purpose, Claim 8 is depending on Claim 7 instead of Claim 6.
Claim 9 recites the limitation "the correcting of the second label and the second prediction label" in lines 1-2.  There is insufficient antecedent basis for this limitation in the claim.  Since "the correcting of the second label and the second prediction label" is recited in Claim 8 and not Claim 7, for examination purpose, Claim 9 is depending on Claim 8 instead of Claim 7.
Claim 10 recites the limitation "the second soft label" in line 6.  There is insufficient antecedent basis for this limitation in the claim.  Since "the second soft label" is recited in Claim 7 and not in Claim 6, for examination purpose, Claim 10 is depending on Claim 7 instead of Claim 6.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more. 

Independent Claim 1
Step 1: Claim 1 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) recite(s) additional elements/limitations of ". 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 2
Step 1: Claim 2 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 3
Step 1: Claim 3 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 4
Step 1: Claim 4 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 5
Step 1: Claim 5 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 6
Step 1: Claim 6 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 7
Step 1: Claim 7 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 8
Step 1: Claim 8 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 9
Step 1: Claim 9 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 10
Step 1: Claim 10 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 11
Step 1: Claim 11 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 12
Step 1: Claim 12 is an apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Independent Claim 13
Step 1: Claim 13 is a process claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) recite(s) additional elements/limitations of ". 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 14
Step 1: Claim 14 is a process claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 15
Step 1: Claim 15 is a process claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 16
Step 1: Claim 16 is a  process claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 17
Step 1: Claim 17 is a process claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 18
Step 1: Claim 18 is a process claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim 19
Step 1: Claim 19 is a claim for a non-transitory computer-readable storage medium which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) does/do further recite the elements/limitations. 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) . 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Independent Claim 20
Step 1: Claim 20 is apparatus claim which is within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) recite(s) ". 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) recite(s) additional elements/limitations of ". 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because t.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. ("DIVIDEMIX: LEARNING WITH NOISY LABELS AS SEMI-SUPERVISED LEARNING", in ICLR 2020, arXiv:2002.07394, Feb. 18, 2020, pp. 1-14), hereinafter Li in view of Gao et al. ("A Novel Semi-Supervised Learning Approach for Network Intrusion Detection on Cloud-Based Robotic System ", IEEE Access, VOLUME 6, 2018, September 13, 2018, pp. 50927-50938), hereinafter Gao and Shao et al. ("Ensemble Learning with Manifold-Based Data Splitting for Noisy Label Correction", arXiv:2103.07641, Mar 13, 2021, pp. 1-12), hereinafter Shao.

Independent Claim 1
	Li discloses an apparatus, the apparatus comprising: one or more processors (Li, APPENDIX D of Page 14: training using a single Nvidia V100 GPU) configured to: 
 (Li, ABSTRACT in Page 1: dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples; Section 1 in Pages 1-2: DivideMix discards the sample labels that are highly likely to be noisy, and leverages the noisy samples as unlabeled data to regularize the model from overfitting and improve generalization performance; dynamically fit a Gaussian Mixture Model (GMM) on its per-sample loss distribution to divide the training samples into a labeled set and an unlabeled set; the divided data is then used to train the other network; Section 3 in Page 3: divide the noisy training dataset into a clean labeled set (X) and a noisy unlabeled set (U); FIG. 1 in Page 3: at each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy); Section 3.1 in Pages 3-5:  aim to find the probability of a sample being clean by fitting a mixture model to the per-sample loss distribution; Gaussian Mixture Model (GMM) can better distinguish clean and noisy samples due to its flexibility in the sharpness of distribution; therefore, fit a two-component GMM to l using the Expectation-Maximization algorithm; divide the training data into a labeled set and an unlabeled set by setting a threshold τ on wi); 
train a first neural network using a semi-supervised learning scheme based on the first training data set comprising the first label, and an unlabeled second training data set, and train a second neural network using the semi-supervised learning scheme based on the  first training data set comprising the  first label, and an unlabeled  second training data set (Li, ABSTRACT in Page 1: propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques; DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner; to avoid confirmation bias, simultaneously train two diverged networks where each network uses the dataset division from the other network; during the semi-supervised training phase, improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively; Section 1 in Pages 1-2: propose DivideMix, which addresses learning with label noise in a semi-supervised manner; different from most existing LNL (learning with noisy labels) approaches, DivideMix discards the sample labels that are highly likely to be noisy, and leverages the noisy samples as unlabeled data to regularize the model from overfitting and improve generalization performance; propose co-divide, which trains two networks simultaneously; for each network, dynamically fit a Gaussian Mixture Model (GMM) on its per-sample loss distribution to divide the training samples into a labeled set and an unlabeled set; the divided data is then used to train the other network; co-divide keeps the two networks diverged, so that they can filter different types of error and avoid confirmation bias in self-training; during SSL phase, improve MixMatch with label co-refinement and co-guessing to account for label noise; for labeled samples, refine their ground-truth labels using the network’s predictions guided by the GMM for the other network; for unlabeled samples, use the ensemble of both networks to make reliable guesses for their labels; Section 2.1 in Page 2: the method discards the labels that are highly likely to be noisy, and utilize the noisy samples as unlabeled data to regularize training in a SSL manner; the method can avoid the confirmation bias problem by training two networks to filter error for each other; compared to Co-teaching and Co-teaching+, the method is more robust to noise by enabling the two networks to teach each other implicitly at each epoch (co-divide) and explicitly at each mini-batch (label co-refinement and co-guessing); Section 3 in Page 3: to avoid confirmation bias of self-training where the model would accumulate its errors, simultaneously train two networks to filter errors for each other through epoch-level implicit teaching and batch-level explicit teaching; at each epoch, perform co-divide, where one network divides the noisy training dataset into a clean labeled set (X) and a noisy unlabeled set (U), which are then used by the other network; at each mini-batch, one network utilizes both labeled and unlabeled samples to perform semi-supervised learning guided by the other network; FIG. 1 in Page 3: DivideMix trains two networks (A and B) simultaneously; at each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy), which is then used as training data for the other network (i.e. co-divide); at each mini-batch, a network performs semi-supervised training using an improved MixMatch method; perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples; Section 3.1 in Pages 3-5:  training a model using the data divided by itself could lead to confirmation bias (i.e. the model is prone to confirm its mistakes), as noisy samples that are wrongly grouped into the labeled set would keep having lower loss due to the model overfitting to their labels; therefore, propose co-divide to avoid error accumulation; in co-divide, the GMM for one network is used to divide training data for the other network; the two networks are kept diverged from each other due to different (random) parameter initialization, different training data division, different (random) mini-batch sequence, and different training targets; being diverged offers the two networks distinct abilities to filter different types of error, making the model more robust to noise; Section 3.2 in Pages 5-6: at each epoch, having divided the training data, train the two networks  one at a time while keeping the other one fixed; MixMatch utilizes unlabeled data by merging consistency regularization (i.e. encourage the model to output same predictions on perturbed unlabeled data) and entropy minimization (i.e. encourage the model to output confident predictions on unlabeled data) with the MixUp augmentation (i.e. encourage the model to have linear behavior between samples); to account for label noise, take two improvements to MixMatch which enable the two networks to teach each other; first, perform label co-refinement for labeled samples by linearly combining the ground-truth label yb with the network’s prediction pb (averaged across multiple augmentations of xb), guided by the clean probability wb produced by the other network; then apply a sharpening function on the refined label to reduce its temperature; second, use the ensemble of predictions from both networks to "co-guess" the labels for unlabeled samples (algorithm 1, line 20), which can produce more reliable guessed labels; having acquired                         
                            
                                    X
                                
                                ^
                            
                    (and                         
                            
                                    U
                                
                                ^
                            
                    ) which consists of multiple augmentations of labeled (unlabeled) samples and their refined (guessed) labels, follow MixMatch to "mix" the data, where each sample is interpolated with another sample randomly chosen from the combined mini-batch of                         
                            
                                    X
                                
                                ^
                            
                     and                         
                            
                                    U
                                
                                ^
                            
                    ; to prevent assigning all samples to a single class, apply the regularization term which uses a uniform prior distribution π (i.e. πc = 1/C) to regularize the model’s average output across all samples in the mini-batch).
Li fails to explicitly disclose (1) randomly split a training data set into a first training data set and a second training data set; (2) train a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set.
Gao teaches a system and a method relating to Semi-Supervised Learning (Gao, Title), wherein randomly split a training data set into a first training data set and a second training data set (Gao, Section IV.A in Pages 50933-50934: from the training dataset, randomly select 2000 examples as the labeled data Sl and the remaining examples are used as the unlabeled data Su.; Section II.A,2) in Pages 50929: co-training is also a well-known semi-supervised learning method; unlike the self-training, it splits the feature into two disjoint views and then separately trains two classifiers in an iterative manner; for successfully training, one hypothesis should be satisfied, that is, two views should be conditionally independent given the categorical attributes; as a result, a better splitting method for feature seems to be more important; however, this is not an easy work; Feger and Koprinska have tried to find the optimal splitting by using conditional mutual information; unfortunately, they failed to improve the performance as the random splitting was more outperformed; Nigam and Ghani also showed that the random splitting method appears to be better in performance with sufficient redundancy in data; Salaheldin and E; Gayar proposed a new splitting features method and the best splitting point is obtained by using GA; they finally found that their method was competitive with the random splitting; despite the difficulty of finding the best splitting, the co-training is still a popular approach to implement the semi-supervised learning).
Li and Gao are analogous art because they are from the same field of endeavor, a system and a method relating to Semi-Supervised Learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of Gao to Li.  Motivation for doing so would improve performance (Gao, Section II.A,2) in Page 50929).
Li in view of Gao fails to explicitly disclose to train a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set.
Shao teaches a system and a method relating to machine learning on noisy data (Shao, Abstract) , wherein train a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set (Shao, Abstract in Page 1: focus on the problem that noisy labels are primarily mislabeled samples, which tend to be concentrated near decision boundaries, rather than uniformly distributed, and whose features should be equivocal; propose an ensemble learning method to correct noisy labels by exploiting the local structures of feature manifolds; different from typical ensemble strategies that increase the prediction diversity among sub-models via certain loss terms, the method trains sub-models on disjoint subsets, each being a union of the nearest-neighbors of randomly selected seed samples on the data manifold; as a result, each sub-model can learn a coarse representation of the data manifold along with a corresponding graph; moreover, only a limited number of sub-models will be affected by locally-concentrated noisy labels; the constructed graphs are used to suggest a series of label correction candidates, and accordingly, the method derives label correction results by voting down inconsistent suggestions; Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch)
L and Shao are analogous art because they are from the same field of endeavor, a system and a method relating to machine learning on noisy data.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of Sh to L.  Motivation for doing so would improve performance (.

Claim 2
Li in view of Gao and Shao discloses all the elements as stated in Claim 1 and further discloses wherein the unlabeled first training data set is generated by removing the first label from the first training data set, and the unlabeled second training data set is generated by removing the second label from the second training data set (Shao, Section III.A with FIG. 2 in Pages 3-4: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly).  

Claim 3
Li in view of Gao and Shao discloses all the elements as stated in Claim 1 and further discloses wherein, for the training of the first neural network, the one or more processors are configured to: output a first soft label by correcting the first training data set; and train the first neural network using the semi-supervised learning scheme based on the first training data set, the first soft label, and the unlabeled second training data set (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 4
Li in view of Gao and Shao discloses all the elements as stated in Claim 3 (see 112 rejection to Claim 4) and further discloses wherein, for the outputting of the first soft label, control the second neural network to estimate a first prediction label for the first training data set based on the first training data set; and correct the first label and the first prediction label to output the first soft label (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 5
Li in view of Gao and Shao discloses all the elements as stated in Claim 4 (see 112 rejection to Claim 5) and further discloses wherein, for the correcting of the first label and the first prediction label, perform a convex combination on the first label and the first prediction label to output the first soft label (Shao, Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 6
Li in view of Gao and Shao discloses all the elements as stated in Claim 3 (see 112 rejection to Claim 6) and further discloses wherein, for the training of the first neural network, control the first neural network to output a second pseudo label for the unlabeled second training data set; and train the first neural network using the semi-supervised learning scheme based on the first training data set, the first soft label, the unlabeled second training data set, and the second pseudo label (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs; Section II.D of Page 3: label propagation is a graph-based semi-supervised method for generating pseudo-labels of unlabeled data based on given anchors; FIG. 2 in Page 4: framework of proposed ensemble-based label correction scheme; the proposed method has primarily three branches, namely i) noisy label branch controlled by                         
                            
                                    L
                                
                                    n
                                
                    , ii) corrected label branch controlled by                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                
                    , and iii) ensemble branches controlled by                         
                            
                                    L
                                
                                    G
                                
                    ; the design focuses on the ensemble branches implemented in three phases; the noisy label branch is used to perform feature embedding and also used to regulate certainty weight jointly with the corrected label branch; only the corrected label branch is used while deploying; the three phases of the ensemble branches are detailed in Sections III-B–III-D; note that y(n) denotes the original label given by the noisy dataset; Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; adopt the weighted cross-entropy in eqn. (9) to measure the loss of corrected (pseudo) labels branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 7
Li in view of Gao and Shao discloses all the elements as stated in Claim 1 and further discloses wherein, for the training of the second neural network, output a second soft label by correcting the second training data set; and train the second neural network using the semi-supervised learning scheme based on the second training data set, the second soft label, and the unlabeled first training data set (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 8
Li in view of Gao and Shao discloses all the elements as stated in Claim 7 (see 112 rejection to Claim 8) and further discloses wherein, for the outputting of the second soft label, control the first neural network to estimate a second prediction label for the second training data set based on the second training data set; and correct the second label and the second prediction label to output the second soft label (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 9
Li in view of Gao and Shao discloses all the elements as stated in Claim 8 (see 112 rejection to Claim 9) and further discloses wherein, for the correcting of the second label and the second prediction label, perform a convex combination on the second label and the second prediction label to output the second soft label (Shao, Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 10
Li in view of Gao and Shao discloses all the elements as stated in Claim 7 (see 112 rejection to Claim 10) and further discloses wherein, for the training of the second neural network, control the second neural network to output a first pseudo label for the unlabeled first training data set; and train the second neural network using the semi-supervised learning scheme based on the second training data set, the second soft label, the unlabeled first training data set, and the first pseudo label (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs; Section II.D of Page 3: label propagation is a graph-based semi-supervised method for generating pseudo-labels of unlabeled data based on given anchors; FIG. 2 in Page 4: framework of proposed ensemble-based label correction scheme; the proposed method has primarily three branches, namely i) noisy label branch controlled by                         
                            
                                    L
                                
                                    n
                                
                    , ii) corrected label branch controlled by                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                
                    , and iii) ensemble branches controlled by                         
                            
                                    L
                                
                                    G
                                
                    ; the design focuses on the ensemble branches implemented in three phases; the noisy label branch is used to perform feature embedding and also used to regulate certainty weight jointly with the corrected label branch; only the corrected label branch is used while deploying; the three phases of the ensemble branches are detailed in Sections III-B–III-D; note that y(n) denotes the original label given by the noisy dataset; Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; adopt the weighted cross-entropy in eqn. (9) to measure the loss of corrected (pseudo) labels branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 11
Li in view of Gao and Shao discloses all the elements as stated in Claim 1 and further discloses control a machine learning model to estimate a prediction label for input data, wherein the machine learning model comprises the trained first neural network and the trained second neural network (Li, ABSTRACT in Page 1: during the semi-supervised training phase, improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively; Section 1 in Pages 1-2: during SSL phase, improve MixMatch with label co-refinement and co-guessing to account for label noise; for labeled samples, refine their ground-truth labels using the network’s predictions guided by the GMM for the other network; for unlabeled samples, use the ensemble of both networks to make reliable guesses for their labels; FIG. 1 in Page 3: at each mini-batch, a network performs semi-supervised training using an improved MixMatch method; perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples; Section 3.2 in Pages 5-6: to account for label noise, take two improvements to MixMatch which enable the two networks to teach each other; first, perform label co-refinement for labeled samples by linearly combining the ground-truth label yb with the network’s prediction pb (averaged across multiple augmentations of xb), guided by the clean probability wb produced by the other network; then apply a sharpening function on the refined label to reduce its temperature; second, use the ensemble of predictions from both networks to "co-guess" the labels for unlabeled samples (algorithm 1, line 20), which can produce more reliable guessed labels; having acquired                         
                            
                                    X
                                
                                ^
                            
                    (and                         
                            
                                    U
                                
                                ^
                            
                    ) which consists of multiple augmentations of labeled (unlabeled) samples and their refined (guessed) labels, follow MixMatch to "mix" the data, where each sample is interpolated with another sample randomly chosen from the combined mini-batch of                         
                            
                                    X
                                
                                ^
                            
                     and                         
                            
                                    U
                                
                                ^
                            
                    ).  

Claim 12
Li in view of Gao and Shao discloses all the elements as stated in Claim 1 and further discloses a memory storing instructions that, when executed by the one or more processors, configure the one or more processors (Li, APPENDIX D of Page 14: training using a single Nvidia V100 GPU) (Gao, Section IV.B.1) in Page 50934: the configurations of the computer are listed as follows: the Window 10 operation system, Intel i5-7400 CPU @ 3.00GHz and 8GB main memory) to perform the randomly splitting of the training data set, the training of the first neural network, and the training of the second neural network (Li, ABSTRACT in Page 1: propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques; DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner; to avoid confirmation bias, simultaneously train two diverged networks where each network uses the dataset division from the other network; during the semi-supervised training phase, improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively; Section 1 in Pages 1-2: propose DivideMix, which addresses learning with label noise in a semi-supervised manner; different from most existing LNL (learning with noisy labels) approaches, DivideMix discards the sample labels that are highly likely to be noisy, and leverages the noisy samples as unlabeled data to regularize the model from overfitting and improve generalization performance; propose co-divide, which trains two networks simultaneously; for each network, dynamically fit a Gaussian Mixture Model (GMM) on its per-sample loss distribution to divide the training samples into a labeled set and an unlabeled set; the divided data is then used to train the other network; co-divide keeps the two networks diverged, so that they can filter different types of error and avoid confirmation bias in self-training; during SSL phase, improve MixMatch with label co-refinement and co-guessing to account for label noise; for labeled samples, refine their ground-truth labels using the network’s predictions guided by the GMM for the other network; for unlabeled samples, use the ensemble of both networks to make reliable guesses for their labels; Section 2.1 in Page 2: the method discards the labels that are highly likely to be noisy, and utilize the noisy samples as unlabeled data to regularize training in a SSL manner; the method can avoid the confirmation bias problem by training two networks to filter error for each other; compared to Co-teaching and Co-teaching+, the method is more robust to noise by enabling the two networks to teach each other implicitly at each epoch (co-divide) and explicitly at each mini-batch (label co-refinement and co-guessing); Section 3 in Page 3: to avoid confirmation bias of self-training where the model would accumulate its errors, simultaneously train two networks to filter errors for each other through epoch-level implicit teaching and batch-level explicit teaching; at each epoch, perform co-divide, where one network divides the noisy training dataset into a clean labeled set (X) and a noisy unlabeled set (U), which are then used by the other network; at each mini-batch, one network utilizes both labeled and unlabeled samples to perform semi-supervised learning guided by the other network; FIG. 1 in Page 3: DivideMix trains two networks (A and B) simultaneously; at each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy), which is then used as training data for the other network (i.e. co-divide); at each mini-batch, a network performs semi-supervised training using an improved MixMatch method; perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples; Section 3.1 in Pages 3-5:  training a model using the data divided by itself could lead to confirmation bias (i.e. the model is prone to confirm its mistakes), as noisy samples that are wrongly grouped into the labeled set would keep having lower loss due to the model overfitting to their labels; therefore, propose co-divide to avoid error accumulation; in co-divide, the GMM for one network is used to divide training data for the other network; the two networks are kept diverged from each other due to different (random) parameter initialization, different training data division, different (random) mini-batch sequence, and different training targets; being diverged offers the two networks distinct abilities to filter different types of error, making the model more robust to noise; Section 3.2 in Pages 5-6: at each epoch, having divided the training data, train the two networks  one at a time while keeping the other one fixed; MixMatch utilizes unlabeled data by merging consistency regularization (i.e. encourage the model to output same predictions on perturbed unlabeled data) and entropy minimization (i.e. encourage the model to output confident predictions on unlabeled data) with the MixUp augmentation (i.e. encourage the model to have linear behavior between samples); to account for label noise, take two improvements to MixMatch which enable the two networks to teach each other; first, perform label co-refinement for labeled samples by linearly combining the ground-truth label yb with the network’s prediction pb (averaged across multiple augmentations of xb), guided by the clean probability wb produced by the other network; then apply a sharpening function on the refined label to reduce its temperature; second, use the ensemble of predictions from both networks to "co-guess" the labels for unlabeled samples (algorithm 1, line 20), which can produce more reliable guessed labels; having acquired                         
                            
                                    X
                                
                                ^
                            
                    (and                         
                            
                                    U
                                
                                ^
                            
                    ) which consists of multiple augmentations of labeled (unlabeled) samples and their refined (guessed) labels, follow MixMatch to "mix" the data, where each sample is interpolated with another sample randomly chosen from the combined mini-batch of                         
                            
                                    X
                                
                                ^
                            
                     and                         
                            
                                    U
                                
                                ^
                            
                    ; to prevent assigning all samples to a single class, apply the regularization term which uses a uniform prior distribution π (i.e. πc = 1/C) to regularize the model’s average output across all samples in the mini-batch) (Gao, Section IV.A in Pages 50933-50934: from the training dataset, randomly select 2000 examples as the labeled data Sl and the remaining examples are used as the unlabeled data Su.; Section II.A,2) in Pages 50929: co-training is also a well-known semi-supervised learning method; unlike the self-training, it splits the feature into two disjoint views and then separately trains two classifiers in an iterative manner; for successfully training, one hypothesis should be satisfied, that is, two views should be conditionally independent given the categorical attributes; as a result, a better splitting method for feature seems to be more important; however, this is not an easy work; Feger and Koprinska have tried to find the optimal splitting by using conditional mutual information; unfortunately, they failed to improve the performance as the random splitting was more outperformed; Nigam and Ghani also showed that the random splitting method appears to be better in performance with sufficient redundancy in data; Salaheldin and E; Gayar proposed a new splitting features method and the best splitting point is obtained by using GA; they finally found that their method was competitive with the random splitting; despite the difficulty of finding the best splitting, the co-training is still a popular approach to implement the semi-supervised learning).  

Independent Claim 13
Li discloses a processor-implemented method (Li, APPENDIX D of Page 14: training using a single Nvidia V100 GPU), the method comprising: 
 (Li, ABSTRACT in Page 1: dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples; Section 1 in Pages 1-2: DivideMix discards the sample labels that are highly likely to be noisy, and leverages the noisy samples as unlabeled data to regularize the model from overfitting and improve generalization performance; dynamically fit a Gaussian Mixture Model (GMM) on its per-sample loss distribution to divide the training samples into a labeled set and an unlabeled set; the divided data is then used to train the other network; Section 3 in Page 3: divide the noisy training dataset into a clean labeled set (X) and a noisy unlabeled set (U); FIG. 1 in Page 3: at each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy); Section 3.1 in Pages 3-5:  aim to find the probability of a sample being clean by fitting a mixture model to the per-sample loss distribution; Gaussian Mixture Model (GMM) can better distinguish clean and noisy samples due to its flexibility in the sharpness of distribution; therefore, fit a two-component GMM to l using the Expectation-Maximization algorithm; divide the training data into a labeled set and an unlabeled set by setting a threshold τ on wi); 
training a first neural network using a semi-supervised learning scheme based on the first training data set comprising the first label, and an unlabeled second training data set generated by removing the second label from the second training data set; and training a second neural network using the semi-supervised learning scheme based on the  first training data set comprising the  first label, and an unlabeled  second training data set generated by removing the  second label from the  second training data set (Li, ABSTRACT in Page 1: propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques; DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner; to avoid confirmation bias, simultaneously train two diverged networks where each network uses the dataset division from the other network; during the semi-supervised training phase, improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively; Section 1 in Pages 1-2: propose DivideMix, which addresses learning with label noise in a semi-supervised manner; different from most existing LNL (learning with noisy labels) approaches, DivideMix discards the sample labels that are highly likely to be noisy, and leverages the noisy samples as unlabeled data to regularize the model from overfitting and improve generalization performance; propose co-divide, which trains two networks simultaneously; for each network, dynamically fit a Gaussian Mixture Model (GMM) on its per-sample loss distribution to divide the training samples into a labeled set and an unlabeled set; the divided data is then used to train the other network; co-divide keeps the two networks diverged, so that they can filter different types of error and avoid confirmation bias in self-training; during SSL phase, improve MixMatch with label co-refinement and co-guessing to account for label noise; for labeled samples, refine their ground-truth labels using the network’s predictions guided by the GMM for the other network; for unlabeled samples, use the ensemble of both networks to make reliable guesses for their labels; Section 2.1 in Page 2: the method discards the labels that are highly likely to be noisy, and utilize the noisy samples as unlabeled data to regularize training in a SSL manner; the method can avoid the confirmation bias problem by training two networks to filter error for each other; compared to Co-teaching and Co-teaching+, the method is more robust to noise by enabling the two networks to teach each other implicitly at each epoch (co-divide) and explicitly at each mini-batch (label co-refinement and co-guessing); Section 3 in Page 3: to avoid confirmation bias of self-training where the model would accumulate its errors, simultaneously train two networks to filter errors for each other through epoch-level implicit teaching and batch-level explicit teaching; at each epoch, perform co-divide, where one network divides the noisy training dataset into a clean labeled set (X) and a noisy unlabeled set (U), which are then used by the other network; at each mini-batch, one network utilizes both labeled and unlabeled samples to perform semi-supervised learning guided by the other network; FIG. 1 in Page 3: DivideMix trains two networks (A and B) simultaneously; at each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy), which is then used as training data for the other network (i.e. co-divide); at each mini-batch, a network performs semi-supervised training using an improved MixMatch method; perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples; Section 3.1 in Pages 3-5:  training a model using the data divided by itself could lead to confirmation bias (i.e. the model is prone to confirm its mistakes), as noisy samples that are wrongly grouped into the labeled set would keep having lower loss due to the model overfitting to their labels; therefore, propose co-divide to avoid error accumulation; in co-divide, the GMM for one network is used to divide training data for the other network; the two networks are kept diverged from each other due to different (random) parameter initialization, different training data division, different (random) mini-batch sequence, and different training targets; being diverged offers the two networks distinct abilities to filter different types of error, making the model more robust to noise; Section 3.2 in Pages 5-6: at each epoch, having divided the training data, train the two networks  one at a time while keeping the other one fixed; MixMatch utilizes unlabeled data by merging consistency regularization (i.e. encourage the model to output same predictions on perturbed unlabeled data) and entropy minimization (i.e. encourage the model to output confident predictions on unlabeled data) with the MixUp augmentation (i.e. encourage the model to have linear behavior between samples); to account for label noise, take two improvements to MixMatch which enable the two networks to teach each other; first, perform label co-refinement for labeled samples by linearly combining the ground-truth label yb with the network’s prediction pb (averaged across multiple augmentations of xb), guided by the clean probability wb produced by the other network; then apply a sharpening function on the refined label to reduce its temperature; second, use the ensemble of predictions from both networks to "co-guess" the labels for unlabeled samples (algorithm 1, line 20), which can produce more reliable guessed labels; having acquired                         
                            
                                    X
                                
                                ^
                            
                    (and                         
                            
                                    U
                                
                                ^
                            
                    ) which consists of multiple augmentations of labeled (unlabeled) samples and their refined (guessed) labels, follow MixMatch to "mix" the data, where each sample is interpolated with another sample randomly chosen from the combined mini-batch of                         
                            
                                    X
                                
                                ^
                            
                     and                         
                            
                                    U
                                
                                ^
                            
                    ; to prevent assigning all samples to a single class, apply the regularization term which uses a uniform prior distribution π (i.e. πc = 1/C) to regularize the model’s average output across all samples in the mini-batch).  
Li fails to explicitly disclose (1) randomly splitting a training data set into a first training data set and a second training data set; (2) training a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set generated by removing the first label from the first training data set.
Gao teaches a system and a method relating to Semi-Supervised Learning (Gao, Tiltle), wherein randomly splitting a training data set into a first training data set and a second training data set (Gao, Section IV.A in Pages 50933-50934: from the training dataset, randomly select 2000 examples as the labeled data Sl and the remaining examples are used as the unlabeled data Su.; Section II.A,2) in Pages 50929: co-training is also a well-known semi-supervised learning method; unlike the self-training, it splits the feature into two disjoint views and then separately trains two classifiers in an iterative manner; for successfully training, one hypothesis should be satisfied, that is, two views should be conditionally independent given the categorical attributes; as a result, a better splitting method for feature seems to be more important; however, this is not an easy work; Feger and Koprinska have tried to find the optimal splitting by using conditional mutual information; unfortunately, they failed to improve the performance as the random splitting was more outperformed; Nigam and Ghani also showed that the random splitting method appears to be better in performance with sufficient redundancy in data; Salaheldin and E; Gayar proposed a new splitting features method and the best splitting point is obtained by using GA; they finally found that their method was competitive with the random splitting; despite the difficulty of finding the best splitting, the co-training is still a popular approach to implement the semi-supervised learning).
Li and Gao are analogous art because they are from the same field of endeavor, a system and a method relating to Semi-Supervised Learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of Gao to Li.  Motivation for doing so would improve performance (.
Li in view of Gao fails to explicitly disclose to training a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set generated by removing the first label from the first training data set.
Shao teaches a system and a method relating to machine learning on noisy data (Shao, Abstract) , wherein training a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set generated by removing the first label from the first training data set (Shao, Abstract in Page 1: focus on the problem that noisy labels are primarily mislabeled samples, which tend to be concentrated near decision boundaries, rather than uniformly distributed, and whose features should be equivocal; propose an ensemble learning method to correct noisy labels by exploiting the local structures of feature manifolds; different from typical ensemble strategies that increase the prediction diversity among sub-models via certain loss terms, the method trains sub-models on disjoint subsets, each being a union of the nearest-neighbors of randomly selected seed samples on the data manifold; as a result, each sub-model can learn a coarse representation of the data manifold along with a corresponding graph; moreover, only a limited number of sub-models will be affected by locally-concentrated noisy labels; the constructed graphs are used to suggest a series of label correction candidates, and accordingly, the method derives label correction results by voting down inconsistent suggestions; Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch)
L and Shao are analogous art because they are from the same field of endeavor, a system and a method relating to machine learning on noisy data.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of Sh to L.  Motivation for doing so would improve performance (.

Claim 14
Li in view of Gao and Shao discloses all the elements as stated in Claim 13 and further discloses wherein the training of the first neural network using the semi-supervised learning scheme comprises: outputting a first soft label by correcting the first training data set; and training the first neural network using the semi-supervised learning scheme based on the first training data set, the first soft label, and the unlabeled second training data set, and the training of the second neural network using the semi-supervised learning scheme comprises: outputting a second soft label by correcting the second training data set; and training the second neural network using the semi-supervised learning scheme based on the second training data set, the second soft label, and the unlabeled first training data set (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 15
Li in view of Gao and Shao discloses all the elements as stated in Claim 14 and further discloses wherein the outputting of the first soft label comprises: estimating, by the second neural network, a first prediction label for the first training data set based on the first training data set; and correcting the first label and the first prediction label to output the first soft label, and the outputting of the second soft label comprises: estimating, by the first neural network, a second prediction label for the second training data set based on the second training data set; and correcting the second label and the second prediction label to output the second soft label (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 16
Li in view of Gao and Shao discloses all the elements as stated in Claim 15 and further discloses wherein the correcting of the first label and the first prediction label comprises performing a convex combination based on the first label and the first prediction label to output the first soft label, and the correcting of the second label and the second prediction label comprises performing a convex combination based on the second label and the second prediction label to output the second soft label (Shao, Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 17
Li in view of Gao and Shao discloses all the elements as stated in Claim 14 and further discloses wherein the training of the first neural network using the semi-supervised learning scheme comprises: outputting a second pseudo label for the unlabeled second training data set using the first neural network; and training the first neural network using the semi-supervised learning scheme based on the first training data set, the first soft label, the unlabeled second training data set, and the second pseudo label, and the training of the second neural network using the semi-supervised learning scheme comprises: outputting a first pseudo label for the unlabeled first training data set using the second neural network; and training the second neural network using the semi-supervised learning scheme based on the second training data set, the second soft label, the unlabeled first training data set, and the first pseudo label (Shao, Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; corrections suggested by ensemble learning-based approaches are usually considered as soft-labels for training another student model; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs; Section II.D of Page 3: label propagation is a graph-based semi-supervised method for generating pseudo-labels of unlabeled data based on given anchors; FIG. 2 in Page 4: framework of proposed ensemble-based label correction scheme; the proposed method has primarily three branches, namely i) noisy label branch controlled by                         
                            
                                    L
                                
                                    n
                                
                    , ii) corrected label branch controlled by                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                
                    , and iii) ensemble branches controlled by                         
                            
                                    L
                                
                                    G
                                
                    ; the design focuses on the ensemble branches implemented in three phases; the noisy label branch is used to perform feature embedding and also used to regulate certainty weight jointly with the corrected label branch; only the corrected label branch is used while deploying; the three phases of the ensemble branches are detailed in Sections III-B–III-D; note that y(n) denotes the original label given by the noisy dataset; Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch; adopt the weighted cross-entropy in eqn. (9) to measure the loss of corrected (pseudo) labels branch; Section "Corrected label branch" of Section IV.E in Page 9 with Tables VII and VIII in Page 12 and Table VI in Page 10: use the soft-label, which is estimated as a convex combination, weighted by the certainty weight depicted in eqn. (6), of the prediction result derived by the corrected label branch and that derived the ensemble branches, to compute both the MAE loss and the cross entropy loss).  

Claim 18
Li in view of Gao and Shao discloses all the elements as stated in Claim 13 and further discloses controlling a machine learning model to estimate a prediction label for input data, wherein the machine learning model comprises the trained first neural network and the trained second neural network (Li, ABSTRACT in Page 1: during the semi-supervised training phase, improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively; Section 1 in Pages 1-2: during SSL phase, improve MixMatch with label co-refinement and co-guessing to account for label noise; for labeled samples, refine their ground-truth labels using the network’s predictions guided by the GMM for the other network; for unlabeled samples, use the ensemble of both networks to make reliable guesses for their labels; FIG. 1 in Page 3: at each mini-batch, a network performs semi-supervised training using an improved MixMatch method; perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples; Section 3.2 in Pages 5-6: to account for label noise, take two improvements to MixMatch which enable the two networks to teach each other; first, perform label co-refinement for labeled samples by linearly combining the ground-truth label yb with the network’s prediction pb (averaged across multiple augmentations of xb), guided by the clean probability wb produced by the other network; then apply a sharpening function on the refined label to reduce its temperature; second, use the ensemble of predictions from both networks to "co-guess" the labels for unlabeled samples (algorithm 1, line 20), which can produce more reliable guessed labels; having acquired                         
                            
                                    X
                                
                                ^
                            
                    (and                         
                            
                                    U
                                
                                ^
                            
                    ) which consists of multiple augmentations of labeled (unlabeled) samples and their refined (guessed) labels, follow MixMatch to "mix" the data, where each sample is interpolated with another sample randomly chosen from the combined mini-batch of                         
                            
                                    X
                                
                                ^
                            
                     and                         
                            
                                    U
                                
                                ^
                            
                    ).  

Claim 19
Li in view of Gao and Shao discloses all the elements as stated in Claim 13 and further discloses a non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, configure the one or more processors (Li, APPENDIX D of Page 14: training using a single Nvidia V100 GPU) (Gao, Section IV.B.1) in Page 50934: the configurations of the computer are listed as follows: the Window 10 operation system, Intel i5-7400 CPU @ 3.00GHz and 8GB main memory) to perform the method of Claim 13.

Independent Claim 20
Li discloses an apparatus, the apparatus comprising: one or more processors (Li, APPENDIX D of Page 14: training using a single Nvidia V100 GPU) configured to control a machine learning model to estimate a prediction label for input data (Li, ABSTRACT in Page 1: during the semi-supervised training phase, improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively; Section 1 in Pages 1-2: during SSL phase, improve MixMatch with label co-refinement and co-guessing to account for label noise; for labeled samples, refine their ground-truth labels using the network’s predictions guided by the GMM for the other network; for unlabeled samples, use the ensemble of both networks to make reliable guesses for their labels; FIG. 1 in Page 3: at each mini-batch, a network performs semi-supervised training using an improved MixMatch method; perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples; Section 3.2 in Pages 5-6: to account for label noise, take two improvements to MixMatch which enable the two networks to teach each other; first, perform label co-refinement for labeled samples by linearly combining the ground-truth label yb with the network’s prediction pb (averaged across multiple augmentations of xb), guided by the clean probability wb produced by the other network; then apply a sharpening function on the refined label to reduce its temperature; second, use the ensemble of predictions from both networks to "co-guess" the labels for unlabeled samples (algorithm 1, line 20), which can produce more reliable guessed labels; having acquired                         
                            
                                    X
                                
                                ^
                            
                    (and                         
                            
                                    U
                                
                                ^
                            
                    ) which consists of multiple augmentations of labeled (unlabeled) samples and their refined (guessed) labels, follow MixMatch to "mix" the data, where each sample is interpolated with another sample randomly chosen from the combined mini-batch of                         
                            
                                    X
                                
                                ^
                            
                     and                         
                            
                                    U
                                
                                ^
                            
                    ), 
wherein the machine learning model comprises a trained first neural network and a trained second neural network, wherein the trained first neural network is generated by training a first neural network using a semi-supervised learning scheme based on a first training data set comprising a first label assigned to first data, and an unlabeled second training data set generated by removing a second label from a second training data set comprising the second label assigned to second data, wherein the trained second neural network is generated by training a second neural network using the semi-supervised learning scheme based on the  first training data set comprising the  first label, and an unlabeled  second training data set generated by removing the  second label from the  second training data set (Li, ABSTRACT in Page 1: propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques; DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner; to avoid confirmation bias, simultaneously train two diverged networks where each network uses the dataset division from the other network; during the semi-supervised training phase, improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively; Section 1 in Pages 1-2: propose DivideMix, which addresses learning with label noise in a semi-supervised manner; different from most existing LNL (learning with noisy labels) approaches, DivideMix discards the sample labels that are highly likely to be noisy, and leverages the noisy samples as unlabeled data to regularize the model from overfitting and improve generalization performance; propose co-divide, which trains two networks simultaneously; for each network, dynamically fit a Gaussian Mixture Model (GMM) on its per-sample loss distribution to divide the training samples into a labeled set and an unlabeled set; the divided data is then used to train the other network; co-divide keeps the two networks diverged, so that they can filter different types of error and avoid confirmation bias in self-training; during SSL phase, improve MixMatch with label co-refinement and co-guessing to account for label noise; for labeled samples, refine their ground-truth labels using the network’s predictions guided by the GMM for the other network; for unlabeled samples, use the ensemble of both networks to make reliable guesses for their labels; Section 2.1 in Page 2: the method discards the labels that are highly likely to be noisy, and utilize the noisy samples as unlabeled data to regularize training in a SSL manner; the method can avoid the confirmation bias problem by training two networks to filter error for each other; compared to Co-teaching and Co-teaching+, the method is more robust to noise by enabling the two networks to teach each other implicitly at each epoch (co-divide) and explicitly at each mini-batch (label co-refinement and co-guessing); Section 3 in Page 3: to avoid confirmation bias of self-training where the model would accumulate its errors, simultaneously train two networks to filter errors for each other through epoch-level implicit teaching and batch-level explicit teaching; at each epoch, perform co-divide, where one network divides the noisy training dataset into a clean labeled set (X) and a noisy unlabeled set (U), which are then used by the other network; at each mini-batch, one network utilizes both labeled and unlabeled samples to perform semi-supervised learning guided by the other network; FIG. 1 in Page 3: DivideMix trains two networks (A and B) simultaneously; at each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy), which is then used as training data for the other network (i.e. co-divide); at each mini-batch, a network performs semi-supervised training using an improved MixMatch method; perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples; Section 3.1 in Pages 3-5:  training a model using the data divided by itself could lead to confirmation bias (i.e. the model is prone to confirm its mistakes), as noisy samples that are wrongly grouped into the labeled set would keep having lower loss due to the model overfitting to their labels; therefore, propose co-divide to avoid error accumulation; in co-divide, the GMM for one network is used to divide training data for the other network; the two networks are kept diverged from each other due to different (random) parameter initialization, different training data division, different (random) mini-batch sequence, and different training targets; being diverged offers the two networks distinct abilities to filter different types of error, making the model more robust to noise; Section 3.2 in Pages 5-6: at each epoch, having divided the training data, train the two networks  one at a time while keeping the other one fixed; MixMatch utilizes unlabeled data by merging consistency regularization (i.e. encourage the model to output same predictions on perturbed unlabeled data) and entropy minimization (i.e. encourage the model to output confident predictions on unlabeled data) with the MixUp augmentation (i.e. encourage the model to have linear behavior between samples); to account for label noise, take two improvements to MixMatch which enable the two networks to teach each other; first, perform label co-refinement for labeled samples by linearly combining the ground-truth label yb with the network’s prediction pb (averaged across multiple augmentations of xb), guided by the clean probability wb produced by the other network; then apply a sharpening function on the refined label to reduce its temperature; second, use the ensemble of predictions from both networks to "co-guess" the labels for unlabeled samples (algorithm 1, line 20), which can produce more reliable guessed labels; having acquired                         
                            
                                    X
                                
                                ^
                            
                    (and                         
                            
                                    U
                                
                                ^
                            
                    ) which consists of multiple augmentations of labeled (unlabeled) samples and their refined (guessed) labels, follow MixMatch to "mix" the data, where each sample is interpolated with another sample randomly chosen from the combined mini-batch of                         
                            
                                    X
                                
                                ^
                            
                     and                         
                            
                                    U
                                
                                ^
                            
                    ; to prevent assigning all samples to a single class, apply the regularization term which uses a uniform prior distribution π (i.e. πc = 1/C) to regularize the model’s average output across all samples in the mini-batch), and 
wherein the first training data set and the second training data set are generated by  (Li, ABSTRACT in Page 1: dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples; Section 1 in Pages 1-2: DivideMix discards the sample labels that are highly likely to be noisy, and leverages the noisy samples as unlabeled data to regularize the model from overfitting and improve generalization performance; dynamically fit a Gaussian Mixture Model (GMM) on its per-sample loss distribution to divide the training samples into a labeled set and an unlabeled set; the divided data is then used to train the other network; Section 3 in Page 3: divide the noisy training dataset into a clean labeled set (X) and a noisy unlabeled set (U); FIG. 1 in Page 3: at each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy); Section 3.1 in Pages 3-5:  aim to find the probability of a sample being clean by fitting a mixture model to the per-sample loss distribution; Gaussian Mixture Model (GMM) can better distinguish clean and noisy samples due to its flexibility in the sharpness of distribution; therefore, fit a two-component GMM to l using the Expectation-Maximization algorithm; divide the training data into a labeled set and an unlabeled set by setting a threshold τ on wi)).  
Li fails to explicitly disclose (1) randomly splitting a training data set into a first training data set and a second training data set; (2) training a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set generated by removing the first label from the first training data set.
Gao teaches a system and a method relating to Semi-Supervised Learning (Gao, Tiltle), wherein randomly splitting a training data set into a first training data set and a second training data set (Gao, Section IV.A in Pages 50933-50934: from the training dataset, randomly select 2000 examples as the labeled data Sl and the remaining examples are used as the unlabeled data Su.; Section II.A,2) in Pages 50929: co-training is also a well-known semi-supervised learning method; unlike the self-training, it splits the feature into two disjoint views and then separately trains two classifiers in an iterative manner; for successfully training, one hypothesis should be satisfied, that is, two views should be conditionally independent given the categorical attributes; as a result, a better splitting method for feature seems to be more important; however, this is not an easy work; Feger and Koprinska have tried to find the optimal splitting by using conditional mutual information; unfortunately, they failed to improve the performance as the random splitting was more outperformed; Nigam and Ghani also showed that the random splitting method appears to be better in performance with sufficient redundancy in data; Salaheldin and E; Gayar proposed a new splitting features method and the best splitting point is obtained by using GA; they finally found that their method was competitive with the random splitting; despite the difficulty of finding the best splitting, the co-training is still a popular approach to implement the semi-supervised learning).
Li and Gao are analogous art because they are from the same field of endeavor, a system and a method relating to Semi-Supervised Learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of Gao to Li.  Motivation for doing so would improve performance (.
Li in view of Gao fails to explicitly disclose to training a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set generated by removing the first label from the first training data set.
Shao teaches a system and a method relating to machine learning on noisy data (Shao, Abstract) , wherein training a second neural network using the semi-supervised learning scheme based on the second training data set comprising the second label, and an unlabeled first training data set generated by removing the first label from the first training data set (Shao, Abstract in Page 1: focus on the problem that noisy labels are primarily mislabeled samples, which tend to be concentrated near decision boundaries, rather than uniformly distributed, and whose features should be equivocal; propose an ensemble learning method to correct noisy labels by exploiting the local structures of feature manifolds; different from typical ensemble strategies that increase the prediction diversity among sub-models via certain loss terms, the method trains sub-models on disjoint subsets, each being a union of the nearest-neighbors of randomly selected seed samples on the data manifold; as a result, each sub-model can learn a coarse representation of the data manifold along with a corresponding graph; moreover, only a limited number of sub-models will be affected by locally-concentrated noisy labels; the constructed graphs are used to suggest a series of label correction candidates, and accordingly, the method derives label correction results by voting down inconsistent suggestions; Section 1 in Pages 1-2 with FIG. 1 in Page 1 and FIG. 2 in Page 4: label noise can be roughly divided into two types: random label noise and confusing label noise, as illustrated in Fig, 1; the former typically involves mismatched descriptions or tags usually due to the negligence of an annotator; for this type of error, not only a label error may occur randomly in the sample space, but also the erroneous label is often of another irrelevant random class; in contrast, the latter usually occurs when a to-be-labeled sample contains confusing content or equivocal features and is the main cause of noisy labels in real-world applications; confusing label noise often occurs on data samples lying near the decision boundaries, and such noisy labels should be corrected as one neighboring category to the current one in the feature space; propose an ensemble-based label correction algorithm by exploiting the local structures of data manifolds; as illustrated in Fig. 2, the noisy label correction scheme involves three iterative phases: i) k-NN based data splitting, ii) multi-graph label propagation, and iii) confidence-guided label correction; by partitioning the source noisy dataset into disjoint subsets using our k-NN splitting scheme, each noisy label, along with its k-nearest-neighbors, will usually affect only a minority of the ensemble branches; as a result, each ensemble branch generates a graph that holds its own noisy local manifold structures so that such singulars can be treated as outliers during the majority decision process; train the ensemble branches on the corresponding disjoint subsets independently; through this design, each sub-network can learn not only a coarse global representation of the data manifold, but also different local manifold structures; then derive label correction suggestions for each sample based on the predictions of the sample’s nearest-neighbors in individual disjoint subsets via the corresponding ensemble branches; finally, the method suggests final label corrections by ruling out inconsistent suggestions derived according to graphs accessed by ensemble branches; propose a novel iterative data splitting method to split training samples into disjoint subsets, each preserving some local manifold structure of source data while representing a coarse global approximation; this design allows the influence of mislabeled instances to be limited to a minority of ensemble branches; design contains a novel noisy-label branch that can stably provide a correct suggestion for within-class clean labels; hence, this branch can boost the accuracy of label correction result, especially for datasets primarily containing confusing label noises; adopt multi-graph label propagation, rather than a simple nearest-neighbor strategy, to derive label correction suggestions via multiple graph representations characterizing similar data manifolds; hence, the method can take advantages of both nearest neighbors and manifold structures with the aid of graphs;  Section III with FIGS. 2-3 in Pages 3-6: Fig. 2 illustrates the framework of the proposed method for label correction; ensemble learning scheme aims to split D into M disjoint subsets, namely Sm for m = 1, …, M, for training M ensemble branches; during the training, the method first trains M classifiers Φm with parameter θm through the M ensemble branches, independently on the M disjoint subsets Sm derived by the data splitting strategy; second, by feeding all training samples xi into each classifier, extract totally M different feature vector sets, each of which can be used to derive one graph representation of the data manifold of training dataset                         
                            X
                        
                     = {xi}; third, for given the m-th partial-label set (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm where only labels                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                     belonging to Sm are available, construct its graph representation and use the graph to predict the label correction suggestions through label propagation for all data samples in the remaining M-1 partial-label sets (i.e., the remaining M-1 partial-label sets is unlabeled for m-th ensemble branch); as a result, use the M partial-label sets to predict totally M×M label correction suggestions based on the M graphs derived in the second step; fourth, based on the corrected labels                         
                            
                                            y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                            ∈
                            
                                            Y
                                        
                                        ^
                                    
                                    l
                                
                                    m
                                
                     and the M sample-correction sets                         
                            
                                                    x
                                                
                                                    l
                                                
                                                    m
                                                
                                            ,
                                             
                                                            y
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                                    m
                                                
                                    ∀
                                    
                                            x
                                        
                                            l
                                        
                                            m
                                        
                                    ∈
                                    
                                            S
                                        
                                            m
                                        
                    obtained in the l-th training epoch, generate another M × M label propagation suggestions; finally, derive the most likely correction result                         
                            
                                    Y
                                
                                -
                            
                            =
                            
                                                    y
                                                
                                                -
                                            
                                            i
                                        
                     from the total 2×M2 label correction suggestions via a majority decision; devise a data splitting strategy to partition the noisy source dataset D into M disjoint subsets Sm, each being the training set for one ensemble branch; in this way, Sm is expected to retain the shapes of k randomly-selected local neighborhoods, no matter noisy or not, as well as ensemble a random subset of xi [Symbol font/0xCE]                         
                            X
                        
                    ; therefore, by taking a majority decision on the approximate manifolds learned by the M sub-networks, the local influences brought by noisy samples can be mitigated; finally, the noisy label branch in the design is trained on the whole original dataset D to derive the feature vectors of all xi; meanwhile, the corrected label branch is trained based on corrected labels, i.e., (                        
                            X
                            ,
                            
                                    Y
                                
                                -
                            
                    ), obtained after the l-th training epoch, and it is the only branch used for deployment; the key idea is to restrict the influence of any group of locally-concentrated noisy labels to only a minority of subnetworks via our local-patch-based data slitting strategy; as a result, most ensemble branches can learn their own relatively correct approximations of the data manifold around the noisy local patch, so that a suitable label correction result can be yielded by majority decision, accordingly; the first phase of each training epoch is to randomly scatter per local neighborhood of data points on the source data manifold, including noisy labels, into disjoint subsets; when a noisy local neighborhood only contaminates a minority of ensemble branches, the method can vote down the negative influence of noisy labels by a majority decision; this case leads to a performance leap with the method: the performance upper-bound; on the other hand, in the extreme case that all noisy labels are uniformly distributed globally, the data splitting strategy will not alter the noise distribution so that the ensemble model leads to a as good performance as typical random-selection schemes: the baseline performance; in sum, the method can be expected to provide a performance in between the upper-bound and the baseline; since real-word label noise distribution tends to be concentrated on decision boundaries, the method can usually lead to performance improvements; splitting: for each class, randomly assign the M•B packages into M disjoint subsets; each subset Sm therefore stands for a coarse global approximation of the source data manifold but holds fine local manifold structures of different places independently; initialize each training epoch with the data splitting process, and name this strategy re-splitting; re-splitting enables each resemble branch to learn a different coarse approximation of the data manifold to prevent the resemble branches from overfitting and being biased to the same training data; initialize the model parameter                         
                            
                                    θ
                                
                                    l
                                
                                    m
                                
                     for each ensemble branch to guarantee the fast convergence of the l-th training epoch by eqn. (1); construct the m-th graph representing the data manifold based on the features                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                            =
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                            θ
                                        
                                            m
                                        
                     extracted by the m-th model; then, for each graph, produce M+M label correction suggestions by taking i) the original sample label pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) [Symbol font/0xCE] Sm and ii) the sample-correction pair (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) obtained in the l-th training epoch as different starting partial-labels, where                         
                            
                                            y
                                        
                                        -
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                     denotes the j-th sample’s label correction suggestion given by the m-th ensemble branch in the l-th training epoch; the multi-graph label propagation strategy works based on the features extracted by classification models of the M ensemble branches; the first step is to build M normalized weighted undirectional adjacency matrices, each representing the m-th graph constructed based on                         
                            
                                    f
                                
                                    i
                                
                                    m
                                
                    , for the label propagation process described in eqn. (4); next, propagate the label information of i) the sample-label pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                    y
                                
                                    j
                                
                                    m
                                
                    ) in the m-th disjoint subset Sm and ii) sample-correct pairs (                        
                            
                                    x
                                
                                    j
                                
                                    m
                                
                    ,                         
                            
                                            y
                                        
                                        ^
                                    
                                    j
                                
                                    m
                                    ,
                                    l
                                
                    ) in turn through each of the M graphs to obtain totally 2M2 label correction suggestions; finally, the partial-label suggested by the m-th ensemble branch based on the label information of Sj becomes eqn. (5); select the most frequent category among all label suggestions via a majority decision as the final label correction result; to guide the label correction, propose a normalized average confidence level to devise the loss function; after normalizing the final weights described in eqn. (6), use i) the corrected labels and ii) the weighted cross entropy loss to train the next-epoch ensemble branches and the next-epoch corrected label branch)
L and Shao are analogous art because they are from the same field of endeavor, a system and a method relating to machine learning on noisy data.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of Sh to L.  Motivation for doing so would improve performance (.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Ding et al. ("A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels", 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Mar 12-15, 2018, pp. 1215-1224) discloses in ABSTRACT that (1) the recent success of deep neural networks is powered in part by large-scale well-labeled training data; (2) however, it is a daunting task to laboriously annotate an ImageNet-like dataset; (3) on the contrary, it is fairly convenient, fast, and cheap to collect training images from the Web along with their noisy labels; (4) this signifies the need of alternative approaches to training deep neural networks using such noisy labels.; (5) existing methods tackling this problem either try to identify and correct the wrong labels or reweigh the data terms in the loss function according to the inferred noisy rates; (6) both strategies inevitably incur errors for some of the data points; (7) in this paper, contend that it is actually better to ignore the labels of some of the data points than to keep them if the labels are incorrect, especially when the noisy rate is high; (8) after all, the wrong labels could mislead a neural network to a bad local optimum; (9) suggest a two-stage framework for the learning from noisy labels; (10) in the first stage, identify a small portion of images from the noisy training set of which the labels are correct with a high probability; (11) the noisy labels of the other images are ignored; (12) in the second stage, train a deep neural network in a semi-supervised manner; (13) this framework effectively takes advantage of the whole training set and yet only a portion of its labels that are most likely correct; and (14) experiments on three datasets verify the effectiveness of our approach especially when the noisy rate is high.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HWEI-MIN LU whose telephone number is (313)446-4913. The examiner can normally be reached Mon - Fri: 9:00 AM - 6:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mariela D. Reyes can be reached at (571) 270-1006. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HWEI-MIN LU/Primary Examiner, Art Unit 2142
Read full office action
Prosecution Timeline

Jun 27, 2023
Application Filed
Mar 07, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/737,938
Patent 12602578
LIGHT SOURCE COLOR COORDINATE ESTIMATION SYSTEM AND DEEP LEARNING METHOD THEREOF
2y 5m to grant Granted Apr 14, 2026
17/804,513
Patent 12596954
MACHINE LEARNING FOR MANAGEMENT OF POSITIONING TECHNIQUES AND RADIO FREQUENCY USAGE
2y 5m to grant Granted Apr 07, 2026
17/231,757
Patent 12591770
PREDICTING A STATE OF A COMPUTER-CONTROLLED ENTITY
2y 5m to grant Granted Mar 31, 2026
17/662,568
Patent 12579466
DYNAMIC USER-INTERFACE COMPARISON BETWEEN MACHINE LEARNING OUTPUT AND TRAINING DATA
2y 5m to grant Granted Mar 17, 2026
17/805,377
Patent 12561222
REDUCING BIAS IN MACHINE LEARNING MODELS UTILIZING A FAIRNESS DEVIATION CONSTRAINT AND DECISION MATRIX
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
62%
Grant Probability
99%
With Interview (+39.5%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 217 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD AND APPARATUS WITH MACHINE LEARNING MODEL

This examiner grants 62% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

METHOD AND APPARATUS WITH MACHINE LEARNING MODEL

This examiner grants 62% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email