DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in responsive to communication(s): original application filed on 11/30/2022, said application claims a priority filing date of 07/14/2020. Claims 1-18 are pending. Claims 1, 7, and 13 are independent.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: S4 in FIGS. 1 and 3. Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Figure 2 should be designated by a legend such as --Prior Art-- because only that which is old is illustrated. See MPEP § 608.02(g). Corrected drawings in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. The replacement sheet(s) should be labeled “Replacement Sheet” in the page header (as per 37 CFR 1.84(c)) so as not to obstruct any portion of the drawing figures. If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to because it is unclear whether K2 is the score before or after converted into probability values because values shown in K2 of FIG. 11B appears to be probability values (i.e., all possibility added up to 1.0) and however, the specification in ¶ [0113] describes '"propyl" is determined to be "I-Molecular" based on the magnitude of the score, but is determined to be “B-Molecular” by probabilistic selection', which indicates K2 is before converted to probability values. Clarification is required. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities:
in ¶ [0113], "As illustrated in FIG. 10B, in the result data K2, the label is probabilistically determined (selected) based on the estimation score converted into the probability value" appears to be "As illustrated in FIG. 11B, in the result data K2, the label is probabilistically determined (selected) based on the estimation score converted into the probability value".
Appropriate correction is required.
Claim Objections
Claims 1, 6-7, 12-13, and 18 are objected to because of the following informalities:
in Claim 1, lines 7-8; Claim 7, lines 6-7; and Claim 13, lines 8-9, "… generating a first machine learning model by training by the plurality of data …" appears to be "… generating a first machine learning model by training using the plurality of data …";
in Claims 6, 12, and 18, lines 3-4, "… generating/generate a second machine learning model by training by the generated second training data group" appears to be "… generating/generate a second machine learning model by training using the generated second training data group".
Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-5, 7-11, and 13-17 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more.
The claim(s) recite(s) "selecting/select a plurality of data from a first training data group based on an appearance frequency of first data attached with a first label, the first data being included in the first training data group" (Claims 1-18), "generating/generate a second training data group obtained by combining the first training data group and an output by the first machine learning model when the first data is input" (Claims 1-18), "excluding/exclude second data whose appearance frequency is less than a first threshold from a selection target, the second data being included in the first training data group" (Claims 2, 8, and 14), "(obtaining/determining) entropy and self-information amount of the first data based on the appearance frequency" (Claims 3, 9, and 15), "excluding/exclude third data whose self-information amount is larger than a second threshold and whose entropy is less than a third threshold from a selection target, the third data being included in the first training data group" (Claims 3, 9, and 15), "generating/generate the second training data group by combining the first training data group and a first result output by the first machine learning model when fourth data generated by changing content of fifth data included in the first training data group is input" (Claims 4, 10, and 16), and "generating the second training data group by combining the first training data group and a second result generated by changing content of a first result output by the first machine learning model when the first data is input" (Claims 5, 11, and 17) which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/algorithms/calculations.
This judicial exception is not integrated into a practical application because the claim(s) recite(s) additional elements/limitations of "non-transitory computer-readable storage medium storing machine learning program" (Claims 1-6), "computer" (Claims 1-12), "information processing device" (Claims 13-18), "one or more memories" (Claims 13-18), "one or more processors" (Claims 13-18), "generating/generate a first machine learning model by training by the plurality of data" (Claims 1-18), and "acquiring/acquire entropy and self-information amount of the first data" (Claims 3, 9, and 15) which only amount to "apply it" with the use of generic computer components or insignificant extra solution activity. None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application except Claims 6, 12, and 18 which include the additional element/limitation "generating a second machine learning model by training by the generated second training data group" to integrate with other limitations as a whole to reflect improvement on machine learning technology described in ¶¶ [0122]-[0126] of the specification..
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because th.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-18 are rejected under 35 U.S.C. 103 as being unpatentable over Radosavovic ("Data Distillation: Towards Omni-Supervised Learning", 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 18-23, 2018, pp. 4119-4128), hereinafter Radosavovic in view of Mizobuchi (US 2018/0082215 A1, pub. date: 03/22/2018), hereinafter Mizobuchi.
Independent Claims 1, 7, and 13
Radosavovic discloses a non-transitory computer-readable storage medium storing machine learning program that causes a computer to execute a process (Radosavovic, Abstract of Page 4119: inherited in a computer for performing omni-supervised learning with data distillation and self-training), the process comprising:
selecting a plurality of data from a first training data group based on (Radosavovic, Section 1 of Pages 4119-4120: most research on semi-supervised learning has simulated labeled/unlabeled data by splitting a fully annotated dataset and is therefore likely to be upper-bounded by fully supervised learning with all annotations; on the contrary, omni-supervised learning is lower-bounded by the accuracy of training on all annotated data, and its success can be evaluated by how much it surpasses the fully supervised baseline; propose to perform knowledge distillation from data, inspired by [3, 18] which performed knowledge distillation from models; generate annotations on unlabeled data using a model trained on large amounts of labeled data; Section 3 of Pages 4120-4121: propose data distillation, a general method for omni-supervised learning that distills knowledge from unlabeled data without the requirement of training a large set of models; data distillation involves four steps: (1) training a model on manually labeled data (just as in normal supervised learning) …; i.e., only data which are manually labeled are selected to train a model for generating annotations on unlabeled data; Section 4 of Pages 4121-4222: expect the predicted boxes and keypoints to be reliable enough for generating good training labels; nevertheless, the predictions will contain false positives that we hope to identify and discard; use the predicted detection score as a proxy for prediction quality and generate annotations only from the predictions that are above a certain score threshold; a score threshold works well if it makes "the average number of annotated instances per unlabeled image" roughly equal to "the average number of instances per labeled image"; although this heuristic assumes that the unlabeled and labeled images follow similar distributions, it is robust and works well even in cases where the assumption does not hold); and
generating a second training data group obtained by combining the first training data group and an output by the first machine learning model when the first data is input (Radosavovic, Abstract of Page4119: to exploit the omni-supervised setting, propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations; Section 1 with FIG. 1 of Pages 4119-4120: propose to perform knowledge distillation from data, inspired by [3, 18] which performed knowledge distillation from models; generate annotations on unlabeled data using a model trained on large amounts of labeled data, and then retrain the model using the extra generated annotations; ensembling the results of a single model run on different transformations (e.g., flipping and scaling) of an unlabeled image; such transformations are widely known to improve single-model accuracy [20] when applied at test time, indicating that they can provide nontrivial knowledge that is not captured by a single prediction; in other words, in comparison with [18], which distills knowledge from the predictions of multiple models, distill the knowledge of a single model run on multiple transformed copies of unlabeled data (see Figure 1); data distillation is a simple and natural approach based on “self-training” (i.e., making predictions on unlabeled data and using them to update the model); we are now equipped with accurate models that may make fewer errors than correct predictions; this allows us to trust their predictions on unseen data and reduces the requirement for developing data cleaning heuristics; as a result, data distillation does not require one to change the underlying recognition model (e.g., no modification on the loss definitions), and is a scalable solution for processing large-scale unlabeled data sources; Section 2 of Page 4120: ensembling [14] multiple models has been a successful method for improving accuracy; model compression [3] is proposed to improve test-time efficiency of ensembling by compressing an ensemble of models into a single student model; our approach distills knowledge from a lightweight ensemble formed by multiple data transformations; among semi-supervised methods, our method is most related to self-training, a strategy in which a model’s predictions on unlabeled data are used to train itself; once the predicted annotations are generated, our method leverages them as if they were true labels; it does not require any modifications to the optimization problem or model structure; multiple views or perturbations of the data can provide useful signal for semi-supervised learning; our method is also based on multiple geometric transformations, but it does not require to modify network structures or impose consistency by adding any extra loss terms; Section 3 with FIG. 2 of Pages 4120-4121: propose data distillation, a general method for omni-supervised learning that distills knowledge from unlabeled data without the requirement of training a large set of models; data distillation involves four steps: …; (2) applying the trained model to multiple transformations of unlabeled data; (3) converting the predictions on the unlabeled data into labels by ensembling the multiple predictions; and (4) retraining the model on the union of the manually labeled data and automatically labeled data; a common strategy for boosting the accuracy of a visual recognition model is to apply the same model to multiple transformations of the input and then to aggregate the results; refer to the general application of inference to multiple transformations of a data point with a single model as multi-transform inference. In data distillation, apply multi-transform inference to a potentially massive set of unlabeled data; by aggregating the results of multi-transform inference, it is often possible to obtain a single prediction that is superior to any of the model’s predictions under a single transform (e.g., see Figure 2); the aggregated prediction generates new knowledge and in principle the model can use this information to learn from itself by generating labels; simply ensemble (or aggregate) the predictions from multi-transform inference in a way that generates “hard” labels of the same structure and type of those found in the manually annotated data; once such labels are generated, they can be used to retrain the model in a simple plug-and-play fashion, as if they were authentic ground-truth labels; the new knowledge generated from unlabeled data can be used to improve the model; to do this, a student model (which can be the same as the original model or different) is trained on the union set of the original supervised data and the unlabeled data with automatically generated labels; training on the union set is straightforward and requires no change to the loss function; ensure that each training minibatch contains a mixture of manually labeled data and automatically labeled data which ensures that every minibatch has a certain percentage of ground-truth labels, which results in better gradient estimates; since more data is available, the training schedule must be lengthened to take full advantage of it; Section 4 of Pages 4121-4222: opt for geometric transformations for multi-transform inference, though other transformations such as color jittering [20] are possible; the only requirement is that it must be possible to ensemble the resulting predictions; for geometric transformations, if the prediction is a geometric quantity (e.g., coordinates of a keypoint), then the inverse transformation must be applied to each prediction before they are merged; use two popular transformations: scaling and horizontal flipping; one could ensemble the multi-transform inference results from each stage and each head of Mask R-CNN; for simplicity, only apply multi-transform inference to the keypoint head; the outputs from the other stage (i.e., RPN) and heads (i.e., bounding box classification and regression) are from a single-scale without any transformations; train a student model on the union set of the original supervised images and the images with automatically generated annotations).
Radosavovic further discloses a machine learning method for a computer to execute a process described above (Radosavovic, Abstract of Page 4119: inherited in a computer for performing omni-supervised learning with data distillation and self-training).
Radosavovic further discloses an information processing device comprising: one or more memories; and one or more processors coupled to the one or more memories and the one or more processors configured to perform the process described above (Radosavovic, Abstract of Page 4119: inherited in a computer for performing omni-supervised learning with data distillation and self-training).
Radosavovic fails to explicitly disclose selecting a plurality of data from a first training data group based on an appearance frequency of first data in the first training data group for generating a first machine learning model.
Mizobuchi teaches a system and a method relating to machine learning (Mizobuchi, ¶ [0003]), wherein selecting a plurality of data from a first training data group based on an appearance frequency of first data in the first training data group for generating a first machine learning model (Mizobuchi, ¶ [0007]: extracting, from the plurality of teacher data elements, a plurality of potential features each included in at least one of the plurality of teacher data elements; calculating, based on a frequency of occurrence of each of the plurality of potential features in the plurality of teacher data elements, a degree of importance of said each potential feature in machine learning; calculating an information amount of each of the plurality of teacher data elements, using degrees of importance calculated respectively for a plurality of potential features included in said each teacher data element; and selecting a teacher data element for use in the machine learning from the plurality of teacher data elements, based on information amounts of respective ones of the plurality of teacher data elements; ¶¶ [0025]-[0052] with FIG. 1: reads the teacher data elements 20a1 to 20an from the storage unit 11, and extracts, from the teacher data elements 20a1 to 20an, a plurality of potential features each of which is included in at least one of the teacher data elements 20a1 to 20an; what are extracted as the potential features A to C from the teacher data elements 20a1 to 20an is determined according to what is learned in the machine learning; e.g., in the case of creating a learning model for determining whether two documents are similar, take words and sequences of words as features to be extracted; in the case of creating a learning model for determining whether two images are similar, the control unit 12 takes pixel values and sequences of pixel values as features to be extracted; calculate the degree of importance of each potential feature A to C in the machine learning, on the basis of the frequency of occurrence of the potential feature A to C in the teacher data elements 20a1 to 20an; e.g., a potential feature has a higher degree of importance as its frequency of occurrence in all the teacher data elements 20a1 to 20an is lower; in the case where the potential features A to C are words or sequences of words, an inverse document frequency (idf) or another may be used as the degree of importance; even if a potential feature is not useful for sorting-out, its frequency of occurrence becomes lower as the potential feature consists of more words; normalize the idf value by dividing by the length of the potential feature (the number of words) and use the resultant as the degree of importance; the normalization by dividing the idf value by the number of words prevents obtaining a high degree of importance for a potential feature that just consists of many words and is not useful for sorting-out; calculate the information amount (hereinafter, may be referred to as potential information amount) of each of the teacher data elements 20a1 to 20an, using the degrees of importance calculated for the potential features included in the teacher data element 20a1 to 20an; e.g., the information amount of each teacher data element 20a1 to 20an is a sum of the degrees of importance calculated for the potential features included in the teacher data element 20a1 to 20an; select teacher data elements for use in the machine learning, from the teacher data elements 20a1 to 20an on the basis of the information amounts of the respective teacher data elements 20a1 to 20an; select teacher data elements with information amounts larger than or equal to a threshold, from the teacher data elements 20a1 to 20an, to thereby generate a teacher data set; generates a plurality of teacher data sets by sequentially adding a teacher data element to the teacher data set in descending order of information amount; e.g., the teacher data set 21a of FIG. 1 includes teacher data elements from the teacher data elements 20a2 with the largest information amount to the teacher data element 20an with the k-th largest information amount; "k" is the minimum number of teacher data elements to be used for calculating the evaluation value of a learning model; create a plurality of learning models by performing the machine learning on the individual teacher data sets; calculate an evaluation value regarding the performance of each of the learning models 22a, 22b, and 22c created by the machine learning; an F value is used as the valuation value; the F value is a harmonic mean of recall and precision; search for a teacher data set that produces a learning model with the highest evaluation value; ¶¶ [0072]-[0113] with FIGS. 3-10: a learning model for sorting out similar documents is created using documents at least partly written in natural language as teacher data elements; the documents 20b1 to 20bn are reports on bugs, which includes a title 30 and a body 31 that includes, e.g., descriptions 31a, 31b, and 31c, a source code 31d, and a log 31e; each of the document 20b1 to 20bn is tagged with identification information indicating whether the document 20b1 to 20bn belongs to a similarity group; extract a plurality of potential features from the documents 20b1 to 20bn with natural language processing, which are words or sequences of words; counts the frequency of occurrence of each potential feature in all the documents 20b1 to 20bn; calculate the degree of importance of each potential feature in the machine learning, on the basis of the frequency of occurrence of the potential feature in all the documents 20b1 to 20bn; e.g., as the degree of importance, an idf value or a mutual information amount may be used; here, idf(t) that is an idf value for a word or a sequence of words is calculated by the following equation (1): idf(t)=log(n/df(t)); where "n" denotes the number of all documents, and "df(t)" denotes the number of documents including the word or the sequence of words; the mutual information amount represents a measurement of interdependence between two random variables (X, Y), where a random variable X indicating a probability of occurrence of a word or a sequence of words in all documents and a random variable Y indicating a probability of occurrence of a document belonging to a similarity group in all the documents, the mutual information amount I(X; Y) is calculated by the following equation (2); normalize the idf value of each potential feature by dividing by the number of words in the potential feature, so as to prevent a high degree of importance for a potential feature that merely consists of a large number of words and is not useful for sorting-out; add up the degrees of importance of one or a plurality of potential features included in the document 20b1 to 20bn to calculate a potential information amount; sort the documents 20b1 to 20bn in descending order of potential information amount; generate a plurality of teacher data sets on the basis of the sorting result; perform the machine learning on each of generated teacher data sets; ¶¶ [0127]-[0137] with FIG. 12: (S11) calculate, for each of the plurality of potential features extracted at step S10, the degree of importance in the machine learning on the basis of the frequency of occurrence of the potential feature in all the teacher data elements; . (S12) add up the degrees of importance of one or a plurality of potential features included in each of the plurality of teacher data elements, calculated at step S11, to thereby calculate a potential information amount; (S14) generates a plurality of teacher data sets by sequentially adding the teacher data elements sorted at step S13, one by one in descending order of potential information amount; (S15) select the teacher data sets one by one in ascending order of the number of teacher data elements from the plurality of teacher data sets; (S16) perform the machine learning on the selected teacher data set to thereby create a learning model; (S17) calculate an evaluation value for the performance of the learning model created by the machine learning; (S18) determine whether the evaluation value for the learning model created based on the teacher data set currently selected is lower than that for the learning model created based on the teacher data set selected last time; if the current evaluation value is not lower, step S15 and subsequent steps are repeated; if the current evaluation value is lower, the process proceeds to step S19 which outputs the learning model created based on the teacher data set selected last time, as a learning model with the highest evaluation value, and then completes the machine learning process).
Radosavovic and Mizobuchi are analogous art because they are from the same field of endeavor, a system and a method relating to machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of Mizobuchi to Radosavovic. Motivation for doing so would be possible to exclude inappropriate teacher data elements with little features (small information amount), and thus to improve the learning accuracy (.
Claims 2, 8, and 14
Radosavovic in view of Mizobuchi discloses all the elements as stated in Claims 1, 7, and 13 respectively and further discloses wherein the selecting includes excluding/to exclude second data whose appearance frequency is less than a first threshold from a selection target, the second data being included in the first training data group (Mizobuchi, ¶ [0007]: extracting, from the plurality of teacher data elements, a plurality of potential features each included in at least one of the plurality of teacher data elements; calculating, based on a frequency of occurrence of each of the plurality of potential features in the plurality of teacher data elements, a degree of importance of said each potential feature in machine learning; calculating an information amount of each of the plurality of teacher data elements, using degrees of importance calculated respectively for a plurality of potential features included in said each teacher data element; and selecting a teacher data element for use in the machine learning from the plurality of teacher data elements, based on information amounts of respective ones of the plurality of teacher data elements; ¶¶ [0025]-[0052] with FIG. 1: reads the teacher data elements 20a1 to 20an from the storage unit 11, and extracts, from the teacher data elements 20a1 to 20an, a plurality of potential features each of which is included in at least one of the teacher data elements 20a1 to 20an; what are extracted as the potential features A to C from the teacher data elements 20a1 to 20an is determined according to what is learned in the machine learning; e.g., in the case of creating a learning model for determining whether two documents are similar, take words and sequences of words as features to be extracted; in the case of creating a learning model for determining whether two images are similar, the control unit 12 takes pixel values and sequences of pixel values as features to be extracted; calculate the degree of importance of each potential feature A to C in the machine learning, on the basis of the frequency of occurrence of the potential feature A to C in the teacher data elements 20a1 to 20an; e.g., a potential feature has a higher degree of importance as its frequency of occurrence in all the teacher data elements 20a1 to 20an is lower; in the case where the potential features A to C are words or sequences of words, an inverse document frequency (idf) or another may be used as the degree of importance; even if a potential feature is not useful for sorting-out, its frequency of occurrence becomes lower as the potential feature consists of more words; normalize the idf value by dividing by the length of the potential feature (the number of words) and use the resultant as the degree of importance; the normalization by dividing the idf value by the number of words prevents obtaining a high degree of importance for a potential feature that just consists of many words and is not useful for sorting-out; calculate the information amount (hereinafter, may be referred to as potential information amount) of each of the teacher data elements 20a1 to 20an, using the degrees of importance calculated for the potential features included in the teacher data element 20a1 to 20an; e.g., the information amount of each teacher data element 20a1 to 20an is a sum of the degrees of importance calculated for the potential features included in the teacher data element 20a1 to 20an; select teacher data elements for use in the machine learning, from the teacher data elements 20a1 to 20an on the basis of the information amounts of the respective teacher data elements 20a1 to 20an; select teacher data elements with information amounts larger than or equal to a threshold, from the teacher data elements 20a1 to 20an, to thereby generate a teacher data set; generates a plurality of teacher data sets by sequentially adding a teacher data element to the teacher data set in descending order of information amount; e.g., the teacher data set 21a of FIG. 1 includes teacher data elements from the teacher data elements 20a2 with the largest information amount to the teacher data element 20an with the k-th largest information amount; "k" is the minimum number of teacher data elements to be used for calculating the evaluation value of a learning model; create a plurality of learning models by performing the machine learning on the individual teacher data sets; calculate an evaluation value regarding the performance of each of the learning models 22a, 22b, and 22c created by the machine learning; an F value is used as the valuation value; the F value is a harmonic mean of recall and precision; search for a teacher data set that produces a learning model with the highest evaluation value; ¶¶ [0072]-[0113] with FIGS. 3-10: a learning model for sorting out similar documents is created using documents at least partly written in natural language as teacher data elements; the documents 20b1 to 20bn are reports on bugs, which includes a title 30 and a body 31 that includes, e.g., descriptions 31a, 31b, and 31c, a source code 31d, and a log 31e; each of the document 20b1 to 20bn is tagged with identification information indicating whether the document 20b1 to 20bn belongs to a similarity group; extract a plurality of potential features from the documents 20b1 to 20bn with natural language processing, which are words or sequences of words; counts the frequency of occurrence of each potential feature in all the documents 20b1 to 20bn; calculate the degree of importance of each potential feature in the machine learning, on the basis of the frequency of occurrence of the potential feature in all the documents 20b1 to 20bn; e.g., as the degree of importance, an idf value or a mutual information amount may be used; here, idf(t) that is an idf value for a word or a sequence of words is calculated by the following equation (1): idf(t)=log(n/df(t)); where "n" denotes the number of all documents, and "df(t)" denotes the number of documents including the word or the sequence of words; the mutual information amount represents a measurement of interdependence between two random variables (X, Y), where a random variable X indicating a probability of occurrence of a word or a sequence of words in all documents and a random variable Y indicating a probability of occurrence of a document belonging to a similarity group in all the documents, the mutual information amount I(X; Y) is calculated by the following equation (2); normalize the idf value of each potential feature by dividing by the number of words in the potential feature, so as to prevent a high degree of importance for a potential feature that merely consists of a large number of words and is not useful for sorting-out; add up the degrees of importance of one or a plurality of potential features included in the document 20b1 to 20bn to calculate a potential information amount; sort the documents 20b1 to 20bn in descending order of potential information amount; generate a plurality of teacher data sets on the basis of the sorting result; perform the machine learning on each of generated teacher data sets; ¶¶ [0127]-[0137] with FIG. 12: (S11) calculate, for each of the plurality of potential features extracted at step S10, the degree of importance in the machine learning on the basis of the frequency of occurrence of the potential feature in all the teacher data elements; . (S12) add up the degrees of importance of one or a plurality of potential features included in each of the plurality of teacher data elements, calculated at step S11, to thereby calculate a potential information amount; (S14) generates a plurality of teacher data sets by sequentially adding the teacher data elements sorted at step S13, one by one in descending order of potential information amount; (S15) select the teacher data sets one by one in ascending order of the number of teacher data elements from the plurality of teacher data sets; (S16) perform the machine learning on the selected teacher data set to thereby create a learning model; (S17) calculate an evaluation value for the performance of the learning model created by the machine learning; (S18) determine whether the evaluation value for the learning model created based on the teacher data set currently selected is lower than that for the learning model created based on the teacher data set selected last time; if the current evaluation value is not lower, step S15 and subsequent steps are repeated; if the current evaluation value is lower, the process proceeds to step S19 which outputs the learning model created based on the teacher data set selected last time, as a learning model with the highest evaluation value, and then completes the machine learning process) (Radosavovic, Section 4 of Pages 4121-4222: expect the predicted boxes and keypoints to be reliable enough for generating good training labels; nevertheless, the predictions will contain false positives that we hope to identify and discard; use the predicted detection score as a proxy for prediction quality and generate annotations only from the predictions that are above a certain score threshold; a score threshold works well if it makes "the average number of annotated instances per unlabeled image" roughly equal to "the average number of instances per labeled image"; although this heuristic assumes that the unlabeled and labeled images follow similar distributions, it is robust and works well even in cases where the assumption does not hold).
Claims 3, 9, and 15
Radosavovic in view of Mizobuchi discloses all the elements as stated in Claims 1, 7, and 13 respectively and further discloses wherein the selecting includes/to: acquiring/acquire entropy and self-information amount of the first data based on the appearance frequency; and excluding/exclude third data whose self-information amount is larger than a second threshold and whose entropy is less than a third threshold from a selection target, the third data being included in the first training data group (Mizobuchi, ¶ [0007]: extracting, from the plurality of teacher data elements, a plurality of potential features each included in at least one of the plurality of teacher data elements; calculating, based on a frequency of occurrence of each of the plurality of potential features in the plurality of teacher data elements, a degree of importance of said each potential feature in machine learning; calculating an information amount of each of the plurality of teacher data elements, using degrees of importance calculated respectively for a plurality of potential features included in said each teacher data element; and selecting a teacher data element for use in the machine learning from the plurality of teacher data elements, based on information amounts of respective ones of the plurality of teacher data elements; ¶¶ [0025]-[0052] with FIG. 1: reads the teacher data elements 20a1 to 20an from the storage unit 11, and extracts, from the teacher data elements 20a1 to 20an, a plurality of potential features each of which is included in at least one of the teacher data elements 20a1 to 20an; what are extracted as the potential features A to C from the teacher data elements 20a1 to 20an is determined according to what is learned in the machine learning; e.g., in the case of creating a learning model for determining whether two documents are similar, take words and sequences of words as features to be extracted; in the case of creating a learning model for determining whether two images are similar, the control unit 12 takes pixel values and sequences of pixel values as features to be extracted; calculate the degree of importance of each potential feature A to C in the machine learning, on the basis of the frequency of occurrence of the potential feature A to C in the teacher data elements 20a1 to 20an; e.g., a potential feature has a higher degree of importance as its frequency of occurrence in all the teacher data elements 20a1 to 20an is lower; in the case where the potential features A to C are words or sequences of words, an inverse document frequency (idf) or another may be used as the degree of importance; even if a potential feature is not useful for sorting-out, its frequency of occurrence becomes lower as the potential feature consists of more words; normalize the idf value by dividing by the length of the potential feature (the number of words) and use the resultant as the degree of importance; the normalization by dividing the idf value by the number of words prevents obtaining a high degree of importance for a potential feature that just consists of many words and is not useful for sorting-out; calculate the information amount (hereinafter, may be referred to as potential information amount) of each of the teacher data elements 20a1 to 20an, using the degrees of importance calculated for the potential features included in the teacher data element 20a1 to 20an; e.g., the information amount of each teacher data element 20a1 to 20an is a sum of the degrees of importance calculated for the potential features included in the teacher data element 20a1 to 20an; select teacher data elements for use in the machine learning, from the teacher data elements 20a1 to 20an on the basis of the information amounts of the respective teacher data elements 20a1 to 20an; select teacher data elements with information amounts larger than or equal to a threshold, from the teacher data elements 20a1 to 20an, to thereby generate a teacher data set; generates a plurality of teacher data sets by sequentially adding a teacher data element to the teacher data set in descending order of information amount; e.g., the teacher data set 21a of FIG. 1 includes teacher data elements from the teacher data elements 20a2 with the largest information amount to the teacher data element 20an with the k-th largest information amount; "k" is the minimum number of teacher data elements to be used for calculating the evaluation value of a learning model; create a plurality of learning models by performing the machine learning on the individual teacher data sets; calculate an evaluation value regarding the performance of each of the learning models 22a, 22b, and 22c created by the machine learning; an F value is used as the valuation value; the F value is a harmonic mean of recall and precision; search for a teacher data set that produces a learning model with the highest evaluation value; ¶¶ [0072]-[0113] with FIGS. 3-10: a learning model for sorting out similar documents is created using documents at least partly written in natural language as teacher data elements; the documents 20b1 to 20bn are reports on bugs, which includes a title 30 and a body 31 that includes, e.g., descriptions 31a, 31b, and 31c, a source code 31d, and a log 31e; each of the document 20b1 to 20bn is tagged with identification information indicating whether the document 20b1 to 20bn belongs to a similarity group; extract a plurality of potential features from the documents 20b1 to 20bn with natural language processing, which are words or sequences of words; counts the frequency of occurrence of each potential feature in all the documents 20b1 to 20bn; calculate the degree of importance of each potential feature in the machine learning, on the basis of the frequency of occurrence of the potential feature in all the documents 20b1 to 20bn; e.g., as the degree of importance, an idf value or a mutual information amount may be used; here, idf(t) that is an idf value for a word or a sequence of words is calculated by the following equation (1): idf(t)=log(n/df(t)); where "n" denotes the number of all documents, and "df(t)" denotes the number of documents including the word or the sequence of words; the mutual information amount represents a measurement of interdependence between two random variables (X, Y), where a random variable X indicating a probability of occurrence of a word or a sequence of words in all documents and a random variable Y indicating a probability of occurrence of a document belonging to a similarity group in all the documents, the mutual information amount I(X; Y) is calculated by the following equation (2); normalize the idf value of each potential feature by dividing by the number of words in the potential feature, so as to prevent a high degree of importance for a potential feature that merely consists of a large number of words and is not useful for sorting-out; add up the degrees of importance of one or a plurality of potential features included in the document 20b1 to 20bn to calculate a potential information amount; sort the documents 20b1 to 20bn in descending order of potential information amount; generate a plurality of teacher data sets on the basis of the sorting result; perform the machine learning on each of generated teacher data sets; ¶¶ [0127]-[0137] with FIG. 12: (S11) calculate, for each of the plurality of potential features extracted at step S10, the degree of importance in the machine learning on the basis of the frequency of occurrence of the potential feature in all the teacher data elements; . (S12) add up the degrees of importance of one or a plurality of potential features included in each of the plurality of teacher data elements, calculated at step S11, to thereby calculate a potential information amount; (S14) generates a plurality of teacher data sets by sequentially adding the teacher data elements sorted at step S13, one by one in descending order of potential information amount; (S15) select the teacher data sets one by one in ascending order of the number of teacher data elements from the plurality of teacher data sets; (S16) perform the machine learning on the selected teacher data set to thereby create a learning model; (S17) calculate an evaluation value for the performance of the learning model created by the machine learning; (S18) determine whether the evaluation value for the learning model created based on the teacher data set currently selected is lower than that for the learning model created based on the teacher data set selected last time; if the current evaluation value is not lower, step S15 and subsequent steps are repeated; if the current evaluation value is lower, the process proceeds to step S19 which outputs the learning model created based on the teacher data set selected last time, as a learning model with the highest evaluation value, and then completes the machine learning process; it is also well-known in the art (see en.wikipedia.org/wiki/Information_content and en.wikipedia.org/wiki/Entropy_(information_theory)) that entropy is equivalent to mutual information amount in eqn. (2) and self-information is equivalent to idf value in eqn. (1) except negative sign).
Claims 4, 10, and 16
Radosavovic in view of Mizobuchi discloses all the elements as stated in Claims 1, 7, and 13 respectively and further discloses wherein the generating the second training data group includes generating/to generate the second training data group by combining the first training data group and a first result output by the first machine learning model when fourth data generated by changing content of fifth data included in the first training data group is input (Radosavovic, Abstract of Page4119: to exploit the omni-supervised setting, propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations; Section 1 with FIG. 1 of Pages 4119-4120: propose to perform knowledge distillation from data, inspired by [3, 18] which performed knowledge distillation from models; generate annotations on unlabeled data using a model trained on large amounts of labeled data, and then retrain the model using the extra generated annotations; ensembling the results of a single model run on different transformations (e.g., flipping and scaling) of an unlabeled image; such transformations are widely known to improve single-model accuracy [20] when applied at test time, indicating that they can provide nontrivial knowledge that is not captured by a single prediction; in other words, in comparison with [18], which distills knowledge from the predictions of multiple models, distill the knowledge of a single model run on multiple transformed copies of unlabeled data (see Figure 1); data distillation is a simple and natural approach based on “self-training” (i.e., making predictions on unlabeled data and using them to update the model); we are now equipped with accurate models that may make fewer errors than correct predictions; this allows us to trust their predictions on unseen data and reduces the requirement for developing data cleaning heuristics; as a result, data distillation does not require one to change the underlying recognition model (e.g., no modification on the loss definitions), and is a scalable solution for processing large-scale unlabeled data sources; Section 2 of Page 4120: ensembling [14] multiple models has been a successful method for improving accuracy; model compression [3] is proposed to improve test-time efficiency of ensembling by compressing an ensemble of models into a single student model; our approach distills knowledge from a ligh