Last updated: April 19, 2026
Application No. 17/350,034
FEATURE SELECTION USING TESTING DATA

Non-Final OA §103
Filed
Jun 17, 2021
Examiner
TRAN, AMY NMN
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
3 (Non-Final)
Interview Optional

— +47.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 28 resolved cases, 2023–2026
Examiner Intelligence

TRAN, AMY NMN View full profile →
Grants only 36% of cases
Career Allow Rate
10 granted / 28 resolved
-19.3% vs TC avg
Strong +48% interview lift
Without
With
+47.9%
Interview Lift
resolved cases with interview
Typical timeline
5y 2m
Avg Prosecution
24 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
32.5%
-7.5% vs TC avg
§103
44.2%
+4.2% vs TC avg
§102
6.0%
-34.0% vs TC avg
§112
15.6%
-24.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 28 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12-01-2025 has been entered.
Response to Amendment
	The amendments filed on 12-01-2025 have been entered. Applicant’s amendments to the claims overcome the claim rejection under 35 U.S.C 112(b) previously set forth in the Final-Office Action mailed on 10-01-2025. The status of the claims is as follows:
	Claims 1-5, 7-13, 15-21, 23-28 remain pending in the application.
	Claims 6, 14 and 22 are cancelled.
	Claims 1, 7, 9, 15, 17, 23 and 25 are amended.
	Claims 26-28 are new.
Response to Arguments
In reference to the Claim Rejections under 35 U.S.C 101:
Applicant’s arguments, see Remarks pg.12-21, filed 12-01-2025, with respect to the claim rejections under 35 U.S.C 101 have been fully considered and are persuasive.  The rejection of claim under 35 U.S.C 101 has been withdrawn. 

In reference to the Claim Rejection under 35 U.S.C 103:
Applicant’s arguments, see Remarks pg. 21-25, filed , with respect to the rejection(s) of claim(s) under 35 U.S.C 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Pan et al. (“Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber”)
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-2, 4-5, 7-10, 12-13, 15-18, 20-21, 23-24 and 26-28 are rejected under 35 U.S.C. 103 as being unpatentable over Hosseini & Mahdavi (“F Plus KS: A new Feature Selection Strategy for Steganalysis”) (hereafter referred to as “Hosseini”) in view of Hassani & Silva (“A Kolmogorov-Smirnov Based Test for Comparing the Predictive Accuracy of Two Sets of Forecasts”) (hereafter referred to as “Hassani”) and further in view of Haws et al (US 9,483,739 B2) (hereafter referred to as “Haws”), Rajagopal et al. (“Towards Effective Network Intrusion Detection: From Concept to Creation on Azure Cloud”) (hereafter referred to as “Rajagopal”) and Pan et al. (“Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber”) (hereafter referred to as “Pan”)
	Regarding claim 1, Hosseini explicitly discloses:
A computer-implemented method comprising: performing, by one or more 
processors, on a plurality of datasets, a plurality of distribution tests, (Hosseini, Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1), ¶[1-2]: “Kolmogorov-Smirnov (KS) is a statistical test evaluating the hypothesis that whether two independent sample sets represent two different population or not [16]. In KS-test, the maximum difference between cumulative probability distributions related to the two random variables is calculated. If there is a significant difference at any point along the two cumulative probability distributions, it can be concluded that the two sample sets belong to two different populations. The KS value between the two random variable M and N can be calculated from the following steps. Random variables outcomes x are discretized, separately, into k bins [xi, xi+1], I = 1 … k. Occurrence probabilities in each bin are estimated. Cumulative probability distributions are calculated. KS statistic can be calculated from (1). 
    PNG
    media_image1.png
    79
    482
    media_image1.png
    Greyscale
 In (1), nM and nN are the number of elements in M and N. Also Mk and Nk refer to cumulative probability distributions of M and N in kth bin, respectively. In each statistical significance level of α, if KS (M,N) < λα, the two random variables M and N are related to a unique distribution.”) [Examiner’s note: the “plurality of datasets” is being interpreted as the “cover” dataset and the “stegos” dataset. Because the features in these datasets are used to train a classifier to “predict suspect objects”, they are being interpreted as the “predictive features”. The “plurality of distribution tests” is being interpreted as the “cumulative probability distributions” and the “Kolmogorov-Smirnov statistical test”]
each dataset of the plurality of datasets comprises: (i) a set of training entries corresponding to a single predictive feature of the plurality of predictive features, (ii) a set of testing entries corresponding to the single predictive feature, and (Hosseini, Page 3, Col. 2, Section III, ¶[1-2]: “Quantity of 8,000 images were used as the cover images of training set and 2,000 remaining images were used as the cover images of testing set… In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain.”)
generating, [by one or more processors], a final feature set based on the differential feature set and the consistent feature set. (Hosseini, Page 3, Col. 1, Section. B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them. In order to evaluate the similarity of the features, KS-test can be used and the evaluation can be made through a forward comparison”, Page 3, Col. 1, Section. B, ¶[3-4]: “To resolve this problem and make the results consistent, a sorting on the input features is required. In this paper it is suggested to sort features according to F statistic described in previous section… The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability. In the next step, KS statistical measure is employed to find the features with equal distributions. In the cases where distributions are equal, the second feature is always discarded because it has a smaller F value.”) [Examiner’s note: Hosseini discloses evaluating the similarity of the features by using KS test then sorting them to form the optimal final feature set, which aligns with the concept of partitioning the features into differential set (e.g., set of features which do not share similarity) and consistent set (e.g., set of features which share similarity)]
building, by one or more processors, a machine learning model using the final feature set; (Hosseini, Pg. 1, Col. 1, Section I, ¶[1]: “The statistical measures which model these relations are known as features”, Col. 2, ¶[1]: “Kodovský [6] introduced rich models which were combination of smaller sub-models made to capture inter-block and intra-block relationships, form both the image and calibrated one. The final feature set included 22,510 features.”)
	Hosseini fails to disclose:
each distribution test of the plurality of distribution tests corresponds to a different predictive feature of a plurality of predictive features;  
	(iii) for each individual entry, of the set of training entries and the set of testing entries, a variable indicating a data source of the respective individual entry;
	the set of testing entries are devoid of a target;
and each distribution test tests whether a value distribution of the set of training entries is different from that of the set of testing entries
	partitioning, [by one or more processors], the plurality of predictive features into: (i) a differential feature set and (ii) a consistent feature set, based on [[their]] the corresponding distribution test, wherein the consistent feature set includes features that have statistical consistency between the set of training entries and the set of testing entries
	However, Pan explicitly discloses:
each distribution test of the plurality of distribution tests corresponds to a different predictive feature of a plurality of predictive features;  (Pan, Abstract: “With our approach, the system detects concept drift in new data before making inference, trains a model, and produces predictions adapted to the new data.”, Pg. 2, Figure 2: 
    PNG
    media_image2.png
    295
    572
    media_image2.png
    Greyscale
, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset. Classification performance better than random guess indicates the different feature distributions between the training and test datasets.”)
(iii) for each individual entry, of the set of training entries and the set of testing entries, a variable indicating a data source of the respective individual entry; (Pan, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset”, Pg. 3, Col. 1, ¶1]: “We start with a labeled training dataset {(ytrain, Xtrain)} ∈ R × Rd , and an unlabeled test dataset {Xtest } ∈ Rd with an unknown conditional probability Py |X . Then, we train an adversarial classifier that predicts P({train, test }|X) to separate train and test, and generate the propensity score ppropensity = P(test |X) on both Xtest and Xtrain.”) [Examiner’s note: Pan’s adversarial classifier is trained to predict whether a sample is from training or testing dataset, which is effectively a source label.]
and each distribution test tests whether a value distribution of the set of training entries is different from that of the set of testing entries (Pan, Pg. 3, Col. 1, ¶[2]: “The feature importance and propensity score from the adversarial classifier can be used to detect concept drift between the training and test data, and provide insights on the cause of the concept drift such as which features and subsamples in the training data are most different from ones in the test data.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Pan. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Pan teaches an adversarial validation approach to concept drift problems in user targeting automation systems. One of ordinary skill would have motivation to combine Hosseini and Pan in order to select, retain or remove features based not only on predictive usefulness but also on whether those features remain stable across datasets, thereby reducing selection bias, improving robustness to drift, and producing a model that generalizes better to unseen test data.
	However, Rajagopal explicitly discloses:
	and (iii) a data source variable indicating a type of each entry of the set of training entries and the set of testing entries; (Rajagopal, Pg. 19729, Table 2: 
    PNG
    media_image3.png
    338
    553
    media_image3.png
    Greyscale
)
wherein the consistent feature set includes features that have statistical consistency between the set of training entries and the set of testing entries (Rajagopal, Pg. 19732, Col. 1, ¶[5]: “While building machine learning models, it often becomes imperative to compare the performance of classifiers and the best way to achieve this is to perform statistical significance tests… In this context, the null hypothesis (H0) suggests that there is no performance difference among classifiers whereas an alternate hypothesis (H1) indicates that at least one classifier performs differently. Suppose, ‘d’ refers to the number of datasets and ‘k’ signifies the number of classifiers, Friedman test statistic can be calculated as shown in equation (7) 
    PNG
    media_image4.png
    63
    460
    media_image4.png
    Greyscale
”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Rajagopal. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Rajagopal teaches a meta-classification approach using decision jungle to perform both binary and multiclass classification. One of ordinary skill would have motivation to combine Hosseini and Rajagopal because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	However, Haws explicitly discloses:
the set of testing entries are devoid of a target; (Haws, Col. 1, Lines 53-58: “The feature selection module is configured to perform a method. The method includes receiving a set of training samples and a set of test samples. The set of training samples includes a first set of features and a class value. The set of test samples includes the set of features absent the class value.”) [Examiner’s note: the testing dataset is devoid of the target e.g., the set of test samples includes the set of features absent the class value]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Haws. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Haws teaches a transductive feature selection method with maximum-relevancy and minimum-redundancy criteria. One of ordinary skill would have motivation to combine Hosseini and Haws because by excluding the target outcome from the testing dataset during feature selection process, the features are chosen based purely on the underlying data patterns – not on knowledge of the final outcome. This leads to a more honest evaluation of the model’s true predictive power and helps create models that are reliable, and robust when applied to unseen data
	However, Hassani explicitly discloses:
partitioning, [by one or more processors], the plurality of predictive features into: (i) a differential feature set and (ii) a consistent feature set, based on [[their]] the corresponding distribution test, (Hassani, Page 594, ¶[2]: “Next, we introduce the hypothesis which are relevant for the proposed KSPA test. Let us begin by presenting the hypothesis for the two-sided KS test. Let X and Y be two random variables with c.d.f.’s FX and FY , respectively. Then, a two sample, two-sided KS test will test the hypothesis that both c.d.f.’s have an identical distribution, and the resulting null and alternate hypothesis can be expressed as: 
    PNG
    media_image5.png
    76
    410
    media_image5.png
    Greyscale
 In simple terms, the null hypothesis in Equation (6) states that both X and Y share an identical distribution whilst the alternate hypothesis states that X and Y do not share the same distribution.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Hassani. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Hassani teaches performing a similarity statistical test between two different datasets using Kolmogorov-Smirnov technique. One of ordinary skill would have motivation to combine Hosseini and Hassani in order to measure how different the distributions of a feature are across groups. If a feature behaves very differently between two classes, that’s a good sign it’s useful for making predictions, this is especially helpful in real-world situations where data doesn’t follow neat patterns – since KS doesn’t assume any particular distributions, it works well no matter how messy the data gets.
Regarding claim 2, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of claim 1 (as shown in the rejection above).
Hosseini in view of Hassani, Haws, Rajagopal and Pan further discloses:
Identifying, by one or more processors, a lead feature from the consistent feature set; (Hosseini, Page 2, Col. 1, ¶[3]: “in the proposed method, F is used just for ranking features. Afterwards, a strategy based on Kolmogorov-Smirnov (KS) test [16] is used in order to remove redundant features.”, Page 3, Col. 1, Section. B, ¶[4]: “The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability.”, Col. 2, Fig. 2: 
    PNG
    media_image6.png
    327
    599
    media_image6.png
    Greyscale
)[Examiner’s note: Fig. 2 discloses the features are ranked based on F statistical measure in descending order, with the most important feature placed first, which aligns with the concept of identifying the lead feature.]
adjusting, by one or more processors, the differential feature set based on a correlation of each feature in the differential feature set to the lead feature; and (Hosseini, Page 3, Col. 1, Col. 1, Section B, ¶[3]: “Removing redundant features during forward comparison leads to sensitiveness of the algorithm to the ordering of the input features. To resolve this problem and make the results consistent, a sorting on the input features is required. In this paper it is suggested to sort features according to F statistic described in previous section. In KS-CBF1 algorithm [19] which also uses KS for removing redundant features and is similar to the suggested algorithm, the features are reordered using a ranking system based on SUC2.”, Col. 2, Fig. 2: 
    PNG
    media_image6.png
    327
    599
    media_image6.png
    Greyscale
) [Examiner’s note: The highlight indicates that the features that pass the KS test (i.e., have significantly different distribution) remain in the differential set, while features that fail the KS test (i.e., have nearly identical distribution to a higher-ranked feature) are removed. This means the algorithm in Fig. 2 is dynamically adjusting the differential set based on how similar each feature is to the lead feature]
generating, by one or more processors, the final feature set based on the adjusted differential feature set and the consistent feature set. (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”, Page 3, Col. 2, ¶[2]: “The result of comparing features is reduction of redundant features, which can lead to reduction of features that should be extracted from the image. Furthermore, the dimensionality of feature vectors will be decreased and as the result, classification complexity will be reduced.”) [Examiner’s note: the final feature set is generated after removing the redundant feature (i.e., adjusting the differential feature set), which results in a reduction of features.]
Regarding claim 4, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of claim 1 (as shown in the rejection above).
Hosseini in view of Hassani, Haws, Rajagopal and Pan further discloses:
wherein [[the]] partitioning the plurality of features further comprises: adding, by one or more processors, a first [[one]] predictive feature of the plurality of predictive features to the differential feature set in response to determining that the first predictive feature produces a first distribution corresponding to the set of training entries that is different from a second distribution corresponding to the set testing entries. (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”]
Regarding claim 5, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of claim 4 (as shown in the rejection above).
Hosseini in view of Hassani, Haws, Rajagopal and Pan further discloses:
adding, by one or more processors, a second [[one]] predictive feature of the plurality of predictive features to the consistent feature set in response to determining that the second predictive feature produces a third distribution corresponding to the set of training entries that is equivalent to a fourth distribution corresponding to the set testing entries.  (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”]
Regarding claim 7, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of claim 1 (as shown in the rejection above).
Hosseini in view of Hassani, Haws, Rajagopal and Pan further discloses:
selecting, by one or more processors, a training dataset comprising a plurality of features and a target, wherein the training dataset comprises the set of training entries; (Hosseini, Page 3, Col. 2, Section III, ¶[1-2]: “Quantity of 8,000 images were used as the cover images of training set and 2,000 remaining images were used as the cover images of testing set… In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain.”)
for each of the plurality of features: selecting, by one or more processors, one of the 
plurality of features; (Hosseini, Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1)
performing, by one or more processors, a statistical test on the selected feature to 
determine whether the selected feature is statistically important to the target; and (Hosseini, Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1), ¶[1-2]: “Kolmogorov-Smirnov (KS) is a statistical test evaluating the hypothesis that whether two independent sample sets represent two different population or not [16]. In KS-test, the maximum difference between cumulative probability distributions related to the two random variables is calculated. If there is a significant difference at any point along the two cumulative probability distributions, it can be concluded that the two sample sets belong to two different populations. The KS value between the two random variable M and N can be calculated from the following steps. Random variables outcomes x are discretized, separately, into k bins [xi, xi+1], I = 1 … k. Occurrence probabilities in each bin are estimated. Cumulative probability distributions are calculated. KS statistic can be calculated from (1). 
    PNG
    media_image1.png
    79
    482
    media_image1.png
    Greyscale
 In (1), nM and nN are the number of elements in M and N. Also Mk and Nk refer to cumulative probability distributions of M and N in kth bin, respectively. In each statistical significance level of α, if KS (M,N) < λα, the two random variables M and N are related to a unique distribution.”) [Examiner’s note: the “plurality of datasets” is being interpreted as the “cover” dataset and the “stegos” dataset. Because the features in these datasets are used to train a classifier to “predict suspect objects”, they are being interpreted as the “predictive features”. The “plurality of distribution tests” is being interpreted as the “cumulative probability distributions” and the “Kolmogorov-Smirnov statistical test”]
adding, by one or more processors, the selected feature to the plurality of predictive features based on the statistical test. (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”]
Regarding claim 8, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of claim 7 (as shown in the rejections above).
Hosseini in view of Hassani, Haws, Rajagopal and Pan further discloses:
selecting, by one or more processors, a testing dataset comprising the plurality of features and which is devoid of the target, wherein the testing dataset comprises the set of testing entries; (Haws, Col. 1, Lines 53-58: “The feature selection module is configured to perform a method. The method includes receiving a set of training samples and a set of test samples. The set of training samples includes a first set of features and a class value. The set of test samples includes the set of features absent the class value.”) [Examiner’s note: the testing dataset is devoid of the target e.g., the set of test samples includes the set of features absent the class value]
selecting, by one or more processors, one of the plurality of predictive features; (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”)
combining, by one or more processors, one or more of the set of training entries corresponding to the selected predictive feature with one or more of the set of testing entries corresponding to the selected feature into a selected one of the plurality of datasets; and (Hosseini, Page 3, Col. 2, Section III, ¶[2]: “In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain. Then, a classification was performed where all the features were considered and the training and testing data were scaled according to the training data.”) [Examiner’s note: the highlights indicate that the features from both training set and testing set are extracted]
adding, by one or more processors, a data source variable to the selected dataset that indicates a dataset source of each of the entries in the selected dataset.  (Hassani, Page 595, Section 2.2, ¶[1]: “Let us begin by defining forecast errors. Suppose we have a real valued, non zero time series YN = (y1, … yt, … yN) of sufficient length N. YN is divided into two parts, i.e., training set and test set such that Y1 = (y1, …, yt) represents the training set and Y2 = (yt+1,…, yN) represents the test set”) [Examiner’s note: “data source variable” is being interpreted as Y1 = (y1, …, yt) and Y2 = (yt+1, …, yN)]
Regarding claim 9, Hosseini explicitly discloses:
performing, on a plurality of datasets, a plurality of distribution tests, wherein (Hosseini, Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1), ¶[1-2]: “Kolmogorov-Smirnov (KS) is a statistical test evaluating the hypothesis that whether two independent sample sets represent two different population or not [16]. In KS-test, the maximum difference between cumulative probability distributions related to the two random variables is calculated. If there is a significant difference at any point along the two cumulative probability distributions, it can be concluded that the two sample sets belong to two different populations. The KS value between the two random variable M and N can be calculated from the following steps. Random variables outcomes x are discretized, separately, into k bins [xi, xi+1], I = 1 … k. Occurrence probabilities in each bin are estimated. Cumulative probability distributions are calculated. KS statistic can be calculated from (1). 
    PNG
    media_image1.png
    79
    482
    media_image1.png
    Greyscale
 In (1), nM and nN are the number of elements in M and N. Also Mk and Nk refer to cumulative probability distributions of M and N in kth bin, respectively. In each statistical significance level of α, if KS (M,N) < λα, the two random variables M and N are related to a unique distribution.”) [Examiner’s note: the “plurality of datasets” is being interpreted as the “cover” dataset and the “stegos” dataset. Because the features in these datasets are used to train a classifier to “predict suspect objects”, they are being interpreted as the “predictive features”. The “plurality of distribution tests” is being interpreted as the “cumulative probability distributions” and the “Kolmogorov-Smirnov statistical test”]
each dataset of the plurality of datasets comprises: (i) a set of training entries corresponding to a single predictive feature of the plurality of predictive features, (ii) a set of testing entries corresponding to the single predictive feature, and (Hosseini, Page 3, Col. 2, Section III, ¶[1-2]: “Quantity of 8,000 images were used as the cover images of training set and 2,000 remaining images were used as the cover images of testing set… In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain.”)
generating a final feature set based on the differential feature set and the consistent feature set.  (Hosseini, Page 3, Col. 1, Section. B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them. In order to evaluate the similarity of the features, KS-test can be used and the evaluation can be made through a forward comparison”, Page 3, Col. 1, Section. B, ¶[3-4]: “To resolve this problem and make the results consistent, a sorting on the input features is required. In this paper it is suggested to sort features according to F statistic described in previous section… The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability. In the next step, KS statistical measure is employed to find the features with equal distributions. In the cases where distributions are equal, the second feature is always discarded because it has a smaller F value.”) [Examiner’s note: Hosseini discloses evaluating the similarity of the features by using KS test then sorting them to form the optimal final feature set, which aligns with the concept of partitioning the features into differential set (e.g., set of features which do not share similarity) and consistent set (e.g., set of features which share similarity)]
building, by one or more processors, a machine learning model using the final feature set; (Hosseini, Pg. 1, Col. 1, Section I, ¶[1]: “The statistical measures which model these relations are known as features”, Col. 2, ¶[1]: “Kodovský [6] introduced rich models which were combination of smaller sub-models made to capture inter-block and intra-block relationships, form both the image and calibrated one. The final feature set included 22,510 features.”)
Hosseini fails to disclose:
a processor set;
one or more computer-readable storage media; and
program instructions stored on the one or more computer-readable storage media to cause the processor set to perform operations comprising:
each distribution test of the plurality of distribution tests corresponds to a different predictive feature of a plurality of predictive features;  
(iii) for each individual entry, of the set of training entries and the set of testing entries, a variable indicating a data source of the respective individual entry;	
the set of testing entries are devoid of a target;
and each distribution test tests whether a value distribution of the set of training entries is different from that of the set of testing entries
partitioning the plurality of predictive features into: (i) a differential feature set and (ii) a consistent feature set, based on the corresponding distribution test, wherein the consistent feature set includes features that have statistical consistency between the set of training entries and the set of testing entries
However, Pan explicitly discloses:
each distribution test of the plurality of distribution tests corresponds to a different predictive feature of a plurality of predictive features;  (Pan, Abstract: “With our approach, the system detects concept drift in new data before making inference, trains a model, and produces predictions adapted to the new data.”, Pg. 2, Figure 2: 
    PNG
    media_image2.png
    295
    572
    media_image2.png
    Greyscale
, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset. Classification performance better than random guess indicates the different feature distributions between the training and test datasets.”)
(iii) for each individual entry, of the set of training entries and the set of testing entries, a variable indicating a data source of the respective individual entry; (Pan, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset”, Pg. 3, Col. 1, ¶1]: “We start with a labeled training dataset {(ytrain, Xtrain)} ∈ R × Rd , and an unlabeled test dataset {Xtest } ∈ Rd with an unknown conditional probability Py |X . Then, we train an adversarial classifier that predicts P({train, test }|X) to separate train and test, and generate the propensity score ppropensity = P(test |X) on both Xtest and Xtrain.”) [Examiner’s note: Pan’s adversarial classifier is trained to predict whether a sample is from training or testing dataset, which is effectively a source label.]
and each distribution test tests whether a value distribution of the set of training entries is different from that of the set of testing entries (Pan, Pg. 3, Col. 1, ¶[2]: “The feature importance and propensity score from the adversarial classifier can be used to detect concept drift between the training and test data, and provide insights on the cause of the concept drift such as which features and subsamples in the training data are most different from ones in the test data.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Pan. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Pan teaches an adversarial validation approach to concept drift problems in user targeting automation systems. One of ordinary skill would have motivation to combine Hosseini and Pan in order to select, retain or remove features based not only on predictive usefulness but also on whether those features remain stable across datasets, thereby reducing selection bias, improving robustness to drift, and producing a model that generalizes better to unseen test data.
However, Rajagopal explicitly discloses:
	and (iii) a data source variable indicating a type of each entry of the set of training entries and the set of testing entries; (Rajagopal, Pg. 19729, Table 2: 
    PNG
    media_image3.png
    338
    553
    media_image3.png
    Greyscale
)
wherein the consistent feature set includes features that have statistical consistency between the set of training entries and the set of testing entries (Rajagopal, Pg. 19732, Col. 1, ¶[5]: “While building machine learning models, it often becomes imperative to compare the performance of classifiers and the best way to achieve this is to perform statistical significance tests… In this context, the null hypothesis (H0) suggests that there is no performance difference among classifiers whereas an alternate hypothesis (H1) indicates that at least one classifier performs differently. Suppose, ‘d’ refers to the number of datasets and ‘k’ signifies the number of classifiers, Friedman test statistic can be calculated as shown in equation (7) 
    PNG
    media_image4.png
    63
    460
    media_image4.png
    Greyscale
”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Rajagopal. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Rajagopal teaches a meta-classification approach using decision jungle to perform both binary and multiclass classification. One of ordinary skill would have motivation to combine Hosseini and Rajagopal because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Haws explicitly discloses:
a processor set; (Haws, Col. 1, Lines 49-51: “The information processing system includes a memory and a processor that is communicatively coupled to the memory)
one or more computer-readable storage media; and (Haws, Col. 1, Lines 49-51: “The information processing system includes a memory and a processor that is communicatively coupled to the memory)
program instructions stored [[in]] on the one or more computer-readable storage media to cause the processor set to perform operations comprising: (Haws, Col. 7, Lines 33-37: “In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Haws. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Haws teaches a transudative feature selection method with maximum-relevancy and minimum-redundancy criteria. One of ordinary skill would have motivation to combine Hosseini and Haws because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Hassani explicitly discloses:
partitioning the plurality of predictive features into: (i) a differential feature set and (ii) a consistent feature set, based on [[their]] the corresponding distribution test, (Hassani, Page 594, ¶[2]: “Next, we introduce the hypothesis which are relevant for the proposed KSPA test. Let us begin by presenting the hypothesis for the two-sided KS test. Let X and Y be two random variables with c.d.f.’s FX and FY , respectively. Then, a two sample, two-sided KS test will test the hypothesis that both c.d.f.’s have an identical distribution, and the resulting null and alternate hypothesis can be expressed as: 
    PNG
    media_image5.png
    76
    410
    media_image5.png
    Greyscale
 In simple terms, the null hypothesis in Equation (6) states that both X and Y share an identical distribution whilst the alternate hypothesis states that X and Y do not share the same distribution.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Hassani. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Hassani teaches performing a similarity statistical test between two different datasets using Kolmogorov-Smirnov technique. One of ordinary skill would have motivation to combine Hosseini and Hassani in order to measure how different the distributions of a feature are across groups. If a feature behaves very differently between two classes, that’s a good sign it’s useful for making predictions, this is especially helpful in real-world situations where data doesn’t follow neat patterns – since KS doesn’t assume any particular distributions, it works well no matter how messy the data gets.
Regarding claim 10, the combination of Hosseini, Hassani, Rajagopal, Haws and Pan discloses all the limitations of Claim 9 (as shown in the rejections above). 
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
identifying a lead feature from the consistent feature set; (Hosseini, Page 2, Col. 1, ¶[3]: “in the proposed method, F is used just for ranking features. Afterwards, a strategy based on Kolmogorov-Smirnov (KS) test [16] is used in order to remove redundant features.”, Page 3, Col. 1, Section. B, ¶[4]: “The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability.”, Col. 2, Fig. 2: 
    PNG
    media_image6.png
    327
    599
    media_image6.png
    Greyscale
)[Examiner’s note: Fig. 2 discloses the features are ranked based on F statistical measure in descending order, with the most important feature placed first, which aligns with the concept of identifying the lead feature.]
adjusting the differential feature set based on a correlation of each feature in the differential feature set to the lead feature; and (Hosseini, Page 3, Col. 1, Col. 1, Section B, ¶[3]: “Removing redundant features during forward comparison leads to sensitiveness of the algorithm to the ordering of the input features. To resolve this problem and make the results consistent, a sorting on the input features is required. In this paper it is suggested to sort features according to F statistic described in previous section. In KS-CBF1 algorithm [19] which also uses KS for removing redundant features and is similar to the suggested algorithm, the features are reordered using a ranking system based on SUC2.”, Col. 2, Fig. 2: 
    PNG
    media_image6.png
    327
    599
    media_image6.png
    Greyscale
) [Examiner’s note: The highlight indicates that the features that pass the KS test (i.e., have significantly different distribution) remain in the differential set, while features that fail the KS test (i.e., have nearly identical distribution to a higher-ranked feature) are removed. This means the algorithm in Fig. 2 is dynamically adjusting the differential set based on how similar each feature is to the lead feature]
generating the final feature set based on the adjusted differential feature set and the consistent feature set.  (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”, Page 3, Col. 2, ¶[2]: “The result of comparing features is reduction of redundant features, which can lead to reduction of features that should be extracted from the image. Furthermore, the dimensionality of feature vectors will be decreased and as the result, classification complexity will be reduced.”) [Examiner’s note: the final feature set is generated after removing the redundant feature (i.e., adjusting the differential feature set), which results in a reduction of features.]
Regarding claim 12, the combination of Hosseini, Hassani, Rajagopal, Haws and Pan discloses all the limitations of Claim 9 (as shown in the rejections above). 
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
adding a first [[one]] predictive feature of the plurality of predictive features to the differential feature set in response to determining that the first predictive feature produces a first distribution corresponding to the set of training entries that is different from a second distribution corresponding to the set testing entries.  (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”]
Regarding claim 13, the combination of Hosseini, Hassani, Rajagopal, Haws and Pan discloses all the limitations of Claim 12 (as shown in the rejections above). 
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
adding a second [[one]] predictive feature of the plurality of predictive features to the consistent feature set in response to determining that the second predictive feature produces a third distribution corresponding to the set of training entries that is equivalent to a fourth distribution corresponding to the set testing entries.  (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”]
Regarding claim 15, the combination of Hosseini, Hassani, Rajagopal, Haws and Pan discloses all the limitations of Claim 9 (as shown in the rejections above). 
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
selecting a training dataset comprising a plurality of features and a target, wherein the training dataset comprises the set of training entries; (Hosseini, Page 3, Col. 2, Section III, ¶[1-2]: “Quantity of 8,000 images were used as the cover images of training set and 2,000 remaining images were used as the cover images of testing set… In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain.”)
for each of the plurality of features: selecting one of the plurality of features; (Hosseini, 
Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1)
performing a statistical test on the selected feature to determine whether the selected feature is statistically important to the target; and (Hosseini, Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1), ¶[1-2]: “Kolmogorov-Smirnov (KS) is a statistical test evaluating the hypothesis that whether two independent sample sets represent two different population or not [16]. In KS-test, the maximum difference between cumulative probability distributions related to the two random variables is calculated. If there is a significant difference at any point along the two cumulative probability distributions, it can be concluded that the two sample sets belong to two different populations. The KS value between the two random variable M and N can be calculated from the following steps. Random variables outcomes x are discretized, separately, into k bins [xi, xi+1], I = 1 … k. Occurrence probabilities in each bin are estimated. Cumulative probability distributions are calculated. KS statistic can be calculated from (1). 
    PNG
    media_image1.png
    79
    482
    media_image1.png
    Greyscale
 In (1), nM and nN are the number of elements in M and N. Also Mk and Nk refer to cumulative probability distributions of M and N in kth bin, respectively. In each statistical significance level of α, if KS (M,N) < λα, the two random variables M and N are related to a unique distribution.”) [Examiner’s note: the “plurality of datasets” is being interpreted as the “cover” dataset and the “stegos” dataset. Because the features in these datasets are used to train a classifier to “predict suspect objects”, they are being interpreted as the “predictive features”. The “plurality of distribution tests” is being interpreted as the “cumulative probability distributions” and the “Kolmogorov-Smirnov statistical test”]
adding the selected feature to the plurality of predictive features based on the statistical test.  (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”]
Regarding claim 16, the combination of Hosseini, Hassani, Rajagopal, Haws and Pan discloses all the limitations of Claim 15 (as shown in the rejections above). 
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
wherein the operations further comprises: selecting a testing dataset comprising the plurality of features and is devoid of the target, wherein the testing dataset comprises the set of testing entries; (Haws, Col. 1, Lines 53-58: “The feature selection module is configured to perform a method. The method includes receiving a set of training samples and a set of test samples. The set of training samples includes a first set of features and a class value. The set of test samples includes the set of features absent the class value.”) [Examiner’s note: the testing dataset is devoid of the target e.g., the set of test samples includes the set of features absent the class value]
selecting one of the plurality of predictive features; (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”)
combining at least one training entry of the set of training entries corresponding to the selected predictive feature with at least one testing entry of the set of testing entries corresponding to the selected feature into a selected dataset [[one]] of the plurality of datasets; and (Hosseini, Page 3, Col. 2, Section III, ¶[2]: “In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain. Then, a classification was performed where all the features were considered and the training and testing data were scaled according to the training data.”) [Examiner’s note: the highlights indicate that the features from both training set and testing set are extracted]
adding a data source variable to the selected dataset that indicates a dataset source of each of the entries in the selected dataset.  (Hassani, Page 595, Section 2.2, ¶[1]: “Let us begin by defining forecast errors. Suppose we have a real valued, non zero time series YN = (y1, … yt, … yN) of sufficient length N. YN is divided into two parts, i.e., training set and test set such that Y1 = (y1, …, yt) represents the training set and Y2 = (yt+1,…, yN) represents the test set”) [Examiner’s note: “data source variable” is being interpreted as Y1 = (y1, …, yt) and Y2 = (yt+1, …, yN)]
Regarding claim 17, Hosseini explicitly discloses:
performing, on a plurality of datasets, a plurality of distribution tests, wherein (Hosseini, Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1), ¶[1-2]: “Kolmogorov-Smirnov (KS) is a statistical test evaluating the hypothesis that whether two independent sample sets represent two different population or not [16]. In KS-test, the maximum difference between cumulative probability distributions related to the two random variables is calculated. If there is a significant difference at any point along the two cumulative probability distributions, it can be concluded that the two sample sets belong to two different populations. The KS value between the two random variable M and N can be calculated from the following steps. Random variables outcomes x are discretized, separately, into k bins [xi, xi+1], I = 1 … k. Occurrence probabilities in each bin are estimated. Cumulative probability distributions are calculated. KS statistic can be calculated from (1). 
    PNG
    media_image1.png
    79
    482
    media_image1.png
    Greyscale
 In (1), nM and nN are the number of elements in M and N. Also Mk and Nk refer to cumulative probability distributions of M and N in kth bin, respectively. In each statistical significance level of α, if KS (M,N) < λα, the two random variables M and N are related to a unique distribution.”) [Examiner’s note: the “plurality of datasets” is being interpreted as the “cover” dataset and the “stegos” dataset. Because the features in these datasets are used to train a classifier to “predict suspect objects”, they are being interpreted as the “predictive features”. The “plurality of distribution tests” is being interpreted as the “cumulative probability distributions” and the “Kolmogorov-Smirnov statistical test”]
each dataset of the plurality of datasets comprises: (i) a set of training entries corresponding to a single predictive feature of the plurality of predictive features, (ii) a set of testing entries corresponding to the single predictive feature, and (Hosseini, Page 3, Col. 2, Section III, ¶[1-2]: “Quantity of 8,000 images were used as the cover images of training set and 2,000 remaining images were used as the cover images of testing set… In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain.”)
generating a final feature set based on the differential feature set and the consistent feature set.  (Hosseini, Page 3, Col. 1, Section. B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them. In order to evaluate the similarity of the features, KS-test can be used and the evaluation can be made through a forward comparison”, Page 3, Col. 1, Section. B, ¶[3-4]: “To resolve this problem and make the results consistent, a sorting on the input features is required. In this paper it is suggested to sort features according to F statistic described in previous section… The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability. In the next step, KS statistical measure is employed to find the features with equal distributions. In the cases where distributions are equal, the second feature is always discarded because it has a smaller F value.”) [Examiner’s note: Hosseini discloses evaluating the similarity of the features by using KS test then sorting them to form the optimal final feature set, which aligns with the concept of partitioning the features into differential set (e.g., set of features which do not share similarity) and consistent set (e.g., set of features which share similarity)]
building, by one or more processors, a machine learning model using the final feature set; (Hosseini, Pg. 1, Col. 1, Section I, ¶[1]: “The statistical measures which model these relations are known as features”, Col. 2, ¶[1]: “Kodovský [6] introduced rich models which were combination of smaller sub-models made to capture inter-block and intra-block relationships, form both the image and calibrated one. The final feature set included 22,510 features.”)
Hosseini fails to disclose:
a processor set;
one or more computer-readable storage media; and
program instructions stored on the one or more computer-readable storage media to cause the processor set to perform operations comprising:
each distribution test of the plurality of distribution tests corresponds to a different predictive feature of a plurality of predictive features;  
(iii) for each individual entry, of the set of training entries and the set of testing entries, a variable indicating a data source of the respective individual entry;
	the set of testing entries are devoid of a target;
and each distribution test tests whether a value distribution of the set of training entries is different from that of the set of testing entries
partitioning the plurality of predictive features into: (i) a differential feature set and (ii) a consistent feature set, based on [[their]] the corresponding distribution test, wherein the consistent feature set includes features that have statistical consistency between the set of training entries and the set of testing entries
However, Pan explicitly discloses:
each distribution test of the plurality of distribution tests corresponds to a different predictive feature of a plurality of predictive features;  (Pan, Abstract: “With our approach, the system detects concept drift in new data before making inference, trains a model, and produces predictions adapted to the new data.”, Pg. 2, Figure 2: 
    PNG
    media_image2.png
    295
    572
    media_image2.png
    Greyscale
, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset. Classification performance better than random guess indicates the different feature distributions between the training and test datasets.”)
(iii) for each individual entry, of the set of training entries and the set of testing entries, a variable indicating a data source of the respective individual entry; (Pan, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset”, Pg. 3, Col. 1, ¶1]: “We start with a labeled training dataset {(ytrain, Xtrain)} ∈ R × Rd , and an unlabeled test dataset {Xtest } ∈ Rd with an unknown conditional probability Py |X . Then, we train an adversarial classifier that predicts P({train, test }|X) to separate train and test, and generate the propensity score ppropensity = P(test |X) on both Xtest and Xtrain.”) [Examiner’s note: Pan’s adversarial classifier is trained to predict whether a sample is from training or testing dataset, which is effectively a source label.]
and each distribution test tests whether a value distribution of the set of training entries is different from that of the set of testing entries (Pan, Pg. 3, Col. 1, ¶[2]: “The feature importance and propensity score from the adversarial classifier can be used to detect concept drift between the training and test data, and provide insights on the cause of the concept drift such as which features and subsamples in the training data are most different from ones in the test data.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Pan. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Pan teaches an adversarial validation approach to concept drift problems in user targeting automation systems. One of ordinary skill would have motivation to combine Hosseini and Pan in order to select, retain or remove features based not only on predictive usefulness but also on whether those features remain stable across datasets, thereby reducing selection bias, improving robustness to drift, and producing a model that generalizes better to unseen test data.
However, Rajagopal explicitly discloses:
	and (iii) a data source variable indicating a type of each entry of the set of training entries and the set of testing entries; (Rajagopal, Pg. 19729, Table 2: 
    PNG
    media_image3.png
    338
    553
    media_image3.png
    Greyscale
)
wherein the consistent feature set includes features that have statistical consistency between the set of training entries and the set of testing entries (Rajagopal, Pg. 19732, Col. 1, ¶[5]: “While building machine learning models, it often becomes imperative to compare the performance of classifiers and the best way to achieve this is to perform statistical significance tests… In this context, the null hypothesis (H0) suggests that there is no performance difference among classifiers whereas an alternate hypothesis (H1) indicates that at least one classifier performs differently. Suppose, ‘d’ refers to the number of datasets and ‘k’ signifies the number of classifiers, Friedman test statistic can be calculated as shown in equation (7) 
    PNG
    media_image4.png
    63
    460
    media_image4.png
    Greyscale
”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Rajagopal. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Rajagopal teaches a meta-classification approach using decision jungle to perform both binary and multiclass classification. One of ordinary skill would have motivation to combine Hosseini and Rajagopal because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Haws explicitly discloses:
a processor set; (Haws, Col. 1, Lines 49-51: “The information processing system includes a memory and a processor that is communicatively coupled to the memory)
one or more computer-readable storage media; and (Haws, Col. 1, Lines 49-51: “The information processing system includes a memory and a processor that is communicatively coupled to the memory)
program instructions stored [[in]] on the one or more computer-readable storage media to cause the processor set to perform operations comprising: (Haws, Col. 7, Lines 33-37: “In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Haws. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Haws teaches a transudative feature selection method with maximum-relevancy and minimum-redundancy criteria. One of ordinary skill would have motivation to combine Hosseini and Haws because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Hassani explicitly discloses:
partitioning the plurality of predictive features into: (i) a differential feature set and (ii) a consistent feature set, based on [[their]] the corresponding distribution test, (Hassani, Page 594, ¶[2]: “Next, we introduce the hypothesis which are relevant for the proposed KSPA test. Let us begin by presenting the hypothesis for the two-sided KS test. Let X and Y be two random variables with c.d.f.’s FX and FY , respectively. Then, a two sample, two-sided KS test will test the hypothesis that both c.d.f.’s have an identical distribution, and the resulting null and alternate hypothesis can be expressed as: 
    PNG
    media_image5.png
    76
    410
    media_image5.png
    Greyscale
 In simple terms, the null hypothesis in Equation (6) states that both X and Y share an identical distribution whilst the alternate hypothesis states that X and Y do not share the same distribution.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Hassani. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Hassani teaches performing a similarity statistical test between two different datasets using Kolmogorov-Smirnov technique. One of ordinary skill would have motivation to combine Hosseini and Hassani in order to measure how different the distributions of a feature are across groups. If a feature behaves very differently between two classes, that’s a good sign it’s useful for making predictions, this is especially helpful in real-world situations where data doesn’t follow neat patterns – since KS doesn’t assume any particular distributions, it works well no matter how messy the data gets.
Regarding claim 18, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of Claim 17 (as shown in the rejections above)
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
wherein the operations further comprise: identifying a lead feature from the consistent feature set; (Hosseini, Page 2, Col. 1, ¶[3]: “in the proposed method, F is used just for ranking features. Afterwards, a strategy based on Kolmogorov-Smirnov (KS) test [16] is used in order to remove redundant features.”, Page 3, Col. 1, Section. B, ¶[4]: “The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability.”, Col. 2, Fig. 2: 
    PNG
    media_image6.png
    327
    599
    media_image6.png
    Greyscale
)[Examiner’s note: Fig. 2 discloses the features are ranked based on F statistical measure in descending order, with the most important feature placed first, which aligns with the concept of identifying the lead feature.]
adjusting the differential feature set based on a correlation of each feature in the differential feature set to the lead feature; and (Hosseini, Page 3, Col. 1, Col. 1, Section B, ¶[3]: “Removing redundant features during forward comparison leads to sensitiveness of the algorithm to the ordering of the input features. To resolve this problem and make the results consistent, a sorting on the input features is required. In this paper it is suggested to sort features according to F statistic described in previous section. In KS-CBF1 algorithm [19] which also uses KS for removing redundant features and is similar to the suggested algorithm, the features are reordered using a ranking system based on SUC2.”, Col. 2, Fig. 2: 
    PNG
    media_image6.png
    327
    599
    media_image6.png
    Greyscale
) [Examiner’s note: The highlight indicates that the features that pass the KS test (i.e., have significantly different distribution) remain in the differential set, while features that fail the KS test (i.e., have nearly identical distribution to a higher-ranked feature) are removed. This means the algorithm in Fig. 2 is dynamically adjusting the differential set based on how similar each feature is to the lead feature]
generating the final feature set based on the adjusted differential feature set and the consistent feature set.  (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”, Page 3, Col. 2, ¶[2]: “The result of comparing features is reduction of redundant features, which can lead to reduction of features that should be extracted from the image. Furthermore, the dimensionality of feature vectors will be decreased and as the result, classification complexity will be reduced.”) [Examiner’s note: the final feature set is generated after removing the redundant feature (i.e., adjusting the differential feature set), which results in a reduction of features.]
Regarding claim 20, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of Claim 17 (as shown in the rejections above)
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
adding a first [[one]] predictive feature of the plurality of predictive features to the differential feature set in response to determining that the first predictive feature produces a first distribution corresponding to the set of training entries that is different from a second distribution corresponding to the set testing entries. (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”] 
Regarding claim 21, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of Claim 20 (as shown in the rejections above)
Hosseini in view of Hassani, Rajagoapl, Haws and Pan further discloses:
wherein the operations further comprise: adding a second [[one]] predictive feature of the plurality of predictive features to the consistent feature set in response to determining that the second predictive feature produces a third distribution corresponding to the set of training entries that is equivalent to a fourth distribution corresponding to the set testing entries. (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”] 
Regarding claim 23, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of Claim 17 (as shown in the rejections above)
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
selecting a training dataset comprising a plurality of features and a target, wherein the training dataset comprises the set of training entries; (Hosseini, Page 3, Col. 2, Section III, ¶[1-2]: “Quantity of 8,000 images were used as the cover images of training set and 2,000 remaining images were used as the cover images of testing set… In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain.”)
for each of the plurality of features: selecting one of the plurality of features; (Hosseini, 
Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1)
performing a statistical test on the selected feature to determine whether the selected feature is statistically important to the target; and (Hosseini, Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1), ¶[1-2]: “Kolmogorov-Smirnov (KS) is a statistical test evaluating the hypothesis that whether two independent sample sets represent two different population or not [16]. In KS-test, the maximum difference between cumulative probability distributions related to the two random variables is calculated. If there is a significant difference at any point along the two cumulative probability distributions, it can be concluded that the two sample sets belong to two different populations. The KS value between the two random variable M and N can be calculated from the following steps. Random variables outcomes x are discretized, separately, into k bins [xi, xi+1], I = 1 … k. Occurrence probabilities in each bin are estimated. Cumulative probability distributions are calculated. KS statistic can be calculated from (1). 
    PNG
    media_image1.png
    79
    482
    media_image1.png
    Greyscale
 In (1), nM and nN are the number of elements in M and N. Also Mk and Nk refer to cumulative probability distributions of M and N in kth bin, respectively. In each statistical significance level of α, if KS (M,N) < λα, the two random variables M and N are related to a unique distribution.”) [Examiner’s note: the “plurality of datasets” is being interpreted as the “cover” dataset and the “stegos” dataset. Because the features in these datasets are used to train a classifier to “predict suspect objects”, they are being interpreted as the “predictive features”. The “plurality of distribution tests” is being interpreted as the “cumulative probability distributions” and the “Kolmogorov-Smirnov statistical test”]
adding the selected feature to the plurality of predictive features based on the statistical test.  (Hassani, Page 595, Section 2.2, ¶[3]: “The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models m1 and m2. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as: 
    PNG
    media_image7.png
    52
    345
    media_image7.png
    Greyscale
 where                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    F
                                
                                
                                    ε
                                     
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     denote the empirical c.d.f.’s for the forecast errors from two different models.”, Page 596, ¶[2]: “Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows, where                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                     are the absolute or squared forecast errors from two forecasting models m1 and m2 with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis: 
    PNG
    media_image8.png
    102
    276
    media_image8.png
    Greyscale
Then, if the observed significance value of the two-sample two-sided KSPA test statistic Di, i+h is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    ε
                                
                                
                                    i
                                    +
                                    h
                                
                                
                                    m
                                    2
                                
                            
                        
                      do not share the same distribution.”) [Examiner’s note: The null hypothesis H0 states that both model set share identical distribution (i.e., consistent set), the alternate hypothesis H1 states that they do not share the same distribution (i.e., differential set). The “two forecasting models m1 and m2” is being interpreted as the “training dataset” and “testing dataset”]
Regarding claim 24, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of Claim 23 (as shown in the rejections above)
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
selecting a testing dataset comprising the plurality of features and is devoid of the target, wherein the testing dataset comprises the set of testing entries; (Haws, Col. 1, Lines 53-58: “The feature selection module is configured to perform a method. The method includes receiving a set of training samples and a set of test samples. The set of training samples includes a first set of features and a class value. The set of test samples includes the set of features absent the class value.”) [Examiner’s note: the testing dataset is devoid of the target e.g., the set of test samples includes the set of features absent the class value]
selecting one of the plurality of predictive features; (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”)
combining at least one training entry of the set of training entries corresponding to the selected predictive feature with at least one testing entry of the set of testing entries corresponding to the selected feature into a selected dataset [[one]] of the plurality of datasets; and (Hosseini, Page 3, Col. 2, Section III, ¶[2]: “In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain. Then, a classification was performed where all the features were considered and the training and testing data were scaled according to the training data.”) [Examiner’s note: the highlights indicate that the features from both training set and testing set are extracted]
adding a data source variable to the selected dataset that indicates a dataset source of each of the entries in the selected dataset.  (Hassani, Page 595, Section 2.2, ¶[1]: “Let us begin by defining forecast errors. Suppose we have a real valued, non zero time series YN = (y1, … yt, … yN) of sufficient length N. YN is divided into two parts, i.e., training set and test set such that Y1 = (y1, …, yt) represents the training set and Y2 = (yt+1,…, yN) represents the test set”) [Examiner’s note: “data source variable” is being interpreted as Y1 = (y1, …, yt) and Y2 = (yt+1, …, yN)]
Regarding claim 26, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of Claim 1 (as shown in the rejections above)
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
(New) The computer-implemented method of claim 1, further comprising: generating, by one or more processors, the plurality of predictive features by performing filter-based feature selection on a training dataset. (Pan, Pg. 3, Section 3.1, ¶[1-2]: “However, if the adversarial classifier can distinguish between training and test data well (i.e. AUC score ≫ 50%), the top features from the adversarial classifier are potential candidates exhibiting concept drift between the train and test data. We can then exclude these features from model training. Such feature selection can be automated by determining the number of features to exclude based on the performance of adversarial classifier (e.g. AUC score) and raw feature importance values (e.g. mean decrease impurity (MDI) in Decision Trees) as follows:”)
Regarding claim 27, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of Claim 1 (as shown in the rejections above)
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
(New) The computer-implemented method of claim 1, wherein the variable indicates whether the respective individual entry is of a selection from the group consisting of: the set of testing entries and the set of training entries.  (Pan, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset”, Pg. 3, Col. 1, ¶1]: “We start with a labeled training dataset {(ytrain, Xtrain)} ∈ R × Rd , and an unlabeled test dataset {Xtest } ∈ Rd with an unknown conditional probability Py |X . Then, we train an adversarial classifier that predicts P({train, test }|X) to separate train and test, and generate the propensity score ppropensity = P(test |X) on both Xtest and Xtrain.”) [Examiner’s note: Pan’s adversarial classifier is trained to predict whether a sample is from training or testing dataset, which is effectively a source label.]
Regarding claim 28, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of Claim 1 (as shown in the rejections above)
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
(New) The computer-implemented method of claim 1, wherein the final feature set is devoid of classification model dependencies. (Pan, Pg. 3, FIG3: 
    PNG
    media_image9.png
    400
    787
    media_image9.png
    Greyscale
) [Examiner’s note: Fig3 of Pan discloses using adversarial classifier to select a final feature set for actual training, so the final selected feature set is not dependent on the actual model training’s performance]
Claim(s) 3 is rejected under 35 U.S.C. 103 as being unpatentable over Hosseini & Mahdavi (“F Plus KS: A new Feature Selection Strategy for Steganalysis”) (hereafter referred to as “Hosseini”) in view of Hassani & Silva (“A Kolmogorov-Smirnov Based Test for Comparing the Predictive Accuracy of Two Sets of Forecasts”) (hereafter referred to as “Hassani”), Haws et al (US 9,483,739 B2) (hereafter referred to as “Haws”), Rajagopal et al. (“Towards Effective Network Intrusion Detection: From Concept to Creation on Azure Cloud”) (hereafter referred to as “Rajagopal”), Pan et al. (“Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber”) (hereafter referred to as “Pan”) and further in view of Yu & Liu (“Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution”) (hereafter referred to as “Yu”)
Regarding claim 3, the combination of Hosseini, Haws, Rajagopal, Hassani and Pan discloses all the limitations of claim 2 (as shown in the rejection above).
Hosseini in view of Hassani, Haws, Rajagopal and Pan further discloses:
wherein the adjusting the differential feature set further comprises: selecting, by 
one or more processors, one of the features in the differential feature set; (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”, Page 3, Col. 1, Section B, ¶[4]: “The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability. In the next step, KS statistical measure is employed to find the features with equal distributions. In the cases where distributions are equal, the second feature is always discarded because it has a smaller F value.”)
Hosseini in view of Hassani, Haws and Rajagopal fails to discloses:
computing, by one or more processors, a correlation range of the set of training entities based on a correlation between a set of first data values of the set of training entities and the lead feature;
computing, by one or more processors, a correlation value of the set of testing entities based on a correlation between a set of second data values of the set of testing entities and the lead feature; and
removing, by one or more processors, the selected differential feature from the differential feature set in response to determining that the correlation value is outside the correlation range
However, Yu explicitly discloses:
computing, by one or more processors, a correlation range of the set of training entities based on a correlation between a set of first data values of the set of training entities and the lead feature; (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “More specifically, suppose a data set S contains N features and a class C. Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that ⱯFi                         
                            ∈
                        
                     S’, 1                         
                            ≤
                        
                     I                         
                            ≤
                        
                     N, SUi,c                         
                            ≥
                        
                     ᵹ”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                         
                            ∈
                        
                     S) and the class C is predominant iff SUi,c                         
                            ≥
                        
                     ᵹ, and ⱯFj ϵ S’ (j                         
                            ≠
                        
                     i), there exists no Fj such that SUi,j                         
                            ≥
                        
                     SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                         
                            ∈
                        
                     S’ and SPi  (SPi                          
                            ≠
                        
                                             
                            ∅
                        
                    ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                         
                            ∈
                             
                        
                    SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                         
                            ∈
                             
                        
                    SPi, SUj,c                         
                            ≤
                        
                     SUi,c}”) [Examiner’s note: the “lead feature” is being interpreted as the predominant feature Fi, and the “set of training entities” is being interpreted as “class C” because class C represents the target values that we are trying to predict]
computing, by one or more processors, a correlation value of the set of testing entities based on a correlation between a set of second data values of the set of testing entities and the lead feature; and (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “More specifically, suppose a data set S contains N features and a class C. Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that ⱯFi                         
                            ∈
                        
                     S’, 1                         
                            ≤
                        
                     I                         
                            ≤
                        
                     N, SUi,c                         
                            ≥
                        
                     ᵹ”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                         
                            ∈
                        
                     S) and the class C is predominant iff SUi,c                         
                            ≥
                        
                     ᵹ, and ⱯFj ϵ S’ (j                         
                            ≠
                        
                     i), there exists no Fj such that SUi,j                         
                            ≥
                        
                     SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                         
                            ∈
                        
                     S’ and SPi  (SPi                          
                            ≠
                        
                                             
                            ∅
                        
                    ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                         
                            ∈
                             
                        
                    SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                         
                            ∈
                             
                        
                    SPi, SUj,c                         
                            ≤
                        
                     SUi,c}”) [Examiner’s note: the “lead feature” is being interpreted as the predominant feature Fi, and the “set of testing entities” is being interpreted as “class C” because class C represents the target values that we are trying to predict]
removing, by one or more processors, the selected differential feature from the differential feature set in response to determining that the correlation value is outside the correlation range. (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that                         
                            ∀
                        
                    Fi                         
                            ∈
                        
                     S’, 1                         
                            ≤
                        
                     i                         
                            ≤
                        
                     N, SUi,c                         
                            ≥
                        
                     ᵹ”, Page 4, Col. 2, Definition 2: “A feature is predominant to the class, iff its correlation to the class is predominant or can become predominant after removing its redundant peers.”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                         
                            ∈
                        
                     S) and the class C is predominant iff SUi,c                         
                            ≥
                        
                     ᵹ, and ⱯFj ϵ S’ (j                         
                            ≠
                        
                     i), there exists no Fj such that SUi,j                         
                            ≥
                        
                     SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                         
                            ∈
                        
                     S’ and SPi  (SPi                          
                            ≠
                        
                                             
                            ∅
                        
                    ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                         
                            ∈
                             
                        
                    SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                         
                            ∈
                             
                        
                    SPi, SUj,c                         
                            ≤
                        
                     SUi,c}”) [Examiner’s note: “the correlation range” is being interpreted as the “threshold SU value ᵹ”. Feature Fj will be removed when its threshold value SU is larger than (i.e., outside the correlation range) the threshold value SU of feature Fi]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini, Hassani, Haws, Rajagopal and Yu. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Hassani teaches performing a similarity statistical test between two different datasets using Kolmogorov-Smirnov technique. Rajagopal teaches a meta-classification approach using decision jungle to perform both binary and multiclass classification. Haws teaches a transudative feature selection method with maximum-relevancy and minimum-redundancy criteria. Yu teaches a correlation-based feature selection method. One of ordinary skill would have motivation to combine Hosseini, Hassani, Haws, Rajagopal and Yu in order to identify redundancy and keep the model focused on the most valuable information. If two features are highly correlated, they are basically telling the model the same thing. Including both doesn’t add value – it just clutters the model, increases complexity, and can even lead to overfitting. By computing the correlation range, you can spots these overlaps early and remove redundant features.
Claim(s) 11, 19 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Hosseini & Mahdavi (“F Plus KS: A new Feature Selection Strategy for Steganalysis”) (hereafter referred to as “Hosseini”) in view of Hassani & Silva (“A Kolmogorov-Smirnov Based Test for Comparing the Predictive Accuracy of Two Sets of Forecasts”) (hereafter referred to as “Hassani”) , Haws et al (US 9,483,739 B2) (hereafter referred to as “Haws”), Rajagopal et al. (“Towards Effective Network Intrusion Detection: From Concept to Creation on Azure Cloud”) (hereafter referred to as “Rajagopal”), Pan et al. (“Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber”) (hereafter referred to as “Pan”) and further in view of Yu & Liu (“Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution”) ) (hereafter referred to as “Yu”)
Regarding claim 11, the combination of Hosseini, Hassani, Haws and Pan discloses all the limitations of claim 10 (as shown in the rejections above).
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
selecting one of the features in the differential feature set; (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”, Page 3, Col. 1, Section B, ¶[4]: “The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability. In the next step, KS statistical measure is employed to find the features with equal distributions. In the cases where distributions are equal, the second feature is always discarded because it has a smaller F value.”)
Hosseini in view of Hassani, Rajagopal, Haws and Pan fails to discloses:
computing a correlation range of the set of training entries based on a correlation between a set of first data values of the set of training entries and the lead feature;
computing a correlation value of the set of testing entries based on a correlation between a set of second data values of the set of testing entries and the lead feature; and
removing the selected differential feature from the differential feature set in response to determining that the correlation value is outside the correlation range
However, Yu explicitly discloses:
computing a correlation range of the set of training entries based on a correlation between a set of first data values of the set of training entries and the lead feature; (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “More specifically, suppose a data set S contains N features and a class C. Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that ⱯFi                 
                    ∈
                
             S’, 1                 
                    ≤
                
             I                 
                    ≤
                
             N, SUi,c                 
                    ≥
                
             ᵹ”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                 
                    ∈
                
             S) and the class C is predominant iff SUi,c                 
                    ≥
                
             ᵹ, and ⱯFj ϵ S’ (j                 
                    ≠
                
             i), there exists no Fj such that SUi,j                 
                    ≥
                
             SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                 
                    ∈
                
             S’ and SPi  (SPi                  
                    ≠
                
                             
                    ∅
                
            ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c                 
                    ≤
                
             SUi,c}”) [Examiner’s note: the “lead feature” is being interpreted as the predominant feature Fi, and the “set of training entities” is being interpreted as “class C” because class C represents the target values that we are trying to predict]
computing a correlation value of the set of testing entries based on a correlation between a set of second data values of the set of testing entries and the lead feature; and (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “More specifically, suppose a data set S contains N features and a class C. Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that ⱯFi                 
                    ∈
                
             S’, 1                 
                    ≤
                
             I                 
                    ≤
                
             N, SUi,c                 
                    ≥
                
             ᵹ”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                 
                    ∈
                
             S) and the class C is predominant iff SUi,c                 
                    ≥
                
             ᵹ, and ⱯFj ϵ S’ (j                 
                    ≠
                
             i), there exists no Fj such that SUi,j                 
                    ≥
                
             SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                 
                    ∈
                
             S’ and SPi  (SPi                  
                    ≠
                
                             
                    ∅
                
            ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c                 
                    ≤
                
             SUi,c}”) [Examiner’s note: the “lead feature” is being interpreted as the predominant feature Fi, and the “set of testing entities” is being interpreted as “class C” because class C represents the target values that we are trying to predict] 
removing the selected differential feature from the differential feature set in response to determining that the correlation value is outside the correlation range. (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that                 
                    ∀
                
            Fi                 
                    ∈
                
             S’, 1                 
                    ≤
                
             i                 
                    ≤
                
             N, SUi,c                 
                    ≥
                
             ᵹ”, Page 4, Col. 2, Definition 2: “A feature is predominant to the class, iff its correlation to the class is predominant or can become predominant after removing its redundant peers.”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                 
                    ∈
                
             S) and the class C is predominant iff SUi,c                 
                    ≥
                
             ᵹ, and ⱯFj ϵ S’ (j                 
                    ≠
                
             i), there exists no Fj such that SUi,j                 
                    ≥
                
             SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                 
                    ∈
                
             S’ and SPi  (SPi                  
                    ≠
                
                             
                    ∅
                
            ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c                 
                    ≤
                
             SUi,c}”) [Examiner’s note: “the correlation range” is being interpreted as the “threshold SU value ᵹ”. Feature Fj will be removed when its threshold value SU is larger than (i.e., outside the correlation range) the threshold value SU of feature Fi]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini, Hassani, Rajagopal, Haws, Pan and Yu. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Hassani teaches performing a similarity statistical test between two different datasets using Kolmogorov-Smirnov technique. Haws teaches a transudative feature selection method with maximum-relevancy and minimum-redundancy criteria. Rajagopal teaches a meta-classification approach using decision jungle to perform both binary and multiclass classification. Pan teaches an adversarial validation approach to concept drift problems in user targeting automation systems. Yu teaches a correlation-based feature selection method. One of ordinary skill would have motivation to combine Hosseini, Hassani, Rajagopal, Haws, Pan and Yu in order to identify redundancy and keep the model focused on the most valuable information. If two features are highly correlated, they are basically telling the model the same thing. Including both doesn’t add value – it just clutters the model, increases complexity, and can even lead to overfitting. By computing the correlation range, you can spots these overlaps early and remove redundant features.
Regarding claim 19, the combination of Hosseini, Hassani, Rajagopal, Haws and Pan discloses all the limitations of claim 18 (as shown in the rejections above).
Hosseini in view of Hassani, Rajagopal, Haws and Pan further discloses:
selecting one of the features in the differential feature set; (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”, Page 3, Col. 1, Section B, ¶[4]: “The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability. In the next step, KS statistical measure is employed to find the features with equal distributions. In the cases where distributions are equal, the second feature is always discarded because it has a smaller F value.”)
Hosseini in view of Hassani, Rajagopal, Haws and Pan fails to discloses:
computing a correlation range of the set of training entries based on a correlation between a set of first data values of the set of training entries and the lead feature;
computing a correlation value of the set of testing entries based on a correlation between a set of second data values of the set of testing entries and the lead feature; and
removing the selected differential feature from the differential feature set in response to determining that the correlation value is outside the correlation range
However, Yu explicitly discloses:
computing a correlation range of the set of training entries based on a correlation between a set of first data values of the set of training entries and the lead feature; (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “More specifically, suppose a data set S contains N features and a class C. Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that ⱯFi                 
                    ∈
                
             S’, 1                 
                    ≤
                
             I                 
                    ≤
                
             N, SUi,c                 
                    ≥
                
             ᵹ”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                 
                    ∈
                
             S) and the class C is predominant iff SUi,c                 
                    ≥
                
             ᵹ, and ⱯFj ϵ S’ (j                 
                    ≠
                
             i), there exists no Fj such that SUi,j                 
                    ≥
                
             SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                 
                    ∈
                
             S’ and SPi  (SPi                  
                    ≠
                
                             
                    ∅
                
            ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c                 
                    ≤
                
             SUi,c}”) [Examiner’s note: the “lead feature” is being interpreted as the predominant feature Fi, and the “set of training entities” is being interpreted as “class C” because class C represents the target values that we are trying to predict]
computing a correlation value of the set of testing entries based on a correlation between a set of second data values of the set of testing entries and the lead feature; and (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “More specifically, suppose a data set S contains N features and a class C. Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that ⱯFi                 
                    ∈
                
             S’, 1                 
                    ≤
                
             I                 
                    ≤
                
             N, SUi,c                 
                    ≥
                
             ᵹ”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                 
                    ∈
                
             S) and the class C is predominant iff SUi,c                 
                    ≥
                
             ᵹ, and ⱯFj ϵ S’ (j                 
                    ≠
                
             i), there exists no Fj such that SUi,j                 
                    ≥
                
             SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                 
                    ∈
                
             S’ and SPi  (SPi                  
                    ≠
                
                             
                    ∅
                
            ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c                 
                    ≤
                
             SUi,c}”) [Examiner’s note: the “lead feature” is being interpreted as the predominant feature Fi, and the “set of testing entities” is being interpreted as “class C” because class C represents the target values that we are trying to predict] 
removing the selected differential feature from the differential feature set in response to determining that the correlation value is outside the correlation range. (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that                 
                    ∀
                
            Fi                 
                    ∈
                
             S’, 1                 
                    ≤
                
             i                 
                    ≤
                
             N, SUi,c                 
                    ≥
                
             ᵹ”, Page 4, Col. 2, Definition 2: “A feature is predominant to the class, iff its correlation to the class is predominant or can become predominant after removing its redundant peers.”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                 
                    ∈
                
             S) and the class C is predominant iff SUi,c                 
                    ≥
                
             ᵹ, and ⱯFj ϵ S’ (j                 
                    ≠
                
             i), there exists no Fj such that SUi,j                 
                    ≥
                
             SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                 
                    ∈
                
             S’ and SPi  (SPi                  
                    ≠
                
                             
                    ∅
                
            ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c                 
                    ≤
                
             SUi,c}”) [Examiner’s note: “the correlation range” is being interpreted as the “threshold SU value ᵹ”. Feature Fj will be removed when its threshold value SU is larger than (i.e., outside the correlation range) the threshold value SU of feature Fi]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini, Hassani, Rajagopal, Haws, Pan and Yu. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Hassani teaches performing a similarity statistical test between two different datasets using Kolmogorov-Smirnov technique. Haws teaches a transudative feature selection method with maximum-relevancy and minimum-redundancy criteria. Rajagopal teaches a meta-classification approach using decision jungle to perform both binary and multiclass classification. Pan teaches an adversarial validation approach to concept drift problems in user targeting automation systems. Yu teaches a correlation-based feature selection method. One of ordinary skill would have motivation to combine Hosseini, Hassani, Rajagopal, Haws, Pan and Yu in order to identify redundancy and keep the model focused on the most valuable information. If two features are highly correlated, they are basically telling the model the same thing. Including both doesn’t add value – it just clutters the model, increases complexity, and can even lead to overfitting. By computing the correlation range, you can spots these overlaps early and remove redundant features.
Regarding claim 25, Hosseini explicitly discloses:
performing, by one or more processors, on a plurality of datasets, a plurality of distribution tests, wherein (Hosseini, Page 1, Col. 1, Section I, ¶[2]: “In steganalysis, the digital object that does not contain embedded data is named cover and the one that contains embedded data is named stego. In order to make a decision about suspect objects, a classifier is trained with the features extracted from sufficient number of covers and stegos [1]; Then, the classifier predicts on the suspect objects and decides whether each of them contains hidden data or not.”, Page 2, Col. 1, Section II.A.1), ¶[1-2]: “Kolmogorov-Smirnov (KS) is a statistical test evaluating the hypothesis that whether two independent sample sets represent two different population or not [16]. In KS-test, the maximum difference between cumulative probability distributions related to the two random variables is calculated. If there is a significant difference at any point along the two cumulative probability distributions, it can be concluded that the two sample sets belong to two different populations. The KS value between the two random variable M and N can be calculated from the following steps. Random variables outcomes x are discretized, separately, into k bins [xi, xi+1], I = 1 … k. Occurrence probabilities in each bin are estimated. Cumulative probability distributions are calculated. KS statistic can be calculated from (1). 
    PNG
    media_image1.png
    79
    482
    media_image1.png
    Greyscale
 In (1), nM and nN are the number of elements in M and N. Also Mk and Nk refer to cumulative probability distributions of M and N in kth bin, respectively. In each statistical significance level of α, if KS (M,N) < λα, the two random variables M and N are related to a unique distribution.”) [Examiner’s note: the “plurality of datasets” is being interpreted as the “cover” dataset and the “stegos” dataset. Because the features in these datasets are used to train a classifier to “predict suspect objects”, they are being interpreted as the “predictive features”. The “plurality of distribution tests” is being interpreted as the “cumulative probability distributions” and the “Kolmogorov-Smirnov statistical test”]
each dataset of the plurality of datasets comprises: (i) a set of training entries corresponding to a single predictive feature of the plurality of predictive features, (ii) a set of testing entries corresponding to the single predictive feature, and (Hosseini, Page 3, Col. 2, Section III, ¶[1-2]: “Quantity of 8,000 images were used as the cover images of training set and 2,000 remaining images were used as the cover images of testing set… In the first experiment, all the images in both training and testing sets were embedded by our LSB Matching simulator, with relative payload of 0.3bpp (bit per pixel). Hence, the training set included 8,000 covers and 8,000 stegos and the testing set included 2,000 covers and 2,000 stegos. Afterwards, 686 features of SPAM [3] were extracted from the covers and stegos of both sets, specially designed for detecting steganography in the spatial domain.”)
identifying, by one or more processors, a lead feature from the consistent feature set; (Hosseini, Page 2, Col. 1, ¶[3]: “in the proposed method, F is used just for ranking features. Afterwards, a strategy based on Kolmogorov-Smirnov (KS) test [16] is used in order to remove redundant features.”, Page 3, Col. 1, Section. B, ¶[4]: “The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability.”, Col. 2, Fig. 2: 
    PNG
    media_image6.png
    327
    599
    media_image6.png
    Greyscale
)[Examiner’s note: Fig. 2 discloses the features are ranked based on F statistical measure in descending order, with the most important feature placed first, which aligns with the concept of identifying the lead feature.]
selecting, by one or more processors, one of the features in the differential feature set; (Hosseini, Page 3, Col. 1, Section B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them.”, Page 3, Col. 1, Section B, ¶[4]: “The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability. In the next step, KS statistical measure is employed to find the features with equal distributions. In the cases where distributions are equal, the second feature is always discarded because it has a smaller F value.”)
in response to determining that the correlation value is outside the correlation range, adjusting , by one or more processors, the differential feature set by removing the selected differential feature from the differential feature set; and (Hosseini, Page 3, Col. 1, Col. 1, Section B, ¶[3]: “Removing redundant features during forward comparison leads to sensitiveness of the algorithm to the ordering of the input features. To resolve this problem and make the results consistent, a sorting on the input features is required. In this paper it is suggested to sort features according to F statistic described in previous section. In KS-CBF1 algorithm [19] which also uses KS for removing redundant features and is similar to the suggested algorithm, the features are reordered using a ranking system based on SUC2.”, Col. 2, Fig. 2: 
    PNG
    media_image6.png
    327
    599
    media_image6.png
    Greyscale
) [Examiner’s note: The highlight indicates that the features that pass the KS test (i.e., have significantly different distribution) remain in the differential set, while features that fail the KS test (i.e., have nearly identical distribution to a higher-ranked feature) are removed. This means the algorithm in Fig. 2 is dynamically adjusting the differential set based on how similar each feature is to the lead feature]
generating, by one or more processors, a final feature set based on the adjusted differential feature set and the consistent feature set.  (Hosseini, Page 3, Col. 1, Section. B, ¶[1]: “The underlying idea of the proposed method is that some features have equal distributions and to form the final feature set, it is sufficient to select just one of them. In order to evaluate the similarity of the features, KS-test can be used and the evaluation can be made through a forward comparison”, Page 3, Col. 1, Section. B, ¶[3-4]: “To resolve this problem and make the results consistent, a sorting on the input features is required. In this paper it is suggested to sort features according to F statistic described in previous section… The proposed algorithm starts by sorting the features according to their F values in descending order; that is because the features having greater F can lead to more separability. In the next step, KS statistical measure is employed to find the features with equal distributions. In the cases where distributions are equal, the second feature is always discarded because it has a smaller F value.”) [Examiner’s note: Hosseini discloses evaluating the similarity of the features by using KS test then sorting them to form the optimal final feature set, which aligns with the concept of partitioning the features into differential set (e.g., set of features which do not share similarity) and consistent set (e.g., set of features which share similarity)] 
building, by one or more processors, a machine learning model using the final feature set; (Hosseini, Pg. 1, Col. 1, Section I, ¶[1]: “The statistical measures which model these relations are known as features”, Col. 2, ¶[1]: “Kodovský [6] introduced rich models which were combination of smaller sub-models made to capture inter-block and intra-block relationships, form both the image and calibrated one. The final feature set included 22,510 features.”)
Hosseini fails to disclose:
each distribution test of the plurality of distribution tests corresponds to a different predictive feature of a plurality of predictive features;  
(iii) for each individual entry, of the set of training entries and the set of testing entries, a variable indicating a data source of the respective individual entry;
	the set of testing entries are devoid of a target;
	and each distribution test tests whether a value distribution of the set of training entries is different from that of the set of testing entries
	partitioning, [by one or more processors], the plurality of predictive features into: (i) a differential feature set and (ii) a consistent feature set, based on [[their]] the corresponding distribution test, wherein the consistent feature set includes features that have statistical consistency between the set of training entries and the set of testing entries
computing, [by one or more processors], a correlation range of the set of training entities based on a correlation between data values of the set of training entities and the lead feature;
computing, [by one or more processors], a correlation value of the set of testing entities based on a correlation between data values of the set of testing entities and the lead feature;
However, Pan explicitly discloses:
each distribution test of the plurality of distribution tests corresponds to a different predictive feature of a plurality of predictive features;  (Pan, Abstract: “With our approach, the system detects concept drift in new data before making inference, trains a model, and produces predictions adapted to the new data.”, Pg. 2, Figure 2: 
    PNG
    media_image2.png
    295
    572
    media_image2.png
    Greyscale
, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset. Classification performance better than random guess indicates the different feature distributions between the training and test datasets.”)
(iii) for each individual entry, of the set of training entries and the set of testing entries, a variable indicating a data source of the respective individual entry; (Pan, Pg. 2, Col. 1, ¶[1]: “In adversarial validation, a binary classifier, adversarial classifier, is trained to predict if a sample belongs to the test dataset”, Pg. 3, Col. 1, ¶1]: “We start with a labeled training dataset {(ytrain, Xtrain)} ∈ R × Rd , and an unlabeled test dataset {Xtest } ∈ Rd with an unknown conditional probability Py |X . Then, we train an adversarial classifier that predicts P({train, test }|X) to separate train and test, and generate the propensity score ppropensity = P(test |X) on both Xtest and Xtrain.”) [Examiner’s note: Pan’s adversarial classifier is trained to predict whether a sample is from training or testing dataset, which is effectively a source label.]
and each distribution test tests whether a value distribution of the set of training entries is different from that of the set of testing entries (Pan, Pg. 3, Col. 1, ¶[2]: “The feature importance and propensity score from the adversarial classifier can be used to detect concept drift between the training and test data, and provide insights on the cause of the concept drift such as which features and subsamples in the training data are most different from ones in the test data.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Pan. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Pan teaches an adversarial validation approach to concept drift problems in user targeting automation systems. One of ordinary skill would have motivation to combine Hosseini and Pan in order to select, retain or remove features based not only on predictive usefulness but also on whether those features remain stable across datasets, thereby reducing selection bias, improving robustness to drift, and producing a model that generalizes better to unseen test data.
However, Rajagopal explicitly discloses:
	and (iii) a data source variable indicating a type of each entry of the set of training entries and the set of testing entries; (Rajagopal, Pg. 19729, Table 2: 
    PNG
    media_image3.png
    338
    553
    media_image3.png
    Greyscale
)
wherein the consistent feature set includes features that have statistical consistency between the set of training entries and the set of testing entries (Rajagopal, Pg. 19732, Col. 1, ¶[5]: “While building machine learning models, it often becomes imperative to compare the performance of classifiers and the best way to achieve this is to perform statistical significance tests… In this context, the null hypothesis (H0) suggests that there is no performance difference among classifiers whereas an alternate hypothesis (H1) indicates that at least one classifier performs differently. Suppose, ‘d’ refers to the number of datasets and ‘k’ signifies the number of classifiers, Friedman test statistic can be calculated as shown in equation (7) 
    PNG
    media_image4.png
    63
    460
    media_image4.png
    Greyscale
”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Rajagopal. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Rajagopal teaches a meta-classification approach using decision jungle to perform both binary and multiclass classification. One of ordinary skill would have motivation to combine Hosseini and Rajagopal because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Hassani explicitly discloses:
partitioning, [by one or more processors], the plurality of predictive features into: (i) a differential feature set and (ii) a consistent feature set, based on [[their]] the corresponding distribution test (Hassani, Page 594, ¶[2]: “Next, we introduce the hypothesis which are relevant for the proposed KSPA test. Let us begin by presenting the hypothesis for the two-sided KS test. Let X and Y be two random variables with c.d.f.’s FX and FY , respectively. Then, a two sample, two-sided KS test will test the hypothesis that both c.d.f.’s have an identical distribution, and the resulting null and alternate hypothesis can be expressed as: 
    PNG
    media_image5.png
    76
    410
    media_image5.png
    Greyscale
 In simple terms, the null hypothesis in Equation (6) states that both X and Y share an identical distribution whilst the alternate hypothesis states that X and Y do not share the same distribution.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Hassani. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Hassani teaches performing a similarity statistical test between two different datasets using Kolmogorov-Smirnov technique. One of ordinary skill would have motivation to combine Hosseini and Hassani in order to measure how different the distributions of a feature are across groups. If a feature behaves very differently between two classes, that’s a good sign it’s useful for making predictions, this is especially helpful in real-world situations where data doesn’t follow neat patterns – since KS doesn’t assume any particular distributions, it works well no matter how messy the data gets.
However, Haws explicitly discloses:
the set of testing entries are devoid of a target; (Haws, Col. 1, Lines 53-58: “The feature selection module is configured to perform a method. The method includes receiving a set of training samples and a set of test samples. The set of training samples includes a first set of features and a class value. The set of test samples includes the set of features absent the class value.”) [Examiner’s note: the testing dataset is devoid of the target e.g., the set of test samples includes the set of features absent the class value]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Haws. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Haws teaches a transductive feature selection method with maximum-relevancy and minimum-redundancy criteria. One of ordinary skill would have motivation to combine Hosseini and Haws because by excluding the target outcome from the testing dataset during feature selection process, the features are chosen based purely on the underlying data patterns – not on knowledge of the final outcome. This leads to a more honest evaluation of the model’s true predictive power and helps create models that are reliable, and robust when applied to unseen data
However, Yu explicitly discloses:
computing, [by one or more processors], a correlation range of the set of training entities based on a correlation between data values of the set of training entities and the lead feature; (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “More specifically, suppose a data set S contains N features and a class C. Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that ⱯFi                 
                    ∈
                
             S’, 1                 
                    ≤
                
             I                 
                    ≤
                
             N, SUi,c                 
                    ≥
                
             ᵹ”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                 
                    ∈
                
             S) and the class C is predominant iff SUi,c                 
                    ≥
                
             ᵹ, and ⱯFj ϵ S’ (j                 
                    ≠
                
             i), there exists no Fj such that SUi,j                 
                    ≥
                
             SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                 
                    ∈
                
             S’ and SPi  (SPi                  
                    ≠
                
                             
                    ∅
                
            ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c                 
                    ≤
                
             SUi,c}”) [Examiner’s note: the “lead feature” is being interpreted as the predominant feature Fi, and the “set of training entities” is being interpreted as “class C” because class C represents the target values that we are trying to predict]
computing, [by one or more processors], a correlation value of the set of testing entities based on a correlation between data values of the set of testing entities and the lead feature; (Yu, Page 4, Col. 1, Section 4.1, ¶[2]: “More specifically, suppose a data set S contains N features and a class C. Let SUi,c denote the SU value that measures the correlation between a feature Fi and the class C (named C-correlation), then a subset S’ of relevant features can be decided by a threshold SU value ᵹ, such that ⱯFi                 
                    ∈
                
             S’, 1                 
                    ≤
                
             I                 
                    ≤
                
             N, SUi,c                 
                    ≥
                
             ᵹ”, Page 4, Col. 2, ¶[2]: “The correlation between a feature Fi (Fi                 
                    ∈
                
             S) and the class C is predominant iff SUi,c                 
                    ≥
                
             ᵹ, and ⱯFj ϵ S’ (j                 
                    ≠
                
             i), there exists no Fj such that SUi,j                 
                    ≥
                
             SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi and use SPi to denote the set of all redundant peers for Fi. Given Fi                 
                    ∈
                
             S’ and SPi  (SPi                  
                    ≠
                
                             
                    ∅
                
            ), we divide SPi into two parts, SPi+ and SPi-, where SPi+ = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c > SUi,c} and SPi- = {Fj|Fj                 
                    ∈
                     
                
            SPi, SUj,c                 
                    ≤
                
             SUi,c}”) [Examiner’s note: the “lead feature” is being interpreted as the predominant feature Fi, and the “set of testing entities” is being interpreted as “class C” because class C represents the target values that we are trying to predict]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosseini and Yu. Hosseini teaches a new feature selection algorithm which utilizes two statistical measures (i.e., KS from Kolmogorov-Smirnov test and F from F-to-remove). Yu teaches a correlation-based feature selection method. One of ordinary skill would have motivation to combine Hosseini and Yu in order to identify redundancy and keep the model focused on the most valuable information. If two features are highly correlated, they are basically telling the model the same thing. Including both doesn’t add value – it just clutters the model, increases complexity, and can even lead to overfitting. By computing the correlation range, you can spots these overlaps early and remove redundant features.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY TRAN whose telephone number is (571)270-0693. The examiner can normally be reached Monday - Friday 7:30 am - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached on (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AMY TRAN/Examiner, Art Unit 2126                                                                                                                                                                                                        /DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

Jun 17, 2021
Application Filed
Feb 20, 2025
Non-Final Rejection — §103
May 08, 2025
Applicant Interview (Telephonic)
May 08, 2025
Examiner Interview Summary
May 27, 2025
Response Filed
Sep 26, 2025
Final Rejection — §103
Nov 20, 2025
Examiner Interview Summary
Nov 20, 2025
Applicant Interview (Telephonic)
Dec 01, 2025
Response after Non-Final Action
Jan 02, 2026
Request for Continued Examination
Jan 17, 2026
Response after Non-Final Action
Mar 10, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/226,399
Patent 12602582
DYNAMIC DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS
2y 5m to grant Granted Apr 14, 2026
17/137,588
Patent 12468932
IDENTIFYING RELATED MESSAGES IN A NATURAL LANGUAGE INTERACTION
2y 5m to grant Granted Nov 11, 2025
16/996,310
Patent 12462185
SCENE GRAMMAR BASED REINFORCEMENT LEARNING IN AGENT TRAINING
2y 5m to grant Granted Nov 04, 2025
17/111,611
Patent 12423589
TRAINING DECISION TREE-BASED PREDICTIVE MODELS
2y 5m to grant Granted Sep 23, 2025
16/261,092
Patent 12288074
GENERATING AND PROVIDING PROPOSED DIGITAL ACTIONS IN HIGH-DIMENSIONAL ACTION SPACES USING REINFORCEMENT LEARNING MODELS
2y 5m to grant Granted Apr 29, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
36%
Grant Probability
84%
With Interview (+47.9%)
5y 2m
Median Time to Grant
High
PTA Risk
Based on 28 resolved cases by this examiner. Grant probability derived from career allow rate.