autoCV Module¶
- Description :
- This module is used for model selection:
- Automate the models training with cross validation
- GridSearch the best parameters
- Export the optimized models as pkl files, and saved them in /pkl folders
- Validate the optimized models, and select the best model
- Class
- dynaClassifier : Focus on classification problems
- fit() : fit method for classifier
- dynaRegressor : Focus on regression problems
- fit() : fit method for regressor
- Current available estimators
- clf_cv : Class focusing on classification estimators
- lgr - Logistic Regression (aka logit, MaxEnt) classifier - LogisticRegression()
- svm : C-Support Vector Classification - SVM.SVC()
- mlp : Multi-layer Perceptron classifier - MLPClassifier()
- ada : An AdaBoost classifier - AdaBoostClassifier()
- rf : Random Forest classifier - RandomForestClassifier()
- gb : Gradient Boost classifier - GradientBoostingClassifier()
- xgb : XGBoost classifier - xgb.XGBClassifier()
- reg_cv : Class focusing on regression estimators
- lr : Linear Regression - LinearRegression()
- knn : Regression based on k-nearest neighbors - KNeighborsRegressor()
- svr : Epsilon-Support Vector Regression - SVM.SVR()
- rf : Random Forest Regression - RandomForestRegressor()
- ada : An AdaBoost regressor - AdaBoostRegressor()
- gb : Gradient Boosting for regression -GradientBoostingRegressor()
- tree : A decision tree regressor - DecisionTreeRegressor()
- mlp : Multi-layer Perceptron regressor - MLPRegressor()
- xgb : XGBoost regression - XGBRegressor()
- hgboost : Hist Gradient Boosting regression - HistGradientBoostingRegressor(); New added on 8/7/2020
- huber : Huber regression - HuberRegressor(); New added on 8/7/2020
- rgcv : Ridge cross validation regression - RidgeCV(); New added on 8/7/2020
- cvlasso : Lasso cross validation regression - LassoCV(); New added on 8/7/2020
- sgd : Stochastic Gradient Descent regression - SGDRegressor(); New added on 8/7/2020
dynaClassifier¶
-
class
dynapipe.autoCV.
dynaClassifier
(custom_estimators=None, random_state=13, cv_num=5, in_pipeline=False, input_from_file=True)[source]¶ This class implements classification model selection with hyperparameters grid search and cross-validation.
Parameters: - custom_estimators (list, default = None) – Custom set the estimators in the autoCV regression module(if set None, will use all available estimators). Current version’s default available estimators are [‘lgr’,’svm’,’mlp’,’rf’,’ada’,’gb’,’xgb’].
- random_state (int, default = None) – Random state value.
- cv (int, default = None) – # of folds for cross-validation.
- in_pipeline (bool, default = False) – Should be set to “True” when using autoPipe module to build Pipeline Cluster Traveral Experiments.
- input_from_file (bool, default = True) – When input dataset is df, needs to set “True”; Otherwise, i.e. array, needs to set “False”.
Example
[Example] https://dynamic-pipeline.readthedocs.io/en/latest/demos.html#model-selection-for-a-classification-problem-using-autocv References
None
-
fit
(tr_features=None, tr_labels=None)[source]¶ Fit and train datasets with classification hyperparameters GridSearch and CV across multiple estimators. Module will Auto save trained model as {estimator_name}_clf_model.pkl file to ./pkl folder. :param features: Train features columns. ( NOTE: In the Pipeline Cluster Traversal Experiments, the features columns should be from the same pipeline dataset). :type features: df, default = None :param labels: Train label column.
( NOTE: In the Pipeline Cluster Traversal Experiments, the label column should be from the same pipeline dataset).Returns: - cv_num (int) – # of fold for cross-validation.
- DICT_EST (dictionary) – key is the name of estimators, value is the ralated trained model
- NOTE - Trained model auto save function only avalable when in_pipeline = “False”.
- NOTE - Log records will generate and save to ./logs folder automatedly.
dynaRegressor¶
-
class
dynapipe.autoCV.
dynaRegressor
(custom_estimators=None, random_state=25, cv_num=5, in_pipeline=False, input_from_file=True)[source]¶ This class implements regression model selection with with hyperparameters grid search and cross-validation. Module will Auto save trained model as {estimator_name}_reg_model.pkl file to ./pkl folder.
Parameters: - custom_estimators (list, default = None) – Custom set the estimators in the autoCV regression module(if set None, will use all available estimators). Current version’s default available estimators are [‘lr’,’knn’,’tree’,’svm’,’mlp’,’rf’,’gb’,’ada’,’xgb’,’hgboost’,’huber’,’rgcv’,’cvlasso’,’sgd’].
- random_state (int, default = None) – Random state value.
- cv (int, default = None) – # of folds for cross-validation.
- in_pipeline (bool, default = False) – Should be set to “True” when using autoPipe module to build Pipeline Cluster Traveral Experiments.
- input_from_file (bool, default = True) – When input dataset is df, needs to set “True”; Otherwise, i.e. array, needs to set “False”.
Example
[Example] https://dynamic-pipeline.readthedocs.io/en/latest/demos.html#model-selection-for-a-regression-problem-using-autocv References
None
-
fit
(tr_features=None, tr_labels=None)[source]¶ Fit and train datasets with regression hyperparameters GridSearch and CV across multiple estimators.
Parameters: - features (df, default = None) – Train features columns. ( NOTE: In the Pipeline Cluster Traversal Experiments, the features columns should be from the same pipeline dataset).
- labels (df ,default = None) – Train label column. ( NOTE: In the Pipeline Cluster Traversal Experiments, the label column should be from the same pipeline dataset).
Returns: - cv_num (int) – # of fold for cross-validation.
- DICT_EST (dictionary) – key is the name of estimators, value is the ralated trained model.
- NOTE - Trained model auto save function only avalable when in_pipeline = “False”.
- NOTE - Log records will generate and save to ./logs folder automatedly.
evaluate_model¶
-
class
dynapipe.autoCV.
evaluate_model
(model_type=None, in_pipeline=False)[source]¶ This class implements model evaluation and return key score results.
Parameters: - model_type (str, default = None) – Value in [“reg”,”cls”]. The “reg” for regression problem, and “cls” for classification problem.
- in_pipeline (bool, default = False) – Should be set to “True” when using autoPipe module to build Pipeline Cluster Traveral Experiments.
Example
[Example] https://dynamic-pipeline.readthedocs.io/en/latest/demos.html#build-pipeline-cluster-traveral-experiments-using-autopipe References
-
fit
(name=None, model=None, features=None, labels=None)[source]¶ Model evaluation with all models by the validate datasets.
Parameters: - name (str, default = None) – Estimator name.
- model (pkl, default = None) – Model needs to evaluate. Needs pkl file as input when in_pipeline = “False”; otherwise, should use DICT_EST[estimator name] as the input here.
- features (df, default = None) – Validate features columns. ( NOTE: In the Pipeline Cluster Traversal Experiments, the features columns should be from the same pipeline dataset).
- labels (df ,default = None) – Validate label column. ( NOTE: In the Pipeline Cluster Traversal Experiments, the label column should be from the same pipeline dataset).
Returns: optimal_scores – When model_type = “cls”, will return [name,accuracy,precision,recall,latency] info of model validation results. when model_type = “reg”, will return [name,R-squared,MAE,MSE,RMSE,latency] info of model validation results.
Return type: list
clf_cv¶
reg_cv¶
data_splitting_tool¶
-
dynapipe.utilis_func.
data_splitting_tool
(feature_cols=None, label_col=None, val_size=0.2, test_size=0.2, random_state=13)[source]¶ Splitting each pipeline’s dataset into train, validate, and test parts for Pipeline Cluster Traversal Experiments.
NOTE: When in_pipeline = “True”, this function will be built-in function in autoPipe module. So it needs to use pipeline_splitting_rule() to setup splitting rule.
Parameters: - label_col (array/df, default = None) – Column of label.
- feature_cols (df, default = None) – Feature columns.
- val_size (float, default = 0.2) – Value within [0~1]. Percentage of validate data. NOTE - When val_size with no input value will not return X_val & y_val
- test_size (float, default = 0.2) – Value within [0~1]. Percentage of test data.
- random_state (int, default = 13) – Random state value.
Returns: - X_train (array) – Train features dataset
- y_train (array) – Train label dataset
- X_val (array) – Validate features datset
- y_val (array) – Validate label dataset
- X_test (array) – Test features dataset
- y_test (array) – Test label dataset
reset_parameters¶
update_parameters¶
-
dynapipe.utilis_func.
update_parameters
(mode='None', estimator_name='None', **kwargs)[source]¶ Update autoCV estimators hyperparameters and searching range to custom values.
NOTE: One line of command could only update one estimator.
Parameters: - mode (str, default = None) – Value in [“cls”,”reg”]. “cls” will modify classification estimators; “reg” will modify regression estimators.
- estimator_name (str, default = None) – Name of estimator.
- **kwargs (list, default = None) – Lists of values using comma splitting, i.e. C=[0.1,0.2],kernel=[“linear”].
Returns: Return type: None
Example
export_parameters¶
Defaults Parameters for Classifiers/Regressors¶
Estimators default parameters setting:
Estimators: | Parameters: | Value Range: |
---|---|---|
lgr | ‘C’ | [0.001, 0.01, 0.1, 1, 10, 100, 1000] |
svm | ‘C’ | [0.1, 1, 10] |
‘kernel’ | [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’] | |
mlp | ‘activation’ | [‘identity’,’relu’, ‘tanh’, ‘logistic’] |
‘hidden_layer_sizes’ | [10, 50, 100] | |
‘learning_rate’ | [‘constant’, ‘invscaling’, ‘adaptive’] | |
‘solver’ | [‘lbfgs’, ‘sgd’, ‘adam’] | |
ada | ‘n_estimators’ | [50,100,150] |
‘learning_rate’ | [0.01,0.1, 1, 5, 10] | |
rf | ‘max_depth’ | [2, 4, 8, 16, 32] |
‘n_estimators’ | [5, 50, 250] | |
gb | ‘n_estimators’ | [50,100,150,200,250,300] |
‘max_depth’ | [1, 3, 5, 7, 9] | |
‘learning_rate’ | [0.01, 0.1, 1, 10, 100] | |
xgb | ‘n_estimators’ | [50,100,150,200,250,300] |
‘max_depth’ | [3, 5, 7, 9] | |
‘learning_rate’ | [0.01, 0.1, 0.2,0.3,0.4] |
Estimators: | Parameters: | Value Range: |
---|---|---|
lr | ‘normalize’ | [True,False] |
svm | ‘C’ | [0.1, 1, 10] |
‘kernel’ | [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’] | |
mlp | ‘activation’ | [‘identity’,’relu’, ‘tanh’, ‘logistic’] |
‘hidden_layer_sizes’ | [10, 50, 100] | |
‘learning_rate’ | [‘constant’, ‘invscaling’, ‘adaptive’] | |
‘solver’ | [‘lbfgs’, ‘adam’] | |
ada | ‘n_estimators’ | [50,100,150,200,250,300] |
‘loss’ | [‘linear’,’square’,’exponential’] | |
‘learning_rate’ | [0.01, 0.1, 0.2,0.3,0.4] | |
tree | ‘splitter’ | [‘best’, ‘random’] |
‘max_depth’ | [1, 3, 5, 7, 9] | |
‘min_samples_leaf’ | [1,3,5] | |
rf | ‘max_depth’ | [2, 4, 8, 16, 32] |
‘n_estimators’ | [5, 50, 250] | |
gb | ‘n_estimators’ | [50,100,150,200,250,300] |
‘max_depth’ | [3, 5, 7, 9] | |
‘learning_rate’ | [0.01, 0.1, 0.2,0.3,0.4] | |
xgb | ‘n_estimators’ | [50,100,150,200,250,300] |
‘max_depth’ | [3, 5, 7, 9] | |
‘learning_rate’ | [0.01, 0.1, 0.2,0.3,0.4] | |
sgd | ‘shuffle’ | [True,False] |
‘penalty’ | [‘l2’, ‘l1’, ‘elasticnet’] | |
‘learning_rate’ | [‘constant’,’optimal’,’invscaling’] | |
cvlasso | ‘fit_intercept’ | [True,False] |
rgcv | ‘fit_intercept’ | [True,False] |
huber | ‘fit_intercept’ | [True,False] |
hgboost | ‘max_depth’ | [3, 5, 7, 9] |
‘learning_rate’ | [0.1, 0.2,0.3,0.4] |