autoCV Module

Description :
  • This module is used for model selection:
    • Automate the models training with cross validation
    • GridSearch the best parameters
    • Export the optimized models as pkl files, and saved them in /pkl folders
    • Validate the optimized models, and select the best model
  • Class
    • dynaClassifier : Focus on classification problems
      • fit() : fit method for classifier
    • dynaRegressor : Focus on regression problems
      • fit() : fit method for regressor
  • Current available estimators
    • clf_cv : Class focusing on classification estimators
      • lgr - Logistic Regression (aka logit, MaxEnt) classifier - LogisticRegression()
      • svm : C-Support Vector Classification - SVM.SVC()
      • mlp : Multi-layer Perceptron classifier - MLPClassifier()
      • ada : An AdaBoost classifier - AdaBoostClassifier()
      • rf : Random Forest classifier - RandomForestClassifier()
      • gb : Gradient Boost classifier - GradientBoostingClassifier()
      • xgb : XGBoost classifier - xgb.XGBClassifier()
    • reg_cv : Class focusing on regression estimators
      • lr : Linear Regression - LinearRegression()
      • knn : Regression based on k-nearest neighbors - KNeighborsRegressor()
      • svr : Epsilon-Support Vector Regression - SVM.SVR()
      • rf : Random Forest Regression - RandomForestRegressor()
      • ada : An AdaBoost regressor - AdaBoostRegressor()
      • gb : Gradient Boosting for regression -GradientBoostingRegressor()
      • tree : A decision tree regressor - DecisionTreeRegressor()
      • mlp : Multi-layer Perceptron regressor - MLPRegressor()
      • xgb : XGBoost regression - XGBRegressor()
      • hgboost : Hist Gradient Boosting regression - HistGradientBoostingRegressor(); New added on 8/7/2020
      • huber : Huber regression - HuberRegressor(); New added on 8/7/2020
      • rgcv : Ridge cross validation regression - RidgeCV(); New added on 8/7/2020
      • cvlasso : Lasso cross validation regression - LassoCV(); New added on 8/7/2020
      • sgd : Stochastic Gradient Descent regression - SGDRegressor(); New added on 8/7/2020

dynaClassifier

class dynapipe.autoCV.dynaClassifier(custom_estimators=None, random_state=13, cv_num=5, in_pipeline=False, input_from_file=True)[source]

This class implements classification model selection with hyperparameters grid search and cross-validation.

Parameters:
  • custom_estimators (list, default = None) – Custom set the estimators in the autoCV regression module(if set None, will use all available estimators). Current version’s default available estimators are [‘lgr’,’svm’,’mlp’,’rf’,’ada’,’gb’,’xgb’].
  • random_state (int, default = None) – Random state value.
  • cv (int, default = None) – # of folds for cross-validation.
  • in_pipeline (bool, default = False) – Should be set to “True” when using autoPipe module to build Pipeline Cluster Traveral Experiments.
  • input_from_file (bool, default = True) – When input dataset is df, needs to set “True”; Otherwise, i.e. array, needs to set “False”.

Example

[Example]https://dynamic-pipeline.readthedocs.io/en/latest/demos.html#model-selection-for-a-classification-problem-using-autocv

References

None

fit(tr_features=None, tr_labels=None)[source]

Fit and train datasets with classification hyperparameters GridSearch and CV across multiple estimators. Module will Auto save trained model as {estimator_name}_clf_model.pkl file to ./pkl folder. :param features: Train features columns. ( NOTE: In the Pipeline Cluster Traversal Experiments, the features columns should be from the same pipeline dataset). :type features: df, default = None :param labels: Train label column.

( NOTE: In the Pipeline Cluster Traversal Experiments, the label column should be from the same pipeline dataset).
Returns:
  • cv_num (int) – # of fold for cross-validation.
  • DICT_EST (dictionary) – key is the name of estimators, value is the ralated trained model
  • NOTE - Trained model auto save function only avalable when in_pipeline = “False”.
  • NOTE - Log records will generate and save to ./logs folder automatedly.

dynaRegressor

class dynapipe.autoCV.dynaRegressor(custom_estimators=None, random_state=25, cv_num=5, in_pipeline=False, input_from_file=True)[source]

This class implements regression model selection with with hyperparameters grid search and cross-validation. Module will Auto save trained model as {estimator_name}_reg_model.pkl file to ./pkl folder.

Parameters:
  • custom_estimators (list, default = None) – Custom set the estimators in the autoCV regression module(if set None, will use all available estimators). Current version’s default available estimators are [‘lr’,’knn’,’tree’,’svm’,’mlp’,’rf’,’gb’,’ada’,’xgb’,’hgboost’,’huber’,’rgcv’,’cvlasso’,’sgd’].
  • random_state (int, default = None) – Random state value.
  • cv (int, default = None) – # of folds for cross-validation.
  • in_pipeline (bool, default = False) – Should be set to “True” when using autoPipe module to build Pipeline Cluster Traveral Experiments.
  • input_from_file (bool, default = True) – When input dataset is df, needs to set “True”; Otherwise, i.e. array, needs to set “False”.

Example

[Example]https://dynamic-pipeline.readthedocs.io/en/latest/demos.html#model-selection-for-a-regression-problem-using-autocv

References

None

fit(tr_features=None, tr_labels=None)[source]

Fit and train datasets with regression hyperparameters GridSearch and CV across multiple estimators.

Parameters:
  • features (df, default = None) – Train features columns. ( NOTE: In the Pipeline Cluster Traversal Experiments, the features columns should be from the same pipeline dataset).
  • labels (df ,default = None) – Train label column. ( NOTE: In the Pipeline Cluster Traversal Experiments, the label column should be from the same pipeline dataset).
Returns:

  • cv_num (int) – # of fold for cross-validation.
  • DICT_EST (dictionary) – key is the name of estimators, value is the ralated trained model.
  • NOTE - Trained model auto save function only avalable when in_pipeline = “False”.
  • NOTE - Log records will generate and save to ./logs folder automatedly.

evaluate_model

class dynapipe.autoCV.evaluate_model(model_type=None, in_pipeline=False)[source]

This class implements model evaluation and return key score results.

Parameters:
  • model_type (str, default = None) – Value in [“reg”,”cls”]. The “reg” for regression problem, and “cls” for classification problem.
  • in_pipeline (bool, default = False) – Should be set to “True” when using autoPipe module to build Pipeline Cluster Traveral Experiments.

Example

[Example]https://dynamic-pipeline.readthedocs.io/en/latest/demos.html#build-pipeline-cluster-traveral-experiments-using-autopipe

References

fit(name=None, model=None, features=None, labels=None)[source]

Model evaluation with all models by the validate datasets.

Parameters:
  • name (str, default = None) – Estimator name.
  • model (pkl, default = None) – Model needs to evaluate. Needs pkl file as input when in_pipeline = “False”; otherwise, should use DICT_EST[estimator name] as the input here.
  • features (df, default = None) – Validate features columns. ( NOTE: In the Pipeline Cluster Traversal Experiments, the features columns should be from the same pipeline dataset).
  • labels (df ,default = None) – Validate label column. ( NOTE: In the Pipeline Cluster Traversal Experiments, the label column should be from the same pipeline dataset).
Returns:

optimal_scores – When model_type = “cls”, will return [name,accuracy,precision,recall,latency] info of model validation results. when model_type = “reg”, will return [name,R-squared,MAE,MSE,RMSE,latency] info of model validation results.

Return type:

list

clf_cv

class dynapipe.estimatorCV.clf_cv(cv_val=None, random_state=None)[source]

This class stores classification estimators.

Parameters:
  • random_state (int, default = None) – Random state value.
  • cv_val (int, default = None) – # of folds for cross-validation.

Example

[Example]

References

None

reg_cv

class dynapipe.estimatorCV.reg_cv(cv_val=None, random_state=None)[source]

This class stores regression estimators.

Parameters:
  • random_state (int, default = None) – Random state value.
  • cv_val (int, default = None) – # of folds for cross-validation.

Example

[Example]

References

None

data_splitting_tool

dynapipe.utilis_func.data_splitting_tool(feature_cols=None, label_col=None, val_size=0.2, test_size=0.2, random_state=13)[source]

Splitting each pipeline’s dataset into train, validate, and test parts for Pipeline Cluster Traversal Experiments.

NOTE: When in_pipeline = “True”, this function will be built-in function in autoPipe module. So it needs to use pipeline_splitting_rule() to setup splitting rule.

Parameters:
  • label_col (array/df, default = None) – Column of label.
  • feature_cols (df, default = None) – Feature columns.
  • val_size (float, default = 0.2) – Value within [0~1]. Percentage of validate data. NOTE - When val_size with no input value will not return X_val & y_val
  • test_size (float, default = 0.2) – Value within [0~1]. Percentage of test data.
  • random_state (int, default = 13) – Random state value.
Returns:

  • X_train (array) – Train features dataset
  • y_train (array) – Train label dataset
  • X_val (array) – Validate features datset
  • y_val (array) – Validate label dataset
  • X_test (array) – Test features dataset
  • y_test (array) – Test label dataset

reset_parameters

dynapipe.utilis_func.reset_parameters()[source]

Reset autoCV estimators hyperparameters and searching range to default values.

Parameters:None
Returns:
Return type:None

Example

update_parameters

dynapipe.utilis_func.update_parameters(mode='None', estimator_name='None', **kwargs)[source]

Update autoCV estimators hyperparameters and searching range to custom values.

NOTE: One line of command could only update one estimator.

Parameters:
  • mode (str, default = None) – Value in [“cls”,”reg”]. “cls” will modify classification estimators; “reg” will modify regression estimators.
  • estimator_name (str, default = None) – Name of estimator.
  • **kwargs (list, default = None) – Lists of values using comma splitting, i.e. C=[0.1,0.2],kernel=[“linear”].
Returns:

Return type:

None

Example

export_parameters

dynapipe.utilis_func.export_parameters()[source]

Export current autoCV estimators hyperparameters and searching range to current work dictionary.

Parameters:None
Returns:
Return type:None

Example

Defaults Parameters for Classifiers/Regressors

Estimators default parameters setting:

Classifiers Estimators Default Parameters Searching Range
Estimators: Parameters: Value Range:
lgr ‘C’ [0.001, 0.01, 0.1, 1, 10, 100, 1000]
svm ‘C’ [0.1, 1, 10]
  ‘kernel’ [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]
mlp ‘activation’ [‘identity’,’relu’, ‘tanh’, ‘logistic’]
  ‘hidden_layer_sizes’ [10, 50, 100]
  ‘learning_rate’ [‘constant’, ‘invscaling’, ‘adaptive’]
  ‘solver’ [‘lbfgs’, ‘sgd’, ‘adam’]
ada ‘n_estimators’ [50,100,150]
  ‘learning_rate’ [0.01,0.1, 1, 5, 10]
rf ‘max_depth’ [2, 4, 8, 16, 32]
  ‘n_estimators’ [5, 50, 250]
gb ‘n_estimators’ [50,100,150,200,250,300]
  ‘max_depth’ [1, 3, 5, 7, 9]
  ‘learning_rate’ [0.01, 0.1, 1, 10, 100]
xgb ‘n_estimators’ [50,100,150,200,250,300]
  ‘max_depth’ [3, 5, 7, 9]
  ‘learning_rate’ [0.01, 0.1, 0.2,0.3,0.4]
Regressors Default Parameters Searching Range
Estimators: Parameters: Value Range:
lr ‘normalize’ [True,False]
svm ‘C’ [0.1, 1, 10]
  ‘kernel’ [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]
mlp ‘activation’ [‘identity’,’relu’, ‘tanh’, ‘logistic’]
  ‘hidden_layer_sizes’ [10, 50, 100]
  ‘learning_rate’ [‘constant’, ‘invscaling’, ‘adaptive’]
  ‘solver’ [‘lbfgs’, ‘adam’]
ada ‘n_estimators’ [50,100,150,200,250,300]
  ‘loss’ [‘linear’,’square’,’exponential’]
  ‘learning_rate’ [0.01, 0.1, 0.2,0.3,0.4]
tree ‘splitter’ [‘best’, ‘random’]
  ‘max_depth’ [1, 3, 5, 7, 9]
  ‘min_samples_leaf’ [1,3,5]
rf ‘max_depth’ [2, 4, 8, 16, 32]
  ‘n_estimators’ [5, 50, 250]
gb ‘n_estimators’ [50,100,150,200,250,300]
  ‘max_depth’ [3, 5, 7, 9]
  ‘learning_rate’ [0.01, 0.1, 0.2,0.3,0.4]
xgb ‘n_estimators’ [50,100,150,200,250,300]
  ‘max_depth’ [3, 5, 7, 9]
  ‘learning_rate’ [0.01, 0.1, 0.2,0.3,0.4]
sgd ‘shuffle’ [True,False]
  ‘penalty’ [‘l2’, ‘l1’, ‘elasticnet’]
  ‘learning_rate’ [‘constant’,’optimal’,’invscaling’]
cvlasso ‘fit_intercept’ [True,False]
rgcv ‘fit_intercept’ [True,False]
huber ‘fit_intercept’ [True,False]
hgboost ‘max_depth’ [3, 5, 7, 9]
  ‘learning_rate’ [0.1, 0.2,0.3,0.4]