Preprocessing

Reading

class mlbox.preprocessing.Reader(sep=None, header=0, to_hdf5=False, to_path='save', verbose=True)[source]

Reads and cleans data

Parameters:
  • sep (str, defaut = None) – Delimiter to use when reading a csv file.
  • header (int or None, default = 0.) – If header=0, the first line is considered as a header. Otherwise, there is no header. Useful for csv and xls files.
  • to_hdf5 (bool, default = True) – If True, dumps each file to hdf5 format.
  • to_path (str, default = "save") – Name of the folder where files and encoders are saved.
  • verbose (bool, defaut = True) – Verbose mode
clean(path, drop_duplicate=False)[source]

Reads and cleans data (accepted formats : csv, xls, json and h5):

  • del Unnamed columns
  • casts lists into variables
  • try to cast variables into float
  • cleans dates and extracts timestamp from 01/01/2017, year, month, day, day_of_week and hour
  • drop duplicates (if drop_duplicate=True)
Parameters:
  • path (str) – The path to the dataset.
  • drop_duplicate (bool, default = False) – If True, drop duplicates when reading each file.
Returns:

Cleaned dataset.

Return type:

pandas dataframe

train_test_split(Lpath, target_name)[source]

Creates train and test datasets

Given a list of several paths and a target name, automatically creates and cleans train and test datasets. IMPORTANT: a dataset is considered as a test set if it does not contain the target value. Otherwise it is considered as part of a train set. Also determines the task and encodes the target (classification problem only).

Finally dumps the datasets to hdf5, and eventually the target encoder.

Parameters:
  • Lpath (list, defaut = None) – List of str paths to load the data
  • target_name (str, default = None) – The name of the target. Works for both classification (multiclass or not) and regression.
Returns:

Dictionnary containing :

  • ’train’ : pandas dataframe for train dataset
  • ’test’ : pandas dataframe for test dataset
  • ’target’ : encoded pandas Serie for the target on train set (with dtype=’float’ for a regression or dtype=’int’ for a classification)

Return type:

dict

Drift thresholding

class mlbox.preprocessing.Drift_thresholder(threshold=0.6, inplace=False, verbose=True, to_path='save')[source]

Automatically drops ids and drifting variables between train and test datasets.

Drops on train and test datasets. The list of drift coefficients is available and saved as “drifts.txt”. To get familiar with drift: https://github.com/AxeldeRomblay/MLBox/blob/master/docs/webinars/features.pdf

Parameters:
  • threshold (float, defaut = 0.6) – Drift threshold under which features are kept. Must be between 0. and 1. The lower the more you keep non-drifting/stable variables: a feature with a drift measure of 0. is very stable and a one with 1. is highly unstable.
  • inplace (bool, default = False) – If True, train and test datasets are transformed. Returns self. Otherwise, train and test datasets are not transformed. Returns a new dictionnary with cleaned datasets.
  • verbose (bool, default = True) – Verbose mode
  • to_path (str, default = "save") – Name of the folder where the list of drift coefficients is saved.
drifts()[source]

Returns the univariate drifts for all variables.

Returns:Dictionnary containing the drifts for each feature
Return type:dict
fit_transform(df)[source]

Fits and transforms train and test datasets

Automatically drops ids and drifting variables between train and test datasets. The list of drift coefficients is available and saved as “drifts.txt”

Parameters:df (dict, defaut = None) –

Dictionnary containing :

  • ’train’ : pandas dataframe for train dataset
  • ’test’ : pandas dataframe for test dataset
  • ’target’ : pandas serie for the target on train set
Returns:Dictionnary containing :
  • ’train’ : transformed pandas dataframe for train dataset
  • ’test’ : transformed pandas dataframe for test dataset
  • ’target’ : pandas serie for the target on train set
Return type:dict

Encoding

Missing values

class mlbox.encoding.NA_encoder(numerical_strategy='mean', categorical_strategy='<NULL>')[source]

Encodes missing values for both numerical and categorical features.

Several strategies are possible in each case.

Parameters:
  • numerical_strategy (str or float or int. default = "mean") – The strategy to encode NA for numerical features. Available strategies = “mean”, “median”, “most_frequent” or a float/int value
  • categorical_strategy (str, default = '<NULL>') – The strategy to encode NA for categorical features. Available strategies = a string or “most_frequent”
fit(df_train, y_train=None)[source]

Fits NA Encoder.

Parameters:
  • df_train (pandas dataframe of shape = (n_train, n_features)) – The train dataset with numerical and categorical features.
  • y_train (pandas series of shape = (n_train, ), default = None) – The target for classification or regression tasks.
Returns:

self

Return type:

object

fit_transform(df_train, y_train=None)[source]

Fits NA Encoder and transforms the dataset.

Parameters:
  • df_train (pandas.Dataframe of shape = (n_train, n_features)) – The train dataset with numerical and categorical features.
  • y_train (pandas.Series of shape = (n_train, ), default = None) – The target for classification or regression tasks.
Returns:

The train dataset with no missing values.

Return type:

pandas.Dataframe of shape = (n_train, n_features)

get_params(deep=True)[source]

Get parameters of a NA_encoder object.

set_params(**params)[source]

Set parameters for a NA_encoder object.

Set numerical strategy and categorical strategy.

Parameters:
  • numerical_strategy (str or float or int. default = "mean") – The strategy to encode NA for numerical features.
  • categorical_strategy (str, default = '<NULL>') – The strategy to encode NA for categorical features.
transform(df)[source]

Transform the dataset.

Parameters:df (pandas.Dataframe of shape = (n, n_features)) – The dataset with numerical and categorical features.
Returns:The dataset with no missing values.
Return type:pandas.Dataframe of shape = (n, n_features)

Categorical features

class mlbox.encoding.Categorical_encoder(strategy='label_encoding', verbose=False)[source]

Encodes categorical features.

Several strategies are possible (supervised or not). Works for both classification and regression tasks.

Parameters:
  • strategy (str, default = "label_encoding") – The strategy to encode categorical features. Available strategies = {“label_encoding”, “dummification”, “random_projection”, entity_embedding”}
  • verbose (bool, default = False) – Verbose mode. Useful for entity embedding strategy.
fit(df_train, y_train)[source]

Fit Categorical Encoder.

Encode categorical variable of a dataframe following strategy parameters.

Parameters:
  • df_train (pandas.Dataframe of shape = (n_train, n_features)) – The training dataset with numerical and categorical features. NA values are allowed.
  • y_train (pandas.Series of shape = (n_train, )) – The target for classification or regression tasks.
Returns:

self

Return type:

object

fit_transform(df_train, y_train)[source]

Fits Categorical Encoder and transforms the dataset.

Fit categorical encoder following strategy parameter and transform the dataset df_train.

Parameters:
  • df_train (pandas.Dataframe of shape = (n_train, n_features)) – The training dataset with numerical and categorical features. NA values are allowed.
  • y_train (pandas.Series of shape = (n_train, )) – The target for classification or regression tasks.
Returns:

Training dataset with numerical and encoded categorical features.

Return type:

pandas.Dataframe of shape = (n_train, n_features)

get_params(deep=True)[source]

Get param that can be defined by the user.

Get strategy parameters and verbose parameters

Parameters:
  • strategy (str, default = "label_encoding") – The strategy to encode categorical features. Available strategies = {“label_encoding”, “dummification”, “random_projection”, entity_embedding”}
  • verbose (bool, default = False) – Verbose mode. Useful for entity embedding strategy.
Returns:

dict – Dictionary that contains strategy and verbose parameters.

Return type:

dictionary

set_params(**params)[source]

Set param method for Categorical encoder.

Set strategy parameters and verbose parameters

Parameters:
  • strategy (str, default = "label_encoding") – The strategy to encode categorical features. Available strategies = {“label_encoding”, “dummification”, “random_projection”, entity_embedding”}
  • verbose (bool, default = False) – Verbose mode. Useful for entity embedding strategy.
transform(df)[source]

Transform categorical variable of df dataset.

Transform df DataFrame encoding categorical features with the strategy parameter if self.__fitOK is set to True.

Parameters:df (pandas.Dataframe of shape = (n_train, n_features)) – The training dataset with numerical and categorical features. NA values are allowed.
Returns:The dataset with numerical and encoded categorical features.
Return type:pandas.Dataframe of shape = (n_train, n_features)

Model

Classification

Feature selection

class mlbox.model.classification.Clf_feature_selector(strategy='l1', threshold=0.3)[source]

Selects useful features.

Several strategies are possible (filter and wrapper methods). Works for classification problems only (multiclass or binary).

Parameters:
  • strategy (str, defaut = "l1") – The strategy to select features. Available strategies = {“variance”, “l1”, “rf_feature_importance”}
  • threshold (float, defaut = 0.3) – The percentage of variable to discard according to the strategy. Must be between 0. and 1.
fit(df_train, y_train)[source]

Fits Clf_feature_selector

Parameters:
  • df_train (pandas dataframe of shape = (n_train, n_features)) – The train dataset with numerical features and no NA
  • y_train (pandas series of shape = (n_train, )) – The target for classification task. Must be encoded.
Returns:

self

Return type:

object

fit_transform(df_train, y_train)[source]

Fits Clf_feature_selector and transforms the dataset

Parameters:
  • df_train (pandas dataframe of shape = (n_train, n_features)) – The train dataset with numerical features and no NA
  • y_train (pandas series of shape = (n_train, )) – The target for classification task. Must be encoded.
Returns:

The train dataset with relevant features

Return type:

pandas dataframe of shape = (n_train, n_features*(1-threshold))

transform(df)[source]

Transforms the dataset

Parameters:df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features and no NA
Returns:The train dataset with relevant features
Return type:pandas dataframe of shape = (n_train, n_features*(1-threshold))

Classification

class mlbox.model.classification.Classifier(**params)[source]

Wraps scikitlearn classifiers.

Parameters:
  • strategy (str, default = "LightGBM") – The choice for the classifier. Available strategies = {“LightGBM”, “RandomForest”, “ExtraTrees”, “Tree”, “Bagging”, “AdaBoost” or “Linear”}.
  • **params (default = None) – Parameters of the corresponding classifier. Examples : n_estimators, max_depth…
feature_importances()[source]

Compute feature importances.

Classifier must be fitted before.

Returns:Dictionnary containing a measure of feature importance (value) for each feature (key).
Return type:dict
fit(df_train, y_train)[source]

Fits Classifier.

Parameters:
  • df_train (pandas dataframe of shape = (n_train, n_features)) – The train dataset with numerical features.
  • y_train (pandas series of shape = (n_train,)) – The numerical encoded target for classification tasks.
Returns:

self

Return type:

object

get_estimator()[source]

Return classfier.

get_params(deep=True)[source]

Get strategy parameters of Classifier object.

predict(df)[source]

Predicts the target.

Parameters:df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features.
Returns:The encoded classes to be predicted.
Return type:array of shape = (n, )
predict_log_proba(df)[source]

Predicts class log-probabilities for df.

Parameters:df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features.
Returns:y – The log-probabilities for each class
Return type:array of shape = (n, n_classes)
predict_proba(df)[source]

Predicts class probabilities for df.

Parameters:df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features.
Returns:The probabilities for each class
Return type:array of shape = (n, n_classes)
score(df, y, sample_weight=None)[source]

Return the mean accuracy.

Parameters:
  • df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features.
  • y (pandas series of shape = (n,)) – The numerical encoded target for classification tasks.
Returns:

Mean accuracy of self.predict(df) wrt. y.

Return type:

float

set_params(**params)[source]

Set strategy parameters of Classifier object.

Stacking

class mlbox.model.classification.StackingClassifier(base_estimators=[<mlbox.model.classification.classifier.Classifier object>, <mlbox.model.classification.classifier.Classifier object>, <mlbox.model.classification.classifier.Classifier object>], level_estimator=<Mock name='mock()' id='139653242018560'>, n_folds=5, copy=False, drop_first=True, random_state=1, verbose=True)[source]

A stacking classifier.

A stacking classifier is a classifier that uses the predictions of several first layer estimators (generated with a cross validation method) for a second layer estimator.

Parameters:
  • base_estimators (list, default = [Classifier(strategy="LightGBM"), Classifier(strategy="RandomForest"),Classifier(strategy="ExtraTrees")]) – List of estimators to fit in the first level using a cross validation.
  • level_estimator (object, default = LogisticRegression()) – The estimator used in second and last level.
  • n_folds (int, default = 5) – Number of folds used to generate the meta features for the training set
  • copy (bool, default = False) – If true, meta features are added to the original dataset
  • drop_first (bool, default = True) – If True, each estimator output n_classes-1 probabilities
  • random_state (None or int or RandomState. default = 1) – Pseudo-random number generator state used for shuffling. If None, use default numpy RNG for shuffling.
  • verbose (bool, default = True) – Verbose mode.
fit(df_train, y_train)[source]

Fits the first level estimators and the second level estimator on X.

Parameters:
  • df_train (pandas dataframe of shape (n_samples, n_features)) – Input data
  • y_train (pandas series of shape = (n_samples, )) – The target
Returns:

self.

Return type:

object

fit_transform(df_train, y_train)[source]

Creates meta-features for the training dataset.

Parameters:
  • df_train (pandas dataframe of shape = (n_samples, n_features)) – The training dataset.
  • y_train (pandas series of shape = (n_samples, )) – The target.
Returns:

The transformed training dataset.

Return type:

pandas dataframe of shape = (n_samples, n_features*int(copy)+n_metafeatures)

predict(df_test)[source]

Predicts class for the test set using the meta-features.

Parameters:df_test (pandas DataFrame of shape = (n_samples_test, n_features)) – The testing samples
Returns:The predicted classes.
Return type:array of shape = (n_samples_test,)
predict_proba(df_test)[source]

Predicts class probabilities for the test set using the meta-features.

Parameters:df_test (pandas DataFrame of shape = (n_samples_test, n_features)) – The testing samples
Returns:The class probabilities of the testing samples.
Return type:array of shape = (n_samples_test, n_classes)
transform(df_test)[source]

Creates meta-features for the test dataset.

Parameters:df_test (pandas dataframe of shape = (n_samples_test, n_features)) – The test dataset.
Returns:The transformed test dataset.
Return type:pandas dataframe of shape = (n_samples_test, n_features*int(copy)+n_metafeatures)

Regression

Feature selection

class mlbox.model.regression.Reg_feature_selector(strategy='l1', threshold=0.3)[source]

Selects useful features.

Several strategies are possible (filter and wrapper methods). Works for regression problems only.

Parameters:
  • strategy (str, defaut = "l1") – The strategy to select features. Available strategies = {“variance”, “l1”, “rf_feature_importance”}
  • threshold (float, defaut = 0.3) – The percentage of variable to discard according the strategy. Must be between 0. and 1.
fit(df_train, y_train)[source]

Fits Reg_feature_selector.

Parameters:
  • df_train (pandas dataframe of shape = (n_train, n_features)) – The train dataset with numerical features and no NA
  • y_train (pandas series of shape = (n_train, )) – The target for regression task.
Returns:

self

Return type:

sobject

fit_transform(df_train, y_train)[source]

Fits Reg_feature_selector and transforms the dataset

Parameters:
  • df_train (pandas dataframe of shape = (n_train, n_features)) – The train dataset with numerical features and no NA
  • y_train (pandas series of shape = (n_train, )) – The target for regression task.
Returns:

The train dataset with relevant features

Return type:

pandas dataframe of shape = (n_train, n_features*(1-threshold))

transform(df)[source]

Transforms the dataset

Parameters:df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features and no NA
Returns:The train dataset with relevant features
Return type:pandas dataframe of shape = (n_train, n_features*(1-threshold))

Regression

class mlbox.model.regression.Regressor(**params)[source]

Wrap scikitlearn regressors.

Parameters:
  • strategy (str, default = "LightGBM") – The choice for the regressor. Available strategies = {“LightGBM”, “RandomForest”, “ExtraTrees”, “Tree”, “Bagging”, “AdaBoost” or “Linear”}
  • **params (default = None) – Parameters of the corresponding regressor. Examples : n_estimators, max_depth…
feature_importances()[source]

Computes feature importances.

Regressor must be fitted before.

Returns:Dictionnary containing a measure of feature importance (value) for each feature (key).
Return type:dict
fit(df_train, y_train)[source]

Fits Regressor.

Parameters:
  • df_train (pandas dataframe of shape = (n_train, n_features)) – The train dataset with numerical features.
  • y_train (pandas series of shape = (n_train, )) – The target for regression tasks.
Returns:

self

Return type:

object

get_estimator()[source]

Return classfier.

get_params(deep=True)[source]

Get parameters of Regressor object.

predict(df)[source]

Predicts the target.

Parameters:df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features.
Returns:The target to be predicted.
Return type:array of shape = (n, )
score(df, y, sample_weight=None)[source]

Return R^2 coefficient of determination of the prediction.

Parameters:
  • df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features.
  • y (pandas series of shape = (n,)) – The numerical encoded target for classification tasks.
Returns:

R^2 of self.predict(df) wrt. y.

Return type:

float

set_params(**params)[source]

Set parameters of Regressor object.

transform(df)[source]

Transform dataframe df.

Parameters:df (pandas dataframe of shape = (n, n_features)) – The dataset with numerical features.
Returns:The transformed dataset with its most important features.
Return type:pandas dataframe of shape = (n, n_selected_features)

Stacking

class mlbox.model.regression.StackingRegressor(base_estimators=[<mlbox.model.regression.regressor.Regressor object>, <mlbox.model.regression.regressor.Regressor object>, <mlbox.model.regression.regressor.Regressor object>], level_estimator=<Mock name='mock()' id='139653241906736'>, n_folds=5, copy=False, random_state=1, verbose=True)[source]

A Stacking regressor.

A stacking regressor is a regressor that uses the predictions of

several first layer estimators (generated with a cross validation method) for a second layer estimator.

Parameters:
  • base_estimators (list, default = [Regressor(strategy="LightGBM"),) –
    Regressor(strategy=”RandomForest”),
    Regressor(strategy=”ExtraTrees”)]

    List of estimators to fit in the first level using a cross validation.

  • level_estimator (object, default = LinearRegression()) – The estimator used in second and last level
  • n_folds (int, default = 5) – Number of folds used to generate the meta features for the training set
  • copy (bool, default = False) – If true, meta features are added to the original dataset
  • random_state (None, int or RandomState. default = 1) – Pseudo-random number generator state used for shuffling. If None, use default numpy RNG for shuffling.
  • verbose (bool, default = True) – Verbose mode.
fit(df_train, y_train)[source]

Fit the first level estimators and the second level estimator on X.

Parameters:
  • df_train (pandas DataFrame of shape (n_samples, n_features)) – Input data
  • y_train (pandas series of shape = (n_samples, )) – The target
Returns:

self

Return type:

object

fit_transform(df_train, y_train)[source]

Create meta-features for the training dataset.

Parameters:
  • df_train (pandas DataFrame of shape = (n_samples, n_features)) – The training dataset.
  • y_train (pandas series of shape = (n_samples, )) – The target
Returns:

n_features*int(copy)+n_metafeatures) The transformed training dataset.

Return type:

pandas DataFrame of shape = (n_samples,

get_params(deep=True)[source]

Get parameters of a StackingRegressor object.

predict(df_test)[source]

Predict regression target for X_test using the meta-features.

Parameters:df_test (pandas DataFrame of shape = (n_samples_test, n_features)) – The testing samples
Returns:The predicted values.
Return type:array of shape = (n_samples_test, )
set_params(**params)[source]

Set parameters of a StackingRegressor object.

transform(df_test)[source]

Create meta-features for the test dataset.

Parameters:df_test (pandas DataFrame of shape = (n_samples_test, n_features)) – The test dataset.
Returns:n_features*int(copy)+n_metafeatures) The transformed test dataset.
Return type:pandas DataFrame of shape = (n_samples_test,

Optimisation

class mlbox.optimisation.Optimiser(scoring=None, n_folds=2, random_state=1, to_path='save', verbose=True)[source]

Optimises hyper-parameters of the whole Pipeline.

  • NA encoder (missing values encoder)
  • CA encoder (categorical features encoder)
  • Feature selector (OPTIONAL)
  • Stacking estimator - feature engineer (OPTIONAL)
  • Estimator (classifier or regressor)

Works for both regression and classification (multiclass or binary) tasks.

Parameters:
  • scoring (str, callable or None. default: None) –

    A string or a scorer callable object.

    If None, “neg_log_loss” is used for classification and “neg_mean_squared_error” for regression

    Available scorings can be found in the module sklearn.metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules

  • n_folds (int, default = 2) – The number of folds for cross validation (stratified for classification)
  • random_state (int, default = 1) – Pseudo-random number generator state used for shuffling
  • to_path (str, default = "save") – Name of the folder where models are saved
  • verbose (bool, default = True) – Verbose mode
evaluate(params, df)[source]

Evaluates the data.

Evaluates the data with a given scoring function and given hyper-parameters of the whole pipeline. If no parameters are set, default configuration for each step is evaluated : no feature selection is applied and no meta features are created.

Parameters:
  • params (dict, default = None.) –

    Hyper-parameters dictionary for the whole pipeline.

    • The keys must respect the following syntax : “enc__param”.
      • ”enc” = “ne” for na encoder
      • ”enc” = “ce” for categorical encoder
      • ”enc” = “fs” for feature selector [OPTIONAL]
      • ”enc” = “stck”+str(i) to add layer n°i of meta-features [OPTIONAL]
      • ”enc” = “est” for the final estimator
      • ”param” : a correct associated parameter for each step. Ex: “max_depth” for “enc”=”est”, …
    • The values are those of the parameters. Ex: 4 for key = “est__max_depth”, …
  • df (dict, default = None) –

    Dataset dictionary. Must contain keys and values:

    • ”train”: pandas DataFrame for the train set.
    • ”target” : encoded pandas Serie for the target on train set (with dtype=’float’ for a regression or dtype=’int’ for a classification). Indexes should match the train set.
Returns:

The score. The higher the better. Positive for a score and negative for a loss.

Return type:

float.

Examples

>>> from mlbox.optimisation import *
>>> from sklearn.datasets import load_boston
>>> #load data
>>> dataset = load_boston()
>>> #evaluating the pipeline
>>> opt = Optimiser()
>>> params = {
...     "ne__numerical_strategy" : 0,
...     "ce__strategy" : "label_encoding",
...     "fs__threshold" : 0.1,
...     "stck__base_estimators" : [Regressor(strategy="RandomForest"), Regressor(strategy="ExtraTrees")],
...     "est__strategy" : "Linear"
... }
>>> df = {"train" : pd.DataFrame(dataset.data), "target" : pd.Series(dataset.target)}
>>> opt.evaluate(params, df)
optimise(space, df, max_evals=40)[source]

Optimises the Pipeline.

Optimises hyper-parameters of the whole Pipeline with a given scoring function. Algorithm used to optimize : Tree Parzen Estimator.

IMPORTANT : Try to avoid dependent parameters and to set one feature selection strategy and one estimator strategy at a time.

Parameters:
  • space (dict, default = None.) –

    Hyper-parameters space:

    • The keys must respect the following syntax : “enc__param”.
      • ”enc” = “ne” for na encoder
      • ”enc” = “ce” for categorical encoder
      • ”enc” = “fs” for feature selector [OPTIONAL]
      • ”enc” = “stck”+str(i) to add layer n°i of meta-features [OPTIONAL]
      • ”enc” = “est” for the final estimator
      • ”param” : a correct associated parameter for each step. Ex: “max_depth” for “enc”=”est”, …
    • The values must respect the syntax: {“search”:strategy,”space”:list}
      • ”strategy” = “choice” or “uniform”. Default = “choice”
      • list : a list of values to be tested if strategy=”choice”. Else, list = [value_min, value_max].
  • df (dict, default = None) –

    Dataset dictionary. Must contain keys and values:

    • ”train”: pandas DataFrame for the train set.
    • ”target” : encoded pandas Serie for the target on train set (with dtype=’float’ for a regression or dtype=’int’ for a classification). Indexes should match the train set.
  • max_evals (int, default = 40.) – Number of iterations. For an accurate optimal hyper-parameter, max_evals = 40.
Returns:

The optimal hyper-parameter dictionary.

Return type:

dict.

Examples

>>> from mlbox.optimisation import *
>>> from sklearn.datasets import load_boston
>>> #loading data
>>> dataset = load_boston()
>>> #optimising the pipeline
>>> opt = Optimiser()
>>> space = {
...     'fs__strategy':{"search":"choice","space":["variance","rf_feature_importance"]},
...     'est__colsample_bytree':{"search":"uniform", "space":[0.3,0.7]}
... }
>>> df = {"train" : pd.DataFrame(dataset.data), "target" : pd.Series(dataset.target)}
>>> best = opt.optimise(space, df, 3)

Prediction

class mlbox.prediction.Predictor(to_path='save', verbose=True)[source]

Fits and predicts the target on the test dataset.

The test dataset must not contain the target values.

Parameters:
  • to_path (str, default = "save") – Name of the folder where feature importances and predictions are saved (.png and .csv formats). Must contain target encoder object (for classification task only).
  • verbose (bool, default = True) – Verbose mode
fit_predict(params, df)[source]

Fits the model and predicts on the test set.

Also outputs feature importances and the submission file (.png and .csv format).

Parameters:
  • params (dict, default = None.) –

    Hyper-parameters dictionary for the whole pipeline.

    • The keys must respect the following syntax : “enc__param”.
      • ”enc” = “ne” for na encoder
      • ”enc” = “ce” for categorical encoder
      • ”enc” = “fs” for feature selector [OPTIONAL]
      • ”enc” = “stck”+str(i) to add layer n°i of meta-features [OPTIONAL]
      • ”enc” = “est” for the final estimator
      • ”param” : a correct associated parameter for each step. Ex: “max_depth” for “enc”=”est”, …
    • The values are those of the parameters. Ex: 4 for key = “est__max_depth”, …
  • df (dict, default = None) –

    Dataset dictionary. Must contain keys and values:

    • ”train”: pandas DataFrame for the train set.
    • ”test” : pandas DataFrame for the test set.
    • ”target” : encoded pandas Serie for the target on train set (with dtype=’float’ for a regression or dtype=’int’ for a classification). Indexes should match the train set.
Returns:

self.

Return type:

object