Automatically Build Variant Interpretable ML models fast! Auto_ViML is pronounced "auto vimal" (autovimal logo created by Sanket Ghanmare)
NEW FEATURES in this version are:
1. SMOTE -> now we use SMOTE for imbalanced data. Just set Imbalanced_Flag = True in input below
2. Auto_NLP: It automatically detects Text variables and does NLP processing on those columns
3. Date Time Variables: It automatically detects date time variables and adds extra features
4. Feature Engineering: Now you can perform feature engineering with the available featuretools library.
To upgrade to the best, most stable and full-featured version (anything over > 0.1.600), do one of the following:
Use $ pip install autoviml --upgrade --ignore-installed
pip install git+https://github.com/AutoViML/Auto_ViML.git
Read this Medium article to learn how to use Auto_ViML.
Auto_ViML was designed for building High Performance Interpretable Models with the fewest variables.
The "V" in Auto_ViML stands for Variable because it tries multiple models with multiple features to find you the best performing model for your dataset. The "i" in Auto_ViML stands for "interpretable" since Auto_ViML selects the least number of features necessary to build a simpler, more interpretable model. In most cases, Auto_ViML builds models with 20-99% fewer features than a similar performing model with all included features (this is based on my trials. Your experience may vary).
Auto_ViML is every Data Scientist's model assistant that:
from autoviml.feature_engineering import feature_engineering print(df[preds].shape) dfmod = feature_engineering(df[preds],['add'],'ID') print(dfmod.shape)
To clone Auto_ViML, it is better to create a new environment, and install the required dependencies:
To install from PyPi:
conda create -n <your_env_name> python=3.7 anaconda conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>` pip install autoviml or pip install git+https://github.com/AutoViML/Auto_ViML.git
To install from source:
cd <AutoVIML_Destination> git clone firstname.lastname@example.org:AutoViML/Auto_ViML.git # or download and unzip https://github.com/AutoViML/Auto_ViML/archive/master.zip conda create -n <your_env_name> python=3.7 anaconda conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>` cd Auto_ViML pip install -r requirements.txt
In the same directory, open a Jupyter Notebook and use this line to import the .py file:
from autoviml.Auto_ViML import Auto_ViML
Load a data set (any CSV or text file) into a Pandas dataframe and split it into Train and Test dataframes. If you don't have a test dataframe, you can simple assign the test variable below to '' (empty string):
model, features, trainm, testm = Auto_ViML( train, target, test, sample_submission, hyper_param="GS", feature_reduction=True, scoring_parameter="weighted-f1", KMeans_Featurizer=False, Boosting_Flag=False, Binning_Flag=False, Add_Poly=False, Stacking_Flag=False, Imbalanced_Flag=False, verbose=0, )
Finally, it writes your submission file to disk in the current directory called
This submission file is ready for you to show it clients or submit it to competitions.
If no submission file was given, but as long as you give it a test file name, it will create a submission file for you named
Auto_ViML works on any Multi-Class, Multi-Label Data Set. So you can have many target labels.
You don't have to tell Auto_ViML whether it is a Regression or Classification problem.
train: could be a datapath+filename or a dataframe. It will detect which is which and load it.
test: could be a datapath+filename or a dataframe. If you don't have any, just leave it as "".
submission: must be a datapath+filename. If you don't have any, just leave it as empty string.
target: name of the target variable in the data set.
sep: if you have a spearator in the file such as "," or "\t" mention it here. Default is ",".
scoring_parameter: if you want your own scoring parameter such as "f1" give it here. If not, it will assume the appropriate scoring param for the problem and it will build the model.
hyper_param: Tuning options are GridSearch ('GS') and RandomizedSearch ('RS'). Default is 'RS'.
feature_reduction: Default = 'True' but it can be set to False if you don't want automatic feature_reduction since in Image data sets like digits and MNIST, you get better results when you don't reduce features automatically. You can always try both and see.
True: Adds a cluster label to features based on KMeans. Use for Linear.
False (default)For Random Forests or XGB models, leave it False since it may overfit.
Boosting Flag: you have 4 possible choices (default is False):
NoneThis will build a Linear Model
FalseThis will build a Random Forest or Extra Trees model (also known as Bagging)
TrueThis will build an XGBoost model
CatBoostThis will build a CatBoost model (provided you have CatBoost installed)
Add_Poly: Default is 0 which means do-nothing. But it has three interesting settings:
1Add interaction variables only such as x1x2, x2x3,...x9*10 etc.
2Add Interactions and Squared variables such as x12, x22, etc.
3Adds both Interactions and Squared variables such as x1x2, x1**2,x2x3, x2**2, etc.
Stacking_Flag: Default is False. If set to True, it will add an additional feature which is derived from predictions of another model. This is used in some cases but may result in overfitting. So be careful turning this flag "on".
Binning_Flag: Default is False. It set to True, it will convert the top numeric variables into binned variables through a technique known as "Entropy" binning. This is very helpful for certain datasets (especially hard to build models).
Imbalanced_Flag: Default is False. If set to True, it will use SMOTE from Imbalanced-Learn to oversample the "Rare Class" in an imbalanced dataset and make the classes balanced (50-50 for example in a binary classification). This also works for Regression problems where you have highly skewed distributions in the target variable. Auto_ViML creates additional samples using SMOTE for Highly Imbalanced data.
verbose: This has 3 possible states:
0limited output. Great for running this silently and getting fast results.
1more charts. Great for knowing how results were and making changes to flags in input.
2lots of charts and output. Great for reproducing what Auto_ViML does on your own.
model: It will return your trained model
features: the fewest number of features in your model to make it perform well
train_modified: this is the modified train dataframe after removing and adding features
test_modified: this is the modified test dataframe with the same transformations as train
Apache License 2.0 © 2020 Ram Seshadri
This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.