Pandas categorical variables encoding for regression (one-hot encoding vs dummy encoding)

2017-03-20 23:29:48

Pandas has a method called get_dummies() that creates a dummy encoding of a categorical variable. Scikit-learn also has a OneHotEncoder that needs to be used along with a LabelEncoder. What are the pros/cons of using each of them? Also both yield dummy encoding (k dummy variables for k levels of a categorical variable) and not one-hot encoding (k-1 dummy variables), how can one get rid of the extra category? How much of a problem does this dummy encoding create in regression models (collinearity issues - a.k.a. dummy variable trap)?

One advantage of get_dummies is that it can operate on values other than integers (so you don't need the LabelEncoder) and returns a DataFrame with the categories as column names. Also, you can conveniently drop one redundant category using drop_first=True.

One advantage of scikit-learn's OneHoteEncoder lies in the scikit-learn API. OHE gives you a transformer which you can apply to your training and test set separately if you specify

  • One advantage of get_dummies is that it can operate on values other than integers (so you don't need the LabelEncoder) and returns a DataFrame with the categories as column names. Also, you can conveniently drop one redundant category using drop_first=True.

    One advantage of scikit-learn's OneHoteEncoder lies in the scikit-learn API. OHE gives you a transformer which you can apply to your training and test set separately if you specify the total number of categories. This doesn't work with get_dummies ,for example, if the training set misses categories present in the test set.

    You can still delete categories by simply deleting columns from the resulting numpy array (e.g. using n_values_ or feature_indices_ to see which columns correspond to the same feature). Some models work regardless, for example tree-based models. Also, L1 regularization can often set redundant features to zero (see Lasso regression).

    2017-03-20 23:43:29