Why PCA feature reduction make accuracy worse?

2017-09-18 01:53:52

I'm trying to estimate how much feature reduction using PCA can help with increasing accuracy in case of classification using different ml methods. I'm using digits dataset available in scikit-learn. To do it, I'm checking accuracy using 64 features available, later using PCA, I reduce it to 63 features and accuracy decreases extremely:

ANN:

featureNum #accuracy

64 | 0.966 +- 0.008

63 | 0.132 +- 0.0116619037897

SVM:

featureNum accuracy

64 | 0.96 +- 0.0

63 | 0.54 +- 0.0

RandomForest:

featureNum accuracy

64 | 0.974 +- 0.008

63 | 0.12 +- 0.022803508502

DecisiontTree:

featureNum accuracy

64 | 0.802 +- 0.0172046505341

63 | 0.11 +- 0.0126491106407

All calculations were repeated 5 times to get statistics. Before using PCA (64 features) scores where quite good in all cases. After, In case of all tested methods apart from SVM, it was practically random (there're 10 classes). I would understand that accuraccy dropped a little because we loose some

  • I've created a notebook that almost replicates your drop in accuracy.

    I think that most likely error is actually retraining PCA - if you fit PCA on train set, then fit classifier, and then try to run it on principal components retrieved from the test set, then you use incorrect parameter space for the classifier - classifier uses train set principal components as coordinates, and then you run it on test set PCs.

    2017-09-18 02:02:50
  • I think you have a wrong hypothesis to verify.

    In general, applying PCA before building a model will NOT help to make the model perform better (in terms of accuracy)!

    This is because PCA is an algorithm that does not consider the response variable / prediction target into account. PCA will treat the feature has large variance as important features, but the feature has large variance can have noting to do with the prediction target.

    This means, you can produce a lot of useless features and eliminate useful features after PCA.

    Please check my answer here for details and some demo.

    How to decide between PCA and logistic regression?

    2017-09-18 02:03:29