Baja precisión de clasificación, ¿qué hacer a continuación?

16

Entonces, soy un novato en el campo de ML e intento hacer alguna clasificación. Mi objetivo es predecir el resultado de un evento deportivo. He reunido algunos datos históricos y ahora trato de entrenar a un clasificador. Obtuve alrededor de 1200 muestras, 0.2 de ellas las dividí para fines de prueba, otras las puse en la búsqueda de cuadrícula (validación cruzada incluida) con diferentes clasificadores. He probado SVM con núcleos lineales, rbf y polinominales y bosques aleatorios hasta el momento. Desafortunadamente, no puedo obtener una precisión significativamente mayor que 0.5 (lo mismo que la elección aleatoria de la clase). ¿Significa que no puedo predecir el resultado de un evento tan complejo? ¿O puedo obtener al menos una precisión de 0.7-0.8? Si es factible, ¿qué debo considerar a continuación?

  • ¿Obtener más datos? (Puedo ampliar el conjunto de datos hasta 5 veces)
  • Prueba diferentes clasificadores? (Regresión logística, kNN, etc.)
  • Reevaluate my feature set? Are there any ML-tools to analyze, which features make sense and which don't? Maybe, I should reduce my feature set (currently I have 12 features)?
fspirit
fuente
What is your training accuracy? And how many samples you have in each class?
Leo
1
What sport is this and what do you consider a "correct" classification? If you're simply trying to predict a win/loss outcome in virtually any major sport it's almost inconceivable that even the simplest of classifiers wouldn't predict better than 0.5. If you are, say, trying to predict win/loss against a spread or some other handicapped outcome, then much better than 0.5 may be difficult.
cardinal
@Leo Training accuracy is around 0.5. Classes are evenly distributed, I have classes 0 and 1.
fspirit
@cardinal Yes, I try to predict win/loss outcome, no handicaps. Is it feasible to reach, say, 0.8 accuracy on test set?
fspirit
1
@fspirit: That depends on the sport and the inequity in ability between the participants, for one thing. Just knowing who is participating in each contest can often be a strong predictor. Here and here are a couple of related posts.
cardinal

Respuestas:

17

First of all, if your classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class. A good question to ask yourself in such a position, is weather you or a domain expert could infer the class (with an accuracy greater than a random classifier) based on given features. If no, then getting more data rows or changing the classifier won't help. What you need to do is get more data using different features.

IF on the other hand you think the information needed to infer the class is already in the labels, you should check whether your classifier suffers from a high bias or high variance problem.

To do this, graph the validation error and training set error, as a function of training examples.

If the lines seem to converge to the same value and are close at the end, then your classifier has high bias and adding more data won't help. A good idea in this case is to either change the classifier for a one that has higher variance, or simply lower the regularization parameter of your current one.

If on the other hand the lines are quite far apart, and you have a low training set error but high validation error, then your classifier has too high variance. In this case getting more data is very likely to help. If after getting more data the variance will still be too high, you can increase the regularization parameter.

This are the general rules I would use when faced with a problem like yours.

Cheers.

sjm.majewski
fuente
Did you mean validation set error and TEST set error? Otherwise, I'm confused. I dont even know train set error, cause I use validation set error to choose model and them check selected model on test set.
fspirit
No, I mean training set error where it is written. The training error is the number of misclassified examples in the training set divided by training set size. Similarly test set error is number of misclassified examples in test set divided by training set size. Also you may want to check Coursera's Machine Learning Class, (class.coursera.org/ml-2012-002/lecture/index), especially videos for "Advice for applying Machine Learning". Those advice are quite relevant to your situation.
sjm.majewski
I've competed the course, when it was ran for the first time. As for the training set error, I now output it too, for SVM its quite high - 0.5, but for random forests its 0.
fspirit
5

I would suggest taking a step back and doing some exploratory data analysis prior to attempting classification. It is worth examining your features on an individual basis to see if there is any relationship with the outcome of interest - it may that the features you have do not have any association with the class labels. How do you know if the features you have will be any use?

You could start with doing hypothesis testing or correlation analysis to test for relationships. Generating class specific histograms for features (i.e. plotting histograms of the data for each class, for a given feature on the same axis) can also be a good way to show if a feature discriminates well between the two classes.

It is important to remember though not to let the results of your exploratory analysis influence your choices for classification. Choosing features for classification based on a prior exploratory analysis on the same data, can lead to overfitting and biased performance estimates (see discussion here) but an exploratory analysis will at least give you an idea of whether the task you are trying to do is even possible.

BGreene
fuente
I'll try to draw the histograms and see what they will look like.
fspirit
@BGreene - your third paragraph is a tough one for me. If exploratory analysis shows predictor x1 to be highly correlated with the outcome, wouldn't it defeat the purpose of checking that correlation if one didn't use x1 as at least a candidate predictor in a multivariate model?
rolando2
@rolando2 - I'm not suggesting that you don't include the feature as a candidate as part of a feature selection routine but you should not choose features based on such an exploratory analysis as this will overfit. However for the purposes of evaluating the generalized performance of a classifier model, feature selection should be done within the model selection routine (i.e. within each fold of cross validation). What I am suggesting is that exploratory analysis and classification should be treated as separate activities - each tells you different things about your data
BGreene
3

It's good that you separated your data into the training data and test data.

Did your training error go down when you trained? If not, then you may have a bug in your training algorithm. You expect the error on your test set to be greater than the error on your training set, so if you have an unacceptably high error on your training set there is little hope of success.

Getting rid of features can avoid some types of overfitting. However, it should not improve the error on your training set. A low error on your training set and a high error on your test set might be an indication that you overfit using an overly flexible feature set. However, it is safer to check this through cross-validation than on your test set. Once you select your feature set based on your test set, it is no longer valid as a test set.

Douglas Zare
fuente
I use separate train, validation and test sets. I select hyper-params based on validation set error and then apply selected model to the test set. I doubt there is a bug in training algorithm, because I use off-the-shelf lib.
fspirit
You had to connect that library to your data somehow. Always check that you are training correctly. If you are getting a training error rate of 50%, this either means your features are terrible, or else you are not training correctly.
Douglas Zare
In the "features are terrible" possibility, I include the case that there is no solution possible. However, I doubt that very much. There is no sport I know where there aren't ways to see that one competitor is a favorite over another. It is even possible in rock-paper-scissors.
Douglas Zare
1

Why not follow the principle "look at plots of the data first". One thing you can do is a 2 D scatterplot of the two class conditional densities for two covariates. If you look at these and see practically no separation that could indicate lack of predictability and you can do this with all the covariates. That gives you some ideas about the ability to use these covariates to predict. If you see some hope that these variables can separate a little then start thinking about linear discriminants, quadratic discriminants, kernel discrimination, regularization, tree classification, SVM etc.

Michael R. Chernick
fuente
Sorry, um, is covariate == feature?
fspirit