Entonces, soy un novato en el campo de ML e intento hacer alguna clasificación. Mi objetivo es predecir el resultado de un evento deportivo. He reunido algunos datos históricos y ahora trato de entrenar a un clasificador. Obtuve alrededor de 1200 muestras, 0.2 de ellas las dividí para fines de prueba, otras las puse en la búsqueda de cuadrícula (validación cruzada incluida) con diferentes clasificadores. He probado SVM con núcleos lineales, rbf y polinominales y bosques aleatorios hasta el momento. Desafortunadamente, no puedo obtener una precisión significativamente mayor que 0.5 (lo mismo que la elección aleatoria de la clase). ¿Significa que no puedo predecir el resultado de un evento tan complejo? ¿O puedo obtener al menos una precisión de 0.7-0.8? Si es factible, ¿qué debo considerar a continuación?
- ¿Obtener más datos? (Puedo ampliar el conjunto de datos hasta 5 veces)
- Prueba diferentes clasificadores? (Regresión logística, kNN, etc.)
- Reevaluate my feature set? Are there any ML-tools to analyze, which features make sense and which don't? Maybe, I should reduce my feature set (currently I have 12 features)?
Respuestas:
First of all, if your classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class. A good question to ask yourself in such a position, is weather you or a domain expert could infer the class (with an accuracy greater than a random classifier) based on given features. If no, then getting more data rows or changing the classifier won't help. What you need to do is get more data using different features.
IF on the other hand you think the information needed to infer the class is already in the labels, you should check whether your classifier suffers from a high bias or high variance problem.
To do this, graph the validation error and training set error, as a function of training examples.
If the lines seem to converge to the same value and are close at the end, then your classifier has high bias and adding more data won't help. A good idea in this case is to either change the classifier for a one that has higher variance, or simply lower the regularization parameter of your current one.
If on the other hand the lines are quite far apart, and you have a low training set error but high validation error, then your classifier has too high variance. In this case getting more data is very likely to help. If after getting more data the variance will still be too high, you can increase the regularization parameter.
This are the general rules I would use when faced with a problem like yours.
Cheers.
fuente
I would suggest taking a step back and doing some exploratory data analysis prior to attempting classification. It is worth examining your features on an individual basis to see if there is any relationship with the outcome of interest - it may that the features you have do not have any association with the class labels. How do you know if the features you have will be any use?
You could start with doing hypothesis testing or correlation analysis to test for relationships. Generating class specific histograms for features (i.e. plotting histograms of the data for each class, for a given feature on the same axis) can also be a good way to show if a feature discriminates well between the two classes.
It is important to remember though not to let the results of your exploratory analysis influence your choices for classification. Choosing features for classification based on a prior exploratory analysis on the same data, can lead to overfitting and biased performance estimates (see discussion here) but an exploratory analysis will at least give you an idea of whether the task you are trying to do is even possible.
fuente
It's good that you separated your data into the training data and test data.
Did your training error go down when you trained? If not, then you may have a bug in your training algorithm. You expect the error on your test set to be greater than the error on your training set, so if you have an unacceptably high error on your training set there is little hope of success.
Getting rid of features can avoid some types of overfitting. However, it should not improve the error on your training set. A low error on your training set and a high error on your test set might be an indication that you overfit using an overly flexible feature set. However, it is safer to check this through cross-validation than on your test set. Once you select your feature set based on your test set, it is no longer valid as a test set.
fuente
Why not follow the principle "look at plots of the data first". One thing you can do is a 2 D scatterplot of the two class conditional densities for two covariates. If you look at these and see practically no separation that could indicate lack of predictability and you can do this with all the covariates. That gives you some ideas about the ability to use these covariates to predict. If you see some hope that these variables can separate a little then start thinking about linear discriminants, quadratic discriminants, kernel discrimination, regularization, tree classification, SVM etc.
fuente