Machine Learning: Notes and Advice

Am I overfitting/underfitting?
  • Training loss much greater than validation loss. That is underfitting.
  • Training loss much less than validation loss. That is overfitting.
Bagging vs boosting
  • bagging = parallel models (ex. random forest)
  • boosting = sequential models (ex. haar cascade)
Good ideas to improve a model
  • Cluster the unlabeled training data, then add cluster features to get additional free features
  • Let's say you have multiple labels per data point in your training data. If labelers disagree a lot on a single data point, then that sample is pretty crappy. If you sort your data in order of crappiness, then drop the most crappy ones first, then you'll get a small boost in model performance. Even smart is to penalize the model less if it misclassifies a crappier data point.
Precision vs Recall (w/ statistics translation)
Type I vs Type II errors