Machine learning has the ultimate goal of creating a model that is generalizable.
Hello!
1. Understanding the Problem
Ask questions early and validate your understanding with domain experts, peers, and end-users. If the answers you receive align with your understanding, then you are on a right track.
2. Know Your Data
Knowing what your data means will help you understand which models are most effective and which features to use. The data problem will determine which model is most successful. Also, the computational time will impact the project’s cost.
You can improve or mimic human decision-making by using and creating meaningful features. It is crucial to understand the meaning of each field, especially when it comes to regulated industries where data may be anonymized and not clear. If you’re unsure what something means, consult a domain expert.
3. Split your data
You can validate its performance on unknown data by not letting the model see any of it while training. This is essential in order to choose the right model architecture and tuning parameters for the best performance.
Splitting your data into multiple parts is necessary for supervised learning. The training data is the data the model uses to learn. It typically consists of 75-80% of the original data.
This data was chosen randomly. The remaining data is called the testing data. This data is used to evaluate your model. You may need another set of data, called the validation set.
This is used to compare different supervised learning models that were tuned using the test data, depending on what type of model you are creating.
You will need to separate the non-training data into the validation and testing data sets. It is possible to compare different iterations of the same model with the test data, and the final versions using the validation data.
4. Don’t Leak Test Data
If you normalize your data before splitting, the model will gain information about the test set, since the global minimum and maximum might be in the held-out data.
5. Use the Right Evaluation Metrics
Every problem is unique so the evaluation method must be based on that context. Accuracy is the most dangerous and naive classification method. Take the example of cancer detection.
This isn’t the best model, since we want to detect it. Be careful when deciding which evaluation metric you will use for your regression and classification problems.
6. Keep it simple
It is important to select the best solution for your problem and not the most complex. Management, customers, and even you might want to use the “latest-and-greatest.” You need to use the simplest model that meets your needs, a principle called Occam’s Razor.
This will not only make it easier to see and reduce training time but can also improve performance. You shouldn’t try to kill Godzilla or shoot a fly with your bazooka.
7. Do not overfit or underfit your model
Bias, also known as underfitting, is when the model has too few details to be able to accurately represent the problem. These two are often referred to as “bias-variance trading-off”, and each problem requires a different balance.
Let’s use a simple image classification tool as an example. It is responsible for identifying whether a dog is present in an image.
This model won’t recognize an image that it is a dog if it hasn’t seen it before. It might not recognize an image of a dog if you overfit it, even though it may have seen it before.
8. Try Different Model Architectures
You can mix simple and complex algorithms. If you are creating a classification model, for example, try as simple as random forests and as complex as neural networks.
Interestingly, extreme gradient boosting is often superior to a neural network classifier. Simple problems are often easier to solve with simple models.
9. Tune Your Hyperparameters
These are the values that are used in the model’s calculation. One example of a hyperparameter in a decision tree would be depth.
This is how many questions the tree will ask before it decides on an answer. The default parameters for a model’s hyperparameters are those that give the highest performance on average.
10. Comparing Models Correctly
You will need a different holdout than the one you used to train your hyperparameters. You will also need to use statistical tests that are appropriate to evaluate the results.
Thank you! Join us on social networks! See you!