Imbalanced Dataset problem is the invisible problem for the beginners. It looks strange to be happen at the intelligence systems, however, there are mathematical reasons underneath the carpet and of course we have solutions.
Most of the Machine Learning algorithms are suffering from Imbalanced Dataset Problem. Although solution methods are intuitive and very simple to apply, it may cause large troubles even if it tried to be solved. You can cry and pray for luck if you have very limited data and have imbalanced dataset problem.
This post will define the problem, not solve it, for now. If you’re not the man who likes to read, I can at least advise my video about it. However, post is containing more detail.
At most basic level, we can say that it occurs when you have unequal number of elements in class. Let’s say you’re working on classification problem. When you histogram your data, you see something like that:
Although it’s not a real histogram, you can observe that classes do not have nearly equal number of elements. When you train your algorithm, you observe that your machine learning model mostly (sometimes always) predicts that the input is the class that has high frequency at the dataset.
This is the point that you realise you’re doing something wrong, even that everything works fine until you test your model with the test data.
Before going into the details of mathematical background of the problem, let’s give an example. Fraud detection is the concern of machine learning (not only the concern of machine learning). Since fraud is very rare(!) case, its hard to detect it. There are lots of fraud examples, however, the ratio between fraud incidents and legal incidents is very low.
You see the histogram representation of %2 fraud events and %98 legal events. When you train your machine learning model what you observe is %98 of accuracy. Looks good, huh?
The problem is, your machine learning model can distinguish legal events very well, it can classify all legal events. All of them! This is where the %98 accuracy comes from. It can predict legal events very well, but it can not predict the %2, where you get the accuracy %100-%2 = %98. Is it a valid way of fraud detection, then? You can not even predict one of the fraud events.
It’s not just a problem when you have large differences between classes. %60-%40, %60-%20-%20, %50-%15-%15-%20 can also lead imbalanced dataset problem.
Why it happens. Let’s check a simple loss function for that.
We have something like a mean square error, we just don’t normalize the loss according to number of elements in our data.
I will try to be as intuitive as I can. Let’s say you have 10 data points that belongs to the Class 0, 90 data points that belongs to the Class 1. Assume that model can predict just two classes, 0 and 1. It doesn’t result in a probability.
Loss function is used to correct the model parameters. Loss function does it by checking whether the classes are predicted correctly. When there is a 90 data points for Class 1, your machine learning model will correct itself 90 times for the Class 1, and 10 times for Class 0. It tends to predict the class that has high frequency because of that.
We will continue to investigate, stay tuned. If you’re lazy, check my free course on YouTube.