When performing an exploratory data analysis, outcomes may be biased if the outliers are not dealt with in the first stage. A bias can have a variety of negative effects, which may result in erroneous business decisions and eventually a loss for the company.
Hariharan Kolam, CEO and creator of Findem, stated in his lecture that “avoiding bias starts by acknowledging that data problem exists, both within the data itself as well as in individuals studying or using it.” The person working on the data can create bias in addition to the data itself. We just need to make sure that even before modelling the data, such biases are handled with and made sure that they don’t pose any threat to the end results. The biases may be introduced unconsciously, but they will still be there.
Various Algorithms to Handle Outliers
The below are the most widely used machine learning methods for treating outliers; let’s examine each technique in turn:
Z score test: One of the widely used techniques to find outliers is the Z score test. It calculates how much an observation deviates from the mean value by the amount of standard deviations. An observation with a z score of 1.5 is 1.5 standard deviations just above mean, whereas an observation with a z score of -1.5 is 1.5 standard deviations underneath the mean.
Box plot: By grouping the data points into different quartiles, the box plot illustrates how the data points are distributed. The dataset’s minimum, max, median, first, and third quartiles are indicated on the box plot. The lower quartile, median, and higher quartile are other names for these percentiles. One visual technique to find abnormalities is this. Outliers can be defined as anything that deviates significantly from the plot’s norms.
Isolation Forest: The isolation forest method for detecting outliers is a simple yet effective option. As it isolates the outliers from the dataset by choosing a random feature as well as a split value between both the maximum and minimum values of the feature subset, Isolation Forest is using the decision tree method.
When the data set is large and has numerous features, the isolation forest method is chosen over other methods since it consumes less memory.
DBSCAN: DBSCAN, also referred to as density-based spatial clustering for applications with noise, is a clustering algorithm. Like other clustering algorithms, DBSCAN splits the dataset into groups by examining how well the observations can be aggregated with other data points; observations that cannot be aggregated are referred to as outliers.