what is outliers in machine learning

11 months ago 16
Nature

In machine learning, outliers are data points that are significantly different from the rest of the dataset. They can be caused by inconsistent data entry, erroneous observations, or measurement or execution errors. Outliers can skew the data distribution and lead to less effective and less useful models, so its important to detect and remove them during the data cleaning and preprocessing step.

There are various statistical techniques that are widely used for outlier detection and removal, including:

  • Standard Deviation: When the data or certain features in the dataset follow a normal distribution, the standard deviation of the data or the equivalent z-score can be used to detect outliers.

  • Clustering-based Outlier Detection: In the K-Means clustering technique, each cluster has a mean value. Objects belong to the cluster whose mean value is closest to it. To identify the outlier, the distance of the test data to each cluster mean is calculated. If the distance between the test data and the closest cluster to it is greater than the threshold value, then the test data is classified as an outlier.

  • Box Plots: This method considers multiple variables in a dataset to detect outliers. It calculates the Euclidean distance of the data points from their mean and converts the distances into absolute z-scores. Any z-score greater than the pre-specified cut-off is considered to be an outlier.

There is no precise way to define and identify outliers in general because of the specifics of each dataset. Instead, you, or a domain expert, must interpret the raw observations and decide whether a value is an outlier or not. Even with a thorough understanding of the data, outliers can be hard to define.