Imagine you look at the thermometer in your apartment in the summer and it shows -20 degrees. Does it make sense? Or Maybe the thermometer has a damage or is it just the reflection of the sun that shows a minus? No matter what – you will not rely on the measurement result. Reading errors, sensor errors or other values that do not reflect reality are called outliers. This behaviour not only takes place in the summer, but also in manufacturing process. Therefore, this post is the first in a series about the so-called phenomenon of outliers in manufacturing.
Let us first have a look on the abstract question “What is an outlier”?
Outliers are anomalous observations in the dataset (e.g. the -20°C in the summer, where we usually would not expect negative temperatures). An observation is often classified as an outlier if it lies significant/abnormal distance away from majority of the observations in the dataset. The absolute distance limit in relation to the classification of outliers varies in many application as it depends on the transparency of the data acquisition process and expert knowledge in the particular domain. Let us assume it’s December and we are located in Trondheim, Norway. -20°C does not sound so unrealistic anymore right? Setting this context is quite hard, but a general definition of this limit is between an upper bound (i.e. First quartile – 1.5 times the interquartile range) and a lower bound (i.e. third quartile + 1.5 times the interquartile range) of the dataset . The source of outliers in datasets are often regarded as an error of measurement (e.g. data captured by a faulty sensor), experiment, data processing and sampling. The presence of outliers violates the integrity of the dataset in several ways. One of their most critical affect is to mask or miss-represent the information in the dataset . They can distort results from algorithm and analytics system which has it root to the dataset and can bias the findings of the analysis.
Showing -20°C is not that bad or? How it affects machine learning models?
The key aims of machine learning models in industries is to correctly predict or classify processes. Data quality (i.e. the range and its distribution values) in any quantitative field, including machine learning, is as important as the quality of outcome from the discipline. The foundation of a machine learning model is built on the quality of training from the dataset. Let us assume a model is trained to predict the temperature in the afternoon based on the temperature in the morning. If the model is trained on our -20°C in the morning and the temperature in the afternoon is very high, well we will have questionable prediction results in the winter. The presence of outliers in data can therefore corrupt the learning process of the model, resulting in poor accuracy.
Okay outliers are bad. But what should we do with then? How can data preparation help?
Data preparation remains the forefront step prior to drawing insights from the dataset in many quantitative fields (e.g. Machine Learning and Economics). The process is considered to be very crucial step as it removes bias from the measurement thus improve the data analysis outcome. It not only reveals the structure of the dataset but also indicates the presence of outliers, which is then removed or treated. Visualising data points in a graph remains the easiest method to identify outliers. In fact, boxplots, histograms and scatterplots (i.e. graphs with interquartile methods with fences) can effectively flag the presence of outliers in the dataset.
This was the blog post about the nature of outliers. In our next blog plot we aim to highlight some of our approaches in dealing with outliers. Interested in knowing how outliers might impact your manufacturing site? Talk to an expert and get in touch with us.
 S. Chaudhary, „Why “1.5” in IQR Method of Outlier Detection?“, Towards Data Science, 2019. Find the article here.
 J. Osborne and A. Overbay, „The Power of Outliers (and Why Researchers Should Always Check for Them)“, Pract. Assess. Res. Eval., vol. 9, 2004.
Dr. Farhan Santo works as a data-scientist at DatenBerg. He received his PhD in Acoustic Emission analysis at The Welding Institute (TWI Ltd) in the UK. Reach out for any questions about data analytics in manufacturing. E-Mail: [email protected]