SDK 2's Outlier Alert Report
We recently released a feature in SDK 2 that we call General Reports. These are widely applicable reports that any SDK customer can use by simply activating it on the project dashboard. Today, we'd like to detail the Outlier Alert Report and how it works.
An outlier is an observation that is an abnormal distance from other values in a population. In order to identify abnormal observations, normal must be defined. In reports that identify outliers, normal is defined for dataset fields that are ordinal numeric data or categorical data.
Ordinal Numeric Data
A field is considered to have ordinal numeric data if all responses for the field are numeric and multiple choice options are not utilized on the intake form. In these cases, our default way of determining the lower limit for outliers is found by multiplying the interquartile range by 3 and then subtracting that result from the 1st quartile. For the upper limit the same number is added to the 3rd quartile. To see how that works in an example, let's consider a dataset with one single column named "heads". The value in each row of the column represents the number of times a coin toss resulted in heads out of 100 coin toss observations. The possible values range from 0 to 100. However, it is highly unlikely that 100 observations of heads occurred out of 100 coin tosses. If we consider each number along the way to the middle of the range, in this case approaching 50, it is more and more likely that the number being considered was observed out of 100 coin tosses. So, at what point would a value be considered a normal observation?
Let's look at a sample dataset with 1,000 observations for heads.
A histogram plotting the values clearly illustrates a normal distribution of data. Most values are clustered towards the mean of the range.
In order to establish the upper and lower limits that will define outliers, we use the interquartile range. It is the difference between the 3rd quartile and the 1st quartile (53 - 46), which in this case is 7. We then use an interquartile mulitiplier of 3 to arrive at 21 as the distance from the quartiles that sets the upper and lower limits. Boxplots commonly use 1.5 as a mulitiplier to identify mild outliers. A mutliplier of 3 is typically used to identify extreme outliers.
Since our minimum (36) and maximum (65) are between the lower limit (25) and upper limit (74), there were no outliers detected in this dataset. To answer the orignal question, 74 is the point where the number of heads would be considered a normal observation when moving down from the top of the range.
Fields in datasets that are submitted using multiple choice options are considered categorical data. These values are assumed to be non ordinal and therefore a more basic calculation is used. In these cases, any responses that make up less than 1% of total responses for the field are considered outliers.
That's how we identify outliers. Have suggestions or questions? Reach out to us.
Nov. 21 2017 note