Box and Whisker (boxplot) in PowerBI
Updated: Jan 30, 2021
The Box and Whisker chart is becoming one of my favorite visualizations to use in PowerBI. It conveys a lot of information in a compact manner, telling you a lot about the underlying data set at a glance. Let's have a look at one in action. I, unfortunately, have an issue with my ISP at the moment resulting in terrible internet speed. As part of my conversations with the ISP, I have captured the results of a speed test every 15 minutes so I can demonstrate to the ISP how bad my speed is. One result of this is the following set of Box and Whisker charts. (Yep, my internet is that bad at the moment)
Looking at my download chart, I can tell at a glance that;
My internet speed varies between close to 0 and 4 Mbs (approx.)
The Mean (average) is approx. 1.5 (this is the big white dot)
The Median (the line at the edge of the dark grey and light grey in the box) is less than 1
The effective range (most common range) is between (approx.) .5 and 2.2
Speeds (if you can call it that) of close to 4 is an outlier
There are a significant number of speed measurements below the box, so my outliers tend to be poor rather than good.
In PowerBI with this particular visual, if you float over the "box", then you get additional information as shown below. Note the actual value of the Median is 0.92, which is less than the Mean of 1.5. When your Median is less than you Mean, then you know that your data is skewed to the left (or lower in this case), however, we could visually read that by just looking at the number of dots in between the box and a lower whisker.
So as you can see, there is a significant amount of data about your dataset conveyed in a Box and Whisker. Often you are just given an Average in a report, or maybe a line chart of measurements, but Average on its own can hide the impact of outliers, and a line chart forces the user to interpret the data for themselves. If I was to convey this information via line and an average, it doesn't, in my opinion, have the sophistication and authority of a Box and Whisker.
So, let's look a bit closer at the parts of a Box and Whisker. There are five key values you want to see. The Median, the lowest, the highest, the Q1 value, and the Q3 value. The lowest and highest represent the whiskers, the lines that you see at the top and bottom of the chart. In my case 0.04 and 3.84. The median is the value in the middle of my ordered set of numbers and is the small dot between the Light and Dark grey area in the box. The Q1 and Q3 values are the top and bottom values of my Box, and it is within this space that you would define the "typical" range of values. I.e. my internet speed typically bounces between .5 and 2.18.
For examples, if I had a list of number as follows
51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13
Then to get my 5 values, I first of all order the list to get
6, 7, 13, 17, 20, 25, 39, 41, 43, 49, 51, 62.
As I have 12 values, the Median is (25 + 39) ÷ 2 = 32
The Lowest is 6 (The low whisker)
The highest is 62 (The high whisker)
The Q1 is the Median of the set of numbers that are less than the full set Median. I.e. Median of 6, 7, 13, 17, 20, 25 = (13 + 17) ÷ 2 = 15 (The Low edge of the box)
The Q3 is the Median of the set of numbers that are greater than the full set Median. I.e. Median of 39, 41, 43, 49, 51, 62 = (43 + 49) ÷ 2 = 46 (The high edge of the box)
The Box and Whisker chart in PowerBI can also use the size of the dots in green to show how frequently individual measurements are recorded, but in my case all observations are distinct. From a statistical point of view, the Mean (average) is not actually part of a box and whisker, but the PowerBI visualization shows this by default via the big white dot. You can disable this if you wish.
You have several box and whisker chart options in Power BI, if you search in App Source (which you can access from the Insert Ribbon -> More Visuals -> From AppSource), you will see the following options.
The top two are direct hits, and in my use of Box and Whisker so far, I have used the one from MAQ Software and I find it simple and effective, however, in a follow-up post, I plan to compare the different visuals on the same dataset. Stay tuned for that.
To wrap up, what I like here is the simplicity of the visual, but also the fuller understanding of the set of numbers that I am looking at. You are bringing a user into a data science mindset in a more friendly manner. The five numbers on their own would be a challenge for a lot of users to understand the significance of, but the chart gives them that insight at a glance. I love the fact that an average does not have to be used, I love averages, but they can be skewed by the dominance of outliers, the box and whisker chart presents all of my data in a glance.
I hope you find this of interest, stay safe, and marvel at how crap my internet speed is right now! Hopefully, my speed is an outlier.