It is really important to know about how values are distributed in your data set or database. If you are using a database, you should be able get some visual or measure of variance to help you make meaning of your measures of central tendency or the shape of your data in general.
Using Visuals To See Distributions
Visuals can be really helpful here. Making a bar graph for categorical data or a histogram for numeric data can help you get a sense of things.
Categorical Data and Bar Graphs
Calculating the mode will probably be just as informative as seeing it as a bar graph. Still, sometimes visuals can allow us to see patterns better than alphanumeric writing.
Click here to learn more about this and how to make a bar chart in Excel.
Numeric Data and Histograms
Seeing the distributions of numerical data can be much more important. It is foundational to the most simple and complex statistical analyses.
Variance, Standard Deviation, Interquartile Range
Variance is just the sum of the difference between each value and the mean, squaring that, and then dividing it by the population size (if you have all values in existence) or by the sample size – 1 (to reflect that you do not have all values in existence).
Standard Deviation
To make using variance more manageable, Standard Deviation is conventionally used which is just the square root of the variance.
It can be helpful to use standard deviation as a way to think about any one value compared against other values. For example, if there are a lot of values clustered around the mean creating a standard deviation of 1.07, a difference of two points between two values will be really far a part in terms of spread (i.e., nearly two standard deviations away from one another).
By contrast, if if the distribution is more spread out and, say, has a standard deviation of 2.18, a difference of two points between values would be within one standard deviation and would not be very far a part in terms of spread. (previous example adapted from Jane Miller, 2015, The Chicago Guide to Writing about Numbers, Second Edition, pp. 79-81).
Click here on how to calculate standard deviation, quartiles, and interquartile range.
Quartiles and Interquartile Range
Interquartile range is another way to describe the variability in your numeric data. A quartile is a value that is above a certain percentage of other values recorded. There are four quartiles:
- Q1 (25%)
- Q2 (50%–i.e., the median)
- Q3 (75%)
- Q4 (100%–i.e., the maximum).
You could include all of that information, along with the minimum to get a sense of the spread of the data.
You could also calculate what is called the interquartile range, which can be especially helpful if you have outliers. Since standard deviation relies on the mean, if there are outliers, then the standard deviation may not be that useful in explaining the spread (because the outlier will greatly increase or decrease the standard deviation).
To calculate the interquartile range, you simply subtract Q3 from Q1.
Click here on how to calculate standard deviation, quartiles, and interquartile range.
A convention in statistics is to use the interquartile range as a way to help see if you have any outliers. To do so, multiply the interquartile range by 1.5.
Take that number and subtract it from Q1–if any number is below that result, then it is considered an outlier.
Additionally, take the IQR*1.5 figure and add it to Q3–if any numbers is above that result, then it is also considered an outlier.
Click here on how to calculate standard deviation, quartiles, and interquartile range.
Task
Choose one of the following:
- Choose numeric data from your data set and create a visual that can tell you what kind of distribution it is. Also calculate the standard deviation or interquartile range. In a comment below, tell me what the variable was that you chose, write about what the distribution looks like, what the standard deviation or interquartile range was, and what that information tells you about what a measure of central tendency you could use in a meaningful way.
- You can use the Airbnb data set we used in class to answer #1.
- If you are working with a database, and you can get a sense of the distribution visually or numerically, follow the same instructions for commenting in #1.
After commenting, click button below to continue:
I chose the Price variable for the Airbnb data set. I analyzed the data using an histogram. The graph seems to have a Skewed to the left distribution. People tend to spend on average $100 to $200 on Airbnb.
Standard Deviation: $215.59
Min: $10.00
Q1: $65.00
Q2: $90.00
Q3: $140.00
Q4: $9,999.00
INTERQUARTILE: $130
When I calculated standard deviation from the Airbnb price data, I got a result of 240.15. This tells me that this data likely has some very large outliers? I believe this means that the average amount of variability in my data is around $240 per night. This is likely due to some very expensive (on a per night basis) rooms to rent. Likely standard deviation is not a goof indicator here because of the outliers. This was also verified by the fact that the maximum, or fourth quartile, has a value of $10,000.
I used the Airbnb data set for these calculations and within the data set I used the “Price” column to calculate my findings.
The standard deviation was $215.59, the min was $10.00, Q1 was $65.00, Q2 was $90.00, Q3 was $140.00, Q4 was $9999.00, and finally, the interquartile range was $75.00. There was a left skew in the distribution of the data set.
I used my database which is called “securities from the bulk of cross- border wealth” , I used the percent of World GDP. the standard deviation was 45.12. This information tells us whether or not the economy is inflating by producing more goods and services.
I decided to work on the AirBnB example because it was a bit difficult to do it with my data set. I chose the minimum night of staying at the airbnb. The calculations are the following: SD 23.01 nights; minimum of 1 night; Q1 is one night; Q2 a minimum of 2 nights; Q3 a minimum of 4 nights; and the maximum of nights are 99 nights.
I used the airbnb set to see how many listings a lister will typically have. The standard devation was 7, min was 1, Q1 was 1, Q2 was 1, Q3 was 2, and max was 103. With a mean of 2.2, each of these data point, on average are 7 deviations away from the mean. If the max is 103, but the min and the other quartiles are either 1 or 2, that tells me that 103 is probably an outlier and that most listers are listing either 1 or 2 listings at a time. The interquartile range is 1 as well. You could use this data to see how many people have fewer listings in one area or another to approximate the value of property in the area.
I also created a bar graph which showed a visual representation of the majority of listers having around 1-2 listings and that the outliers were the listers with 103 listings.
I picked a category of “number of reviews” on AirBnB dataset, the standard deviation for this is 51.38837, minimum is 0, Q1 is 1, Q2 as median is 8, Q3 is 35, and maximum is 607. This distribution tells me that at least one customer left one review and there are 607 reviews maximum in this dataset.
I choose to do calculations on the Airbnb data set. I got $10 as the minimum price, $215.62 as the standard deviation, $65 as quartile 1, $90 as quartile 2, $140 as quartile 3, $9999, as the maximum price, and $75 as interquartile range.
The data is positively skewed since most of the data points are on the right side. Moreover, since the mean, median, and mode of the data are different, it indicates that the boundary results are low relative to the data. The standard deviation of the chosen variable, gross domestic product, is 24,408.52, which is very high. The high value indicates that the gross domestic product values are highly spread. Lastly, the most suitable measure of central tendency to use is the mean since it is the average gross domestic product value.
The standard deviation in my data set for female dropout was 0.023105228, whereas the male was 0.25034728. Male had a slightly higher dropout rate than females. The percentage of graduates in males was 0.05986715 vs. female grads was 0.056442095. This also showed a 10% lead in the female outcomes.
I had trouble understanding what exactly is being required of me right here.
Hi Joey, feel free to reach out so we can talk it over.
For my data set, there is a line graph already created to visually show how the data can be interpreted. The mean is on the x-axis and the year is on the y-axis. So far my data set, I calculated the standard deviation of the mean column and it was 0.320068702.
q0= -0.47
q1= -0.20435
q2= -0.0534
q3= 0.229025
q4= 0.99