3. Descriptive statistics¶
3.2 Display data¶
- statistical graph help you learn about the the shape of the distribution of a sample, get overall picture of the data.
- graph types:
- histogram
- box plots
- stem-and-leaf plot
- frequency polygon
- pie charts
3.2.1 Histograms¶
- used for displaying the distribution of continuous numeric data
- advantage: display large data sets, good to use histograms when the data set is 100 values or more
- in R:
hist(numericSequence)
- A histogram consists of contiguous boxes, that maps the values (x-axis) to their corresponding frequency (y-axis).
- histogram divides the range of the data (the x-axis) into equal intervals, which are the bases for the boxes.
3.2. Box Plots¶
- The box plot, or box-whisker plot, shows the concentration of the data, and how far the extreme values are from the rest of the data
-
box plot constructed from:
- the smallest value,
- the first quartile,
- the median,
- the third quartile,
- the largest value.
-
median:
- is a number that measures the center of the data (middle value),
- it does not need to be part of the observed values.
- a number that separates an ordered set of values into halves.
- median is the second quartile.
-
Quartiles:
- numbers that separate data into quarters.
- might or might not be part of the observed values.
- to find quartiles:
- find the median (second quartile).
- find the median of the first half (first quartile).
- find the median of the second half (third quartile)
-
Outliers:
- values that do not fit with the rest of the data
- values that lie outside of the normal range.
- data values that are much too large or much to small in comparison to the vast majority of the observed values.
-
IQR:
- denoted as inter-quartile range.
- the distance between the third quarter and the first quarter.
IQR = Q3 - Q1
-
any data value that is larger than Q3 + ( 1.5 * IQR ) is a potential outlier.
- any data value that is smaller than Q1 - ( 1.5 * IQR ) is a potential outlier.
- outliers affect the outcome of the statistical analysis greatly, identifying them is important.
-
example box plot:
-
data that is being represented by the box plot above:
-
reading the box plot above:
- Min data is 124 = (Q1 - (IQR * 1.5)) represented by the horizontal line down below.
- min value is 117 less than (Q1 - (IQR * 1.5)) (outlier) represented by circle at the bottom of the box plot.
- Q1 is 158 represented by the lower line of the grey box.
- Q3 is 180 represented by the upper line of the grey box
- median is 170 represented by the thick horizontal line inside the grey box
- max data is Q3 + (IQR * 1.5) = 213. there is no data larger than that so we can use the max value which is 208 represented by the horizontal line at the top of the box plot.
3.3 Measures of the center of the data¶
- two measures involved: mean(average), and median.
- The median is better measure of the center of the data, because it does not get affected by the extreme outliers.
- The mean is the most used measure of the center of the data, denoted as x-bar.
- skewness:
- symmetrical: mean == median.
- skewed to the left: mean < median
- skewed to the right: mean > median
3.4 Measures of the spread of the data¶
- two measures involved: IQR (inter-quartile range) and standard deviation
- the standard deviation is the most used measure of the spread of the data.
- observation (subject) deviation is the difference between the value and the average(mean) of the data.
deviation of (x[i]) = x[i] - avg(x)
standard deviation of x = avg( (...deviation of x[i])^2 )
-
calculate the standard deviation using R:
x <- c(9,9.5,9.5,10,10,10,10,10.5,10.5,10.5,10.5,11,11,11,11,11,11,11.5,11.5,11.5) # data x.bar <- mean(x) # average or mean x - x.bar # individual deviations (x - x.bar) ^ 2 # individual deviations squared std.variance <- sum((x - x.bar)^2)/(length(x)-1) # average of individual deviations squared = standard variance std.deviation <- sqrt(sum((x - x.bar)^2)/(length(x)-1)) # square root of average of individual deviations squared = 0.715891 std.deviation <- sd(x) # standard deviation native command std.variance <- var(x) # standard variance native command std.deviation <- sqrt(std.variance)
-
The sample standard deviation, s, is either zero or is larger than zero.
- When s = 0, there is no spread and the data values are equal to each other.
- When s is a lot larger than zero, the data values are very spread out about the mean.
- Outliers can make s very large.
- example:
- data set contains value 7, mean = 5, s = 2.
- value 7 is 1 standard deviation to the right of the mean, because 7 = 5 + (1 * 2) = mean + (1 * s) and 7 is larger than the mean.
- value 1 is 2 standard deviation to the left of the mean, because 1 = 5 - (2 * 2) = mean - (2 * s) and 1 is smaller than the mean