6. The Normal Random Variable¶
6.2 Normal Random Variable¶
- normal distribution serves as approximation (generic model) of other distributions, eg. the binomial distribution under certain circumstances can be approximated to the normal distribution.
-
Normal distribution:
- A Normal random variable has a continuous distribution over the sample space of all numbers, negative or positive.
- denoted X ∼ Normal(µ, σ2), where µ = E(x) = expectation = the mean, σ2 = Var(x) = the variance of x.
- normal distribution is symmetric about the expectation: the random variable is more likely to obtain its value around the expectation.
- dnorm: calculates the density of the normal distribution
- pnorm: calculates the cumulative density (probability) of the normal distribution, pnorm(x, µ, σ) = pmorm(x, E(x), std deviation) = pnorm(x, mean, std deviation)
- the smaller the variance the more concentrated is the distribution of the random variable about the expectation.
-
Standard Normal distribution:
- normal distribution of standardized values z-score.
- z-score: the original measurement measured in units of the standard deviation from the expectation.
- in a standard Normal distribution the expectation is 0 and the variance is 1 =
Z ∼ N(0, 1) = N(E(x), variance
- example, P(0 < X ≤ 5) for X ∼ N(2, 9):
- the standardized values of x = (x - E(x) / std deviation) = (x - 2) / (x^½) = (x-2)/3.
- to find the probability of x belongs to [0, 5], we need to find z score for the boundaries of the domain.
- z-score(0) = (0 - 2) / 3 = -1.5
- z-score(5) = (5 - 2) / 3 = 1
- The probability of the original distribution to belong to [0, 5] equals the probability of the standard distribution that belongs to [-1.5, 1].
- find the probability of the standard distribution to be equal to -1.5: pnorm(-1.5), then pnorm(+1) for the second boundary.
- probability of the standard distribution to belong to [-1.5, 1] = pnorm(+1) - pnorm(-1.5) = 0.5888522
- if the expectation and the variance is not provided for pnorm(x, E(x), std deviation), the function will calculate the standard normal distribution of x, where E(x) is default to 0, and std deviation is default to 1.
6.2.3 Computing Percentiles¶
- p-percentile: given a random variable X and given a percent p, the x value with the property that the cumulative distribution up to x is equal to the probability p.
- qnorm: calculates The percentiles of the Normal distribution == calculates the z-score.
- qnorm(0.975) = 1.9599 => if you are 1.9599 right to the mean, you are at 97.5%-percentile
- qnorm(0.025) = -1.9599 => if you are 1.9599 left to the mean, you are at 2.5%-percentile
- practically, z0 = 2.5%-percentile = -1.6; z1 = 97.5%-percentile = 1.6 => 95% of the probability of standard normal distribution is concentrated in the range [-1.96, 1.96].
- for a normal distribution with E(x)=2, std dev =3: 95% of the probability of this normal distribution is concentrated in the range [2 - (1.96 * 3), 2 + (1.96 * 3)]=[-3.88, 7.88] = [E(x) + (qnorm(0.025) * sd), E(x) + (qnorm(0.975) * sd)].
6.2.4 Outliers and the normal distribution¶
- Inter-quartile range is the length of the central interval that contains 50% of the distribution, starts with Q1 (25%) and ends with Q3 (75%).
- qnorm(0.75) = 0.6744, qnorm(0.25) = -0.6744. so that IQR = Q3 - Q1 = 0.6744 - (-0.6744) = 1.348.
- Outliers are identified as values that are more then 1.5 times the inter-quartile range (IQR) away from the ends of the central rectangle.
- identifying the upper and lower limits for the outliers in a standard normal distribution:
-
qnorm(0.75) + 1.5*(qnorm(0.75)-qnorm(0.25)) [1] 2.697959 // upper
-
qnorm(0.25) - 1.5*(qnorm(0.75)-qnorm(0.25)) [1] -2.697959 // lower
- any value outside of [-2.69, 2.69] is considered outlier.
- probability of being less than the upper outlier limit = pnorm(2.697959)
- probability of being greater than the lower outlier limit equals the pnorm(-2.697959)
- probability of being greater than the upper outlier limit is 1 - pnorm(2.697959)
- because of symmetry in the normal distribution, the probability of being an outlier = 2 * the probability of being greater than the upper outlier limit.
- the probability of being an outlier = 2 (1 - pnorm(2.697959)) = 0.006976603
-
- We get that for the standard Normal distribution the probability of an outlier is approximately 0.7%.
6.3 Approximation of the Binomial distribution¶
- The probability theory that mathematically establishes such approximation is called the Central Limit Theorem.
- The computation of a probability for a Binomial random variable is replaced by computation of probability for a Normal random variable that has the same expectation and standard deviation as the Binomial random variable.
-
example: Tossing fair coin 4000 times, X = Head is the SUCCESS:
- Binomial X ~ Binomial(4000, 0.5) where n = 4000 (the sample space size), p = 0.5 (the probability of success)
- E(X) = n * p = 4000 * 0.5 = 2000; Var(X)= n * p * (1 - p) = 4000 * 0.5 * 0.5 = 4000 * 0.5 * 0.5 = 1000
- sd = sqrt(Var(x)) = sqrt (1000) = 31.6227
- find the probability of getting between 1940 and 2060 successes, call it prob:
- using binomial distribution: prob = P(X <= 2060) - P(X <= 1939) = pbinom(2060,4000,0.5) - pbinom(1939,4000,0.5) = 0.9442883
- using normal distribution: prob = P(X <= 2060) - P(X <= 1939 = pnorm(2060,2000, 31.6227) - pnorm(1939,2000, 31.6227) = 0.9442883
- find the central region that contains 95% of the probability, call it prob1:
- we need to find the boundaries of this region for the normal distribution with same expectation and std deviation.
- z0 = qnorm(0.025, 2000, 31.6227) = 1938.020; z1 = qnorm(0.975, 2000, 31.6227) = 2061.980 => central region belongs to [1938, 2062]
-
functions that calculate the percentile for various distributions:
distribution functions binomial qbinom normal qnorm poisson qpois uniform qunif exponential qexp
6.3.2 Continuity Corrections¶
- the binomial distribution is discrete and only has integer values, but the normal distribution is continuous and can obtain decimal values
- continuity correction: to use the normal distribution to approximate the binomial distribution that has small sample space, we have to calculate the probability of the value throughout the domain [x - 0.5, x + 0.5].
- example: to calculate the probability of P(x <= 6) we need to calculate the probability of P(x <= 6.5).
- another example: to calculate P(x = 6) = P(x <= 6.5) - P(x <= 5.5)
- Poisson Approximation: The greater accuracy of the Poisson approximation for the case where n is large and p is small.