Normal Distributions

1. Normal distributions are bell–shaped – unimodal and symmetric – and continuous. They're only mathematical approximations of Histograms.

Since Normal distributions are smooth curves ~ continuous, P(X = a) = 0 [area under the line x = a is ZERO!] so that, theoretically, P(X > a) ~ P(X > a).

That is why the calculator command doesn’t make a distinction between P(X > 72in) and P(X > 72in.)!

2. All Normal Distributions are uniquely determined by their Mean [~ Median, by the way] and s.d.

3. Forward problem: When X is known and the probability, P is asked.

Use 2nd + VARS [DISTR command] –> OPTION 2: Normalcdf(Left Limit, Right Limit, μ, σ) if

– the X–value [variable] is given / known
– A Proportion / % / Probability / Area / Percentile is asked.
– Use proper Notation: P(X > a) or P(X < b) or P(c < X < d)

– Specific commands:

· For P(X > a), use Normalcdf(a, 9999, μ, σ)

· For P(X < b), use Normalcdf(–9999, b, μ, σ)

· For P(c < X < d), use Normalcdf(c, d, μ, σ)

How do we know if the Normalcdf command is to be used? When X is known and the probability, P is asked.

4. Backward problem: When the probability, P, is known, and the X–value is asked.

Use 2nd + VARS [DISTR command] –> OPTION 3: InvNorm (Left Area in decimals, μ, σ) if

– A Proportion / % / Probability / Area / Percentile is given / known
– The X–value [variable] is unknown/ asked.
– Use proper notation: P(X < a) = p%

How do we know if the InvNorm command is to be used? When the probability, P, is known, and the X–value is asked.

IMPORTANT! 5. The terms Probability, Proportion, Percentage, Relative Frequency are used interchangeably and refer to the same idea.

6. Understand what Percentile means: If X corresponds to the p–th percentile, then p% of ALL values [incomes, weights, heights, length of pregnancies, etc.] are < X.

P = P(X < X–value)

Percentiles denote the left–area of the distribution.

IMPORTANT! 7. A problem that asks for the Percentile is a forward problem [because a Proportion / Probability / Percentage / left Area is sought, for a given X–value].

A problem that gives a Percentile involves a backward problem [because an X–value is sought for a given Proportion / Probability / Percentage / left Area].

8. The Expectations while solving any problem are:

· Define a variable – say, X – and state its distribution [Centre / Shape / Spread: X ~ N(μ, σ)]. This is NOT optional!

· Use Probability Notation to describe the Q. This is NOT optional!

· Draw a graph, label and shade the appropriate region. This is NOT optional!

· Use a calculator command to solve the problem: it is NOT required to write down the command!

IMPORTANT! Note: At times, you may prefer to visualize / illustrate the problem before you interpret the Q in probability notation. Thats OK!

9. Z–scores give the number of s.d. an X–value is from the mean: Z = (X – μ) / σ. Clearly, Values above the Mean have positive Z–scores; Values below the Mean, Negative

10. Power Tip! Students, at times, wonder when to use Z–scores. Here's a good Rule of Thumb: when the Q uses the phrase: s.d. from the mean.

11. Z–scores and the Empirical Rule:

· ~68% of Z–scores lie between –1 and +1. [Why? Because ~68% of all X–values lie within 1 s.d. of the mean!].

· ~95% of Z–scores lie between –2 and +2. [Why? Because ~95% of all X–values lie within 2 s.d. of the mean!].

· ~99.7% of Z–scores lie between –3 and +3. [Why? Because ~99.7% of all X–values lie within 3 s.d. of the mean!].

E.g. For X ~ r.v. denoting heights of US males (inches) ~ N(72, 2.5) <–––––– Study this example. You need to understand these translations for your HW!

a height that is Z = +1 s.d. above the mean is 72 + 1·2.5 = 74.5
a height Z = 1.5s.d below the mean is of 72 – 1.5·2.5 = 68.25
a height of 66 is Z = (X – μ) / σ = (66 – 72)/2.5 = –2.4 ~2.4 s.d. below the mean. AP 5 Note! This height is quite rare: it lies beyond 2.d. from the Mean!
a height of 78 is Z = (X – μ) / σ = (80 – 72)/2.5 = +3.2 ~3.2 s.d. above the mean AP 5 Note! This height is very rare: it lies beyond 3.d. from the Mean!

12. An outcome is said to Rare or Unusual or Extreme if the probability of its occurrence is < 5%. Alternately, Rare Events occur at beyond the 5th or 95th percentiles. For any distribution, Z–scores beyond 2 are regarded as rare; for symmetric distributions, Z–scores beyond 1.645 are regarded as rare...since theyd occur < 5% of the time. On the other hand, In general, Z–scores closer to 0 (Zero) are less rare than otherwise since they relate to values or outcomes close to the Mean ~ representative value!

13. Z–scores are affected by extreme–value or outliers since it relies on the Mean and s.d., which, in turn, are influenced by extreme observations.

Comparing distributions in terms of Z–scores, and Percentiles

Given 2 data–sets, we can determine the Percentiles corresponding to a given value, a, for both sets: P1 = P(X < a) and P2 = P(Y < a)

Recommended: for ANY data–set. Percentiles only depend on the relative position of the numbers, not on the values themselves...so they arent affected by extreme values!
Given only the Mean and s.d. of 2 data–sets, we can determine the Z–scores corresponding to a given value, a, for both sets: Zx = (a – Mean1) / s.d.1 and Zy = (a – Mean 2) / s.d.2

Recommended: for ANY symmetric data–set or when the entire data–set is unknown or when only the Mean and s.d. are provided.

Example 1
On a certain edition of the SAT, the Math scores were approximately normal with a mean of 500 and a s.d. Of 75. For that edition:

(i) Find the percentage of scores between 600 and 700.

(ii) Find the probability that an individual scores above 730.

(iii) Find the proportion of scores below 400.

(iv) A student gets a score of 710. Should one be impressed? Determine this using 4 methods: 2 by calculating probabilities, 2 using Z-scores!

(v) Find the percentile that a score of 780 corresponds to.

(vi) Find the score that corresponds to the 65th percentile.

(vii) Find the cut-off score that is the bottom 20% of SAT Math scores.

(viii) Find the score that separates the top 10% from the rest.

(ix) Find the 2 scores that constitute the middle 30% of SAT Math scores. What is the spread of the middle 30% of scores? This is the more general version of the IQR problem, which is the spread of the middle 50% of scores!

(x) Between what 2 scores did virtually all Math SAT scores fall between?

(xi) Which score lies 1.5s.d. below the mean?

(xii) What scores lie 1.75 s.d. from the Mean?

(xiii) Use the Empirical Rule. 68% of SAT scores lie between what 2 scores? 95% of SAT scores lie between what 2 scores? 99.7% of SAT scores lie between what 2 scores?

(xiv) What scores might be unusually high? Unusually low?

(xv) Calculate the IQR of SAT scores.

(xvi) If 5895 students took the SAT, how many scores were below 600 or exceeded 700?

Before you attempt the Qs, ask yourself if the Q is

a forwards problem [X known => Normalcdf] or

a backwards problem [X unknown => InvNorm]!

Solution.
Note: I am not too particular about getting the inequality < or < correct…since it isn’t a big deal since P(X < a) = P(X < a)

Let X be a r.v. denoting SAT scores. X ~ N(500, 75).
(i) P(600 < X < 700)

Sketch a figure to label the Mean, 500, and s.d., 75, with a region suitably shaded between 600 and 700.

P = Normalcdf(600, 700, 500, 75) = 8.73% [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

8.73% of SAT Math scores lie between 600 and 700.

(ii) P(X > 730)

Sketch a figure to label the Mean, 500, and s.d., 75, with a right region suitably shaded above 730.

P = Normalcdf(730, 99999, 500, 75) = 0.1%. [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

The probability that an individual scores above 730 on the Math SAT is 0.1%.

(iii) P(X < 400)
Sketch a figure to label the Mean, 500, and s.d., 75, with a left region suitably shaded to the left of 400.

P = Normalcdf(-99999, 400, 500, 75) = 9.12% [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

9.12% of SAT Math scores lie below 400.

(iv) Method I P(X > 710) = 0.255%

Sketch a figure to label the Mean, 500, and s.d., 75, with a region suitably shaded.

Since P = 0.255% << 5%, we find the score very impressive!

Method II P(X < 710) = 99.75%

Sketch a figure to label the Mean, 500, and s.d., 75, with a region suitably shaded.

Since P = 99.75% >> 95%, we find the score very impressive!

Method III Z = (X – μ) / σ = (710 – 500)/75 = 2.8

Since Z = 2.8 >> 1.645, we find the score very impressive!

Method IV A score that is 1.645s.d. above the mean would be impressive since that would happen < 5% of the time:
µ + 1.645σ = 500 + 1.645· 75 = 623.375

Since 710 >> 623.375, it is very impressive.

(v) We need to find what % of scores are < 780:
P(X < 780)

Sketch a figure to label the Mean, 500, and s.d., 75, with a left region suitably shaded below 780.

P = Normalcdf(-9999, 780, 500, 75) = 99.99% [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

A score of 780 corresponds to the 99.99th percentile: 99.99% of SAT Math scores lie at 780 or below.

(vi) We need to find a score, a, such that 65% of SAT scores are < a: this is a backwards problem...so we use the InvNorm command!

Sketch a figure to label the Mean, 500, and s.d., 75, with a left region suitably shaded for 0.65 with a on the axis.

P(X < a) = 0.65 => a = InvNorm(0.65, 500, 75) = 528.89 [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

(vii) We need a score a such that P(X < a) = 0.2: this is a backwards problem...so we use the InvNorm command!
Sketch a figure to label the Mean, 500, and s.d., 75, with a left region suitably shaded for 0.2 with a on the axis.

P(X < a) = 0.2 => a = InvNorm(0.20, 500, 75) = 436.87 [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

(viii) We need a score a such that P(X < a) = 0.9 [since the calculator can only process left Area / %]: this is a backwards problem...so we use the InvNorm command!

Sketch a figure to label the Mean, 500, and s.d., 75, with a left region suitably shaded for 0.9 with a on the axis.

P(X < a) = 0.9 => a = InvNorm(0.90, 500, 75) = 596.11 [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

(ix) If 30% of scores are in the middle, then 70% are left over and distributed symmetrically on the left and on the right => we need the 35th percentile and (30 + 35) = 65th percentile.

As in (v), We need to find 2 scores, a and b, such that 35% of SAT scores are < a and 65% of scores are < b: this is a backwards problem...so we use the InvNorm command!

Sketch 2 figures to label the Mean, 500, and s.d., 75, with a left region suitably shaded for 0.35 with a on the axis, and with a left region suitably shaded for 0.65 with b on the axis.

P(X < a) = 0.35 => a = InvNorm(0.35, 500, 75) = 471.10 [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

P(X < b) = 0.65 => b = InvNorm(0.65, 500, 75) = 528.89 [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

The 2 scores that constitute the middle 30% of SAT Math scores are 480 and 530. The spread of the middle 30% of scores is 530 – 480 = 50.

(x) In a Normal distribution, since 99.7% of observations lie within 3 s.d. of the Mean [µ ± 3σ], in this case, virtually all Math SAT scores shall within 3.s.d of 500 i.e. between [500 – 3·75, 500 + 3·75] = [275, 725]

(xi) The SAT score that lies 1.5s.d. below the Mean is µ – 1.5σ = 500 – 1.5(75) = 387.5.

(xii) The 2 scores that lie 1.75s.d. from the Mean are: µ ± 1.75σ i.e. [500 – 1.75·75, 500 + 1.75·75] ~ [370, 630]

(xiii) According to the Empirical Rule,

68% of scores lie within 1s.d. of the mean µ ± 1σ i.e. [500 – 1·75, 500 + 1·75] ~ [425, 575]
95% of scores lie within 2s.d. of the mean µ ± 2σ i.e. [500 – 2·75, 500 + 2·75] ~ [350, 650]
99.7% of scores lie within 3s.d. of the mean µ ± 3σ i.e. [500 – 3·75, 500 + 3·75] ~ [275, 725]

(xiv) Since most [~95%] of scores like between 350 and 650 [see (xiii)], scores beyond those limits might be regarded as unusually low / high, respectively!

Note! You might also use 99.7% limits of 275 and 725...or you may use the µ ± 1.645σ limits pf [376.625, 623.375].

(xv) IQR = 75^th – 25^th percentiles. We need to find 2 scores, a and b, such that 25% of SAT scores are < a and 75% of scores are < b: this is a backwards problem...so we use the InvNorm command!

Sketch 2 figures to label the Mean, 500, and s.d., 75, with a left region suitably shaded for 0.25 with a on the axis, and with a left region suitably shaded for 0.75 with b on the axis.

P(X < a) = 0.25 => a = InvNorm(0.25, 500, 75) = 449.4133 [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

P(X < b) = 0.75 => b = InvNorm(0.75, 500, 75) = 550.5867 [Note: it is NOT required to write down the command! I have done so for illustrative purposes only!]

The IQR of 550.5867 – 449.4133 = 101.1734 is the spread of the middle 50% of scores

(xvi) First, we find: P(X < 600) = 90.87% [Sketch a figure to label the Mean, 500, and s.d., 75 and shade suitably...]

Next, P(X > 700) = 0.38%

Required P = 90.87% + 0.38% = 91.25%

If 5895 students took the SAT, then 91.25%$ (5000) = 4562 scored below 600 or above 700.

Example 2
The blood glucose level (BGL) of 1200 patients being tested for diabetes (under the age of 50) was found (after a 12–hour fast), to be roughly normal with a mean of 85 mg of glucose per deciliter of blood & s.d 25 mg/dl.

1. Between what BGLs would most of the observations lie? Write a simple sentence to clearly explain your choice.

2. How many patients have BGLs below 10mg/dl?

3. A BGL of 141.25 mg/dl is how many s.d. from the mean?
4. What BGL corresponds to the 16^thpercentile? Interpret this.
5. Calculate and interpret the Z–score corresponding to a BGL of 41.25 mg/dl.
6. How many of the patients had BGLs less than 50 mg/dl or more than 105 mg/dl?
7. Calculate the IQR of the BGLs. Interpret this in simple English.
8a) What interval of BGLs captures all observations lying within 2.75 s.d. of the mean?

Solution.
Let X be a r.v. denoting BGLs (in mg/dl).
X ~ N(85, 25).

1. Since BGL are normally distributed, most [95%] of the BGLs shall lie within 2s.d. of the mean BGL of 85: i.e. between 85 – 2·25 = 35mg/dl and 85 + 2·25 = 135mg/dl.

2. Note: we used common–sense to resolve the Q: if we knew what % of individuals, in general, had a BGL of < 10, then observing that there are 1200 of them, we could determine the Number of individuals satisfying the condition!

P(X < 10) = 0.001349

– Sketch, label and shade a figure to illustrate the situation.

Therefore, 0.1349%·1200 = 1.6 ~ About 2 individuals would have a BGL below 10mg/dl.

3. The phrasing reveals it to be Z–score problem!

Z = (X – μ) / σ = (141.25 – 85)/25 = 2.25

A BGL of 141.25mg/dl is 2.25 s.d. above the mean BGL of 85.

4. P(X < a) = 0.16

– Sketch, label and shade a figure to illustrate the situation.

a = 60.14

A BGL of 60.14mg/dl corresponds to the 16th percentile indicating that 16% of the patients had a BGL of [or 16% of the BGLs were] 60.14mg/dl or less.

5. Z = (X – μ) / σ = (41.25 – 85)/25 = –1.75

A BGL of 41.25mg/dl lies 1.75s.d. below the Mean BGL of 85mg/dl.

6. P(X < 50 or X > 105)
= P(X < 50) + P(X > 105)
= 8.07% + 21.18%
= 29.26%

– Sketch, label and shade a figure to illustrate the situation.

Therefore, 29.26%·1200 = 351.12 About 351 patients would have a BGL below 50mg/dl or more than 105.

7. P(X < Q1) = 0.25 and P(X < Q3) = 0.75

– Sketch and label a figure to illustrate the situation.

Q1 = 68.13
Q3 = 101.86

The IQR, the range of the middle 50% of BGLs, is Q3 – Q1 = 33.72 mg/dl.

8a) An interval of [85 – 2.75·25, 85 + 2.75·25] = [16.25, 153.75] would capture all BGLs within 2.75s.d. from the mean of 85mg/dl.

b) P(16.25 < X < 153.75) = 99.4%

– Sketch, label and shade a figure to illustrate the situation.

Therefore, 99.4%·1200 ~ 1193 individuals have a BGL between 16.25 and 153.74mg/dl.

c) An interval of [85 – 1.75·25, 85 + 1.75·25] = [41.25, 128.75] would capture all BGLs within 1.75s.d. from the mean of 85mg/dl.

Method I

P(X < 41.25) + P(X > 128.75) = 4% + 4% [can you see why?!] = 8%

– Sketch, label and shade a figure to illustrate the situation.

Method II

P(X < 41.25) + P(X > 128.75)
= 1 – P(41.25 < X < 128.75)
= 8%

– Sketch, label and shade a figure to illustrate the situation.

Therefore, 8% of individuals have a BGL within 1.75s.d. from the mean.

9. As in 4. Do it yourselves!

– Interpret the Q in Probability Notation.
– Sketch, label and shade a figure to illustrate the situation.