Normal Distributions
1.
Normal distributions are bell–shaped – unimodal and
symmetric – and continuous. They're only mathematical
approximations of Histograms.
Since
Normal distributions are smooth curves ~ continuous, P(X = a)
= 0 [area under the line
x = a
is
ZERO!] so that, theoretically, P(X
>
a)
~ P(X > a).
That is why the calculator command doesn’t make a
distinction between P(X > 72in) and P(X >
72in.)!
2. All Normal Distributions are uniquely determined by
their Mean [~ Median, by the way] and s.d.
3. Forward problem:
When X is known and the probability, P is asked.
Use 2nd + VARS [DISTR command] –> OPTION 2: Normalcdf(Left Limit, Right Limit, μ, σ) if
–
the X–value [variable]
is given / known
– A Proportion / % / Probability / Area /
Percentile is asked.
– Use proper Notation: P(X
>
a) or
P(X
<
b) or
P(c
<
X <
d)
–
Specific commands:
· For P(X > a), use Normalcdf(a, 9999, μ, σ)
· For P(X < b), use Normalcdf(–9999, b, μ, σ)
· For P(c < X < d), use Normalcdf(c, d, μ, σ)
How
do we know if the Normalcdf command is to be used?
When X is known and the probability, P is asked.
4. Backward
problem: When the probability, P, is known, and the X–value is
asked.
Use 2nd
+ VARS [DISTR command] –> OPTION 3: InvNorm (Left Area
in
decimals,
μ, σ)
if
–
A Proportion / % /
Probability / Area / Percentile is given / known
– The
X–value [variable] is unknown/ asked.
– Use proper
notation: P(X
< a)
= p%
How
do we know if the InvNorm command is to be used?
When the probability, P, is known, and the X–value is
asked.
IMPORTANT!
5.
The terms Probability, Proportion, Percentage, Relative Frequency are
used interchangeably
and refer to the same idea.
6. Understand what Percentile
means:
If X corresponds to the p–th percentile, then p% of ALL values
[incomes, weights, heights, length of pregnancies, etc.] are <
X.
P = P(X <
X–value)
Percentiles denote the left–area
of the distribution.
IMPORTANT!
7.
A problem that asks
for the
Percentile is a forward problem [because a Proportion / Probability /
Percentage / left
Area is sought, for a given X–value].
A
problem that gives
a
Percentile involves a backward problem [because an X–value is
sought for a given Proportion / Probability / Percentage / left
Area].
8. The Expectations
while
solving any problem are:
· Define a variable – say, X – and state its distribution [Centre / Shape / Spread: X ~ N(μ, σ)]. This is NOT optional!
· Use Probability Notation to describe the Q. This is NOT optional!
· Draw a graph, label and shade the appropriate region. This is NOT optional!
· Use a calculator command to solve the problem: it is NOT required to write down the command!
IMPORTANT! Note: At times, you may prefer to visualize / illustrate the problem before you interpret the Q in probability notation. Thats OK!
9.
Z–scores give the number of s.d. an X–value is from the
mean: Z =
(X – μ) / σ.
Clearly, Values above the Mean have positive Z–scores; Values
below the Mean, Negative
10. Power
Tip! Students,
at times, wonder when
to use Z–scores. Here's a good Rule of Thumb: when the Q uses
the phrase: s.d.
from the mean.
11.
Z–scores and the Empirical
Rule:
· ~68% of Z–scores lie between –1 and +1. [Why? Because ~68% of all X–values lie within 1 s.d. of the mean!].
· ~95% of Z–scores lie between –2 and +2. [Why? Because ~95% of all X–values lie within 2 s.d. of the mean!].
· ~99.7% of Z–scores lie between –3 and +3. [Why? Because ~99.7% of all X–values lie within 3 s.d. of the mean!].
E.g.
For
X ~ r.v. denoting heights of US males (inches) ~ N(72, 2.5) <––––––
Study this example. You need
to understand these translations for your HW!
a height that is Z = +1 s.d. above the mean is 72 + 1·2.5 = 74.5
a height Z = 1.5s.d below the mean is of 72 – 1.5·2.5 = 68.25
a height of 66 is Z = (X – μ) / σ = (66 – 72)/2.5 = –2.4 ~2.4 s.d. below the mean. AP 5 Note! This height is quite rare: it lies beyond 2.d. from the Mean!
a height of 78 is Z = (X – μ) / σ = (80 – 72)/2.5 = +3.2 ~3.2 s.d. above the mean AP 5 Note! This height is very rare: it lies beyond 3.d. from the Mean!
12.
An
outcome is said to Rare
or
Unusual
or
Extreme
if
the probability of its occurrence is <
5%. Alternately, Rare Events occur at beyond the 5th or 95th
percentiles. For any distribution, Z–scores beyond 2 are
regarded as rare; for symmetric distributions, Z–scores beyond
1.645 are regarded as rare...since theyd occur < 5% of the time.
On the other hand, In general, Z–scores closer to 0 (Zero) are
less rare than otherwise since they relate to values or outcomes
close
to the Mean
~ representative value!
13.
Z–scores are
affected by extreme–value or outliers since it relies on the
Mean and s.d., which, in turn, are influenced by extreme
observations.
Comparing distributions in terms of Z–scores, and Percentiles
Given
2 data–sets, we can determine the Percentiles corresponding to
a given value, a,
for both sets: P1 = P(X <
a)
and P2 = P(Y <
a)
Recommended:
for ANY data–set. Percentiles only depend on the relative
position of the numbers, not on the values themselves...so they
arent affected by extreme values!
Given
only the Mean and s.d. of 2 data–sets, we can determine the
Z–scores corresponding to a given value, a,
for both sets: Zx = (a
– Mean1)
/ s.d.1 and Zy = (a
– Mean 2) / s.d.2
Recommended:
for ANY symmetric data–set or
when the entire data–set is unknown or
when only
the
Mean and s.d. are provided.
Example
1
On
a certain edition of the SAT, the Math scores were approximately
normal with a mean of 500 and a s.d. Of 75. For that
edition:
(i) Find the percentage of scores between 600 and 700.
(ii) Find the probability that an individual scores above 730.
(iii) Find the proportion of scores below 400.
(iv) A student gets a score of 710. Should one be impressed? Determine this using 4 methods: 2 by calculating probabilities, 2 using Z-scores!
(v) Find the percentile that a score of 780 corresponds to.
(vi) Find the score that corresponds to the 65th percentile.
(vii) Find the cut-off score that is the bottom 20% of SAT Math scores.
(viii) Find the score that separates the top 10% from the rest.
(ix) Find the 2 scores that constitute the middle 30% of SAT Math scores. What is the spread of the middle 30% of scores? This is the more general version of the IQR problem, which is the spread of the middle 50% of scores!
(x) Between what 2 scores did virtually all Math SAT scores fall between?
(xi) Which score lies 1.5s.d. below the mean?
(xii) What scores lie 1.75 s.d. from the Mean?
(xiii) Use the Empirical Rule. 68% of SAT scores lie between what 2 scores? 95% of SAT scores lie between what 2 scores? 99.7% of SAT scores lie between what 2 scores?
(xiv) What scores might be unusually high? Unusually low?
(xv) Calculate the IQR of SAT scores.
(xvi)
If 5895 students took the SAT, how
many
scores were below 600 or exceeded 700?
Before you attempt the Qs, ask yourself if the Q is
a forwards problem [X known => Normalcdf] or
a backwards problem [X unknown => InvNorm]!
Solution.
Note:
I am not
too particular about getting the inequality < or <
correct…since it isn’t a big deal since P(X <
a)
= P(X < a)
Let
X be a r.v. denoting SAT scores. X ~ N(500, 75).
(i)
P(600
< X < 700)
Sketch a figure to label the Mean, 500, and
s.d., 75, with a region suitably shaded between 600 and 700.
P
= Normalcdf(600, 700, 500, 75) = 8.73% [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
8.73% of SAT Math scores lie between 600 and
700.
(ii)
P(X
> 730)
Sketch a figure to label the Mean, 500, and s.d.,
75, with a right region suitably shaded above 730.
P =
Normalcdf(730, 99999, 500, 75) = 0.1%. [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
The probability that an individual scores
above 730 on the Math SAT is 0.1%.
(iii)
P(X
< 400)
Sketch a figure to label the Mean, 500, and s.d., 75,
with a left region suitably shaded to the left of 400.
P =
Normalcdf(-99999, 400, 500, 75) = 9.12% [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
9.12% of SAT Math scores lie below 400.
(iv)
Method I P(X
>
710) = 0.255%
Sketch a figure to label the Mean, 500, and
s.d., 75, with a region suitably shaded.
Since P = 0.255% << 5%, we find the score very impressive!
Method II P(X < 710) = 99.75%
Sketch
a figure to label the Mean, 500, and s.d., 75, with a region suitably
shaded.
Since P = 99.75% >> 95%, we find the score very
impressive!
Method
III Z
= (X
– μ) / σ = (710 – 500)/75 = 2.8
Since Z =
2.8 >> 1.645, we find the score very
impressive!
Method
IV
A score that is 1.645s.d. above the mean would be impressive
since that would happen < 5% of the time:
µ + 1.645σ
= 500 + 1.645· 75 = 623.375
Since 710 >>
623.375, it is very
impressive.
(v)
We
need to find what % of scores are <
780:
P(X <
780)
Sketch a figure to label the Mean, 500, and s.d., 75,
with a left region suitably shaded below 780.
P =
Normalcdf(-9999, 780, 500, 75) = 99.99% [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
A score of 780 corresponds to the 99.99th
percentile: 99.99% of SAT Math scores lie at 780 or below.
(vi)
We need to find a score, a,
such that 65% of SAT scores are <
a:
this is a backwards
problem...so we use the InvNorm
command!
Sketch a figure to label the Mean, 500, and s.d., 75, with a
left region suitably shaded for 0.65 with a
on the axis.
P(X < a)
= 0.65 => a
=
InvNorm(0.65, 500, 75) = 528.89 [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
(vii)
We
need a score a
such that P(X < a)
= 0.2: this is a backwards
problem...so we use the InvNorm
command!
Sketch
a figure to label the Mean, 500, and s.d., 75, with a left region
suitably shaded for 0.2 with a
on the axis.
P(X < a)
= 0.2 => a
= InvNorm(0.20, 500, 75) = 436.87 [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
(viii)
We
need a score a
such that P(X < a)
= 0.9 [since the calculator can only
process left Area / %]: this is a backwards
problem...so we use the InvNorm
command!
Sketch
a figure to label the Mean, 500, and s.d., 75, with a left region
suitably shaded for 0.9 with a
on the axis.
P(X < a)
= 0.9 => a
= InvNorm(0.90, 500, 75) = 596.11 [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
(ix)
If
30% of scores are in the middle, then 70% are left over and
distributed symmetrically
on the left and on the right => we need the 35th percentile and
(30 + 35) = 65th percentile.
As in (v), We need to find 2
scores, a
and b,
such that 35% of SAT scores are <
a
and 65% of scores are <
b:
this is a backwards
problem...so we use the InvNorm
command!
Sketch
2 figures to label the Mean, 500, and s.d., 75, with a left region
suitably shaded for 0.35 with a
on the axis, and with a left region suitably shaded for 0.65 with b
on the axis.
P(X < a)
= 0.35 => a
=
InvNorm(0.35, 500, 75) = 471.10 [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
P(X < b)
= 0.65 => b
=
InvNorm(0.65, 500, 75) = 528.89 [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
The 2 scores that constitute the middle 30% of
SAT Math scores are 480 and 530. The spread
of the middle 30% of scores is 530 – 480 = 50.
(x)
In
a Normal distribution, since 99.7% of observations lie within 3 s.d.
of the Mean [µ ± 3σ], in this case, virtually all
Math SAT scores shall within 3.s.d of 500 i.e. between [500 –
3·75, 500 + 3·75] = [275, 725]
(xi) The SAT score that lies 1.5s.d. below the Mean is µ – 1.5σ = 500 – 1.5(75) = 387.5.
(xii) The 2 scores that lie 1.75s.d. from the Mean are: µ ± 1.75σ i.e. [500 – 1.75·75, 500 + 1.75·75] ~ [370, 630]
(xiii) According to the Empirical Rule,
68% of scores lie within 1s.d. of the mean µ ± 1σ i.e. [500 – 1·75, 500 + 1·75] ~ [425, 575]
95% of scores lie within 2s.d. of the mean µ ± 2σ i.e. [500 – 2·75, 500 + 2·75] ~ [350, 650]
99.7% of scores lie within 3s.d. of the mean µ ± 3σ i.e. [500 – 3·75, 500 + 3·75] ~ [275, 725]
(xiv)
Since most [~95%] of scores like between 350 and 650 [see
(xiii)],
scores beyond
those
limits might be regarded as unusually low / high, respectively!
Note!
You might also use 99.7% limits of 275 and 725...or you may use the µ
± 1.645σ limits pf [376.625, 623.375].
(xv) IQR = 75th – 25th percentiles. We need to find 2 scores, a and b, such that 25% of SAT scores are < a and 75% of scores are < b: this is a backwards problem...so we use the InvNorm command!
Sketch
2 figures to label the Mean, 500, and s.d., 75, with a left region
suitably shaded for 0.25 with a
on the axis, and with a left region suitably shaded for 0.75 with b
on the axis.
P(X < a)
= 0.25 => a
=
InvNorm(0.25, 500, 75) = 449.4133 [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
P(X < b)
= 0.75 => b
=
InvNorm(0.75, 500, 75) = 550.5867 [Note:
it is NOT required to write down the command! I have done so for
illustrative
purposes only!]
The IQR of 550.5867 – 449.4133 =
101.1734 is the spread
of the middle 50% of scores
(xvi)
First, we find: P(X < 600) = 90.87% [Sketch a figure to
label the Mean, 500, and s.d., 75 and shade suitably...]
Next,
P(X > 700) = 0.38%
Required
P = 90.87% + 0.38% = 91.25%
If 5895 students took the SAT,
then 91.25%$ (5000) = 4562 scored below 600 or above 700.
Example
2
The
blood glucose level (BGL) of 1200 patients being tested for diabetes
(under the age of 50) was found (after a 12–hour fast),
to be roughly normal with a mean of 85 mg of glucose per deciliter of
blood & s.d 25 mg/dl.
1. Between what BGLs would most of the observations lie? Write a simple sentence to clearly explain your choice.
2. How many patients have BGLs below 10mg/dl?
3.
A BGL of 141.25 mg/dl is how many s.d. from the mean?
4. What BGL
corresponds to the 16thpercentile?
Interpret
this.
5.
Calculate and interpret
the Z–score
corresponding to a BGL of 41.25 mg/dl.
6. How
many of
the patients had BGLs less
than 50
mg/dl or more
than 105
mg/dl?
7. Calculate the IQR of the BGLs. Interpret this in simple
English.
8a) What interval
of BGLs captures all observations lying within
2.75 s.d. of the mean?
Solution.
Let
X be a r.v. denoting BGLs (in mg/dl).
X ~ N(85, 25).
1.
Since
BGL are normally distributed, most [95%] of the BGLs shall lie within
2s.d. of the mean BGL of 85: i.e. between 85 – 2·25 =
35mg/dl and 85 + 2·25 = 135mg/dl.
2.
Note:
we
used common–sense to resolve the Q: if
we
knew what % of individuals, in general, had a BGL of < 10, then
observing that there are 1200 of them, we could determine the Number
of
individuals satisfying the condition!
P(X < 10) =
0.001349
– Sketch, label
and
shade a figure to illustrate the situation.
Therefore,
0.1349%·1200 = 1.6 ~ About 2 individuals would have a BGL
below 10mg/dl.
3.
The
phrasing reveals it to be Z–score problem!
Z = (X –
μ) / σ = (141.25 – 85)/25 = 2.25
A BGL of
141.25mg/dl is 2.25 s.d. above the mean BGL of 85.
4.
P(X
<
a)
= 0.16
– Sketch, label
and
shade a figure to illustrate the situation.
a
=
60.14
A BGL of 60.14mg/dl corresponds to the 16th percentile
indicating that 16% of the patients had a BGL of [or
16%
of the BGLs were] 60.14mg/dl or less.
5.
Z
= (X – μ) / σ = (41.25 – 85)/25 = –1.75
A
BGL of 41.25mg/dl lies 1.75s.d. below the Mean BGL of 85mg/dl.
6.
P(X
< 50 or X > 105)
= P(X < 50) + P(X > 105)
=
8.07% + 21.18%
= 29.26%
– Sketch, label
and
shade a figure to illustrate the situation.
Therefore,
29.26%·1200 = 351.12 About 351 patients would have a BGL below
50mg/dl or more than 105.
7.
P(X
< Q1) = 0.25 and P(X < Q3) = 0.75
– Sketch and
label
a
figure to illustrate the situation.
Q1 = 68.13
Q3 =
101.86
The IQR, the range of the middle 50% of BGLs, is Q3 –
Q1 = 33.72 mg/dl.
8a)
An interval of [85 – 2.75·25, 85 + 2.75·25] =
[16.25, 153.75] would capture all BGLs within 2.75s.d. from the mean
of 85mg/dl.
b)
P(16.25 < X < 153.75) = 99.4%
– Sketch, label
and
shade a figure to illustrate the situation.
Therefore,
99.4%·1200 ~ 1193 individuals have a BGL between 16.25 and
153.74mg/dl.
c) An interval of [85 – 1.75·25, 85
+ 1.75·25] = [41.25, 128.75] would capture all BGLs within
1.75s.d. from the mean of 85mg/dl.
Method
I
P(X
< 41.25) + P(X > 128.75) = 4% + 4% [can you see why?!] = 8%
–
Sketch, label
and
shade a figure to illustrate the situation.
Method
II
P(X
< 41.25) + P(X > 128.75)
= 1 – P(41.25 < X <
128.75)
= 8%
– Sketch, label
and
shade a figure to illustrate the situation.
Therefore,
8% of individuals have a BGL within 1.75s.d. from the mean.
9.
As in 4. Do
it yourselves!
–
Interpret the Q in Probability Notation.
– Sketch, label
and shade a figure to illustrate the situation.