NOTES on Correlation, describing a Scatterplot
It is critical that when "interpreting" the association between 2 variables via a scatterplot, to employ "weasel words" such as in general and on average and tends to.
Why?
Because if you dont, youre claiming that there are absolutely NO COUNTEREXAMPLES and that the relationship described in strictly / always true...which is very rarely the case! For instance, this is from a UCLA statistics curriculum: there is a moderate linear relationship between writing scores and reading scores with a positive association suggesting that those with higher
reading
scores tended
to have
higher
writing scores...specifically, as reading scores rose from ~30-75,
the writing scores rose from ~30-68.
Why
is the bold
phrase important?
Because if you didnt include it, youd be stating that ALWAYS and
WITHOUT EXCEPTION
fellows with higher reading scores had higher writing scores! But
that's clearly not
the case from examining the scatterplot: there are numerous instances
of individuals with higher reading scores with THE SAME or LOWER
writing scores!
Basic expectations for the CSET:
describe bivariate relationships
interpret the correlation coefficient
know and apply the properties of the correlation coefficient
calculate the Line of Best Fit and make predictions
How
to describe bivariate relationships?
In
terms of Strength,
Form, Association and
Clusters
/ Outliers,
if
any.
Strength: indicates whether the relationship is Strong, Moderate or Weak
Form: indicates whether the relationship is Linear or non-Linear
Association: indicates whether the direction of the relationship is Positive or Negative or Constant over specific domains of the X-variable
MADLIBS:
The relationship between [Y] and [X] is [STRENGTH
moderately?
strongly? weakly?]
[FORM
linear?
nonlinear?]
with a [ASSOCIATION
positive?
negative?] association, which means that as [X] increases, [Y]
[increases? decreases?], on
average.
If
applicable,
the value of ([X], [Y]) is an outlier since it falls outside the
overall pattern of the distribution. There are clusters [between X
values from #-# and another from #-#].
Example.
The relationship between Number of Calories Consumed and Time at the
Table by toddlers is moderately linear with a negative association:
as the amount of time spent by toddlers rose from ~20 to ~50min, in
general,
the calories consumed fell from ~525 to ~400cal.
Example.
The relationship between Income and Age is strongly linear with the
association being positive from 15-55years: as individuals grow older
from 15 to 55 years, their incomes tended
to
rise from $12,000 to $62,000. From 55 to 65years, the relationship
was moderately linear with negative association: as individuals grow
older from 55 to 65 years, their incomes tended
to
decline from $62,000 to $58,000. Beyond 65years, incomes were
relatively flat at about $58,000. The income of $55,000 for a 45 year
old was an outlier.
Bottomline:
youre
describing the relationship between Y and X in detail -- you write
what you see!
What
is the correlation coefficient? How to interpret the correlation
coefficient, r?
The correlation coefficient is a measure of how close the points
of the scatterplot are to the Line of Best Fit i.e. it
measures the strength (and direction) of linear
relationship
between
the 2 variables.
There
are 2 components to r:
Number:
close to ±1 → strong linear relationship;
close to 0 → weak linear relationship
Sign [positive or negative association].
In
general, both
in
the positive and negative direction:
0.8 < |r| < 1 ~ VERY STRONG;
0.6 < |r| < 0.8 ~ MODERATELY STRONG
0.4 < |r| < 0.6 ~ MODERATELY WEAK
0 < |r| < 0.4 ~ VERY WEAK
How
to make a scatterplot on the calculator?
Enter
X-Variable: L1 and Y-variable: L2
Use
2nd
+ STATPLOT [Y = on
the top left]
→
Select the 1st
plot →
Xlist: L1 →
Ylist: L2
Hit
ZOOM
[on
the top row]→
ZOOMSTAT
to view the scatterplot.
How
to calculate the corr. coeff., r,
on
the calculator?
After
entering the X-values in L1, and Y-values in L2, use
STAT
→
CALC →
Option8:
LinReg
L1, L2.
Note:
If
your
calculator does not display r,
do this:
2nd
0
→
Scroll Down to Diagnostic On →
Hit ENTER twice. Then,
follow the above instructions.
What
are the properties of the Correlation Coefficient, r?
Definition: r is a measure of the strength [#] and association [sign] of the linear relationship between 2 variables. Alternately, r is a measure of how closely the points are clustered to the Line of Best Fit.
For
it to be reasonable to use r,
the scatterplot must appear
linear.
Caution! It makes no sense to calculate r for a relationship that is non-linear [a u-shaped pattern]...or for one where there is no clear association between Y and X [a haphazard plot].
1. The formula to calculate r is r = (∑ZxZy)/(n – 1) i.e. it is the average of the product of the z-scores of the X and Y variables.
IMPORTANT! r has no units, being derived from Z-scores; Z-scores are pure unit-less numbers [~ number of s.d. a number is from the mean].
2. r has the same sign of the slope of the Line of Best Fit.
Positive
association: when X increases, so does Y;
Negative association:
when X increases, Y decreases
3. -1 < r < 1
4. An Important Property:
r is NOT affected by changes in SCALE [Translation: multiplying or dividing every X- or Y- value by a constant does not affect r!] and
r is NOT affected by changes in ORIGIN [adding a constant to each X- or Y- value does not affect r!].
WHY?
Because
A)
r
only tells you how closely clustered the data is to the Line of Best
fit i.e. r
measures the strength and association of the linear
relationship.
Changing the origin [adding / subtracting each
X- or Y-value by a constant] does NOT affect the relative cluster of
the points to the Line of Best Fit since all it does is shift the all
the points leftward / rightward [if a constant is added / subtracted
to all the X-values], or upwards / downwards [if a constant is added
/ subtracted to all the Y-values]...but it doesnt affect how closely
the points are relative
to the Line of Best Fit!
Changing
the scale [multiplying / dividing X- or Y-value by a constant] also
does NOT affect the relative cluster of the points to the Line of
Best Fit since all it does is move the points closer or further apart
(multiplying / dividing). There's simply more or less "empty
space" in the plot: it doesnt change the orientation of the Line
of Best Fit!
B)
r
is the average of the product of the Z-scores of the X- and
Y-variables...and how are Z-scores calculated? By subtracting a
constant [Mean!] from each value and dividing each value by another
constant [the s.d.!].
Hence, other
linear transformations do not
affect r.
5.
Correlation
does not imply
causation.
Translation:
simply because the correlation coefficient, r, is
close to ±1, say, does not indicate
that X caused Y.
A [strong] relationship between Y and X does not
automatically indicate a causal
relationship.
Example. There is a strong correlation between Hair length and Shoe-size [after all, as babies grow, they have longer hair and bigger shoe-sizes]…but it’d be absurd to argue that bigger shoe size causes hair to grow longer!
Example. There is a strong correlation between SAT scores and Family Income [wealthier families tend to be well-educated or well-educated families tend to be wealthier...and those families may invest in education more, leading to higher SAT scores amongst their young]...but it'd be ridiculous to that the higher Income caused the higher scores or that to do better on the SATs one would need wealthier parents!
6. r being close to zero indicates a weak linear relationship [for instance, a relatively haphazard scatterplot!] or may suggest a strong non-linear relationship [for instance, a U-shaped scatterplot]; on the other hand, r being close to ±1 indicates a strong linear relationship.
Case
I:
haphazard scatter, showing NO relationship at all
Case
II:
U-shaped / upside-down-U shaped curves with strong non-linear
relationships; even polynomial curves...however, some curves MAY
given a HIGH r.
Observe this! A Haphazard scatter indicates no relationship between X and Y and Zero Correlation.
Observe this! But Zero Correlation Does NOT Mean No Relationship...
it MIGHT mean which might be curvilinear.
7.
r
requires the 2 variables to be numeric.
Caution! It does not make sense to calculate r for categorical variables: "Income and Gender are strongly correlated" is an absurd statement from a statistical standpoint!
How
is the correlation coefficient, r,
interpreted?
Example.
The correlation coefficient of r
= -0.14 indicates that the relationship between Sodium content in Hot
Dogs and Calories is weakly linear
[r
→ 0], with negative association suggesting that as the calories
in the hot dogs rose from 280 to 720cal, the Sodium content fell from
420g to 370g, on
average.
Yes, interpreting r
is
just like describing the linear relationship!
Example. The correlation coefficient of r = +0.68 indicates that the relationship between Income and Education Level is moderately strong and linear [r ~ 0.7] with positive association [r, being +]: as the Education Levels of individuals rose from 2years to 8years, so did their incomes in general, from $23,450/year to $73,861.
What is the Regression Line or the Line of Best Fit?
The regression line is: Y^ = a + bX, with the slope, b [the # next to X] and Y-intercept, a [the # all by itself] is used to predict Y-values for given X-values.
For the Line of Best Fit, i.e. regression, it matters which is the Independent variable (X), and which, the Dependent variable (Y). So, it is vital to identify the independent (X) and dependent (Y) variables accurately. Y depends on X. Alternately, we use X to predict Y. Read the Q carefully, and examine the given plots make a judgment.
Observe this!
It is convention to use a ^ after the Y to suggest the idea that the Line of Best Fit is used to predict Y-values given an X-value, so Y^ ~ predicted Y e.g. Y^ = 3.1452 – 4.1562X
It is convention to write the Line of Best Fit "in context" i.e. without using the Y and X letters but with mnemonic variables e.g. Height^ = 3.1452 – 4.1562Age or HT^ = 3.1452 – 4.1562AGE
How
to calculate the Regression Line ~ Line of Best Fit on the
calculator?
After
entering the X-values in L1, and Y-values in L2, use
STAT
→
CALC →
Option8: LinReg L1, L2
Problem. Fill in the blanks.
1. r is a measure of the ______ between 2 variables in terms of their strength [#] and association [sign]. Alternately, r is a measure of how close the scatterplot is clustered to the ___.
2. r requires X and Y to be ___ variables.
3. The formula for r is ___.
4.
The correlation coefficient, r,
has the same sign as the ___
of
the LSRL.
5. Suppose we are estimating the Height (cm) of
children based on their Age (months).
a) The Explanatory variable
is __, and the response variable is __.
b) The unit of the slope
of the LSRL is __.
c) The unit of the slope of the y-intercept is
__.
6. r
is
not
affected by changes in __
or __.
7. Correlation does not imply __.
8.
LSRL stands for ___.
9.
The Least Squares Regression Line or LSRL describing the linear
relationship between Y and X minimizes ___.
10.
For the LSRL, Y^ = a
+
bX,
the formula for the slope is __
and
the formula for the Y-intercept is __.
11. The Residual is simply the __
[definition
in simple language, not
the formula].
12. The LSRL of Y on X minimizes the Sum of the
Squares of the Residuals in the [vertical / horizontal] __
direction.
13.
The letter [or variable] used to describe the slope is __.
14.
The letter [or variable] used to describe the slope is __.
15. The
formula for Residual, R = __.
Solution.
1.
r is
a measure of the strength of the linear relationship between 2
variables in terms of their strength [#] and association [sign].
Alternately, r is
a measure of how close the data are clustered to the LSRL.
2. r requires X and Y to be numeric / quantitative variables.
3. The formula for r is ∑ZxZy/(n – 1).
4.
r has
the same sign of the slope
of
the LSRL.
5. Suppose we are predicting the Height (cm) of
children based on their Age (months).
a) The Explanatory variable
is Age
and the response variable is Height.
b)
The unit of the slope of the LSRL is cm/months.
c)
The unit of the y-intercept is cm.
6.
r
is
not
affected by changes in origin
or scale.
7. Correlation does not imply causation.
8.
LSRL stands for Least
Squares Regression Line.
9.
The Least Squares Regression Line or LSRL describing the linear
relationship between Y and X minimizes the
sum of the squares of residuals OR
the sum of the squares of vertical distances OR
the
sum of the squares of differences between Actual and Predicted
Y-values.
10.
For the LSRL, Y^ = a
+
bX,
the formula for the slope, b
= ∆Y/
∆X = r∙Sy/Sx and
Y-intercept,
a
=
YB – b∙XB.
11. The Residual is simply the prediction
error.
12. The LSRL of Y on X minimizes the Sum of the Squares of the
Residuals in the vertical
direction.
13.
The letter [or variable] used to describe the slope is b.
14. The letter [or variable] used to describe the Y-intercept is
a.
15. The formula for Residual, R = Actual Y-value – Predicted Y-value
Problem.
Researchers
suspect that there's a relationship between Per Capita Alcohol
Consumption and Heart Disease, specifically, that they could employ
Per Capita Alcohol Consumption to predict incidence of Heart Disease.
It is found that for a data-set of 19 well-developed countries, the
average Per Capita Alcohol Consumption from wine was
3.0263litres/year with a s.d. of 2.5097litres/year; and that the
average heart disease death rate was 191.0526 (per 100,000) with a
s.d. of 68.3963 (per 100,000). Further, the correlation between per
capita wine consumption and heart disease was -0.8428.
Country |
Per Capita Alcohol Consumption (litres / year) |
Heart Disease Death Rate (per 100,000) |
Australia |
2.5 |
211 |
Austria |
3.9 |
167 |
Belgium |
2.9 |
131 |
Canada |
2.4 |
191 |
Denmark |
2.9 |
220 |
Finland |
0.8 |
297 |
France |
9.1 |
71 |
Iceland |
0.8 |
211 |
Ireland |
0.7 |
300 |
Italy |
7.9 |
107 |
Netherlands |
1.8 |
167 |
New Zealand |
1.9 |
266 |
Norway |
0.8 |
227 |
Spain |
6.5 |
86 |
Sweden |
1.6 |
207 |
Switzerland |
5.8 |
115 |
United Kingdom |
1.3 |
285 |
United States |
1.2 |
199 |
West Germany |
2.7 |
172 |
a)
Determine the Explanatory and Response variables. What are the units
in which each is measured? [Be detailed.]
b)
Attach symbols the statistics above. Then use the statistics to
calculate the slope and y-intercept of the LSRL between Heart Disease
Death Rate and Per Capita Alcohol Consumption.
c)
Write the LSRL in context. Show
formulas / work.
d) Interpret the correlation coefficient between Per Capita Alcohol Consumption and Heart Disease Death Rate in context.
e) What are the units of the slope of the LSRL? That for the y-intercept of the LSRL? Tip! The units of slope, b are units of [rise / run ~ Y / X] whereas the unit of the y-intercept, a, is simply that of the Y-variable.
f) Predict the death rate for Sweden. Is this an overestimate / underestimate?
g) Calculate and interpret the Residual for Sweden.
Solution.
a)
The Explanatory or X-variable is the Per Capita Wine Consumption ~
PCWC (litres / year) and Response or Y-variable is the Heart Disease
Death Rate ~ HDDR (per 100,000).
b)
XB = 3.0263litres/year,
Sx = 2.5097litres/year;
YB = 191.0526
(per 100,000);
Sy = 68.3963 (per 100,000);
r
=
-0.8428.
b
=
r∙Sy/Sx
= -0.8428·68.3963/2.5097 = -22.9686 per 100,00 / (litres/year)
a
=
YB – b∙XB
= 191.0526 – (-22.9686)·(3.0263) = 260.5633 per 100,000
c)
HDDR^
(per 100,000) = 260.5633 – 22.9687·PCAC (litres per
year).
d) The relationship between Heart Disease Death Rate
and Per Capita Wine Consumption is strongly
linear
with a negative association: as the Per Capita Wine Consumption of
counties increased [from 0.7 to 9.1 litres per year], in
general,
the Heart Disease Death Rate fell [from 300 to 71 deaths per
100,000]. WATCH
THE UNITS!
e)
Units of Slope (Y
/ X):
per 100,00 / (litres/year)
Unit of y-intercept (Y):
per 100,000
f) For Sweden, given: PCAC = 1.6 litres / year
Therefore,
HDDR^ = 260.5633 – 22.9687·1.6 = 224.2 deaths / 100,000
WATCH
THE UNITS!
Since the actual HDDR is higher than the predicted HDDR, it is an over-estimate.
g) Residual Death Rate = Actual Death Rate – Predicted Death Rate [Observe context]
= 207 from the table! – 224.2 = -17.2 deaths / 100,000 WATCH THE UNITS!
Interpretation: A residual of -17.2 deaths / 100,000 indicates the our LSRL model overestimates the Actual Death Rate of 204 deaths / 100,000 for Sweden by 17.2 deaths / 100,000.
Alternately:
A residual of -17.2 deaths / 100,000 is a measure of the prediction
error – an overestimate – when predicting the Death Rate
for Sweden using the LSRL.
Problem.
Obesity
is a growing problem around the world. A study sought to shed some
light on gaining weight. Some people don’t gain weight even
when they overeat. Perhaps, fidgeting and other non-exercise activity
(NEA) explains why – some people may spontaneously increase
non-exercise activity when fed more. Researchers deliberately overfed
16 healthy young adults for 8 weeks. They wished to determine if the
change in energy use (in calories) from Non-Exercise Activity (NEA)
i.e. activity other than deliberate exercise – fidgeting, daily
living, etc. could predict the fat gain (in kgs).
a) Identify
the explanatory and response variables, and the units they are
measured in.
b) The following summary statistics were obtained:
Mean
NEA change: 324.8cal,
s.d. of NEA change = 257.66cal;
Mean
fat gain = 2.388Kg;
s.d. of fat gain = 1.1389kg;
correlation
between fat gain and NEA change was -0.7786.
Calculate the
slope of the LSRL and the y-intercept of the LSRL. Show
formulas / work. State
their units.
c) Write the equation of the LSRL, in context. Mention units in ( ) beneath the Y^ and X variables.
d) The actual data-set is:
NEA change (cal) |
Fat Gain (Kg) |
-94 |
4.2 |
-57 |
3 |
-29 |
3.7 |
135 |
2.7 |
143 |
3.2 |
151 |
3.6 |
245 |
2.4 |
355 |
1.3 |
392 |
3.8 |
473 |
1.7 |
486 |
1.6 |
535 |
2.2 |
571 |
1 |
580 |
0.4 |
620 |
2.3 |
690 |
1.1 |
Interpret
the correlation coefficient r
= -0.7786, in context, to describe the relationship between fat gain
and NEA change, describing the association between them.
e) Calculate and interpret the residual for (355, 1.3). Show formulas / work.
Solution.
a)
Explanatory
Variable: NEA change (cal); Response Variable: fat gain (kgs)
b)
The following summary statistics were obtained:
Mean
NEA change: 324.8cal = XB
s.d.
of NEA change = 257.66cal = Sx
Mean
fat gain = 2.388Kg = YB
s.d.
of fat gain = 1.1389kg = Sy
correlation
between fat gain and NEA change was -0.7786 = r
b
=
r∙Sy/Sx
= -0.7786·1.1389/257.66 = -0.00344 Kg / Cal
a
=
YB – b∙XB
= 2.388 - (-0.00344)·(324.8) = 3.505Kg
c) FG^ (Kg) =
3.505 – 0.00344·NEAC (Cal)
d) The correlation coefficient of -0.7786 indicates that the relationship between Fat Gain and NEA change is strongly linear with a negative association, indicating that as the NEA change increased for the subjects [from -94 to 690cal], the Fat Gain they experienced fell, in general [from 4.2 to 0.4Kgs].
e)
For NEAC = 355, FG^ (Kg) = 3.505 – 0.00344·355 = 2.2838
Kg
Residual FG = Actual FG – Predicted FG
= 1.3 Given!
– 2.2838 = -0.9838 Kg WATCH
THE UNITS!
A
residual of -0.9838 indicates that the LSRL overestimates
the actual FG for an NEAC of 355Cal by 0.9838 Kgs
OR
the residual of -0.9838 is a measure of the prediction error when
using the LSRL for predicting the FG for an NEAC of 355Cal.
Problem:
Consider the scatterplot below that displays the distribution of the
heights of 53 pairs of parents.
a) What is the smallest
height of any
mother in the group? How many mothers have that height? What are the
heights of the fathers for those
mothers?
b)What
is the greatest
height
of any father in the group? How many fathers have that height? What
are the heights of the mothers for those
fathers?
c) How many couples are of the same height?
d) In how many couples was the father shorter than the mother?
e) Describe the relationship as depicted by the scatterplot, in context.
Solution.
a)
57in; 2; 66in and 67in
b)
Do this yourselves.
c)
Just look for paired heights: 61-61in: 1; 63-63in: 1; 65-65in: 1;
67-67in: 1 → total 4
couples! Ignore all other values...
d)
1 [F-M: 65-65in]
e) There is no clear relationship between Mother's heights and Father's heights OR The relationship between Mother's heights and Father's heights is very weakly linear with positive association suggesting that fathers that are taller tend to have wives that are taller too...
NOTES on the Regression Line
How
are the slope and y-intercept interpreted?
The
LSRL is Y^ = a
+ bX
with the slope being b
and the y-intercept, a.
Concept
of slope,
b:
The slope tells us how fast is Y changing when X changes by a certain
amount.
Mathematically,
b
= rise / run = ΔY / ΔX where Δ stands for change.
UNDERSTAND
THIS WELL What
b
= ΔY / ΔX implies is that if
ΔX = 1,
then
clearly,
b
= ΔY or what is the same thing, ΔY
= b.
That
is the essence of the interpretation
of the slope:
a
slope of b
(Y / X units) indicates that when X changes by 1 unit, then Y is
estimated to change by b
units.
MASTER
THIS.
Concept
of Y-intercept,
a:
The y-intercept tells us what the Y-value is when X = 0.
UNDERSTAND
THIS WELL → in
the LSRL, if you substitute X = 0, then Y = a
→
the y-intercept is the point (X = 0, Y = a).
That
is the essence of the interpretation
of the Y-intercept:
A
Y-intercept of a
(Y
units) ~ (X = 0 units, Y = a
units)
indicates that when X = 0 units, Y is predicted to be a
units.
← MASTER
THIS.
Example.
If
SalesVolume^
($mn) = -410.2365 + 26.8934Time (years since 1995), then
a slope of setting up the interpretation b = 263.8936 $mn / year = ΔSalesVolume^ / Δtime = 1 indicates that every successive year since 1995, the Sales Volume has been estimated to rise by about $26.8934mn.
A y-intercept of -$410.2365mn ~ setting up the interpretation (Time = 0years since 1995, SalesVolume^ = -$410.2365mn) indicates that in 1995 (can you see why?!), the Sales Volume was predicted to be -$410.2365mn...which, however does not make sense in the context of the problem.
Example.
If
WaterUsage^
('000
gallons)
= 410.2365 – 26.8934RelativeHumidity(%),
then
a slope of setting up the interpretation -26.8936 '000 gallons / % = ΔWaterUsage^ / ΔRelativeHumidity = 1 indicates that when the Relative Humidity increases by 1%, the Water Usage is estimated to rise by 410.2365 '000 gallons.
A
y-intercept of 410.2365 '000
gallons ~
setting
up the interpretation (RelativeHumidity
= 0%,
WaterUsage ^ = 410.2365 '000
gallons)
indicates that when the Relative Humidity is 0%,
then the Water Usage is predicted to be 410.2365 '000
gallons.
Note:
this does
makes sense in the context of the problem but you dont have to state
that
when it does make sense, haha.
Caution!
The slope deals with the idea of change: you must employ terms such as increase, decrease, rise, fall etc. when interpreting the slope.
Only slope deals with the idea of change, not the Y-intercept. So dont use the terms increase, decrease, rise, fall for the Y-intercept.
Both, slope and y-intercept, relate to the LSRL -> use the terms estimated or predicted.
You need to mention the UNITS of the slope and y-intercept throughout.
Psst!
Read the Q and identify the X-variable and the Y-variable 1st. Remember: Y depends on X; also, we predict Y when X is given!
If the problem refers to “predict” or “estimate”, that refers to predicted Y-value -> use the LSRL!
Example.
The LSRL relating Value
of a car, V(t)
with time,
t,
measured in terms of Years since 2002
is V(t)^ = $34,500 – 2,300t.
Set up and interpret the
slope of the LSRL in context, using UNITS.
Set
up and interpret the y-intercept of the LSRL in context, using UNITS.
[Power
Tip! It
enormously helps to “set up” the interpretation as we did
in class 1st!]
Solution.
Interpretation
Guide for Slope!
b
= $2300 / year = ∆Value of Car ($)/ ∆Time (Years since
2002) Mentioning
UNITS is vital!
A
slope of $2300 / year indicates that every
year since
2002,
the Value of the car is estimated to fall
by
$2300, on
average.
Mentioning
UNITS is vital!
OR
A
slope of $2300 / year indicates that as the number of years increases
by 1 beyond 2002 [how awkward!], the Value of the car is estimated
to fall
by about
$2300.
Mentioning
UNITS is vital!
Interpretation
Guide for Y-intercept!
(Time = 0 years since 2002, Value of Car = $34,500) Mentioning
UNITS is vital!
A
y-intercept of $34,500 indicates that in
2002,
the Value of the car was estimated to be about
$34,500.
Mentioning
UNITS is vital!
Example:
The
LSRL relating the Number of women in the labour force in millions, W,
and Years since 1998, t is W^ = 0.9286t + 63.7
Set up and
interpret the slope of the LSRL in context. Mentioning
UNITS is vital!
Set
up and interpret the y-intercept of the LSRL in context. Mentioning
UNITS is vital!
Solution.
Interpretation
Guide for slope! b
= 0.9286mn/year = ∆Number of women in the labour force (mn) /
∆Time (years since 1998) =
1
A
slope of 0.9286mn/year indicates that for every (successive) year
since
'98,
the estimated
number of women in the labour force in the U.S. rose by about
0.9286mn.
Mentioning
UNITS is vital!
Interpretation
Guide for Y-intercept! (0
years since 1998, 63.7mn women in the US labour force)
A
y-intercept of 63.7mn indicates that in
1998
there were an estimated
63.7mn
women in the labour force in the U.S. on
average...[which
makes sense, by the way!].
Mentioning
UNITS is vital!
Example.
The
distribution of Heart Disease Death Rates and Per Capita Alcohol
Consumption, the LSRL was HDDR^
(per 100,000) = 260.5633 – 22.9687·PCAC (litres per
year). Set up and interpret the slope of the LSRL in context.
Mentioning
UNITS is vital!
Set
up and interpret the y-intercept of the LSRL in context. Mentioning
UNITS is vital!
Solution.
Mentioning UNITS is vital! Interpretation Guide for slope! m
= –22.9687 per 100,000 / lires per year = ΔHDDR (per
100,000) /Δ PCWC (in litres / year) = 1
A slope of
–22.9687 per
100,000 / litres per year indicates
that when the Per Capita Alcohol Consumption of countries rose by 1
litre
/ year,
the estimated
Heart Disease Death Rate fell by about
22.9687
deaths / 100,000 [OR...the
Heart Disease Death Rate was estimated
to fall by about
22.9687
deaths / 100,000] OR
A
slope of –22.9687 per
100,000 / litres per year indicates
that countries that had a Per Capita Alcohol Consumption of 1 litre /
year more than another, had about
22.9687
deaths / 100,000 lower due to Heart Disease.
Interpretation
Guide for Y-intercept! (0
litres/year, 260.5633 deaths / 100,000)
A y-intercept of 260.5633
deaths
per 100,000 indicates
that when the Per Capita Wine Consumption of countries was close to 0
litres
/ year,
the estimated
Heart
Disease Death Rate was 260.5633 deaths
per 100,000 [or
the
Heart Disease Death Rate was estimated
to
be 260.5633 deaths per 100,000]...[which
makes sense, by the way!].
Example.
A
certain teacher wishes to predict GPA of her students based on their
IQ and finds that their Mean IQ was 108.9, the s.d. of IQ was 13.17,
the Mean GPA was 7.447, with the s.d. of GPAs being 2.10. If the
correlation coefficient between GPA and IQ is r
= 0.6337, determine the equation of the LSRL and write it in context.
Show
relevant formulas and calculations.
Solution.
Since
we are predicting GPA based on IQ, Y ~ GPA and X ~ IQ so that:
Given:
XB = 108.9, Sx = 13.17, YB = 7.447, Sy = = 2.10 with r
= 0.6337
Slope, b
=
r∙Sy/Sx
= 0.6337·2.1/13.17 = 0.1010
and Y-intercept: a
=
YB – b∙XB
= 7.447 – (0.1010)·(108.9) = -3.557
and LSRL: GPA^ = -3.557 + 0.101·IQ
EXAMPLE.
An
electricity utility would like to examine the relationship between
daily temperature and electricity consumption and records the
following data:
Average Daily Temperature (F) |
KiloWatts (KW) of Electricity Consumed (in Million) |
77 |
10 |
84 |
12.1 |
85 |
13.1 |
90 |
14.2 |
92 |
15.6 |
91 |
14.1 |
81 |
9.7 |
88 |
10.7 |
79 |
8.1 |
86 |
11.5 |
78 |
8.4 |
93 |
9.9 |
105 |
16.3 |
95 |
12.7 |
a) Identify the Explanatory and Response variables [be detailed], and state their units.
b)
Use the calculator and calculate the Correlation Coefficient, r,
between Electricity Consumption and Average Daily Temperature and
interpret it in context.
c)
Write the LSRL in
context.
Mention units for the Y and X variables in ( ) next to them.
d)
Calculate the Residual for a temperature of 81F. Show all work.
Interpret the Residual in #14. Give adequate context.
Solution.
a)
E: Average Daily Temperature (F);
R: Electricity Consumption:
KiloWatts (KW) (in Million)
b)
r
=
0.7712 suggests that the relationship between Electricity Consumption
and Average Daily Temperature is strongly linear with a positive
association indicating that Average Daily Temperature rose from 77 to
105 F, the Electricity Consumption rose from 8.14 to 16.3 KiloWatts
(KW) (in Million). Mentioning
UNITS is vital!
c)
EC^ (kW mn) = -10.6924 + 0.2582·ADT (F)
d)
The
LSRL for predicting EC for a given ADT is: EC (kW mn) = -10.6924 +
0.2582*ADT (F)
Given:
ADT = 81F → Predicted
EC = EC (kW mn) = -10.6924 + 0.2582*81 = 10.22mn
kW
Residual
EC = Actual EC – Predicted EC
= 9.7 – 10.22 = -0.52 mn
kW
A
residual EC of -0.52mn kW indicates that the LSRL overestimates
the EC for an ADT of 81F by 0.52mn kW OR
A
residual EC of -0.52mn kW is a measure of the prediction error when
using the LSRL for an ADT of 81F.
Problem. We want to employ Amount of Vegetables and Fruits Consumed (in grams) to estimate the Time to Lose 5lbs (in Months). Suppose R2 = 58.96% with the LSRL: TL5LBS^ = 45.5964 – 2.2353AVFC
1. Identify the Explanatory and Response variables and their units.
2.
Calculate r.
Interpret r
in context.
3. Interpret the slope in context.
4. Interpret the
y-intercept in context.
Solution.
1.
Explanatory variable (X): Amount of Vegetables and Fruits Consumed
(g)
Response Variable (Y): Time Taken to Lose 5lbs (months)
2.
r
= ±√0.5896 = -0.7679
[since
(think about it) there's a negative
relationship
between Time to Lose 5lbs and the Amount of Vegetables and Fruits
Consumed: after all, as the Amount of Vegetables and Fruits Consumed,
X, increases, the Time to Lose 5lbs, Y, decreases. Also, the slope is
negative, so...]
Interpretation:
r
= -0.7679 suggests that there is a reasonably strong linear
relationship between the Time to Lose 5lbs and the Amount of
Vegetables and Fruits Consumed [Mentioning
detailed CONTEXT is vital!],
indicating that as the Amount of Vegetables and Fruits Consumed, X,
increases, the Time to Lose 5lbs, Y, decreases. ←
some
of you are forgetting to explain the association.
3.
Interpretation
Guide for slope b
=
-2.2353 months / grams = ΔTL5LBS^/ΔAVFC = 1g
Mentioning
CONTEXT and UNITS is vital! A
slope of -2.2353
months / grams indicates
that when the Amount of Vegetables and Fruits that an individual
Consumes rises by 1gram,
the Time to Lose 5lbs is estimated to fall by 2.2353months
OR
AP 5 students! A
slope of -2.2353
months / grams indicates
that every additional gram of vegetable and fruit consumed is
associated with 2.2353 fewer
months, on
average, for
a person to lose 5lbs.
4. Interpretation
Guide for Y-intercept! (AVFC
= 0grams, TL5LBS^ = 45.5964months)
Mentioning
CONTEXT and UNITS is vital! A
y-intercept of 45.5964months indicates that when the Amount of
Vegetables and Fruits Consumed is 0g [or when an individual consumes
no vegetables or fruits], then the Time to Lose 5lbs is estimated to
take about 45.5964months...[which
makes sense, by the way!].
Example.
The relationship between the amount of Nicotine and Tar in cigarettes
is given by the regression line: Nicotine^ (mg) = 0.154030 +
0.065052Tar (mg).
a) Predict the Nicotine content for a cigarette with 4mg of Tar.
b)
Interpret the slope of the regression line.
c) Interpret the y-intercept of the regression line.
Solutions.
a)
For
Tar = 4mg:
Nicotine^ (mg) = 0.154030 + 0.065052∙4 (mg) Showing Substitution is vital!
= 0.414mg
b)
Set-up:
b
=
0.065052 mg / mg = ∆Nicotine^ mg/ ∆Tar = 1mg
Mentioning
CONTEXT and UNITS is vital! A
slope of b
=
0.065052 mg / mg [Note:
you
may
omit the units because they cancel or leave them alone...] indicates
that as the Tar content in cigarettes increased by 1mg, the Nicotene
content was estimated to rise by about 0.065052mg OR
AP 5 students!
in general, cigarettes that had 1mg of tar more, were estimated to
have about 0.065052mg more of nicotine.
c)
Set-up:
(Tar
= 0mg, Nicotine = 0.154030mg)
Mentioning
CONTEXT and UNITS is vital! A
y-intercept of 0.154030mg indicates that for cigarettes with 0mg of
Tar [or no
tar ~
0mg] content, the predicted
Nicotene content was about 0.154030mg.
Example. The relationship between Mortality Rate and Calclium Content in the water supply for a group of cities is given by the regression line, MortalityRate^ (deaths / 100,000) = 1676 – 3.23CalciumContent (ppm) with the correlation coefficient, r = 0.6557.
a) Interpret the correlation cefficient in context.
b) Interpret the slope of the regression line. Then , interpret the y-intercept.
Solution.
a)
The relationship between Mortality Rate (per 100,000) and Calcium
(ppm) is moderately linear with a negative association: as the
calcium content of the water in the towns rose, the mortality rate
declined, in general.
b)
Set-up:
b
=
3.23 deaths per 100000 / ppm = ∆Mortality Rate^/ ∆Calcium
Content = 1ppm
Mentioning
CONTEXT and UNITS is vital! A
slope of 3.23 deaths
per 100000 / ppm
indicates
that as the Calcium Content of the water in
the towns rose
by 1ppm,
the Mortality Rate was estimated
to
fall by about
3.23
deaths
/ 100,000.
Set-up:
[(Calcium
Content = 0ppm, Mortality Rate = 1676 deaths / 100000)]
Mentioning
CONTEXT and UNITS is vital! A
Y-Intercept of 1676 deaths per 100,000 indicates that for a town with
NO CC in the water, the predicted MR
is about 1676 deaths/100,000.