Section 12.1 Scatter Plots and Correlation Copyright by Hawkes Learning All rights reserved. Objectives o Construct and interpret scatter plots. o Calculate and interpret the correlation between two variables. Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation

A scatter plot is a graph on the coordinate plane that contains one point for each pair of data. The horizontal axis (x-axis) represents one variable and the vertical axis (y-axis) represents the other. It is important to note that, unlike a line graph, the points on a scatter plot are not connected. Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation The choice of which variable to put on which axis depends on if we believe that one variable might influence the other. If this type of relationship exists, we say that a change in the value of one variable, called the explanatory variable, influences a change in the

value of the other variable, called the response variable. Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation The importance of a scatter plot is that it enables us to identify trends in the data. For example, do the points fall in a linear pattern, a curved pattern, or no pattern at all? If the points have a linear shape, how close to straight are they? Does the pattern have a positive slope (the points rise from left to right) or a negative slope (the points fall from left to right)? Copyright by Hawkes Learning

All rights reserved. Scatter Plots and Correlation Table 12.2: Sample of NFL Quarterbacks (20112012 Season) 2012 Base Number of Salary Quarterback Passing Millions of Rating Touchdowns (in Dollars) Drew Brees Michael Vick Philip Rivers Tony Romo

Aaron Rodgers Jay Cutler Alex Smith 46 18 27 31 45 13 17 3.0 12.5 10.2 0.825

8.0 7.7 5.0 110.6 84.9 88.7 102.5 122.5 85.7 90.7 Copyright by Hawkes Learning All rights reserved. Table (cont.)

Table 12.2: Sample of NFL Quarterbacks (20112012 Season) Eli Manning 29 1.75 92.9 Tim Tebow 12 2.1 72.9 Tom Brady 39 0.95 105.6 Source: Yahoo! Sports. NFL - Statistics by Position. http://sports.yahoo.com/ nfl/stats/byposition?pos=QB&conference=NFL&year=season_20

11&sort=49&timeframe=All (20 May 2012). Source: Spotrac.com. NFL Player Contracts, Salaries, and Transactions. http://www.spotrac.com/nfl/ (2 Oct. 2012). Copyright by Hawkes Learning All rights reserved. Example 12.1: Creating a Scatter Plot to Identify Trends in Data Use the data from Table 12.2 to produce a scatter plot that shows the relationship between the base salary of an NFL quarterback and the number of touchdowns the quarterback has thrown in one season. Solution We might expect for the number of touchdowns a quarterback throws in one season to influence his

salary. Taking this into consideration, we will place the number of touchdowns on the x-axis and the base salary on the y-axis. Copyright by Hawkes Learning All rights reserved. Example 12.1: Creating a Scatter Plot to Identify Trends in Data (cont.) Copyright by Hawkes Learning All rights reserved. Example 12.1: Creating a Scatter Plot to Identify Trends in Data (cont.) Looking at this scatter plot, we do not see a linear pattern. Actually, no pattern is evident. This probably

indicates that these two variables do not have a relationship after all. Copyright by Hawkes Learning All rights reserved. Example 12.2: Creating a Scatter Plot to Identify Trends in Data Use the data in Table 12.2 to produce a scatter plot that shows the relationship between the number of touchdowns thrown in one season and the corresponding quarterback rating for the given sample of NFL quarterbacks. Solution In this case, we would expect that the number of touchdowns thrown by a quarterback does influence that

quarterbacks rating, since number of touchdowns is one of many factors used to determine the quarterback rating. Copyright by Hawkes Learning All rights reserved. Example 12.2: Creating a Scatter Plot to Identify Trends in Data (cont.) Hence, the logical way to label the axes is to place the number of passing touchdowns on the x-axis and the quarterback rating on the y-axis. Copyright by Hawkes Learning All rights reserved. Example 12.2: Creating a Scatter Plot to Identify Trends in Data (cont.)

Notice that the points tend to go up from left to right, and fall close to a straight line. This pattern can be described as a linear pattern with a positive slope. Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation Two variables have a linear relationship in a scatter plot when the two variables roughly follow a straight-line pattern. When the points in a scatter plot do roughly follow a straight line, the direction of the pattern tells how the variables respond to each other. A positive slope indicates that as the values of one variable increase, so do the values of the other variable. This type of

relationship between two variables is called a positive linear relationship. Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation A negative slope indicates that as the values of one variable increase, the values of the other variable decrease. This type of relationship between two variables is called a negative linear relationship. Copyright by Hawkes Learning All rights reserved. Example 12.3: Determining Whether a Scatter Plot Would Have a Positive Slope, Negative Slope, or Not Follow a Straight-Line Pattern

Determine whether the points in a scatter plot for the two variables are likely to have a positive slope, negative slope, or not follow a straight-line pattern. a. The number of hours you study for an exam and the score you make on that exam b. The price of a used car and the number of miles on the odometer c. The pressure on a gas pedal and the speed of the car d. Shoe size and IQ for adults Copyright by Hawkes Learning All rights reserved. Example 12.3: Determining Whether a Scatter Plot Would Have a Positive Slope, Negative Slope, or Not Follow a Straight-Line Pattern (cont.)

Solution a. As the number of hours you study for an exam increases, the score you receive on that exam is usually higher. Thus, the scatter plot would have a positive slope. b. As the number of miles on the odometer of a used car increases, the price usually decreases. Thus, the scatter plot would have a negative slope. Copyright by Hawkes Learning All rights reserved. Example 12.3: Determining Whether a Scatter Plot Would Have a Positive Slope, Negative Slope, or Not Follow a Straight-Line Pattern (cont.)

c. The more you push on the gas pedal, the faster the car will go. Thus, the scatter plot would have a positive slope. d. Common sense suggests that there is not a relationship, linear or otherwise, between a persons IQ and his or her shoe size. Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation

The Pearson correlation coefficient, , is the parameter that measures the strength of a linear relationship between two quantitative variables in a population. The correlation coefficient for a sample is denoted by r. It always takes a value between 1 and 1, inclusive. 1 r 1 Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation If r is positive, then the scatter plot has a positive slope and the variables are said to have a positive linear relationship. If r is negative, then the scatter plot has a negative slope and the variables are said to have a negative

linear relationship. If r = 0, then no linear relationship exists between the two variables. Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation If r = 1, then the data fall in a perfectly straight line with a positive slope. If r = -1, then the data fall in a perfectly straight line with a negative slope. If two variables have a strong positive or negative relationship, we say that the two variables are correlated. The strength of the correlation is expressed by |r|. The larger |r| is, the stronger the correlation.

Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation Copyright by Hawkes Learning All rights reserved. Scatter Plots and Correlation Pearson Correlation Coefficient The Pearson correlation coefficient for paired data from a sample is given by r x y

n x x n y y n x i y i i 2 2 i i i 2

i 2 i where n is the number of data pairs in the sample, xi is the ith value of the explanatory variable, and yi is the ith value of the response variable. Copyright by Hawkes Learning All rights reserved. Example 12.4: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator Calculate the correlation coefficient, r, for the data from

Table 12.2 relating touchdowns thrown and base salaries. Solution The data we need from Table 12.2 are reproduced in the following table. Copyright by Hawkes Learning All rights reserved. Example 12.4: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator (cont.) NFL Quarterbacks Number of Passing Base Salary (in Touchdowns Millions of Dollars)

46 3.0 18 12.5 27 10.2 31 0.825 45 8.0 13 7.7 17 5.0 29 1.75 12 2.1 39 0.95 Copyright by Hawkes Learning All rights reserved. Example 12.4: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator (cont.)

Lets enter these data into our calculator. Press . Select option 1:Edit. Enter the values for number of touchdowns (x) in L1 and the values for base salary (y) in L2. Press . Choose CALC. Choose option 4:LinReg(ax+b). Press twice. Copyright by Hawkes Learning All rights reserved. Example 12.4: Calculating the Correlation

Coefficient Using a TI-83/84 Plus Calculator (cont.) From the scatter plot, we would expect r to be close to 0. The calculator confirms that the correlation coefficient for these two variables is r 0.251, indicating a weak negative relationship, if any relationship exists at all. Copyright by Hawkes Learning All rights reserved. Example 12.5: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator Calculate the correlation coefficient, r, for the data from Table 12.2 relating touchdowns thrown and quarterback ratings. Solution

The data we need from Table 12.2 are reproduced in the following table. Copyright by Hawkes Learning All rights reserved. Example 12.5: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator (cont.) NFL Quarterbacks Number of Passing Quarterback Rating Touchdowns 46 110.6 18 84.9 27 88.7 31 102.5

45 122.5 13 85.7 17 90.7 29 92.9 12 72.9 39 105.6 Copyright by Hawkes Learning All rights reserved. Example 12.5: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator (cont.) Lets enter these data into our calculator. Press . Select option 1:Edit.

Enter the values for number of touchdowns (x) in L1 and the values for base salary (y) in L2. Press . Choose CALC. Choose option 4:LinReg(ax+b). Press twice. Copyright by Hawkes Learning All rights reserved. Example 12.5: Calculating the Correlation Coefficient Using a TI-83/84 Plus Calculator (cont.) From the scatter plot, we would expect r to be close to 1. The calculator confirms that the correlation coefficient for these two variables is r 0.925. Since the

value is close to 1, this indicates a very strong positive correlation between the variables. Copyright by Hawkes Learning All rights reserved. Testing the Correlation Coefficient for Significance Using Critical Values of the Pearson Correlation Coefficient to Determine the Significance of a Linear Relationship A sample correlation coefficient, r, is statistically significant if r r . Note: The term statistically significant for the Pearson correlation coefficient indicates that a significant linear

relationship exists between the two variables. Copyright by Hawkes Learning All rights reserved. Example 12.6: Using a Table of Critical Values to Determine Significance of a Linear Relationship Use the critical values in Table I to determine if the correlation between the number of passing touchdowns and base salary from Example 12.4 is statistically significant. Use a 0.05 level of significance. Solution Begin by finding the critical value for = 0.05 with n = 10 in Table I. Find the value in the table where the row for n = 10 intersects the column for = 0.05. Copyright by Hawkes Learning

All rights reserved. Example 12.6: Using a Table of Critical Values to Determine Significance of a Linear Relationship (cont.) n 6 7 8 9 10 11 12 = 0.05 0.811 0.754

0.707 0.666 0.632 0.602 0.576 = 0.01 0.917 0.875 0.834 0.798 0.765 0.735 0.708 Copyright by Hawkes Learning

All rights reserved. Example 12.6: Using a Table of Critical Values to Determine Significance of a Linear Relationship (cont.) Thus, r = 0.632. Comparing this critical value to the absolute value of the correlation coefficient we found for the data in Example 12.4, we have 0.251 < 0.632, and thus r < r. Therefore, the linear relationship between the variables is not statistically significant at the 0.05 level of significance. Thus, we do not have sufficient evidence, at the 0.05 level of significance, to conclude that a linear relationship exists between the number of passing touchdowns during the 20112012 season and the 2012 base salary of an NFL quarterback. Copyright by Hawkes Learning All rights reserved.

Testing the Correlation Coefficient for Significance Using Hypothesis Testing Testing Linear Relationships for Significance Significant Linear Relationship (Two-Tailed Test) H0: = 0 (Implies that there is no significant linear relationship) Ha: 0 (Implies that there is a significant linear relationship) Copyright by Hawkes Learning All rights reserved. Testing the Correlation Coefficient for Significance Using Hypothesis Testing Testing Linear Relationships for Significance (cont.)

Significant Negative Linear Relationship (Left-Tailed Test) H0: 0 (Implies that there is no significant negative linear relationship) Ha: < 0 (Implies that there is a significant negative linear relationship) Copyright by Hawkes Learning All rights reserved. Testing the Correlation Coefficient for Significance Using Hypothesis Testing Testing Linear Relationships for Significance (cont.) Significant Positive Linear Relationship (Right-Tailed Test) H0: 0 (Implies that there is no significant positive

linear relationship) Ha: > 0 (Implies that there is a significant positive linear relationship) Copyright by Hawkes Learning All rights reserved. Testing the Correlation Coefficient for Significance Using Hypothesis Testing Test Statistic for a Hypothesis Test for a Correlation Coefficient The test statistic for testing the significance of the correlation coefficient is given by r t 1 r2

n 2 Copyright by Hawkes Learning All rights reserved. Testing the Correlation Coefficient for Significance Using Hypothesis Testing Test Statistic for a Hypothesis Test for a Correlation Coefficient (cont.) where r is the sample correlation coefficient and n is the number of data pairs in the sample. The number of degrees of freedom for the t-distribution of the test statistic is given by n 2. Copyright by Hawkes Learning All rights reserved.

Testing the Correlation Coefficient for Significance Using Hypothesis Testing Rejection Regions for Testing Linear Relationships Significant Linear Relationship (Two-Tailed Test) Reject the null hypothesis, H0 , if tt 2 . Significant Negative Linear Relationship (Left-Tailed Test) Reject the null hypothesis, H0 , if tt . Significant Positive Linear Relationship (Right-Tailed Test) Reject the null hypothesis, H0 , if tt . Copyright by Hawkes Learning All rights reserved. Example 12.7: Performing a Hypothesis Test to Determine if

the Linear Relationship between Two Variables Is Significant Use a hypothesis test to determine if the linear relationship between the number of parking tickets a student receives during a semester and his or her GPA during the same semester is statistically significant at the 0.05 level of significance. Refer to the data presented in the following table. GPA and Number of Parking Tickets Number of Tickets 0 0 0 0 1 1 1 2 2 2 3 3 5 7 8 GPA 3.6 3.9 2.4 3.1 3.5 4.0 3.6 2.8 3.0 2.2 3.9 3.1 2.1 2.8 1.7 Copyright by Hawkes Learning All rights reserved.

Example 12.7: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Solution Step 1: State the null and alternative hypotheses. We wish to test the claim that a significant linear relationship exists between the number of parking tickets a student receives during a semester and his or her GPA during the same semester. Thus, the hypotheses are stated as follows. H0 : 0 Ha : 0 Copyright by Hawkes Learning All rights reserved.

Example 12.7: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Step 2: Determine which distribution to use for the test statistic, and state the level of significance. We will use the t-test statistic presented previously in this section along with a significance level of = 0.05 to perform this hypothesis test. Step 3: Gather data and calculate the necessary sample statistics. We need to begin by calculating the correlation coefficient, r. Copyright by Hawkes Learning All rights reserved.

Example 12.7: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Since it is possible to argue for either of these two variables affecting the other, lets assign the number of tickets to be our explanatory variable (x), and thus the GPA as the response variable (y). Using a TI-83/84 Plus calculator, enter the values for the numbers of tickets (x) in L1 and the values for the GPAs (y) in L2. Then press and choose CALC and option 4:LinReg(ax+b). Press twice. We get r 0.586619 from the calculator and we know that n = 15. Copyright by Hawkes Learning

All rights reserved. Example 12.7: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Note that we rounded r to six decimal places, rather than three decimal places, to avoid additional rounding error in the following calculation of the test statistic. Substituting these values into the formula for the t-test statistic yields the following. t r 2 1 r

n 2 0.586619 1 0.586619 15 2 2 2.612 Copyright by Hawkes Learning All rights reserved. Example 12.7: Performing a Hypothesis Test to Determine if the

Linear Relationship between Two Variables Is Significant (cont.) Step 4: Draw a conclusion and interpret the decision. We will use rejection regions in this example to draw the conclusion. Since the sample size for this example is 15, the number of degrees of freedom is n 2 = 15 2 = 13. Using the tdistribution table or distribution table or appropriate technology, we find the critical value for this test, tt 2 t0.05 2 0.025 2.160. So we will reject the null hypothesis, H0, if t 2.160. Copyright by Hawkes Learning All rights reserved. Example 12.7: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.)

Since t 2.612 and 2.612 2.160, the test statistic falls in the rejection region. Thus, we reject the null hypothesis. Therefore, there is sufficient evidence at the 0.05 level of significance to support the claim that there is a significant linear relationship between the number of parking tickets a student receives during a semester and his or her GPA during the same semester. Copyright by Hawkes Learning All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant An online retailer wants to research the effectiveness of

its mail-out catalogs. The company collects data from its eight largest markets with respect to the number of catalogs (in thousands) that were mailed out one fiscal year versus sales (in thousands of dollars) for that year. The results are as follows. Number of Catalogs Mailed and Sales Number of Catalogs 2 (in Thousands) Sales (in Thousands) $126 3 3

3 4 4 5 6 $98 $255 $394 $107 $122 $334 $403 Copyright by Hawkes Learning All rights reserved.

Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Use a hypothesis test to determine if the linear relationship between the number of catalogs mailed out and sales is statistically significant at the 0.01 level of significance. Copyright by Hawkes Learning All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Solution Step 1: State the null and alternative hypotheses.

We wish to test the claim that a significant linear relationship exists between the number of catalogs mailed out and the corresponding sales for that area. H0 : 0 Ha : 0 Copyright by Hawkes Learning All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Step 2: Determine which distribution to use for the test statistic and state the level of significance. We will use the t-test statistic with the given level of significance, = 0.01.

Step 3: Gather data and calculate the necessary sample statistics. We first need to calculate the correlation coefficient, r. It is possible to infer that mailing a larger number of catalogs to a region will influence the number of sales in that region. Copyright by Hawkes Learning All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Thus, the explanatory variable (x) will be the number of catalogs and the response variable (y) will be the sales. Using a TI-83/84 Plus calculator, enter the values for the numbers of catalogs mailed (x) in L1 and the sales

values (y) in L2. Then press and choose CALC and option 4:LinReg(ax+b). Press twice. From the calculator we see that r 0.504505, and we know that n = 8. Copyright by Hawkes Learning All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Note that we rounded r to six decimal places, rather than three decimal places, to avoid additional rounding

error in the following calculation of the test statistic. Substituting these values into the equation for the test statistic, we have the following. t r 2 1 r n 2 0.504505 1 0.504505 8 2

2 1.431 Copyright by Hawkes Learning All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Step 4: Draw a conclusion and interpret the decision. We will use rejection regions to draw the conclusion. Since the sample size for this example is 8, the number of degrees of freedom is n 2 = 8 2 = 6. Using the t-distribution table or appropriate technology, we find

the critical value for this test, tt 2 t0.01 2 0.005 3.707. So, we will reject the null hypothesis, H0 , if t 3.707. Copyright by Hawkes Learning All rights reserved. Example 12.8: Performing a Hypothesis Test to Determine if the Linear Relationship between Two Variables Is Significant (cont.) Since the value of the test statistic, t 1.431, is less than the critical value, t 2 3.707, we fail to reject the null hypothesis. Hence, there is not enough evidence at the 0.01 level of significance to support the claim that there is a significant linear relationship between the number of catalogs distributed in a particular area and the corresponding sales in that area.

Copyright by Hawkes Learning All rights reserved. Interpreting Statistical Significance Many times we are tempted to say that, because the linear relationship between two variables is statistically significant, a change in one variable must cause a change in the other. However, the world is much more complicated than that. When we do have a correlation between two variables that is statistically significant, one of at least four different things may actually be going on, as illustrated in the figure on the following slide. Copyright by Hawkes Learning

All rights reserved. Interpreting Statistical Significance Copyright by Hawkes Learning All rights reserved. Interpreting Statistical Significance Suppose there is a correlation between two variables, x and y. The natural assumption is that x causes y. However, if we reversed the variables, in reality, y may be causing x to change. On the other hand, we might have a combination of factors influencing the response and/or explanatory variables. The x-variable may indeed be influencing the y-variable, but a third variable, z, might also be influencing them both.

In addition, it could be the case that, even though a correlation exists between the two variables, one is not influencing the other at all. Copyright by Hawkes Learning All rights reserved. Coefficient of Determination The coefficient of determination, r2 , is a measure of the proportion of the variation in the response variable (y) that can be associated with the variation in the explanatory variable (x). Copyright by Hawkes Learning All rights reserved. Example 12.9: Calculating and Interpreting the

Coefficient of Determination If the correlation coefficient for the relationship between the numbers of rooms in houses and their prices is r = 0.65, how much of the variation in house prices can be associated with the variation in the numbers of rooms in the houses? Solution Recall that the coefficient of determination tells us the amount of variation in the response variable (house price) that is associated with the variation in the explanatory variable (number of rooms). Copyright by Hawkes Learning All rights reserved. Example 12.9: Calculating and Interpreting the Coefficient of Determination (cont.)

Thus, the coefficient of determination for the relationship between the numbers of rooms in houses and their prices will tell us the proportion or percentage of the variation in house prices that can be associated with the variation in the numbers of rooms in the houses. Also, recall that the coefficient of determination is equal to the square of the correlation coefficient. Copyright by Hawkes Learning All rights reserved. Example 12.9: Calculating and Interpreting the Coefficient of Determination (cont.) Since we know that the correlation coefficient for these data is r = 0.65, we can calculate the coefficient of

2 2 determination as r 0.65 0.4225. Thus, approximately 42.3% of the variation in house prices can be associated with the variation in the numbers of rooms in the houses. Copyright by Hawkes Learning All rights reserved.