Skip to content

DA6. Scatter Plot and Correlation

Statement

Identify two variables from your field of interest. Find the data associated with those two variables or make up some data.

  1. Briefly explain the variables you have selected and the reason of your selection.
  2. Explain the significance of using a scatter plot for your data.
  3. How is a scatter plot interrelated to correlation? Provide relevant examples.

Answer

1. Briefly explain the variables you have selected and the reason of your selection

The dataset borrowed from Ludgerus Darell (2019) contains two variables: SAT and GPA. The SAT variable represents the score of the student in the SAT exam, while the GPA variable represents the grade point average of the student in college. The dataset contains 84 observations.

I expect that there is a relationship between the two variables; that is, if we only know the GPA score of a student, we can predict their SAT score, and vice versa. This relationship is important in the field of education, as it can help SAT tutors to better group students, prepare better exam questions, and do better planning for future.

2. Explain the significance of using a scatter plot for your data

Scatter plots provide a basic visual representation of the relationship between two variables. They are useful in identifying patterns, trends, checking the conditions or assumptions of linear regression, and identifying outliers and influential points.

A quick glance at the scatter plot can encourage or discourage further regression analysis before wasting resources on a model that may not be appropriate for the data; that is, if the scatter plot shows a linear trend, we can proceed with linear regression. However, if the scatter plot shows a non-linear trend, repetitive patterns, residuals are not normally distributed, or the variability of residuals is not constant, we may need to consider other regression models (OpenIntroOrg, 2014).

Also, scatter plots easily identify outliers and their influence on the regression line. It is important to examine all outliers carefully before doing any analysis, as they tend to skew the results greatly. Outliers should be investigated and corrected if they are due to entry errors, removed if they are due to measurement errors, or stop the analysis at all if we think they may affect the results.

3. How is a scatter plot interrelated to correlation? Provide relevant examples

sat-gpa-1
Scatter plot of SAT and GPA scores
sat-gpa-2
Scatter plot of SAT and GPA scores with the density of each variable included

Image 1 shows the scatter plot of the SAT and GPA scores, and Image 2 shows the same scatter plot with the density of each variable included. We notice that as GPA increases, SAT also increases. The scatter plot shows a positive linear relationship between the SAT and GPA scores, and the points are close to the line of best fit with no outliers. The variable’s density shows that both variables are normally distributed.

sat-gpa-3
Image 3: Linear regression summary

After doing the analysis, the summary is captured in Image 3. The intercept b0=1028.641 and the slope b1=245.218 are the estimates of the population parameters \(\beta_0\) and \(\beta_1\). The correlation coefficient r=0.637 indicates a strong positive linear relationship between the SAT and GPA scores, and the final model SAT = 1028.641 + 245.218 * GPA can be used to predict the SAT score of a student given their GPA score.

Generally speaking, the scatter plot is interrelated to correlation in the following ways (Diez, Barr, & Çetinkaya-Rundel, 2019):

  • Correlation is always between -1 and 1.
  • An upward sloping line has a positive correlation, and a downward sloping line has a negative correlation.
  • As observations move closer to the line, the residuals decrease, and correlation increases.

References