DA6. Scatter Plot and Correlation¶
Statement¶
Identify two variables from your field of interest. Find the data associated with those two variables or make up some data.
- Briefly explain the variables you have selected and the reason of your selection.
- Explain the significance of using a scatter plot for your data.
- How is a scatter plot interrelated to correlation? Provide relevant examples.
Answer¶
1. Briefly explain the variables you have selected and the reason of your selection¶
The dataset borrowed from Ludgerus Darell (2019) contains two variables: SAT
and GPA
. The SAT
variable represents the score of the student in the SAT exam, while the GPA
variable represents the grade point average of the student in college. The dataset contains 84 observations.
I expect that there is a relationship between the two variables; that is, if we only know the GPA score of a student, we can predict their SAT score, and vice versa. This relationship is important in the field of education, as it can help SAT tutors to better group students, prepare better exam questions, and do better planning for future.
2. Explain the significance of using a scatter plot for your data¶
Scatter plots provide a basic visual representation of the relationship between two variables. They are useful in identifying patterns, trends, checking the conditions or assumptions of linear regression, and identifying outliers and influential points.
A quick glance at the scatter plot can encourage or discourage further regression analysis before wasting resources on a model that may not be appropriate for the data; that is, if the scatter plot shows a linear trend, we can proceed with linear regression. However, if the scatter plot shows a non-linear trend, repetitive patterns, residuals are not normally distributed, or the variability of residuals is not constant, we may need to consider other regression models (OpenIntroOrg, 2014).
Also, scatter plots easily identify outliers and their influence on the regression line. It is important to examine all outliers carefully before doing any analysis, as they tend to skew the results greatly. Outliers should be investigated and corrected if they are due to entry errors, removed if they are due to measurement errors, or stop the analysis at all if we think they may affect the results.
3. How is a scatter plot interrelated to correlation? Provide relevant examples¶
Scatter plot of SAT and GPA scores |
Scatter plot of SAT and GPA scores with the density of each variable included |
Image 1 shows the scatter plot of the SAT and GPA scores, and Image 2 shows the same scatter plot with the density of each variable included. We notice that as GPA increases, SAT also increases. The scatter plot shows a positive linear relationship between the SAT and GPA scores, and the points are close to the line of best fit with no outliers. The variable’s density shows that both variables are normally distributed.
Image 3: Linear regression summary |
After doing the analysis, the summary is captured in Image 3. The intercept b0=1028.641
and the slope b1=245.218
are the estimates of the population parameters \(\beta_0\) and \(\beta_1\). The correlation coefficient r=0.637
indicates a strong positive linear relationship between the SAT and GPA scores, and the final model SAT = 1028.641 + 245.218 * GPA
can be used to predict the SAT score of a student given their GPA score.
Generally speaking, the scatter plot is interrelated to correlation in the following ways (Diez, Barr, & Çetinkaya-Rundel, 2019):
- Correlation is always between -1 and 1.
- An upward sloping line has a positive correlation, and a downward sloping line has a negative correlation.
- As observations move closer to the line, the residuals decrease, and correlation increases.
References¶
- Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2019). Openintro statistics - Fourth edition. Open Textbook Library. https://www.biostat.jhsph.edu/~iruczins/teaching/books/2019.openintro.statistics.pdf
- Ludgerus Darell. (2019). 1.01. Simple linear regression.csv. Kaggle.com. https://www.kaggle.com/datasets/luddarell/101-simple-linear-regressioncsv/data.
- OpenIntroOrg. (2014, January 27). Fitting a line with least squares regression [Video]. YouTube. https://youtu . be/z8DmwG2G4Qc