
Introduction into R for social scientists
R basics: statistical graphs and analysis
Dr. Elze G. Ufkes
Benjamin Ziepert
Instructions
This handout shows and explains the code used in the lecture. You can use it to run the code on your own computer. Copy pasting the codes in the gray blocks and running it should produce the same outcomes.
1 Installing and activating packages
Don't forget to install package the first time using a package. You only have to install the package once.
install.packages("tidyverse")
Then activate the package each time you start a new session.
library("tidyverse")
Tip: you can use library()
also to check whether a package is installed correctly. For instance run library("tidyverse")
twice and if the second time no message is displayed then the package is correctly installed.
2 Graphics
2.1 Activate the graphics package ggplot2
ggplot2 is a graphics package and is part of the tidyverse. ggplot2 has been downloaded together with tidyverse and therefore we don't need to download is again.
library("ggplot2")
Open the data frame mpg. This is part of the ggplot2 package you just installed and activated.
mpg
# # A tibble: 234 x 11
# manufacturer model displ year cyl trans drv cty hwy fl class
# <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
# 1 audi a4 1.80 1999 4 auto(l5) f 18 29 p compact
# 2 audi a4 1.80 1999 4 manual(m5) f 21 29 p compact
# 3 audi a4 2.00 2008 4 manual(m6) f 20 31 p compact
# 4 audi a4 2.00 2008 4 auto(av) f 21 30 p compact
# 5 audi a4 2.80 1999 6 auto(l5) f 16 26 p compact
# 6 audi a4 2.80 1999 6 manual(m5) f 18 26 p compact
# 7 audi a4 3.10 2008 6 auto(av) f 18 27 p compact
# 8 audi a4 quattro 1.80 1999 4 manual(m5) 4 18 26 p compact
# 9 audi a4 quattro 1.80 1999 4 auto(l5) 4 16 25 p compact
# 10 audi a4 quattro 2.00 2008 4 manual(m6) 4 20 28 p compact
# # ... with 224 more rows
mpg is a data set for the fuel economy data from 1999 and 2008 for 38 popular car models.
2.2 Histogram
With ggplot2 we can visualize the data. For instance we can assess the frequency of engine sizes with a histogram. The column displ is the engine displacement in liters, a common indicator for engine size.
ggplot(data = mpg) +
aes(displ) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
2.3 Update labels and color
To make the histogram more readable we can change labels and colors.
ggplot(data = mpg) +
aes(x = displ) +
geom_histogram(fill = 'darkgreen') +
labs(title = "Histrogram of engine displacement",
x = "Engine displacement in litres",
y = "Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
2.4 Create a scatter dot
With a scatter dot we can get an idea if larger engines are more or less efficient on the highway. The column hwy are the burned miles per gallon on a highway.
ggplot(data = mpg) +
aes(x = displ, y = hwy) +
geom_point()
2.5 Adding more aesthetic mappings
To get an idea how different car classes are performing we can give them different colors.
ggplot(data = mpg) +
aes(x = displ, y = hwy, color = class) +
geom_point()
2.6 Adding regression line
In statistics we are often interested into linear analysis. For that we can also add a regression line to the scatter dot.
ggplot(data = mpg) +
aes(x = displ, y = hwy) +
geom_point() +
geom_smooth(method=lm)
What does this graph tell us?
2.7 More graphics
You can find more info about graphics at
3 Statistics
3.1 Descriptive statistics
You can get descriptive statistics with summary()
.
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
3.2 Correlation
For a better correlation analysis we need the package Hmisc.
install.packages("Hmisc")
library("Hmisc")
We can correlate engine size with highway fuel consumption with rcorr for the Hmisc package.
rcorr(x = mpg$displ, y = mpg$hwy)
## x y
## x 1.00 -0.77
## y -0.77 1.00
##
## n= 234
##
##
## P
## x y
## x 0
## y 0
3.3 Independent T-Test
To run a T test you can use t.test(x, y)
.
t.test(x = mpg$displ, y = mpg$hwy)
##
## Welch Two Sample t-test
##
## data: mpg$displ and mpg$hwy
## t = -50.131, df = 254.89, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -20.75280 -19.18395
## sample estimates:
## mean of x mean of y
## 3.471795 23.440171
Tip: T test, anova an regression are all general linear models. Therefore you could also use t.test(y ~ x)
where x is a factor with two levels giving the corresponding groups.
3.4 One Way Anova
The code for an ANOVA is fit <- aov(y ~ x, data = mydata)
For our example that would be ...
fit <- aov(hwy ~ displ, data = mpg)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## displ 1 4848 4848 329.5 <2e-16 ***
## Residuals 232 3414 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.5 Multiple Linear regression
The code for a regression is fit <- lm(y ~ x1 + x2 + x3, data = mydata)
. For our example that would be ...
fit <- lm(hwy ~ displ, data = mpg)
summary(fit)
##
## Call:
## lm(formula = hwy ~ displ, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1039 -2.1646 -0.2242 2.0589 15.0105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.6977 0.7204 49.55 <2e-16 ***
## displ -3.5306 0.1945 -18.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
## F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
3.6 More statistics
You can find more information about the statistics with:
- https://www.statmethods.net/stats/index.html
- Discovering Statistics Using R by Andy Field.