Introduction Into R for Social Scientists

1 Instructions

For the current session you will practice common statistical methods such as correlation, regression, etc. For this session you will need the material from the website https:\\benjaminziepert.com\teaching and your knowledge from the datacamp assignments. During the session you will create a R script file which is necessary for passing this course.

You have to submit the working R script file before 5th December 2018 to pass the R lectures. To do so, please reply to my last e-mail. All steps that are in bold in the current handout have to be included in the script.

2 Research

For our analysis we will use data from experiments that studied the relationship between movement and cognition. Students had the task to transport supposed cocaine or flour and had to avoid border guards. You can see the movement data, measured with GPS, below. The students walked three or fours rounds and after each round they filled in a questionnaire about their feelings and thoughts. For instance, we were checking whether participants would walk closer together when they had the feeling they would do something illegal.

https://analyse-gps.com/wp-content/uploads/2018/02/smuggle1.gif

The following animation shows a participant walking four rounds from the start (left) to the finish (right). When the path is white then the participant transported flour and when the path is black then the participant transported cocaine. After passing the finish to the right, the participants would answer questionnaires.

https://analyse-gps.com/wp-content/uploads/2016/10/black-and-white.gif

Source:

3 Loading the data

Please download all files for today’s session from benjaminziepert.com/teaching.

Afterwards, load the data from the file “S02D02-SPSS-Data.sav” into R and save it in the variable “data”. To do this please follow the steps from the datacamp course.

Remember, you have to load the package “haven” first. For loading a package you can use the library() function.

4 Exploring the data

You can check the “data” variable in the Environment view on the top right. To see more details you can click on the arrow next to the “data” variable. Moreoever, if you click on the variable itself you can see actual data. Alternatively, you can also run the code below to view the data.

View(data)

To see all columns in the “data” data frame you can use the function names() as illustrated below.

names(data)
##  [1] "team"                    "exp"                    
##  [3] "ppn"                     "round"                  
##  [5] "sexe"                    "age"                    
##  [7] "nationality"             "cocaine"                
##  [9] "id"                      "ah_mean_kmh"            
## [11] "ah_sd_kmh"               "ah_mean_team_distance"  
## [13] "ah_mean_deviation_route" "ah_sd_deviation_route"  
## [15] "com_sat"                 "com_awc"                
## [17] "com_ssas"                "com_fright"             
## [19] "com_impulse"             "com_dt"                 
## [21] "com_awp"                 "com_hi_check"           
## [23] "sat1"                    "sat2"                   
## [25] "sat3"                    "sat4"                   
## [27] "sat5"                    "dt1"                    
## [29] "dt2"                     "dt3"

The columns starting with “com_” are the measurements (components) from the questionnaire. All items were measured with a likert scale from 1 to 7. The columns sat1 - sat5 are the questions for the component com_sat and the columns dt1 - dt3 are the questions for com_dt.

Further, the columns starting with “ah_” are the measurements from the movement (GPS) sensors.

To see the labels of the columns you can run the following code. Please load the “Hmisc” package to make the code work.

as.matrix(label(data))
##                         [,1]                                                                        
## team                    "Team"                                                                      
## exp                     "Experiment"                                                                
## ppn                     "Participant"                                                               
## round                   "Round"                                                                     
## sexe                    "Sexe"                                                                      
## age                     "Age"                                                                       
## nationality             "Nationality"                                                               
## cocaine                 "Illegal Card Selection"                                                    
## id                      "Track ID"                                                                  
## ah_mean_kmh             "Speed"                                                                     
## ah_sd_kmh               "Speed Variation"                                                           
## ah_mean_team_distance   "Intra-Team Distance"                                                       
## ah_mean_deviation_route "Route Deviation"                                                           
## ah_sd_deviation_route   "Variation Route Deviation"                                                 
## com_sat                 "Alertness to Being Target of Guards"                                       
## com_awc                 "Cognitive Self-Regulation"                                                 
## com_ssas                "Situational Self Awareness"                                                
## com_fright              "Frightened by Presence of Guards"                                          
## com_impulse             "Suppressed Impulses to Change Movement"                                    
## com_dt                  "Contemplation of Hostile Intent"                                           
## com_awp                 "Awareness Movement Change in Presence of Guard"                            
## com_hi_check            "Hostile Intent"                                                            
## sat1                    "I had the feeling the border guard(s) targeted me"                         
## sat2                    "I thought I had attracted the border guards’ attention"                    
## sat3                    "I had a feeling that I was going to be stopped"                            
## sat4                    "I felt like I was the one being addressed by the border guard(s)"          
## sat5                    "I had the idea that the others were paying attention to me"                
## dt1                     "I was wondering whether I looked suspicious to the border guards"          
## dt2                     "I was thinking about what I had to hide from the border guards"            
## dt3                     "I was wondering whether I was doing something that I was not allowed to do"

Feel free to also inspect the file in SPSS and compare if you can find the same information in R.

5 Descriptives

To get a first understanding of the variables, please generate the descriptives of all collumns in data in the same way you did the in the first session.

The example below shows for instance the descriptives of age, average walking speed and “Suppressed Impulses to Change Movement”.

##       age         ah_mean_kmh      com_impulse   
##  Min.   :18.00   Min.   :0.4554   Min.   :1.000  
##  1st Qu.:20.00   1st Qu.:4.3013   1st Qu.:1.600  
##  Median :21.00   Median :4.6458   Median :2.600  
##  Mean   :21.61   Mean   :4.6010   Mean   :2.756  
##  3rd Qu.:22.00   3rd Qu.:4.9592   3rd Qu.:3.800  
##  Max.   :37.00   Max.   :6.3679   Max.   :6.600  
##                  NA's   :8        NA's   :63

Please answer the follwing questions

  • What is the average distance of a participants to the team members?
  • How frightened was the most frightened participant?

To write an answer in your script you can use the comment indicator #. If you start a line with a # then this line will not executed by R. Please check the example below.

# How frightened was the most frightened participant?
# The most frightened participant reported a score of ...

6 Reliability

To measure the internal reliability of our questions we can calculate Cronbach’s alpha with the function alpha() from the package “psych”.

Be aware, we also use the package “ggplot2” and both, “ggplot2” and “psych”, have a function alpha(). To tell R that we want to use alpha() from “psych” we can add “psych::” before the function psych::alpha().

Since we want to measure the internal reliability of all questions from the component “Alertness to Being Target of Guards” we select the columns with the questions sat1 - sat5.

library("psych")
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
psych::alpha(data[,c("sat1", "sat2", "sat3", "sat4", "sat5")])
## 
## Reliability analysis   
## Call: psych::alpha(x = data[, c("sat1", "sat2", "sat3", "sat4", "sat5")])
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd median_r
##       0.89      0.88    0.89       0.6 7.6 0.011  3.7 1.7     0.64
## 
##  lower alpha upper     95% confidence boundaries
## 0.87 0.89 0.91 
## 
##  Reliability if an item is dropped:
##      raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
## sat1      0.84      0.84    0.83      0.56  5.2   0.0154 0.0311  0.53
## sat2      0.84      0.83    0.80      0.56  5.0   0.0161 0.0149  0.56
## sat3      0.87      0.86    0.87      0.61  6.2   0.0130 0.0392  0.59
## sat4      0.85      0.84    0.83      0.57  5.4   0.0150 0.0253  0.56
## sat5      0.91      0.91    0.90      0.72 10.1   0.0093 0.0081  0.68
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean  sd
## sat1 192  0.89  0.88  0.86   0.81  4.0 2.2
## sat2 192  0.91  0.90  0.90   0.84  3.7 2.2
## sat3 192  0.82  0.82  0.74   0.71  4.0 2.0
## sat4 192  0.88  0.87  0.86   0.80  3.7 2.2
## sat5 192  0.64  0.66  0.52   0.49  3.1 1.7
## 
## Non missing response frequency for each item
##         1    2    3    4    5    6    7 miss
## sat1 0.19 0.16 0.11 0.07 0.14 0.16 0.17 0.24
## sat2 0.25 0.16 0.08 0.09 0.10 0.19 0.12 0.24
## sat3 0.17 0.10 0.15 0.13 0.12 0.21 0.12 0.24
## sat4 0.22 0.16 0.12 0.08 0.10 0.18 0.12 0.24
## sat5 0.21 0.21 0.24 0.11 0.09 0.10 0.04 0.24

We are interested in the standardized alphas “std.alpha” if an item is dropped. We do so, to check how the items effect the overall reliability of a component.

When we are satisfied with the result and you don’t want to perform a factor analysis / principal component analysis then you can create the “com_sat” component with the function rowMeans. In order that we don’t overwrite our current component, we will create a new component called “com_sat2”.

data[,"com_sat2"] <- rowMeans(data[,c("sat1", "sat2", "sat3", "sat4", "sat5")])

Please perform a internal reliability analysis for the questions of “Contemplation of Hostile Intent” and create the component “com_dt2”.

Please report the standardized alpha levels of “Contemplation of Hostile Intent” for each question if dropped.

If you had to remove an item from “Contemplation of Hostile Intent” which would you choose? What are the pros and cons of removing the item and would you do it?

7 Histogram

Please create a histogram for the speed of the participants.

You can look in the handout of session 1 for an example and the result should look like the image below.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite values (stat_bin).

8 Scatter plot

Please create a scatter plot of “Alertness to Being Target of Guards” and “Speed Variation” with a regression line.

The “Alertness to Being Target of Guards” measured how much participants had the feeling that the guards would stop them to check their supposed contraband of cocaine or flour.

The “Speed Variation” indicates how much the participants changed their walking speed.

## Warning: Removed 69 rows containing non-finite values (stat_smooth).
## Warning: Removed 69 rows containing missing values (geom_point).

9 Correlation

Please correlate “Alertness to Being Target of Guards” and “Speed Variation”.

What can you say about the relationship of “Alertness to Being Target of Guards” and “Speed Variation”?

##      x    y
## x 1.00 0.32
## y 0.32 1.00
## 
## n
##     x   y
## x 192 184
## y 184 245
## 
## P
##   x  y 
## x     0
## y  0

Tip: you can run ?rcorr to open the help page for this function. Under the heading “value” you can read what for information the function returns.

10 Regression

To perform a regression analysis you have to formulate the formula of the regression. You start with the dependent variable and sepearte it with ~ from the predictor. The generalized form of the formula is y ~ x.

To save the formula we use fit <- lm(y ~ x, data = data) where data is a data frame or matrix with the columns x and y. Finally, we run summary(fit) to show the results of the regression analysis.

For instance, if you want to know whether “Alertness to Being Target of Guards” is a significant predictor for “Speed” then you can run the following code.

fit <- lm(ah_mean_kmh ~ com_sat, data = data)
summary(fit)
## 
## Call:
## lm(formula = ah_mean_kmh ~ com_sat, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2105 -0.3027  0.0068  0.2978  1.6845 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.71264    0.10419  45.229   <2e-16 ***
## com_sat     -0.02920    0.02557  -1.142    0.255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5975 on 182 degrees of freedom
##   (69 observations deleted due to missingness)
## Multiple R-squared:  0.007118,   Adjusted R-squared:  0.001663 
## F-statistic: 1.305 on 1 and 182 DF,  p-value: 0.2548

As you can see in the coefficients table above, the predictor com_sat is not a significant predictor for ah_mean_kmh with p = .255.

For the next step you will perform your own regression analysis with “Suppressed Impulses to Change Movement” as predictor and “Variation Route Deviation” as dependent variable.

“Suppressed Impulses to Change Movement” is the attempt of participants to walk normal in the precense of the guards in order not to give themselves away as smugglers with supposed cocaine. Further, “Variation Route Deviation” is an indicator how often particiapnts changed their route.

Please perform the regression analysis with “Suppressed Impulses to Change Movement” and “Variation Route Deviation”.

Is “Suppressed Impulses to Change Movement” a significant predictor for “Variation Route Deviation”? Please write your answer in the same manner as you would do in a result section of your thesis. However, formatting such as italic is not necessary.