# Datacamp course – introduction to R⤴

from @ @cullaloe | Tech, tales and imagery

Having abandoned the data visualisation course run by Edinburgh University, and wanting to gain some further competence in R, I took the DataCamp “Introduction to R” course. This course is written by Jonathan Cornelissen, one of the founders of DataCamp and a man with seriously good credentials in R.

## Basics

### Assignment and operators

a <- 4		# assignment 3 ways
4 -> a
a = 4

1 + 2		# mathematical operators
4 - 3
6 * 5
(7 + 9) / 2
8^2		# exponentiation
10 %% 4		# modulo

x < y		# less than
a > c		# greater than
a <= b
j >= k
one == two	# equal to
up != down	# not equal to


### Data types

12.5 / 2.5	# numerics
7 + 123		# integers are also numerics
7 = 3		# Boolean (TRUE or FALSE) are logicals
"Hello world"	# characters

class(x)	# what data type is x?


## Vectors

A vector is a one-dimensional array (think of a row in a spreadsheet). In research, this is a single observation.

# using the combine function to create a vector
a_numeric_vector <- c(1, 2, 3, 4, 5)

# vectors can have column names
names(a_numeric_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

# printing the vector outputs the element names:
> a_numeric_vector
Monday   Tuesday Wednesday  Thursday    Friday
1         2         3         4         5
>

# using a vector to hold the column names
days_of_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(week-values) <- days_of_week


You can do some quick and easy arithmetic with vectors.

low_nums <- c(1, 2, 3, 4, 5)
hi_nums <- c(6, 7, 8, 9, 10)

total_nums <- low_nums + hi_nums

> total_nums
[1]  7  9 11 13 15

sum(low_nums) 	# adds up the elements in the vector
mean(low_nums)	# average of elements in the vector

low_nums[3]	# print the third low number (note 1-index)
hi_num[c(2:4)] 	# just get the middle values


The selection of elements can be conditional using boolean values in another vector.

> c(49, 50, 51) > 50
[1] FALSE FALSE TRUE

> nums <- c(1:99)	# vector of the first 99 integers
> fives <- nums %% 5
> nums[fives == 0]	# all of those divisible by 5
[1]  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95


In the last example above, fives == 0 is a vector of boolean values. Used as a selector in the nums vector, only the TRUE elements are selected.

## Matrices

A matrix in R is a collection of elements, all of the same data type, arranged in 2 dimensions of rows and columns.

> # A matrix that contain the numbers 1 up to 9 in 3 rows
> matrix(1:9, byrow = TRUE, nrow = 3)
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
>


The access indicators are shown in the row labels and column headers above. So, element [2,3] of the matrix contains the value 6. The first row of my_matrix is the vector my_matrix[1,]. Row and column names can be set for matrices, as they can be for vectors. This can be done by calling rownames() and colnames(), or at the time the matrix is set up.

# Construct star_wars_matrix
box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
c("US", "non-US")))


The function cbind() binds columns to an existing matrix. rbind() does the same thing for adding row vectors to a matrix. rowSums() and colSums() do what they sound like - making new vectors ready to be bound into the source matrix if required.

Arithmetic operators work element-wise on matrices.

## Factors

A factor is a data type used to store categorical variables. These are discrete variable which can only take a finite number of values (cf. continuous variables which can have any of an infinite set of values, like real numbers). R can make a vector of the categories from a vector of categorical values:

> birthdates <- c(12,4,13,23,31,16,1,9,12,4,8,24,27,25,24,25)
> birthdates
[1] 12  4 13 23 31 16  1  9 12  4  8 24 27 25 24 25
> bd_factors <- factor(birthdates)
> bd_factors
[1] 12 4  13 23 31 16 1  9  12 4  8  24 27 25 24 25
Levels: 1 4 8 9 12 13 16 23 24 25 27 31
>


Such variables are nominal or ordinal according to whether they are just names, or if they can be ranked in some meaningful way. Ordinal factors are created with additional parameters, e.g., order = TRUE and levels = c("low", "high") and can be compared easily.

## Data Frames

A data frame has the variables of a data set as columns and the observations as rows. A quick peek at the structure of a data frame is provided by head() and tail() functions, e.g.:

> head(mtcars)
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
>


mtcars is one of the many data sets built into R. A list of them is obtained by calling data(). str() provides a look at the structure of a data set:

> str(mtcars)
'data.frame':	32 obs. of  11 variables:
$mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
$disp: num 160 160 108 258 360 ...$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
$drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
$qsec: num 16.5 17 18.6 19.4 17 ...$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
$am : num 1 1 1 0 0 0 0 0 0 0 ...$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
$carb: num 4 4 1 1 2 1 4 2 2 4 ... >  Columns of a data frame are added one column vector at a time as a list of parameters in the function call data.frame(). Selecting a data point from row 32, column 2 is a matter of calling df_bears[32,2]. Note the order - observation (row) first, then variable (column). A whole observation (e.g. the tenth) is obtained by df_bears[10,]. The first 4 data points from the paw_size column are df_bears[1:4,"paw_size"]. The whole column vector is df_bears$paw_size (notice the dollar sign notation). Subsets can be made calling subset(df_bears, paw_size < 4). Sorting can be achieved by making a vector of the data frame order, based upon the columns you are interested in:

> a <- order(df_bears$claw_size) > df_bears[a,] > > mtcars[order(mtcars$disp),]
>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
...


## Lists

Lists can contain arbitrary data and data types. They are constructed by calling list() with optional names for each component, e.g. list(top_dogs = df_dogs[1:10,], top_cats = df_cats[1:10,]).

## Evaluation and next steps

I’ve found this introduction differently paced to the earlier introduction course run by the university. Because there is no instructor, attention has been paid to very small details: every aspect of the course works because it is programmatic. Learners have to take the right (small) steps to complete the exercises successfully. Errors are picked up and RTFQ-type prompts are given. This was less challenging than the earlier, demonstrator-led course but I completed this one, instead of bailing out feeling frustrated and weak. I also learned considerably more that is useful and earned a more secure foundation for further study.

I am working with RStudio on a daily basis now as I am producing documentation and course materials with Bookdown. My intention is to further develop competence with R and R-markdown.

# Deploying a Bookdown site securely⤴

from @ @cullaloe | Tech, tales and imagery

I have been writing documentation for a project in markdown using RStudio, which provides a nice way of packaging it all as a static (html) website. I wanted to share this work with colleagues securely.

## Writing workflow

The documents exist within an RStudio project and are built to a folder containing static files. That folder is by default _book, but I change this to docs to make it easy to deploy as a github site if I wish1. Configuration management is a crucial element to proper productivity, not just in software but also in all walks of life where documentation is important. Because of this, I use github to store my work safely, should I lose a laptop or suffer some other first-world calamity. It’s one of the reasons I use markdown when writing: configuration management is well-suited to text-based documents because it is easy to track and manage changes.

Although I keep the source files on github, I haven’t published this project to github pages because it should not be publicly available: instead, I deploy to a VPS (Centos/Apache/Plesk), putting it all behind a login.

## The domain

I set up a specific domain static.cullaloe.net for this project, and secured it with an SSL certificate.

## The files

Clone the GitHub repository into a new folder somewhere behind the web-facing directory (i.e. not in httpdocs). In this example, both the repository and the local folder are called “foobar”:

$git clone https://github.com/githubuser/foobar.git /var/www/vhosts/[domain]  It is not necessary to specify the target directory, you’ll get that as default. It is not possible2 to selectively clone a github project: it’s all or nothing. /var/www/vhosts/[domain]/foobar now contains all of the source files of the project. ## Permissions You need to create a .htpasswd file in the server somewhere, containing the username and password you wish to grant access to your files to: $ /path/to/htpasswd -c /var/www/vhosts/[domain]/.htaccess user1


This prompts you for the password you wish to set for this user. Adding another user is the same command without the -c option.

## The server

You need to tell the Apache, using Alias, where to find the files, and with <Location>, control who can access files at the URL you are trying to protect. In the Plesk control panel, Apache & nginx Settings for static.cullaloe.net ···:

Alias /foobar /var/www/vhosts/[domain]/foobar/docs
<Location /foobar>
AuthType Basic
AuthName "Restricted access"
AuthUserFile /var/www/vhosts/[domain]/.htpasswd
Require user user1
</Location>


## Outcome

I can easily continue to work on my project documentation, updating it from time to time for colleagues who are interested in seeing what I’m doing. I make (neurotic) use of github for configuration management and safekeeping of all my hard work anyway, so updating the site just requires $git pull from the repository folder on the web server. They can then view the documentation in a browser, or download a pdf or docx that is up-to-date with my current progress. ## Notes 1. In bookdown.yml, add the line output_dir: "docs" 2. As far as I know, anyway. # Statistics and Visualisation with R #3⤴ from @ @cullaloe | Tech, tales and imagery Post number three of my notes from a course on Statistics and (data) Visualisation with R, presented by Lucia Michielin of the University of Edinburgh in June and July. Third week ## Programme overview Class Date and time Title Class 1 01 June 13:00-14:30 Intro to R and R studio Class 2 02 June 13:00-14:30 Types of data and Grammar of graphs Class 3 08 June 13:00-14:30 Intro to statistics and descriptive stats Class 4 09 June 13:00-14:30 Boxplot and playing with Colours Class 5 15 June 13:00-14:30 Data Collection Bias, Probability, and Distribution Class 6 16 June 13:00-14:30 Hypothesis testings and the main tests Class 7 22 June 13:00-14:30 Barcharts and cleaning the sample Class 8 23 June 13:00-14:30 PCA and Cluster analysis Class 9 29 June 13:00-14:30 Covariance, Regression, Similarity and Difference coefficients Class 10 30 June 13:00-14:30 Recap and Bring your dataset class ## Class 5 - Data Collection Bias, Probability, and Distribution After a quick review of the challenge from last week, an agenda for today was shared which wasn’t really followed1. ### Data Collection and bias “Data collection is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. The goal for all data collection is to capture quality evidence that allows analysis to lead to the formulation of convincing and credible answers to the questions that have been posed.” (Wikipedia) The sample is taken from the population, but a bias is introduced when the sample is taken. One example of this is the survivorship bias. ### Probability A basic definition was given to the class, with an example from an earlier class that helped us. If we make the statement if **A** is true then **B** is true can we infer from this that if **B** is true then **A** is true? Clearly not, but we might be able to say if **B** is true then **A** is more plausible. Plausibility here relates to how probably something can be. “Probability is a numerical description of how likely an event is to occur or how likely it is that a proposition is true.” - Wikipedia ### Distribution In introducing this topic, we were shown the really neat rnorm() function that generates random data sets with specified mean and standard deviation. In the class, we were given a challenge to go through the steps needed with a new data set to examine it. These steps seem to be just to plot it and have a look. All of the curves looked to be distributed around a central mean, which makes them normal distributions. There was a problem with the challenge files which meant that some of us didn’t have the correct files and so did a different (trivial) task. The key message is to examine your data as soon as you have it, to look for central values and distribution, so as to understand it and how you expect it to behave in analysis. Other distributions include uniform, logarithmic, left-, right-skewed and bimodal. ## Class 6 - Hypothesis testing and the main tests Just half a dozen students today. We started with a review of the mini-challenge from last session, which included a consideration of colours: HTML colour and a ggplot colour reference were shared. One emphasis today is that there are multiple ways of doing the same thing in R, with an illustration of data importation and summarisation. Although some of the few remaining students seemed to be following the narrative, I became increasingly frustrated and lost trying to follow the class: the thread was along the lines of hypothesis testing, the null hypothesis and corrections. ## Playtime discoveries Between this week’s two classes, I had a quick look at the Comprehensive R Archive Network (CRAN) and discovered a neat thing. Remember the funky assignment operator <- from the first class which had its funky keyboard shortcut alt -? Most programming languages I’ve used have a simple equals sign as the assignment operator and I was a bit baffled as to why super-sexy R should do something so weird as <-. Well, it turns out that in most contexts, = works in exactly the way you’d expect, so these are equivalent: > x <- 99 > x = 99  What’s interesting about the R syntax, though, it that it can be reversed and it still works: > 99 -> x  I’m sure that the case for this will become apparent eventually. One more way to do assignments is to call the assignment function: > assign('x', 99) # notice the weird need for quotes around x in the function call  ## Next steps: other ways to learn R I got totally overwhelmed with the rate of new concepts in this class, or perhaps the pace and style of delivery. I think, given the time I have to spend after each class going over the material in order to make sense of it, it is clear that this course is not right for me. I have decided to abandon it and follow other materials more appropriate to my purpose, which is to develop skills in data visualisation using R. Despite the title, this course is not doing that for me. Having decided2 to seek out other ways to learn how to R, I found that there are lots of self-directed learning resources out there, of course. An introduction to R course at DataCamp will be the new direction of travel down this rabbit hole for me. ## Notes and references 1. I seem to be not the only one amongst the diminishing cohort (we’re down to a dozen today (Monday), having started closer to 50) that is finding it hard to follow this course. Yes, it’s interesting and cool, the different things you can do with R and geometries, but I am missing clarity, signposting, structure and other basic pedagogical features. 2. Taking time after class 5 to review the presentation slides, it was possible to reconstruct perhaps what the instructor was intending, for example, in the two slides on probability and more generally with this class today. I might have got very little from today’s session without spending as long again going over the materials. I am perhaps going to seek out other resources # Statistics and Visualisation with R #2⤴ from @ @cullaloe | Tech, tales and imagery This is the second in a series of posts containing my notes from a course on Statistics and (data) Visualisation with R, presented by Lucia Michielin of the University of Edinburgh. The course is running two days a week in June and July. Second week ## Programme overview Class Date and time Title Class 1 01 June 13:00-14:30 Intro to R and R studio Class 2 02 June 13:00-14:30 Types of data and Grammar of graphs Class 3 08 June 13:00-14:30 Intro to statistics and descriptive stats Class 4 09 June 13:00-14:30 Boxplot and playing with Colours Class 5 15 June 13:00-14:30 Data Collection Bias, Probability, and Distribution Class 6 18 June 13:00-14:30 Hypothesis testings and the main tests Class 7 22 June 13:00-14:30 Barcharts and cleaning the sample Class 8 23 June 13:00-14:30 PCA and Cluster analysis Class 9 29 June 13:00-14:30 Covariance, Regression, Similarity and Difference coefficients Class 10 30 June 13:00-14:30 Recap and Bring your dataset class ## A bit of homework ### Errors I had the chance to play with the IDE and learn a few things by invoking errors. > ggplot(college, aes(x=sat_avg, y=admission_rate)) + geom_point() Error in ggplot(college, aes(x = sat_avg, y = admission_rate)) : could not find function "ggplot"  The above occurs if you just run that line of code without first loading the library in which ggplot lives, i.e. by calling library("tidyverse") first. > ggplot(college, aes(x=sat_avg, y=admission_rate)) + geom_point() Error in ggplot(college, aes(x = sat_avg, y = admission_rate)) : object 'college' not found  This error is thrown because the data object ‘college’ has not been created. Do this by loading the data first, i.e. call college <- read_csv('http://672258.youcanlearnit.net/college.csv'). ### Mutate and as.factor Last week’s class included the creation of the dataset using this call: college <- college %>% mutate(state=as.factor(state), region=as.factor(region), highest_degree=as.factor(highest_degree), control=as.factor(control), gender=as.factor(gender), loan_default_rate=as.numeric(loan_default_rate))  The mutate() function adds new variable columns to the dataset or replaces existing ones if the same name is used. You can see that state is replaced by a different version of itself: the as.factor() function takes data and turns it into a factor if it isn’t already. “Factor” is a term that indicates that the data is a category or enumerated type, rather than just a set of strings. To see this, consider this: > var=letters[1:5] > var [1] "a" "b" "c" "d" "e" > var=as.factor(var) > var [1] a b c d e Levels: a b c d e  var is created as a vector, then converted to a factor or column in the as.factor call. ## Class 3 - Intro to statistics and descriptive stats • Discuss the results of the challenge • Intro to Statistics • Descriptive Statistics • Summarising Statistics • Exploring 1 variable: Plotting distribution Histograms and Density plot ### The results of the challenge Last week’s challenge was discussed, focusing on the meaning of the plots obtained, and how to add a best fit line using +geom_smooth() geometry to the gg-plot command. We played a bit with a tool to help us get better at “seeing” the correlation of plotted data: guessthecorrelation. ### Intro to Statistics This part of the class was further levelling by going over the basics of statistics and how they may be used to summarise or infer information. Central tendency includes arithmetic mean, median and mode values. Measures of dispersion like variance and standard deviation were also discussed. Standard deviation: $\sigma = \sqrt {\frac{1}{N} \sum\limits ^N _{i=1}(x_i - \mu)^2}$ These functions were illustrated in the R IDE, including accessing columns within datasets using the dollar sign like this: mean(iris$Petal.Width)	#using mean formula


Similar functions offer median, variance and sd (standard deviation) calculations.

### Visualise the distribution: Histogram and Density plots

There is a nice cheat sheet for ggplot data visualisation tips.

Subsets of the data based upon some category can be easily made:

virginica <- subset(iris, Species=="virginica")


A table of data around these subsets can be quickly constructed and plotted:

#Values of Virginica
mean(virginica$Petal.Width) median(virginica$Petal.Width)
var(virginica$Petal.Width) sd (virginica$Petal.Width)

#... etc for the other two

# Make a new table...
Species <- c("setosa","versicolor", "virginica")

Mean <- c(mean(setosa$Petal.Width), mean(versicolor$Petal.Width), mean(virginica$Petal.Width)) Median <- c(median(setosa$Petal.Width), median(versicolor$Petal.Width), median(virginica$Petal.Width))
Variance <- c(round(var(setosa$Petal.Width), digits = 2), round(var(versicolor$Petal.Width), digits = 2), round(var(virginica$Petal.Width), digits = 2)) SD <- c(round(sd(setosa$Petal.Width), digits = 2), round(sd(versicolor$Petal.Width), digits = 2),round(sd(virginica$Petal.Width), digits = 2))

# Make the data frame for plotting...
FullPlot <- as.data.frame(cbind(Species, Mean, Median, Variance, SD))


Notice you can double-click on the FullPlot table in the environment box (equivalent to the command View(FullPlot)), it will display the table in a new tab for inspection.

# ... and now plot it in a nice histogram

ggplot(iris, aes(x=Petal.Width, fill=Species))+
geom_histogram(alpha=0.8,color="black", binwidth=0.08)+
geom_vline(aes(xintercept = mean(Petal.Width)),col='red',size=2)+
theme_bw()+facet_wrap(~Species, ncol = 1)



Here’s how that looks:

### The next challenge, number 2

Once again, I got nowhere with the challenge in the 10 minutes we had to do it. I spent most of that time, after loading the dataset, trying to figure out how to make a scatter plot. When we reviewd the problem at the end of the class, it became apparent that the question was looking for a histogram, not a scatter plot. Getting stuck here meant that I didn’t know how to approach the second part of the challenge, and didn’t progress to reading the third.

## Class 4 - Boxplot and playing with Colours

• Discuss Challenge 2
• Boxplots
• Playing with colours
• Exporting graphs

### Discussing Challenge 2 and feeling lost

No time to spend on this before the next class, so I just sat back and tried to follow the discussion1.

### Boxplots

• The thick black line is the median.
• The boxes represent 50% of the sample closest to the median
• The whiskers correspond to 95% of the sample closest to the median aka 2SD from the median
• The dots represent the outliers

In R, boxplots are another geometry2:

ggplot(iris, aes(x=Species, y=Petal.Width, fill=Species)) + geom_boxplot()


### Playing with colours

There are a number of colour themes within R to allow you to make beautiful and readable plots and graphs. What is important, as with all open source, is that when using different packages and utilities, you must RTFM3.

These are invoked with colour commands and palettes, like scale_color_manual here:

ggplot(iris, aes(x=Species, y=Petal.Width, color=Species)) +
geom_boxplot(outlier.alpha = 0)+
scale_color_manual(values = wes_palette("GrandBudapest2", n=4))

# other color methods:
scale_color_grey(start = 0.6, end = 0.1) # grayscale
scale_color_manual(values=c("#80ec65", "#145200", "#700015")) # manually defined


### Exporting graphs

Pretty self-explanatory from the IDE: exporting graphs to pdf or png is available from the dropdown in the plots tab.

### Challenge number 3

Well, today I managed to get a result during the time allocated for us to work on the problem, which made up for how I had been feeling earlier. With a little tweaking after class, I settled on this:

ggplot(college, aes(x=region, y=faculty_salary_avg, fill = region)) +
geom_boxplot(outlier.alpha = 0.0, alpha = 0.8) +
geom_jitter(alpha =0.2)+
theme_bw()+
facet_wrap(~control, ncol = 2)+
labs(title = "USA University",
subtitle = "Faculty salary by region",
x = "Region",
y = "Average salary")


which yielded this:

## Next class

• Data Collection
• Datasets biases
• Inferential statistics
• Probability

## Notes and references

1. This shook my confidence a little, which took me back to my school days, the last time that I was so lost in a class that I wasn’t even aware when the teacher was asking a question. All you can do in this situation is wait for the next section and try to be invisible. Part of the difficulty I am having is the heavy accent of the instructor, which causes me to miss signposts and words. It made me think of all the EAL kids in Scottish schools.

2. There is a complete catalogue of the graphs available in R at r-graph-gallery.com

# Statistics and Visualisation with R⤴

from @ @cullaloe | Tech, tales and imagery

This is the first in a series of posts containing my notes from a course on Statistics and (data) Visualisation with R, presented by Lucia Michielin of the University of Edinburgh. The course is running two days a week in June and July.

First week

## The learning environment

The course was presented in a blended form using Collaborate, Blackboard Learn and Slack. These represent three key elements in a remote learning environment:

• Synchronous, whole-class activity, usually instructor-led, but may include “break-out” sessions for students to work together in groups.
• A resources repository for files, course outlines, presenter’s notes and assignments.
• A back-channel for students to interact in, and for interaction with the course leader. This may be synchronous and asynchronous, but Slack is only used in this course for offline use. Collaborate chat is used in the live sessions.

The offline chatter and forum space is useful in courses like this, which is why Slack is such a useful tool here. Microsoft teams has been suggested as an equivalent to Slack but I really don’t find it intuitive or useful in the same way as Slack. The UI, as with all Microsoft products I have ever used, is difficult and intrusive.

## Preparation

The pre-course task is to install R and the R-Studio IDE. I painlessly installed on my Mac R 4.0.0 “Arbor Day” and R-Studio 1.3.959, and downloaded the data files for the first week.

## Programme overview

Class Date and time Title
Class 1 01 June 13:00-14:30 Intro to R and R studio
Class 2 02 June 13:00-14:30 Types of data and Grammar of graphs
Class 3 08 June 13:00-14:30 Intro to statistics and descriptive stats
Class 4 09 June 13:00-14:30 Boxplot and playing with Colours
Class 5 15 June 13:00-14:30 Data Collection Bias, Probability, and Distribution
Class 6 18 June 13:00-14:30 Hypothesis testings and the main tests
Class 7 22 June 13:00-14:30 Barcharts and cleaning the sample
Class 8 23 June 13:00-14:30 PCA and Cluster analysis
Class 9 29 June 13:00-14:30 Covariance, Regression, Similarity and Difference coefficients
Class 10 30 June 13:00-14:30 Recap and Bring your dataset class

## Class 1 - Intro to R and R studio

This began with an introduction on The Edinburgh Centre for Data, Culture & Society (CDCS) and their other courses and research projects, followed by an introduction to our course leader, Lucia, and a walk around the VLE folders and files. Then, an overview of the first session:

• Intro to Quantitative methods in Research
• The R and R studio Interface
• How to organise your work in R efficiently
• How to install packages

This is fairly self-explanatory but an essential levelling for all delegates on the course, who come from a range of backgrounds. Most seemed to be researchers.

### R and R-Studio

R is the language, and R-Studio is a graphical IDE for working with projects and data using R. Lucia said that the key skill in acquiring a new language is knowing how to Google: R is widely used and has a large user base and the forums are very helpful in quickly overcoming problems.

### Scientific method

We heard about the scientific method and in particular the issue of correlation vs. causation. This, in context of remaining close to the process, and the meaning of the data, whilst engaging deeply with the IDE.

• Define a research question
• Explore the evidence
• Define working hypotheses
• Translate hypotheses → statistical models
• Compare models to evidence
• Interpret results

### Getting our hands dirty

Finally we got to open R-Studio and follow Lucia through opening the script for today and take a walk around the IDE. We started with creating a new project and setting up a folder organisation for the course, then played with settings to make the IDE more comfortable for us.

Comments and header syntax was explained, then using an immediate mode of running commands in the console (by pressing cmd-enter). Outputs are presented in vectors: index 1.

# THIS IS A LEVEL 1 HEADER #################################

## This Is a Level 2 Header ================================

### This is a level 3 header. ------------------------------

print('Hello World!')

## Define Variables ================
x <- 1:5 #Put the numbers between 1-5 in the variable x (alt- is a short code for <- in the IDE)
x #Displays the values we have set in x


Getting help is a matter of issuing a query command: e.g. ?datasets displays help on the Datasets package.

### Packages

Packages are installed and activated thus:

install.packages("ggplot2")
?install.packages  # Get help on installing packages

## Activate packages ==========================
library(ggplot2)


We ended the session by installing something called the “Tidyverse”, which seems to be a collection of useful packages. It calls itself an “opinionated collection of R packages designed for data science”.

## Class 2 - Types of data and Grammar of graphs

• Type of Data
• How to choose the right Graph
• Graph types
• The Grammar of Graphs
• Ggplot2 structure
• Graph settings

### Data types

Type Description
Vector One dimensional collection of data, like a column or row, all of the same type
Matrix Two-dimensional collection of data, all of the same type
Array Multi-dimensional collection of data, all of the same type
Data frame Mixed type variables (similar to a table in spreadsheet)

Reading a table by convention has variables in columns, and the observations or data values are in the rows. Vectors, arrays and matrices in R all must contain the same data type, and a data frame is the only construct that allows mixed types.

### Keeping the environment clean

# Clear environment
rm(list = ls())

# Clear console
cat("\014")  # ctrl+L


df <- read_csv("data/RodentSimplified.csv")

# ... same, with selection and mutation of month column name:

select(mo,dy,yr,period,species) %>%
mutate(period = as.factor(period)) %>%
rename(month = mo) %>%
print()


What looks like a continuation line in the above code block is called a “pipe”, written %>%. A pipe is much more than just for making the code look pretty: it really does work as a pipe to transfer a value between subsequent function calls. See this and this and this for more details.

### Graph types

After some warnings on selecting data presentation that is consistent with your intention to actually communicate something, and not to show off, Lucia introduced us to some of the graph types available with R. She gave us a really nice demonstration why many journals prefer data presentation (e.g. ratios) as a bar chart rather than a pie chart. Differences are sometimes clearer in the bar chart. Apparently, Florence Nightingale invented the pie chart!

### The Grammar of graphics

The ggplot function allows the creation of graphs according to the framework outlined by Wilkinson1 in which the separate aspects of data presentation are dealt with separately. A basic template is described here with examples for ggplot. My take on the framework is:

1. a data set
2. aesthetics = a set of scales and variables to be plotted
3. geometry = shapes to represent the data and the type of chart
4. a grid of subplots, or facets
5. statistics = summaries of the data
6. coordinates to define the plotting space
7. global themes and labels to make it look pretty

Some of these map into the ggplot function, from the most basic (using examples from the weekly challenge, below):

ggplot(college, aes(x=sat_avg, y=admission_rate)) + geom_point()


Here, the aes part is the aesthetics definition of the variables for the plot; geom_point defines the type of chart (scatter plot), and college is the data set.

### The weekly challenge

I completely failed to get anything at all out of the challenge in the 5 minutes we were given for this task. My code just didn’t produce any kind of plot, just barffing errors at me. This was entirely down to my inability to type or read carefully what I had typed. Later, I managed to make some progress:

# Import CSV files from online repo
college <- college %>%
mutate(state=as.factor(state), region=as.factor(region),
highest_degree=as.factor(highest_degree),
control=as.factor(control), gender=as.factor(gender),
loan_default_rate=as.numeric(loan_default_rate))
# Create a chart that would show the relationship between the SAT average and the admission rate


Time is not in abundance at the moment, so I did eventually just look at the solution and run it, seeing how each element of the code contributed to the final result. Here’s my result:

## Next class

• Discuss the results of the challenge
• Intro to Statistics
• Descriptive Statistics
• Summarising Statistics
• Exploring 1 variable: Plotting distribution Histograms and Density plot

## Notes and references

1. Wilkinson, L. (2005) The Grammar of Graphics, The Grammar of Graphics. Springer-Verlag. doi: 10.1007/0-387-28695-0.