Tag Archives: data

wwwd – John's World Wide Wall Display 2020-10-12 11:50:36⤴

from @ wwwd – John's World Wide Wall Display

Reposted a tweet (Twitter)
DOING DATA DIFFERENTLY We're launching the virtual exhibition from this research project between 16:30 & 17:30 on Wed 11th Nov. online. Of interest to colleagues (esp. sr. leaders) interested in literacy in primary schools Registration (free) http://bit.ly/DDDExhibitionLaunch… - join us!

I’ve registered. Really interesting way of gathering information about primary teaching.

 

 

Merge join data files on 2 columns with python⤴

from @ @cullaloe | Tech, tales and imagery

This was posted on a forum:

I have two enormous data sets - 2 million rows in each one. I have them in ASCII format. Each set has three columns. The first two columns are identical for both sets - essentially, coordinates. The third row in each set gives the temperature at that location for two different substances.

I am trying to find a way to create a single table with 4 columns, the first two being the coordinates and the third and fourth being the different temperatures.

Excel gives up after 1,000,000 rows.

Can anyone suggest a (free) tool that can do this - and then preferably allow for some analysis - plotting temp 1 against temp 2, for instance.

I thought about this on and off during one of those challenging COVID days, and finally sat down in the evening and knocked up a quick solution to the first part of the problem, using python.

file 1.csv, 2.csv contain data:

$ cat 1.csv 
"x","y","t1"
42,35,122
39,44,242
12,43,188

$ cat 2.csv 
"x","y","t2"
53,22,192
39,44,122
22,56,238

Launching python 31

$ python3
Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> csv1 = pd.read_csv("1.csv")
>>> csv2 = pd.read_csv("2.csv”)

(check it looks ok)

>>> csv1.head()
    x   y   t1
0  42  35  122
1  39  44  242
2  12  43  188
>>> csv2.head()
    x   y   t2
0  53  22  192
1  39  44  122
2  22  56  238

Make an outer join of these two tables on the two coordinate columns:

>>> merged_data = csv1.merge(csv2,on=["x","y"],how='outer’)

>>> merged_data.head()
    x   y     t1     t2
0  42  35  122.0    NaN
1  39  44  242.0  122.0
2  12  43  188.0    NaN
3  53  22    NaN  192.0
4  22  56    NaN  238.0

“NaN” is not your Nan, it’s just “not a number”, or “no data here”.

>>> merged_data.to_csv("out.csv",index=False)

… and back in the shell, we can see the result…

$ cat out.csv 
x,y,t1,t2
42,35,122.0,
39,44,242.0,122.0
12,43,188.0,
53,22,,192.0
22,56,,238.0

… your merged file, sir.

It ought to work with a few million rows. I didn’t answer the final part of the OP’s question but it ought to easy enough with this example - a million or more rows, might be a different problem.

Footnotes

  1. I am using python3 because OSX has a legacy version 2 nobody dares touch. On other systems you might want to use just “python” and “pip” in the above examples. You might have to install pandas first, using $ pip3 install pandas

Datacamp course – intermediate R⤴

from @ @cullaloe | Tech, tales and imagery

Continuing my journey into R, the next course in the R programming track at DataCamp is Intermediate R. This course is presented by Filip Schouwenaars. It teaches language syntax and programming conventions, building on the last course.

Conditionals and Control Flow

Relational operators

This section begins with a talk-through the main relational operators in R, with simple examples, followed by exercises in the virtual lab.

> TRUE == TRUE			# Equality
[1] TRUE
> 'oranges' != 'apples'	# Inequality
[1] TRUE
> 'oranges' > 'apples'	# Strings compare alphabetically
[1] TRUE
> 'oranges' < 'apples'
[1] FALSE
> vec <- c('apples', 'bananas', 'dragon fruit', 'tomato')
> vec > 'oranges'		# Works on vectors (and matrices)
[1] FALSE FALSE FALSE  TRUE

TRUE coerces to the value 1, FALSE, 0. So truth is greater!

Logical operators

Syntax for these familiar operators is &, | and !, for logical AND, OR and NOT, respectively. They have high precedence and therefore do not need brackets around expressions:

> 4 > 3 & 8 <= 9
[1] TRUE

Logical operators may be used on matrices and vectors:

> !c(TRUE, FALSE, 1 > 0)
[1] FALSE  TRUE FALSE

Note that double-signed operators like && work only on the first element of a vector.

Conditional statements

Again, familiar syntax here, with the conditional test in brackets; code blocks in curly braces; and two statement words, if and else:

x <- 0
if (x < 0) { 
    print ('x is negative')
} else if (x == 0) { 
  print ('x is zero') 
} else { 
    print ('x is positive') 
}

Notice that the else and else if statements come on the same line as the closign curly brace of the associated if statement. Once a conditional test evaluates TRUE, the corresponding code block is executed and the remaining code within the if control structure is ignored. Conditional statements may be nested.

Evaluation and next steps

There is a greater teacher presence in this course than the previous, through the use of video presentations to support the hands-on interactive labs.

Thus far into the R Programming Track with Datacamp, I have stopped because I have hit an unexpected paywall. Continuing requires a commitment of at least $25 per month, which is good value if I were continuing with courses several hours per day but not appropriate for my current ad-hoc engagement. The day job takes priority, which means all of the available time mostly. I’ll be switching to other resources from now, probably starting with R for Data Science1, or at least the online version.

References

  1. Grolemund, G. and Wickham, H. (2016) R for Data Science, O’Reilly Media. 

Datacamp course – introduction to R⤴

from @ @cullaloe | Tech, tales and imagery

Having abandoned the data visualisation course run by Edinburgh University, and wanting to gain some further competence in R, I took the DataCamp “Introduction to R” course. This course is written by Jonathan Cornelissen, one of the founders of DataCamp and a man with seriously good credentials in R.

Basics

Assignment and operators

a <- 4		# assignment 3 ways
4 -> a
a = 4

1 + 2		# mathematical operators
4 - 3 
6 * 5
(7 + 9) / 2 
8^2		# exponentiation
10 %% 4		# modulo

x < y		# less than
a > c		# greater than
a <= b 
j >= k 
one == two	# equal to
up != down	# not equal to

Data types

12.5 / 2.5	# numerics
7 + 123		# integers are also numerics
7 = 3		# Boolean (TRUE or FALSE) are logicals
"Hello world"	# characters

class(x)	# what data type is x?

Vectors

A vector is a one-dimensional array (think of a row in a spreadsheet). In research, this is a single observation.

# using the combine function to create a vector
a_numeric_vector <- c(1, 2, 3, 4, 5)

# vectors can have column names
names(a_numeric_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

# printing the vector outputs the element names:
> a_numeric_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
        1         2         3         4         5
> 

# using a vector to hold the column names
days_of_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(week-values) <- days_of_week

You can do some quick and easy arithmetic with vectors.

low_nums <- c(1, 2, 3, 4, 5)
hi_nums <- c(6, 7, 8, 9, 10)

total_nums <- low_nums + hi_nums

> total_nums
[1]  7  9 11 13 15

sum(low_nums) 	# adds up the elements in the vector
mean(low_nums)	# average of elements in the vector

low_nums[3]	# print the third low number (note 1-index)
hi_num[c(2:4)] 	# just get the middle values

The selection of elements can be conditional using boolean values in another vector.

> c(49, 50, 51) > 50
[1] FALSE FALSE TRUE

> nums <- c(1:99)	# vector of the first 99 integers
> fives <- nums %% 5
> nums[fives == 0]	# all of those divisible by 5
 [1]  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

In the last example above, fives == 0 is a vector of boolean values. Used as a selector in the nums vector, only the TRUE elements are selected.

Matrices

A matrix in R is a collection of elements, all of the same data type, arranged in 2 dimensions of rows and columns.

> # A matrix that contain the numbers 1 up to 9 in 3 rows
> matrix(1:9, byrow = TRUE, nrow = 3)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
> 

The access indicators are shown in the row labels and column headers above. So, element [2,3] of the matrix contains the value 6. The first row of my_matrix is the vector my_matrix[1,]. Row and column names can be set for matrices, as they can be for vectors. This can be done by calling rownames() and colnames(), or at the time the matrix is set up.

# Construct star_wars_matrix
box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                           dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), 
                                           c("US", "non-US")))

The function cbind() binds columns to an existing matrix. rbind() does the same thing for adding row vectors to a matrix. rowSums() and colSums() do what they sound like - making new vectors ready to be bound into the source matrix if required.

Arithmetic operators work element-wise on matrices.

Factors

A factor is a data type used to store categorical variables. These are discrete variable which can only take a finite number of values (cf. continuous variables which can have any of an infinite set of values, like real numbers). R can make a vector of the categories from a vector of categorical values:

> birthdates <- c(12,4,13,23,31,16,1,9,12,4,8,24,27,25,24,25)
> birthdates
 [1] 12  4 13 23 31 16  1  9 12  4  8 24 27 25 24 25
> bd_factors <- factor(birthdates)
> bd_factors
 [1] 12 4  13 23 31 16 1  9  12 4  8  24 27 25 24 25
Levels: 1 4 8 9 12 13 16 23 24 25 27 31
> 

Such variables are nominal or ordinal according to whether they are just names, or if they can be ranked in some meaningful way. Ordinal factors are created with additional parameters, e.g., order = TRUE and levels = c("low", "high") and can be compared easily.

Data Frames

A data frame has the variables of a data set as columns and the observations as rows. A quick peek at the structure of a data frame is provided by head() and tail() functions, e.g.:

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
>

mtcars is one of the many data sets built into R. A list of them is obtained by calling data(). str() provides a look at the structure of a data set:

> str(mtcars)
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
> 

Columns of a data frame are added one column vector at a time as a list of parameters in the function call data.frame(). Selecting a data point from row 32, column 2 is a matter of calling df_bears[32,2]. Note the order - observation (row) first, then variable (column). A whole observation (e.g. the tenth) is obtained by df_bears[10,]. The first 4 data points from the paw_size column are df_bears[1:4,"paw_size"]. The whole column vector is df_bears$paw_size (notice the dollar sign notation). Subsets can be made calling subset(df_bears, paw_size < 4). Sorting can be achieved by making a vector of the data frame order, based upon the columns you are interested in:

> a <- order(df_bears$claw_size)
> df_bears[a,]
> 
> mtcars[order(mtcars$disp),]
>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
...

Lists

Lists can contain arbitrary data and data types. They are constructed by calling list() with optional names for each component, e.g. list(top_dogs = df_dogs[1:10,], top_cats = df_cats[1:10,]).

Evaluation and next steps

I’ve found this introduction differently paced to the earlier introduction course run by the university. Because there is no instructor, attention has been paid to very small details: every aspect of the course works because it is programmatic. Learners have to take the right (small) steps to complete the exercises successfully. Errors are picked up and RTFQ-type prompts are given. This was less challenging than the earlier, demonstrator-led course but I completed this one, instead of bailing out feeling frustrated and weak. I also learned considerably more that is useful and earned a more secure foundation for further study.

I am working with RStudio on a daily basis now as I am producing documentation and course materials with Bookdown. My intention is to further develop competence with R and R-markdown.

Statistics and Visualisation with R #3⤴

from @ @cullaloe | Tech, tales and imagery

Post number three of my notes from a course on Statistics and (data) Visualisation with R, presented by Lucia Michielin of the University of Edinburgh in June and July.

Third week

Programme overview

Class Date and time Title
Class 1 01 June 13:00-14:30 Intro to R and R studio
Class 2 02 June 13:00-14:30 Types of data and Grammar of graphs
Class 3 08 June 13:00-14:30 Intro to statistics and descriptive stats
Class 4 09 June 13:00-14:30 Boxplot and playing with Colours
Class 5 15 June 13:00-14:30 Data Collection Bias, Probability, and Distribution
Class 6 16 June 13:00-14:30 Hypothesis testings and the main tests
Class 7 22 June 13:00-14:30 Barcharts and cleaning the sample
Class 8 23 June 13:00-14:30 PCA and Cluster analysis
Class 9 29 June 13:00-14:30 Covariance, Regression, Similarity and Difference coefficients
Class 10 30 June 13:00-14:30 Recap and Bring your dataset class

Class 5 - Data Collection Bias, Probability, and Distribution

After a quick review of the challenge from last week, an agenda for today was shared which wasn’t really followed1.

Data Collection and bias

“Data collection is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. The goal for all data collection is to capture quality evidence that allows analysis to lead to the formulation of convincing and credible answers to the questions that have been posed.” (Wikipedia)

The sample is taken from the population, but a bias is introduced when the sample is taken. One example of this is the survivorship bias.

Probability

A basic definition was given to the class, with an example from an earlier class that helped us. If we make the statement if **A** is true then **B** is true can we infer from this that if **B** is true then **A** is true? Clearly not, but we might be able to say if **B** is true then **A** is more plausible. Plausibility here relates to how probably something can be.

“Probability is a numerical description of how likely an event is to occur or how likely it is that a proposition is true.” - Wikipedia

Distribution

In introducing this topic, we were shown the really neat rnorm() function that generates random data sets with specified mean and standard deviation.

In the class, we were given a challenge to go through the steps needed with a new data set to examine it. These steps seem to be just to plot it and have a look. All of the curves looked to be distributed around a central mean, which makes them normal distributions. There was a problem with the challenge files which meant that some of us didn’t have the correct files and so did a different (trivial) task. The key message is to examine your data as soon as you have it, to look for central values and distribution, so as to understand it and how you expect it to behave in analysis.

Other distributions include uniform, logarithmic, left-, right-skewed and bimodal.

Class 6 - Hypothesis testing and the main tests

Just half a dozen students today. We started with a review of the mini-challenge from last session, which included a consideration of colours: HTML colour and a ggplot colour reference were shared.

One emphasis today is that there are multiple ways of doing the same thing in R, with an illustration of data importation and summarisation. Although some of the few remaining students seemed to be following the narrative, I became increasingly frustrated and lost trying to follow the class: the thread was along the lines of hypothesis testing, the null hypothesis and corrections.

Playtime discoveries

Between this week’s two classes, I had a quick look at the Comprehensive R Archive Network (CRAN) and discovered a neat thing. Remember the funky assignment operator <- from the first class which had its funky keyboard shortcut alt -? Most programming languages I’ve used have a simple equals sign as the assignment operator and I was a bit baffled as to why super-sexy R should do something so weird as <-. Well, it turns out that in most contexts, = works in exactly the way you’d expect, so these are equivalent:

> x <- 99
> x = 99

What’s interesting about the R syntax, though, it that it can be reversed and it still works:

> 99 -> x

I’m sure that the case for this will become apparent eventually. One more way to do assignments is to call the assignment function:

> assign('x', 99) # notice the weird need for quotes around x in the function call

Next steps: other ways to learn R

I got totally overwhelmed with the rate of new concepts in this class, or perhaps the pace and style of delivery. I think, given the time I have to spend after each class going over the material in order to make sense of it, it is clear that this course is not right for me. I have decided to abandon it and follow other materials more appropriate to my purpose, which is to develop skills in data visualisation using R. Despite the title, this course is not doing that for me.

Having decided2 to seek out other ways to learn how to R, I found that there are lots of self-directed learning resources out there, of course. An introduction to R course at DataCamp will be the new direction of travel down this rabbit hole for me.

Notes and references

  1. I seem to be not the only one amongst the diminishing cohort (we’re down to a dozen today (Monday), having started closer to 50) that is finding it hard to follow this course. Yes, it’s interesting and cool, the different things you can do with R and geometries, but I am missing clarity, signposting, structure and other basic pedagogical features. 

  2. Taking time after class 5 to review the presentation slides, it was possible to reconstruct perhaps what the instructor was intending, for example, in the two slides on probability and more generally with this class today. I might have got very little from today’s session without spending as long again going over the materials. I am perhaps going to seek out other resources

Statistics and Visualisation with R #2⤴

from @ @cullaloe | Tech, tales and imagery

This is the second in a series of posts containing my notes from a course on Statistics and (data) Visualisation with R, presented by Lucia Michielin of the University of Edinburgh. The course is running two days a week in June and July.

Second week

Programme overview

Class Date and time Title
Class 1 01 June 13:00-14:30 Intro to R and R studio
Class 2 02 June 13:00-14:30 Types of data and Grammar of graphs
Class 3 08 June 13:00-14:30 Intro to statistics and descriptive stats
Class 4 09 June 13:00-14:30 Boxplot and playing with Colours
Class 5 15 June 13:00-14:30 Data Collection Bias, Probability, and Distribution
Class 6 18 June 13:00-14:30 Hypothesis testings and the main tests 
Class 7 22 June 13:00-14:30 Barcharts and cleaning the sample
Class 8 23 June 13:00-14:30 PCA and Cluster analysis
Class 9 29 June 13:00-14:30 Covariance, Regression, Similarity and Difference coefficients
Class 10 30 June 13:00-14:30 Recap and Bring your dataset class

A bit of homework

Errors

I had the chance to play with the IDE and learn a few things by invoking errors.

> ggplot(college, aes(x=sat_avg, y=admission_rate)) + geom_point()
Error in ggplot(college, aes(x = sat_avg, y = admission_rate)) : 
  could not find function "ggplot"

The above occurs if you just run that line of code without first loading the library in which ggplot lives, i.e. by calling library("tidyverse") first.

> ggplot(college, aes(x=sat_avg, y=admission_rate)) + geom_point()
Error in ggplot(college, aes(x = sat_avg, y = admission_rate)) : 
  object 'college' not found

This error is thrown because the data object ‘college’ has not been created. Do this by loading the data first, i.e. call college <- read_csv('http://672258.youcanlearnit.net/college.csv').

Mutate and as.factor

Last week’s class included the creation of the dataset using this call:

college <- college %>%
  mutate(state=as.factor(state), region=as.factor(region),
         highest_degree=as.factor(highest_degree),
         control=as.factor(control), gender=as.factor(gender),
         loan_default_rate=as.numeric(loan_default_rate))

The mutate() function adds new variable columns to the dataset or replaces existing ones if the same name is used. You can see that state is replaced by a different version of itself: the as.factor() function takes data and turns it into a factor if it isn’t already. “Factor” is a term that indicates that the data is a category or enumerated type, rather than just a set of strings. To see this, consider this:

> var=letters[1:5]
> var
[1] "a" "b" "c" "d" "e"
> var=as.factor(var)
> var
[1] a b c d e
Levels: a b c d e

var is created as a vector, then converted to a factor or column in the as.factor call.

Class 3 - Intro to statistics and descriptive stats

  • Discuss the results of the challenge
  • Intro to Statistics
  • Descriptive Statistics
  • Summarising Statistics
  • Exploring 1 variable: Plotting distribution Histograms and Density plot

The results of the challenge

Last week’s challenge was discussed, focusing on the meaning of the plots obtained, and how to add a best fit line using +geom_smooth() geometry to the gg-plot command. We played a bit with a tool to help us get better at “seeing” the correlation of plotted data: guessthecorrelation.

Intro to Statistics

This part of the class was further levelling by going over the basics of statistics and how they may be used to summarise or infer information. Central tendency includes arithmetic mean, median and mode values. Measures of dispersion like variance and standard deviation were also discussed.

Standard deviation: \[ \sigma = \sqrt {\frac{1}{N} \sum\limits ^N _{i=1}(x_i - \mu)^2} \]

These functions were illustrated in the R IDE, including accessing columns within datasets using the dollar sign like this:

mean(iris$Petal.Width)	#using mean formula

Similar functions offer median, variance and sd (standard deviation) calculations.

Visualise the distribution: Histogram and Density plots

There is a nice cheat sheet for ggplot data visualisation tips.

Subsets of the data based upon some category can be easily made:

virginica <- subset(iris, Species=="virginica")

A table of data around these subsets can be quickly constructed and plotted:

#Values of Virginica
mean(virginica$Petal.Width)
median(virginica$Petal.Width)
var(virginica$Petal.Width)
sd (virginica$Petal.Width)

#... etc for the other two

# Make a new table...
Species <- c("setosa","versicolor", "virginica")

# add columns...
Mean <- c(mean(setosa$Petal.Width), mean(versicolor$Petal.Width), mean(virginica$Petal.Width))
Median <- c(median(setosa$Petal.Width), median(versicolor$Petal.Width), median(virginica$Petal.Width))
Variance <- c(round(var(setosa$Petal.Width), digits = 2), round(var(versicolor$Petal.Width), digits = 2), round(var(virginica$Petal.Width), digits = 2))
SD <- c(round(sd(setosa$Petal.Width), digits = 2), round(sd(versicolor$Petal.Width), digits = 2),round(sd(virginica$Petal.Width), digits = 2))

# Make the data frame for plotting...
FullPlot <- as.data.frame(cbind(Species, Mean, Median, Variance, SD))

Notice you can double-click on the FullPlot table in the environment box (equivalent to the command View(FullPlot)), it will display the table in a new tab for inspection.

# ... and now plot it in a nice histogram

ggplot(iris, aes(x=Petal.Width, fill=Species))+ 
  geom_histogram(alpha=0.8,color="black", binwidth=0.08)+
  geom_vline(aes(xintercept = mean(Petal.Width)),col='red',size=2)+
  theme_bw()+facet_wrap(~Species, ncol = 1) 
  

Here’s how that looks:

The next challenge, number 2

Once again, I got nowhere with the challenge in the 10 minutes we had to do it. I spent most of that time, after loading the dataset, trying to figure out how to make a scatter plot. When we reviewd the problem at the end of the class, it became apparent that the question was looking for a histogram, not a scatter plot. Getting stuck here meant that I didn’t know how to approach the second part of the challenge, and didn’t progress to reading the third.

Class 4 - Boxplot and playing with Colours

  • Discuss Challenge 2
  • Boxplots
  • Playing with colours
  • Exporting graphs

Discussing Challenge 2 and feeling lost

No time to spend on this before the next class, so I just sat back and tried to follow the discussion1.

Boxplots

Boxplot

  • The thick black line is the median.
  • The boxes represent 50% of the sample closest to the median
  • The whiskers correspond to 95% of the sample closest to the median aka 2SD from the median
  • The dots represent the outliers

In R, boxplots are another geometry2:

ggplot(iris, aes(x=Species, y=Petal.Width, fill=Species)) + geom_boxplot()

Playing with colours

There are a number of colour themes within R to allow you to make beautiful and readable plots and graphs. What is important, as with all open source, is that when using different packages and utilities, you must RTFM3.

These are invoked with colour commands and palettes, like scale_color_manual here:

ggplot(iris, aes(x=Species, y=Petal.Width, color=Species)) + 
  geom_boxplot(outlier.alpha = 0)+
  scale_color_manual(values = wes_palette("GrandBudapest2", n=4))

# other color methods:
scale_color_grey(start = 0.6, end = 0.1) # grayscale
scale_color_manual(values=c("#80ec65", "#145200", "#700015")) # manually defined
scale_color_gradientn(colours = rainbow(7)) # gradient

Exporting graphs

Pretty self-explanatory from the IDE: exporting graphs to pdf or png is available from the dropdown in the plots tab.

Challenge number 3

Well, today I managed to get a result during the time allocated for us to work on the problem, which made up for how I had been feeling earlier. With a little tweaking after class, I settled on this:

ggplot(college, aes(x=region, y=faculty_salary_avg, fill = region)) + 
  geom_boxplot(outlier.alpha = 0.0, alpha = 0.8) +
  geom_jitter(alpha =0.2)+
  theme_bw()+ 
  facet_wrap(~control, ncol = 2)+
  labs(title = "USA University", 
       subtitle = "Faculty salary by region", 
       x = "Region", 
       y = "Average salary")

which yielded this:

Challenge 3 plot

Next class

  • Data Collection
  • Datasets biases
  • Inferential statistics
  • Probability

Notes and references

  1. This shook my confidence a little, which took me back to my school days, the last time that I was so lost in a class that I wasn’t even aware when the teacher was asking a question. All you can do in this situation is wait for the next section and try to be invisible. Part of the difficulty I am having is the heavy accent of the instructor, which causes me to miss signposts and words. It made me think of all the EAL kids in Scottish schools. 

  2. There is a complete catalogue of the graphs available in R at r-graph-gallery.com

  3. Read the manual. 

Statistics and Visualisation with R⤴

from @ @cullaloe | Tech, tales and imagery

This is the first in a series of posts containing my notes from a course on Statistics and (data) Visualisation with R, presented by Lucia Michielin of the University of Edinburgh. The course is running two days a week in June and July.

First week

The learning environment

The course was presented in a blended form using Collaborate, Blackboard Learn and Slack. These represent three key elements in a remote learning environment:

  • Synchronous, whole-class activity, usually instructor-led, but may include “break-out” sessions for students to work together in groups.
  • A resources repository for files, course outlines, presenter’s notes and assignments.
  • A back-channel for students to interact in, and for interaction with the course leader. This may be synchronous and asynchronous, but Slack is only used in this course for offline use. Collaborate chat is used in the live sessions.

The offline chatter and forum space is useful in courses like this, which is why Slack is such a useful tool here. Microsoft teams has been suggested as an equivalent to Slack but I really don’t find it intuitive or useful in the same way as Slack. The UI, as with all Microsoft products I have ever used, is difficult and intrusive.

Preparation

The pre-course task is to install R and the R-Studio IDE. I painlessly installed on my Mac R 4.0.0 “Arbor Day” and R-Studio 1.3.959, and downloaded the data files for the first week.

Programme overview

Class Date and time Title
Class 1 01 June 13:00-14:30 Intro to R and R studio
Class 2 02 June 13:00-14:30 Types of data and Grammar of graphs
Class 3 08 June 13:00-14:30 Intro to statistics and descriptive stats
Class 4 09 June 13:00-14:30 Boxplot and playing with Colours
Class 5 15 June 13:00-14:30 Data Collection Bias, Probability, and Distribution
Class 6 18 June 13:00-14:30 Hypothesis testings and the main tests 
Class 7 22 June 13:00-14:30 Barcharts and cleaning the sample
Class 8 23 June 13:00-14:30 PCA and Cluster analysis
Class 9 29 June 13:00-14:30 Covariance, Regression, Similarity and Difference coefficients
Class 10 30 June 13:00-14:30 Recap and Bring your dataset class

Class 1 - Intro to R and R studio

This began with an introduction on The Edinburgh Centre for Data, Culture & Society (CDCS) and their other courses and research projects, followed by an introduction to our course leader, Lucia, and a walk around the VLE folders and files. Then, an overview of the first session:

  • Intro to Quantitative methods in Research
  • The R and R studio Interface
  • How to organise your work in R efficiently
  • How to install packages

This is fairly self-explanatory but an essential levelling for all delegates on the course, who come from a range of backgrounds. Most seemed to be researchers.

R and R-Studio

R is the language, and R-Studio is a graphical IDE for working with projects and data using R. Lucia said that the key skill in acquiring a new language is knowing how to Google: R is widely used and has a large user base and the forums are very helpful in quickly overcoming problems.

Scientific method

We heard about the scientific method and in particular the issue of correlation vs. causation. This, in context of remaining close to the process, and the meaning of the data, whilst engaging deeply with the IDE.

  • Define a research question
  • Explore the evidence
  • Define working hypotheses
  • Translate hypotheses → statistical models
  • Compare models to evidence
  • Interpret results

Getting our hands dirty

Finally we got to open R-Studio and follow Lucia through opening the script for today and take a walk around the IDE. We started with creating a new project and setting up a folder organisation for the course, then played with settings to make the IDE more comfortable for us.

Comments and header syntax was explained, then using an immediate mode of running commands in the console (by pressing cmd-enter). Outputs are presented in vectors: index 1.

# THIS IS A LEVEL 1 HEADER #################################

## This Is a Level 2 Header ================================

### This is a level 3 header. ------------------------------

print('Hello World!')

## Define Variables ================
x <- 1:5 #Put the numbers between 1-5 in the variable x (alt- is a short code for <- in the IDE)
x #Displays the values we have set in x

Getting help is a matter of issuing a query command: e.g. ?datasets displays help on the Datasets package.

Packages

Packages are installed and activated thus:

install.packages("ggplot2")
?install.packages  # Get help on installing packages

## Activate packages ==========================
library(ggplot2)

We ended the session by installing something called the “Tidyverse”, which seems to be a collection of useful packages. It calls itself an “opinionated collection of R packages designed for data science”.

Class 2 - Types of data and Grammar of graphs

  • Type of Data
  • How to load data
  • How to choose the right Graph
  • Graph types
  • The Grammar of Graphs
  • Ggplot2 structure
  • Graph settings

Data types

Type Description
Vector One dimensional collection of data, like a column or row, all of the same type
Matrix Two-dimensional collection of data, all of the same type
Array Multi-dimensional collection of data, all of the same type
Data frame Mixed type variables (similar to a table in spreadsheet)

Reading a table by convention has variables in columns, and the observations or data values are in the rows. Vectors, arrays and matrices in R all must contain the same data type, and a data frame is the only construct that allows mixed types.

Keeping the environment clean

# Clear environment
rm(list = ls())

# Clear console
cat("\014")  # ctrl+L

Loading data

df <- read_csv("data/RodentSimplified.csv")

# ... same, with selection and mutation of month column name:

df2 <- read_csv("data/RodentSimplified.csv") %>%
  select(mo,dy,yr,period,species) %>% 
  mutate(period = as.factor(period)) %>%
  rename(month = mo) %>%
  print()

What looks like a continuation line in the above code block is called a “pipe”, written %>%. A pipe is much more than just for making the code look pretty: it really does work as a pipe to transfer a value between subsequent function calls. See this and this and this for more details.

Graph types

After some warnings on selecting data presentation that is consistent with your intention to actually communicate something, and not to show off, Lucia introduced us to some of the graph types available with R. She gave us a really nice demonstration why many journals prefer data presentation (e.g. ratios) as a bar chart rather than a pie chart. Differences are sometimes clearer in the bar chart. Apparently, Florence Nightingale invented the pie chart!

The Grammar of graphics

The ggplot function allows the creation of graphs according to the framework outlined by Wilkinson1 in which the separate aspects of data presentation are dealt with separately. A basic template is described here with examples for ggplot. My take on the framework is:

  1. a data set
  2. aesthetics = a set of scales and variables to be plotted
  3. geometry = shapes to represent the data and the type of chart
  4. a grid of subplots, or facets
  5. statistics = summaries of the data
  6. coordinates to define the plotting space
  7. global themes and labels to make it look pretty

Some of these map into the ggplot function, from the most basic (using examples from the weekly challenge, below):

ggplot(college, aes(x=sat_avg, y=admission_rate)) + geom_point()

Here, the aes part is the aesthetics definition of the variables for the plot; geom_point defines the type of chart (scatter plot), and college is the data set.

The weekly challenge

I completely failed to get anything at all out of the challenge in the 5 minutes we were given for this task. My code just didn’t produce any kind of plot, just barffing errors at me. This was entirely down to my inability to type or read carefully what I had typed. Later, I managed to make some progress:

# Import CSV files from online repo
college <- read_csv('http://672258.youcanlearnit.net/college.csv')
college <- college %>%
  mutate(state=as.factor(state), region=as.factor(region),
         highest_degree=as.factor(highest_degree),
         control=as.factor(control), gender=as.factor(gender),
         loan_default_rate=as.numeric(loan_default_rate))
# Create a chart that would show the relationship between the SAT average and the admission rate 
ggplot(college, aes(x=sat_avg, y=admission_rate)) + geom_point()

Time is not in abundance at the moment, so I did eventually just look at the solution and run it, seeing how each element of the code contributed to the final result. Here’s my result:

Next class

  • Discuss the results of the challenge
  • Intro to Statistics
  • Descriptive Statistics
  • Summarising Statistics
  • Exploring 1 variable: Plotting distribution Histograms and Density plot

Notes and references

  1. Wilkinson, L. (2005) The Grammar of Graphics, The Grammar of Graphics. Springer-Verlag. doi: 10.1007/0-387-28695-0. 

Some witchy history and a very smart woman in data science⤴

from @ education

So yesterday and today the Twitter has been on fire about some work that was done over the summer by Emma Carroll, a recent graduate of Edinburgh and working with us as an Equate Scotland Careerwise intern. It culminated today in a really nice news … Continue reading Some witchy history and a very smart woman in data science

Some thoughts on Doing Data Right⤴

from @ education

I'm playing catch up on the blog - my Drafts is a scary place to look and a lot isn't going to see the light of day. I'm also not going to be shackled to trying to get stuff done in any sort of chronological … Continue reading Some thoughts on Doing Data Right

ALT-C 2019: Ethical EdTech⤴

from @ education

I'm taking part in 2 sessions at ALT-C this year and whilst they might at first glance look totally different, they are in fact underpinned by the same critical thinking and ethical approaches that guide a lot of my (and our) work at Edinburgh. We … Continue reading ALT-C 2019: Ethical EdTech