Data Analysis with R (Part 1): Introduction to R Programming
Source: https://campus.datacamp.com/
Selection by comparison
Select values in numeric_vector
that are larger than 10. Assign the result to the variable larger_than_ten
# A numeric vector containing 3 elementsnumeric_vector <- c(1, 10, 49)larger_than_ten <- numeric_vector > 10print(larger_than_ten)
Print the selected elements
numeric_vector <- c(1, 10, 49)larger_than_ten <- numeric_vector > 10numeric_vector[larger_than_ten]
Matrices
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.
You can construct a matrix in R with the matrix() function. Consider the following example: matrix(1:9, byrow = TRUE, nrow = 3, ncol = 3)
In the matrix() function:
- The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which constructs the vector c(1, 2, 3, 4, 5, 6, 7, 8, 9).
- The argument byrow indicates that the matrix is filled by the rows. This means that the matrix is filled from left to right and when the first row is completed, the filling continues on the second row. If we want the matrix to be filled by the columns, we just place byrow = FALSE.
- The third argument nrow indicates that the matrix should have three rows.
- The fourth argument ncol indicates the number of columns that the matrix should have
Another example
# Construction of a matrix with 5 rows that contain the numbers 1 up to 20 and assign it to mmymatrix <- matrix(1:20, byrow = TRUE, nrow = 5, ncol = 4)# print m to the consoleprint(mymatrix)
Factors
The term factor refers to a statistical data type used to store categorical variables.
# a vector called student_statusstudent_status <- c("student", "not student", "student", "not student")# turn student_status into a factor and save it in the variable categorical_studentcategorical_student <- factor(student_status)# print categorical_student to the consoleprint(categorical_student)
Dataframes
A dataframe is a collection of elements of the different data types (numeric, character, or logical) arranged into a fixed number of rows and columns.
The elements in the matrix should be of the same type. But a dataframe doesn’t have this restriction.
How to inspect a dataframe?
There are several functions you can use to inspect your dataframe. To name a few
head
: this by default prints the first 6 rows of the dataframetail
: this by default prints the last 6 rows to the consolestr
: this prints the structure of your dataframedim
: this by default prints the dimensions, that is, the number of rows and columns of your dataframecolnames
: this prints the names of the columns of your dataframe
For example,
# print the first 6 rows of mtcarsprint(head(mtcars))# print the structure of mtcarsprint(str(mtcars))# print the dimensions of mtcarsprint(dim(mtcars))
How to construct a dataframe?
Suppose you want to construct a data frame that describes the main characteristics of eight planets in our solar system. The main features of a planet are:
- The type of planet (Terrestrial or Gas Giant).
- The planet’s diameter relative to the diameter of the Earth.
- The planet’s rotation across the sun relative to that of the Earth.
- If the planet has rings or not (TRUE or FALSE).
You construct a data frame with the data.frame()
function. As arguments, you should provide the above mentioned vectors as input that should become the different columns of that data frame. Therefore, it is important that each vector used to construct a data frame has an equal length. But do not forget that it is possible (and likely) that they contain different types of data.
# planets vectorplanets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")# type vectortype <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")# diameter vectordiameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)# rotation vectorrotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)# rings vectorrings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)# construct a dataframe planet_df from all the above variablesplanet_df <- data.frame(planets, type, diameter, rotation, rings)
Indexing and selecting columns from a dataframe
In the same way as you indexed your vectors, you can select elements from your dataframe using square brackets. Different from dataframes however, you now have multiple dimensions: rows and columns. That’s why you can use a comma in the middle of the brackets to differentiate between rows and columns. For instance, the following code planet_df[1,2]
would select the element in the first row and the second column from the dataframe planet_df
.
You can also use the $
operator to select an entire column from a dataframe. For instance, planet_df$planets
would select the entire planets column from the dataframe planet_df.
Example
# select the values in the first row and second and third columnsplanet_df[1,2: 3]# select the entire third columnplanet_df[,3]
Lists
A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, type of activity that has to do be done.
A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other.
You can easily construct a list using the list()
function. In this function you can wrap the different elements like so: list(item1, item2, item3)
.
# Vector with numerics from 1 up to 10my_vector <- 1:10# Matrix with numerics from 1 up to 9my_matrix <- matrix(1:9, ncol = 3)# First 10 elements of the built-in data frame 'mtcars'my_df <- mtcars[1:10,]# Construct my_list with these different elements:my_list <- list(my_vector, my_matrix, my_df)# print my_list to the consoleprint(my_list)
Selecting elements from a list
Your list will often be built out of numerous elements and components. Therefore, getting a single element, multiple elements, or a component out of it is not always straightforward. One way to select a component is using the numbered position of that component. For example, to “grab” the first component of my_list
you type my_list[[1]]
Another way to check is to refer to the names of the components: my_list[["my_vector"]]
selects the my_vector
vector.
A last way to grab an element from a list is using the $
sign. The following code would select my_df
from my_list
: my_list$my_df
.
Besides selecting components, you often need to select specific elements out of these components. For example, with my_list[[1]][1]
you select from the first component of my_list
the first element. This would select the number 1.
# Vector with numerics from 1 up to 10my_vector <- 1:10# Matrix with numerics from 1 up to 9my_matrix <- matrix(1:9, ncol = 3)# First 10 elements of the built-in data frame 'mtcars'my_df <- mtcars[1:10,]# Construct list with these different elements:my_list <- list(my_vector, my_matrix, my_df)# Grab the second element of my_list and print it to the consoleprint(my_list[[2]])# Grab the first column of the third component of `my_list` and print it to the consoleprint(my_list[[3]][,1])
Getting help
help(mean)
?mean
args(mean)
How to Create Functions
# make a function called multiply_a_bmultiply_a_b <- function(a, b) { return(a*b)}# call the function multiply_a_b and store the result into a variable resultresult <- multiply_a_b(a=3, b=7)
Getting your data into R
One important thing before you actually do analyses on your data, is that you will need to get your data into R. R contains many functions to read in data from different formats. To name only a few:
read.table
: Reads in tabular data such as txt filesread.csv
: Read in data from a comma-separated file formatreadWorksheetFromFile
: Reads in an excel worksheetread.spss
: Reads in data from .sav SPSS format.
For the current exercise, we have put the R mtcars dataset into a csv file format and put this on github. The data can be found on the following link:
Sample data
http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv
# load in the data and store it in the variable carscars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv")# print the first 6 rows of the dataset using the head() functionprint(head(cars))
Define the separator
# load in the datasetcars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep = ";")# print the first 6 rows of the datasetprint(head(cars))
Working directories in R
Default working directory where R will look for
C:/Users/Username/documents.
Of course this working directory is not static and can be changed by the user.
In R there are two important functions:
getwd()
: This function will retrieve the current working directory for the usersetwd()
: This functions allows the user to set her own working directorylist.files()
lists all the files that exists in your working directory.
Example code
# list all the files in the working directorylist.files(getwd())# read in the cars dataset and store it in a variable called carscars <- read.csv("cars.csv", sep = ";")# print the first 6 rows of carsprint(head(cars))
Importing R packages
Imagine we want to do some great plotting and we want to use ggplot2 for it. If we want to do so, we need to take 2 steps:
- Install the package ggplot2 using
install.packages("ggplot2")
- Load the package ggplot2 using
library(ggplot2)
orrequire(ggplot2)
Example code
# Install the ggplot2 package install(ggplot2)# load the ggplot2 package using the require functionrequire(ggplot2)