R Basics

R Intro

Reading Data files

The ability to datasets into R is an essential skill. For this class, most of the files will be on the course webpage and can be directly downloaded using read_csv. Consider the Seattle Housing dataset from the previous lecture.

Seattle <- read_csv('http://math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv')
## Rows: 869 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfr...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Reading Data files

Recall we can also use the read.csv() function from the tidyverse.

Seattle2 <- read.csv('http://math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv')

Viewing Data files

A common function that we will use is head, which shows the first few rows of a data frame.

head(Seattle)
## # A tibble: 6 × 14
##     price bedrooms bathrooms sqft_living sqft_lot floors waterfront sqft_above
##     <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 1350000        3      2.5         2753    65005    1            1       2165
## 2  228000        3      1           1190     9199    1            0       1190
## 3  289000        3      1.75        1260     8400    1            0       1260
## 4  720000        4      2.5         3450    39683    2            0       3450
## 5  247500        3      1.75        1960    15681    1            0       1960
## 6  850830        3      2.5         2070    13241    1.5          0       1270
## # ℹ 6 more variables: sqft_basement <dbl>, zipcode <dbl>, lat <dbl>,
## #   long <dbl>, yr_sold <dbl>, mn_sold <dbl>

Viewing Data files

A common function that we will use is head, which shows the first few rows of a data frame.

glimpse(Seattle)
## Rows: 869
## Columns: 14
## $ price         <dbl> 1350000, 228000, 289000, 720000, 247500, 850830, 890000,…
## $ bedrooms      <dbl> 3, 3, 3, 4, 3, 3, 4, 5, 3, 2, 3, 3, 1, 4, 4, 1, 2, 4, 5,…
## $ bathrooms     <dbl> 2.50, 1.00, 1.75, 2.50, 1.75, 2.50, 1.00, 2.00, 2.50, 1.…
## $ sqft_living   <dbl> 2753, 1190, 1260, 3450, 1960, 2070, 2550, 2260, 1910, 10…
## $ sqft_lot      <dbl> 65005, 9199, 8400, 39683, 15681, 13241, 4000, 12500, 662…
## $ floors        <dbl> 1.0, 1.0, 1.0, 2.0, 1.0, 1.5, 2.0, 1.0, 2.0, 1.0, 1.0, 1…
## $ waterfront    <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sqft_above    <dbl> 2165, 1190, 1260, 3450, 1960, 1270, 2370, 1130, 1910, 10…
## $ sqft_basement <dbl> 588, 0, 0, 0, 0, 800, 180, 1130, 0, 0, 580, 570, 0, 0, 0…
## $ zipcode       <dbl> 98070, 98148, 98148, 98010, 98032, 98102, 98109, 98032, …
## $ lat           <dbl> 47.4041, 47.4258, 47.4366, 47.3420, 47.3576, 47.6415, 47…
## $ long          <dbl> -122.451, -122.322, -122.335, -122.025, -122.277, -122.3…
## $ yr_sold       <dbl> 2015, 2014, 2014, 2015, 2015, 2014, 2014, 2014, 2015, 20…
## $ mn_sold       <dbl> 3, 9, 8, 3, 3, 6, 6, 10, 1, 11, 4, 9, 10, 9, 10, 6, 7, 6…

Viewing Data files

A common function that we will use is head, which shows the first few rows of a data frame.

head(Seattle2)
##     price bedrooms bathrooms sqft_living sqft_lot floors waterfront sqft_above
## 1 1350000        3      2.50        2753    65005    1.0          1       2165
## 2  228000        3      1.00        1190     9199    1.0          0       1190
## 3  289000        3      1.75        1260     8400    1.0          0       1260
## 4  720000        4      2.50        3450    39683    2.0          0       3450
## 5  247500        3      1.75        1960    15681    1.0          0       1960
## 6  850830        3      2.50        2070    13241    1.5          0       1270
##   sqft_basement zipcode     lat     long yr_sold mn_sold
## 1           588   98070 47.4041 -122.451    2015       3
## 2             0   98148 47.4258 -122.322    2014       9
## 3             0   98148 47.4366 -122.335    2014       8
## 4             0   98010 47.3420 -122.025    2015       3
## 5             0   98032 47.3576 -122.277    2015       3
## 6           800   98102 47.6415 -122.315    2014       6

Viewing Data files

A common function that we will use is head, which shows the first few rows of a data frame.

glimpse(Seattle2)
## Rows: 869
## Columns: 14
## $ price         <dbl> 1350000, 228000, 289000, 720000, 247500, 850830, 890000,…
## $ bedrooms      <int> 3, 3, 3, 4, 3, 3, 4, 5, 3, 2, 3, 3, 1, 4, 4, 1, 2, 4, 5,…
## $ bathrooms     <dbl> 2.50, 1.00, 1.75, 2.50, 1.75, 2.50, 1.00, 2.00, 2.50, 1.…
## $ sqft_living   <int> 2753, 1190, 1260, 3450, 1960, 2070, 2550, 2260, 1910, 10…
## $ sqft_lot      <int> 65005, 9199, 8400, 39683, 15681, 13241, 4000, 12500, 662…
## $ floors        <dbl> 1.0, 1.0, 1.0, 2.0, 1.0, 1.5, 2.0, 1.0, 2.0, 1.0, 1.0, 1…
## $ waterfront    <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sqft_above    <int> 2165, 1190, 1260, 3450, 1960, 1270, 2370, 1130, 1910, 10…
## $ sqft_basement <int> 588, 0, 0, 0, 0, 800, 180, 1130, 0, 0, 580, 570, 0, 0, 0…
## $ zipcode       <int> 98070, 98148, 98148, 98010, 98032, 98102, 98109, 98032, …
## $ lat           <dbl> 47.4041, 47.4258, 47.4366, 47.3420, 47.3576, 47.6415, 47…
## $ long          <dbl> -122.451, -122.322, -122.335, -122.025, -122.277, -122.3…
## $ yr_sold       <int> 2015, 2014, 2014, 2015, 2015, 2014, 2014, 2014, 2015, 20…
## $ mn_sold       <int> 3, 9, 8, 3, 3, 6, 6, 10, 1, 11, 4, 9, 10, 9, 10, 6, 7, 6…

Activity: Reading excel files

The readxl package makes importing excel data files easy.

R Data Structures

Data structure Overview

R has four common types of data structures:

  • Vectors
  • Matrices (and Arrays)
  • Lists
  • Data Frames (including tibbles)

Data structure Overview

The base data structures in R can be organized by dimensionality and whether they are homogenous.

Dimension Homogenous Heterogenous
1d Vector List
2d Matrix Data Frame
no d Array

Vector Types

There are four common types of vectors: logical, integer, double (or numeric), and character. The c() function is used for combining elements into a vector

dbl <- c(1,2.5,pi)
int <- c(1L,4L,10L)
log <- c(TRUE,FALSE,F,T)
char <- c('this is','a character string')

Vector Types

They type of vector can be identified using the typeof() function. Note that only a single data type is allowed.

  typeof(dbl)
## [1] "double"
  comb <- c(char,dbl)
  typeof(comb)
## [1] "character"
  comb
## [1] "this is"            "a character string" "1"                 
## [4] "2.5"                "3.14159265358979"

Exercise: Vectors

Create a vector with your first, middle, and last names.

Solution: Vectors

  1. Create a vector with your first, middle, and last names.
andy.names <- c("Andrew","Blake","Hoegh")
andy.names
## [1] "Andrew" "Blake"  "Hoegh"

Logical Values in R (Conditions)

We have touched on many of these before, but here are some examples of expressions (conditions) in R. Evaluate these expressions:

TRUE
## [1] TRUE
pi > 3
## [1] TRUE
pi == 3.14
## [1] FALSE

Exercise: Conditions in R

Note that & is an and operator

pi > 3 & pi < 3.5
c(1,3,5,7) %in% 1:3
1:3 %in% c(1,3,5,7)

Solutions: Conditions in R: Evaluated

pi > 3 & pi < 3.5
## [1] TRUE
c(1,3,5,7) %in% 1:3
## [1]  TRUE  TRUE FALSE FALSE
1:3 %in% c(1,3,5,7)
## [1]  TRUE FALSE  TRUE

Data Frame Overview

A data frame:

  • is the most common way of storing data in R
  • is like a matrix with rows-and-column structure; however, unlike a matrix each column may have a different mode
  • in a technical sense, a data frame is a list of equal-length vectors.
df <- data.frame(x = 1:3, y = c('a','b','c'))
kable(df)
x y
1 a
2 b
3 c

tibble()

A modern data frame can be constructed using the tibble() command.

  • The read_csv command creates a tibble rather than a data.frame.
  • The tibble includes the type of each vector, and only prints a certain number of rows/columns.
tibble1 <- tibble(x = 1:3, y = c('a','b','c')); tibble1
## # A tibble: 3 × 2
##       x y    
##   <int> <chr>
## 1     1 a    
## 2     2 b    
## 3     3 c

Subsetting

Vector Subsetting: I

Subsetting allows you to extract elements from an object.

num.vec <- seq(from = 1, to = 9, by = 1); num.vec
## [1] 1 2 3 4 5 6 7 8 9
num.vec[1:3]
## [1] 1 2 3
num.vec[c(1,5,8)]
## [1] 1 5 8

Vector Subsetting: II

Subsetting also works with negative values or expressions.

num.vec[-5]
## [1] 1 2 3 4 6 7 8 9
num.vec[num.vec != 6]
## [1] 1 2 3 4 5 7 8 9
num.vec[num.vec > 5]
## [1] 6 7 8 9

Vector Subsetting: III

Another possibility is to use logical values directly.

num.vec > 5
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
num.vec[num.vec > 5]
## [1] 6 7 8 9
num.vec[rep(c(TRUE,FALSE,TRUE),each=3)]
## [1] 1 2 3 7 8 9

Data Frame Subsetting: I

The same ideas apply to data frames, but the indices now constitute rows and columns of the data frame.

df <- data.frame(x=1:3, y=3:1, z=c('a','b','c'))
df[,1]
## [1] 1 2 3
df[-1,c(2:3)]
##   y z
## 2 2 b
## 3 1 c

Data Frame Subsetting: II

There are also a couple built in functions in R for subsetting data frames.

df$x
## [1] 1 2 3
new.df <- subset(df, x >1); new.df
##   x y z
## 2 2 2 b
## 3 3 1 c

Data Frame Subsetting: II

The filter() and select() functions in the dplyr package (in the tidyverse) can also be used for subsetting.

df %>% select(x)
##   x
## 1 1
## 2 2
## 3 3
filter(df, x > 1)
##   x y z
## 1 2 2 b
## 2 3 1 c

Exercise: Subsetting

  1. Create a new data frame that only includes houses worth more than $1,000,000.

  2. From this new data frame what is the average living square footage of houses.

Exercise: Subsetting - Solutions1

  1. Create a new data frame that only includes houses worth more than $1,000,000.
expensive_houses <- subset(Seattle, price > 1000000)
expensive_houses2 <- Seattle %>% filter(price > 1000000)

Exercise: Subsetting - Solutions2

  1. From this new data frame what is the average living square footage of houses. Hint columns in a data.frame can be indexed by Seattle$sqft_living
mean(expensive_houses$sqft_living)
## [1] 3890.065
Seattle %>% filter(price > 1000000) %>% summarize(ave_size = mean(sqft_living))
## # A tibble: 1 × 1
##   ave_size
##      <dbl>
## 1    3890.

Lists

Consider the two lists

msu.info <- list( name = c('Waded Cruzado','Andy Hoegh'), 
         degree.from = c('University of Texas at Arlington','Virginia Tech'),
         job.title = c('President', 'Assistant Professor of Statistics'))

msu.info2 <- list(c('Waded Cruzado','University of Texas at Arlington',
                     'President'), c('Andy Hoegh',
                  'Virginia Tech','Assistant Professor of Statistics'))

List Output

msu.info
## $name
## [1] "Waded Cruzado" "Andy Hoegh"   
## 
## $degree.from
## [1] "University of Texas at Arlington" "Virginia Tech"                   
## 
## $job.title
## [1] "President"                         "Assistant Professor of Statistics"
msu.info2
## [[1]]
## [1] "Waded Cruzado"                    "University of Texas at Arlington"
## [3] "President"                       
## 
## [[2]]
## [1] "Andy Hoegh"                        "Virginia Tech"                    
## [3] "Assistant Professor of Statistics"

Lists - indexing

With the current lists we can index elements using the double bracket [[ ]] notation or if names have been initialized, those can be used too.

So the first element of each list can be indexed

msu.info[[1]]
## [1] "Waded Cruzado" "Andy Hoegh"
msu.info$name
## [1] "Waded Cruzado" "Andy Hoegh"

Exercise: Lists

Explore the indexing with these commands.

msu.info <- list( name = c('Waded Cruzado','Andy Hoegh'), 
         degree.from = c('University of Texas at Arlington','Virginia Tech'),
         job.title = c('President', 'Associate Professor of Statistics'))
msu.info[1]
msu.info[[1]]
msu.info$name[2]
msu.info[1:2]
unlist(msu.info)

Solution: Lists 1

msu.info[1]
## $name
## [1] "Waded Cruzado" "Andy Hoegh"
msu.info[[1]]
## [1] "Waded Cruzado" "Andy Hoegh"
msu.info$name[2]
## [1] "Andy Hoegh"

Solution: Lists 2

msu.info[1:2]
## $name
## [1] "Waded Cruzado" "Andy Hoegh"   
## 
## $degree.from
## [1] "University of Texas at Arlington" "Virginia Tech"
unlist(msu.info)
##                               name1                               name2 
##                     "Waded Cruzado"                        "Andy Hoegh" 
##                        degree.from1                        degree.from2 
##  "University of Texas at Arlington"                     "Virginia Tech" 
##                          job.title1                          job.title2 
##                         "President" "Assistant Professor of Statistics"

Lists - nested lists

list(list('a','b'),list('c','d'))
## [[1]]
## [[1]][[1]]
## [1] "a"
## 
## [[1]][[2]]
## [1] "b"
## 
## 
## [[2]]
## [[2]][[1]]
## [1] "c"
## 
## [[2]][[2]]
## [1] "d"

Arrays

Arrays are a general form a matrix, but have a higher dimension.

array.1 <- array(1:8, dim=c(2,2,2)); array.1
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
array.1[2,2,1]
## [1] 4

Exercise: Arrays

Create an array of dimension 2 x 2 x 3, where each of the three 2 x 2 subarray (or matrix) is the Identity matrix.

Solution: Arrays

Create an array of dimension 2 x 2 x 3, where each of the three 2 x 2 subarray (or matrix) is the Identity matrix.

array(c(1,0,0,1), dim = c(2,2,3))
## , , 1
## 
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1