The ability to datasets into R is an essential skill. For this class,
most of the files will be on the course webpage and can be directly
downloaded using read_csv
. Consider the Seattle Housing
dataset from the previous lecture.
## Rows: 869 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfr...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Recall we can also use the read.csv()
function from the
tidyverse.
A common function that we will use is head
, which shows
the first few rows of a data frame.
## # A tibble: 6 × 14
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront sqft_above
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1350000 3 2.5 2753 65005 1 1 2165
## 2 228000 3 1 1190 9199 1 0 1190
## 3 289000 3 1.75 1260 8400 1 0 1260
## 4 720000 4 2.5 3450 39683 2 0 3450
## 5 247500 3 1.75 1960 15681 1 0 1960
## 6 850830 3 2.5 2070 13241 1.5 0 1270
## # ℹ 6 more variables: sqft_basement <dbl>, zipcode <dbl>, lat <dbl>,
## # long <dbl>, yr_sold <dbl>, mn_sold <dbl>
A common function that we will use is head
, which shows
the first few rows of a data frame.
## Rows: 869
## Columns: 14
## $ price <dbl> 1350000, 228000, 289000, 720000, 247500, 850830, 890000,…
## $ bedrooms <dbl> 3, 3, 3, 4, 3, 3, 4, 5, 3, 2, 3, 3, 1, 4, 4, 1, 2, 4, 5,…
## $ bathrooms <dbl> 2.50, 1.00, 1.75, 2.50, 1.75, 2.50, 1.00, 2.00, 2.50, 1.…
## $ sqft_living <dbl> 2753, 1190, 1260, 3450, 1960, 2070, 2550, 2260, 1910, 10…
## $ sqft_lot <dbl> 65005, 9199, 8400, 39683, 15681, 13241, 4000, 12500, 662…
## $ floors <dbl> 1.0, 1.0, 1.0, 2.0, 1.0, 1.5, 2.0, 1.0, 2.0, 1.0, 1.0, 1…
## $ waterfront <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sqft_above <dbl> 2165, 1190, 1260, 3450, 1960, 1270, 2370, 1130, 1910, 10…
## $ sqft_basement <dbl> 588, 0, 0, 0, 0, 800, 180, 1130, 0, 0, 580, 570, 0, 0, 0…
## $ zipcode <dbl> 98070, 98148, 98148, 98010, 98032, 98102, 98109, 98032, …
## $ lat <dbl> 47.4041, 47.4258, 47.4366, 47.3420, 47.3576, 47.6415, 47…
## $ long <dbl> -122.451, -122.322, -122.335, -122.025, -122.277, -122.3…
## $ yr_sold <dbl> 2015, 2014, 2014, 2015, 2015, 2014, 2014, 2014, 2015, 20…
## $ mn_sold <dbl> 3, 9, 8, 3, 3, 6, 6, 10, 1, 11, 4, 9, 10, 9, 10, 6, 7, 6…
A common function that we will use is head
, which shows
the first few rows of a data frame.
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront sqft_above
## 1 1350000 3 2.50 2753 65005 1.0 1 2165
## 2 228000 3 1.00 1190 9199 1.0 0 1190
## 3 289000 3 1.75 1260 8400 1.0 0 1260
## 4 720000 4 2.50 3450 39683 2.0 0 3450
## 5 247500 3 1.75 1960 15681 1.0 0 1960
## 6 850830 3 2.50 2070 13241 1.5 0 1270
## sqft_basement zipcode lat long yr_sold mn_sold
## 1 588 98070 47.4041 -122.451 2015 3
## 2 0 98148 47.4258 -122.322 2014 9
## 3 0 98148 47.4366 -122.335 2014 8
## 4 0 98010 47.3420 -122.025 2015 3
## 5 0 98032 47.3576 -122.277 2015 3
## 6 800 98102 47.6415 -122.315 2014 6
A common function that we will use is head
, which shows
the first few rows of a data frame.
## Rows: 869
## Columns: 14
## $ price <dbl> 1350000, 228000, 289000, 720000, 247500, 850830, 890000,…
## $ bedrooms <int> 3, 3, 3, 4, 3, 3, 4, 5, 3, 2, 3, 3, 1, 4, 4, 1, 2, 4, 5,…
## $ bathrooms <dbl> 2.50, 1.00, 1.75, 2.50, 1.75, 2.50, 1.00, 2.00, 2.50, 1.…
## $ sqft_living <int> 2753, 1190, 1260, 3450, 1960, 2070, 2550, 2260, 1910, 10…
## $ sqft_lot <int> 65005, 9199, 8400, 39683, 15681, 13241, 4000, 12500, 662…
## $ floors <dbl> 1.0, 1.0, 1.0, 2.0, 1.0, 1.5, 2.0, 1.0, 2.0, 1.0, 1.0, 1…
## $ waterfront <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sqft_above <int> 2165, 1190, 1260, 3450, 1960, 1270, 2370, 1130, 1910, 10…
## $ sqft_basement <int> 588, 0, 0, 0, 0, 800, 180, 1130, 0, 0, 580, 570, 0, 0, 0…
## $ zipcode <int> 98070, 98148, 98148, 98010, 98032, 98102, 98109, 98032, …
## $ lat <dbl> 47.4041, 47.4258, 47.4366, 47.3420, 47.3576, 47.6415, 47…
## $ long <dbl> -122.451, -122.322, -122.335, -122.025, -122.277, -122.3…
## $ yr_sold <int> 2015, 2014, 2014, 2015, 2015, 2014, 2014, 2014, 2015, 20…
## $ mn_sold <int> 3, 9, 8, 3, 3, 6, 6, 10, 1, 11, 4, 9, 10, 9, 10, 6, 7, 6…
The readxl
package makes importing excel data files
easy.
R has four common types of data structures:
The base data structures in R can be organized by dimensionality and whether they are homogenous.
Dimension | Homogenous | Heterogenous |
---|---|---|
1d | Vector | List |
2d | Matrix | Data Frame |
no d | Array |
There are four common types of vectors: logical, integer, double (or
numeric), and character. The c()
function is used for
combining elements into a vector
They type of vector can be identified using the typeof()
function. Note that only a single data type is allowed.
## [1] "double"
## [1] "character"
## [1] "this is" "a character string" "1"
## [4] "2.5" "3.14159265358979"
Create a vector with your first, middle, and last names.
## [1] "Andrew" "Blake" "Hoegh"
We have touched on many of these before, but here are some examples of expressions (conditions) in R. Evaluate these expressions:
## [1] TRUE
## [1] TRUE
## [1] FALSE
Note that &
is an and operator
## [1] TRUE
## [1] TRUE TRUE FALSE FALSE
## [1] TRUE FALSE TRUE
A data frame:
x | y |
---|---|
1 | a |
2 | b |
3 | c |
A modern data frame can be constructed using the
tibble()
command.
read_csv
command creates a tibble rather than a
data.frame.## # A tibble: 3 × 2
## x y
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
Subsetting allows you to extract elements from an object.
## [1] 1 2 3 4 5 6 7 8 9
## [1] 1 2 3
## [1] 1 5 8
Subsetting also works with negative values or expressions.
## [1] 1 2 3 4 6 7 8 9
## [1] 1 2 3 4 5 7 8 9
## [1] 6 7 8 9
Another possibility is to use logical values directly.
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [1] 6 7 8 9
## [1] 1 2 3 7 8 9
The same ideas apply to data frames, but the indices now constitute rows and columns of the data frame.
## [1] 1 2 3
## y z
## 2 2 b
## 3 1 c
There are also a couple built in functions in R for subsetting data frames.
## [1] 1 2 3
## x y z
## 2 2 2 b
## 3 3 1 c
The filter()
and select()
functions in the
dplyr
package (in the tidyverse) can also be used for
subsetting.
## x
## 1 1
## 2 2
## 3 3
## x y z
## 1 2 2 b
## 2 3 1 c
Create a new data frame that only includes houses worth more than $1,000,000.
From this new data frame what is the average living square footage of houses.
Seattle$sqft_living
## [1] 3890.065
## # A tibble: 1 × 1
## ave_size
## <dbl>
## 1 3890.
Consider the two lists
msu.info <- list( name = c('Waded Cruzado','Andy Hoegh'),
degree.from = c('University of Texas at Arlington','Virginia Tech'),
job.title = c('President', 'Assistant Professor of Statistics'))
msu.info2 <- list(c('Waded Cruzado','University of Texas at Arlington',
'President'), c('Andy Hoegh',
'Virginia Tech','Assistant Professor of Statistics'))
## $name
## [1] "Waded Cruzado" "Andy Hoegh"
##
## $degree.from
## [1] "University of Texas at Arlington" "Virginia Tech"
##
## $job.title
## [1] "President" "Assistant Professor of Statistics"
## [[1]]
## [1] "Waded Cruzado" "University of Texas at Arlington"
## [3] "President"
##
## [[2]]
## [1] "Andy Hoegh" "Virginia Tech"
## [3] "Assistant Professor of Statistics"
With the current lists we can index elements using the double bracket
[[ ]]
notation or if names have been initialized, those can
be used too.
So the first element of each list can be indexed
## [1] "Waded Cruzado" "Andy Hoegh"
## [1] "Waded Cruzado" "Andy Hoegh"
Explore the indexing with these commands.
## $name
## [1] "Waded Cruzado" "Andy Hoegh"
## [1] "Waded Cruzado" "Andy Hoegh"
## [1] "Andy Hoegh"
## $name
## [1] "Waded Cruzado" "Andy Hoegh"
##
## $degree.from
## [1] "University of Texas at Arlington" "Virginia Tech"
## name1 name2
## "Waded Cruzado" "Andy Hoegh"
## degree.from1 degree.from2
## "University of Texas at Arlington" "Virginia Tech"
## job.title1 job.title2
## "President" "Assistant Professor of Statistics"
## [[1]]
## [[1]][[1]]
## [1] "a"
##
## [[1]][[2]]
## [1] "b"
##
##
## [[2]]
## [[2]][[1]]
## [1] "c"
##
## [[2]][[2]]
## [1] "d"
Arrays are a general form a matrix, but have a higher dimension.
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
## [1] 4
Create an array of dimension 2 x 2 x 3, where each of the three 2 x 2 subarray (or matrix) is the Identity matrix.
Create an array of dimension 2 x 2 x 3, where each of the three 2 x 2 subarray (or matrix) is the Identity matrix.
## , , 1
##
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
## , , 2
##
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
## , , 3
##
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1