With this class, we cannot cover every possible situation that you will encounter. The goals are to:
When writing code (and conducting statistical analyses) an iterative approach is a good strategy.
Finding your bug is a process of confirming the many things that you believe are true – until you find one which is not true. - Norm Matloff
We will first focus on debugging when an error, or warning is tripped.
R will flag, print out a message, in two cases: warnings and errors.
stop()
and force all execution of code to stop triggering an
error
.warning()
and display potential problems. Warnings do not stop
code from executing.message()
, which pass along information.In other cases, we will have bugs in our code that don’t necessarily give a warning or an error.
Note: NA
values often return a warning message, but not
always.
==
will test for equality. For example
to determine if pi equals three, this can be evaluated with
pi == 3
in R and will return FALSE. Note this operator
returns a logical value.&
is the AND operator, so
TRUE & FALSE
will return FALSE.|
is the OR operator, so
TRUE | FALSE
will return TRUE.!
is the NOT operator, so
! TRUE
will return FALSE.^
permits power terms, so
4 ^ 2
returns 16 and 4 ^ .5
returns 2.Note that order of operations is important in writing R code.
Evaluate all expressions. Note !
is R’s not
operator.
The results of the R code are:
## [1] 0
## [1] 4
## [1] 1
The results of the R code are:
## [1] FALSE
## [1] FALSE
## [1] FALSE
As a statistician or more generally a data scientist the ability to manipulate, process, clean, and merge datasets is an essential skill.
The concept of tidy data can be attributed to Hadley Wickham and has three principles for organizing data. Tidy Data Reference
Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham
dplyr
and ggplot2
require tidy data.An important skill is merging or combining data sets.
Consider the two data frames, how can we merge them and what should be the dimensions of the merged data frame.
school | state |
---|---|
MSU | MT |
VT | VA |
Mines | CO |
school | enrollment |
---|---|
Mines | 5794 |
MSU | 15688 |
VT | 30598 |
One possibility is to use the arrange the data frames first and combine. Note: not generally advised.
One possibility is to use the arrange the data frames first and combine. Note: not generally advised.
## school state
## 3 Mines CO
## 1 MSU MT
## 2 VT VA
## school enrollment
## 1 Mines 5794
## 2 MSU 15688
## 3 VT 30598
Now, given that the data frames are both sorted the same way, we can bind the rows together.
## school state school enrollment
## 3 Mines CO Mines 5794
## 1 MSU MT MSU 15688
## 2 VT VA VT 30598
Now assume we want to add another school to the data frame.
## school state enrollment
## 3 Mines CO 5794
## 1 MSU MT 15688
## 2 VT VA 30598
## 4 Luther IA 2337
Note: if your strings are saved as factors, this chunk of code will give you an error.
dplyr
also contains functions for binding rows and
columns.
Now combine the following information into a single table sorted alphabetically by the name of the ski hill.
ski.hill | skiable.acres |
---|---|
Big Sky | 5800 |
Bridger Bowl | 2000 |
Jackson | 2500+ |
Steamboat | 2965 |
## ski.resort ticket.cost
## 1 Bridger Bowl 60
## 2 Big Sky depends
## 3 Steamboat 145
## 4 Jackson 130
## ski.hill skiable.acres ticket.cost
## 1 Discovery 2200 20
Now combine the following information into a single table sorted alphabetically by the name of the ski hill.
ski.hill | skiable.acres | ticket.cost |
---|---|---|
Big Sky | 5800 | 60 |
Bridger Bowl | 2000 | depends |
Discovery | 2200 | 20 |
Jackson | 2500+ | 145 |
Steamboat | 2965 | 130 |
We could have also used some of the more advanced merge (join) features from dplyr.
## school state enrollment
## 1 Mines CO 5794
## 2 MSU MT 15688
## 3 VT VA 30598
The dplyr join command includes the following options:
Combine the data sets
## ski.hill skiable.acres
## 1 Big Sky 5800
## 2 Bridger Bowl 2000
## 3 Jackson 2500+
## 4 Steamboat 2965
## ski.resort ticket.cost
## 1 Bridger Bowl 60
## 2 Big Sky depends
## 3 Steamboat 145
## 4 Jackson 130
## # A tibble: 1 × 3
## ski.hill skiable.acres ticket.cost
## <chr> <chr> <chr>
## 1 Discovery 2200 20
full_join(ski_acres, df_cost, by = c("ski.hill"= "ski.resort")) |>
full_join(disco, by = join_by(ski.hill, skiable.acres, ticket.cost)) |>
arrange(ski.hill)
## ski.hill skiable.acres ticket.cost
## 1 Big Sky 5800 depends
## 2 Bridger Bowl 2000 60
## 3 Discovery 2200 20
## 4 Jackson 2500+ 130
## 5 Steamboat 2965 145
This is a wide dataset
hoops <- read_csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/TourneyDetailedResults.csv')
head(hoops)
## # A tibble: 6 × 34
## Season Daynum Wteam Wscore Lteam Lscore Wloc Numot Wfgm Wfga Wfgm3 Wfga3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2003 134 1421 92 1411 84 N 1 32 69 11 29
## 2 2003 136 1112 80 1436 51 N 0 31 66 7 23
## 3 2003 136 1113 84 1272 71 N 0 31 59 6 14
## 4 2003 136 1141 79 1166 73 N 0 29 53 3 7
## 5 2003 136 1143 76 1301 74 N 1 27 64 7 20
## 6 2003 136 1163 58 1140 53 N 0 17 52 4 14
## # ℹ 22 more variables: Wftm <dbl>, Wfta <dbl>, Wor <dbl>, Wdr <dbl>,
## # Wast <dbl>, Wto <dbl>, Wstl <dbl>, Wblk <dbl>, Wpf <dbl>, Lfgm <dbl>,
## # Lfga <dbl>, Lfgm3 <dbl>, Lfga3 <dbl>, Lftm <dbl>, Lfta <dbl>, Lor <dbl>,
## # Ldr <dbl>, Last <dbl>, Lto <dbl>, Lstl <dbl>, Lblk <dbl>, Lpf <dbl>
Consider a plot where the number of points is colored or faceted by winning or losing.
This requires a dataset that is thin and looks something like
Game Id | Points | win/lose |
---|---|---|
1 | 92 | win |
1 | 84 | lose |
2 | 80 | win |
2 | 51 | lose |
lose_score <- hoops |> select(score = Lscore) |>
mutate(outcome = 'lose', game_id = row_number())
win_score <- hoops |> select(score = Wscore) |>
mutate(outcome = 'win', game_id = row_number())
thin_scores <- bind_rows(win_score, lose_score) |>
arrange(game_id)
thin_scores
## # A tibble: 1,828 × 3
## score outcome game_id
## <dbl> <chr> <int>
## 1 92 win 1
## 2 84 lose 1
## 3 80 win 2
## 4 51 lose 2
## 5 84 win 3
## 6 71 lose 3
## 7 79 win 4
## 8 73 lose 4
## 9 76 win 5
## 10 74 lose 5
## # ℹ 1,818 more rows
The tidyr
package contains additional tools for
converting files from long to wide formats.
Consider the billboard
dataset which contains the rank
of the song (in 2000) for each week after it first entered the list.
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
If we want to identify songs that reach number 1 quickly, the data needs to manipulated.
billboard_thin <- billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = T
)
billboard_thin
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
billboard_thin |>
filter(rank == 1) |>
group_by(artist, track) |>
tally() |>
ungroup() |>
arrange(desc(n)) |>
slice(1) |>
kable()
artist | track | n |
---|---|---|
Destiny’s Child | Independent Women Pa… | 11 |
## # A tibble: 1 × 1
## n
## <int>
## 1 228
billboard_thin |>
group_by(artist, track) |>
slice(1) |>
ungroup() |>
group_by(artist) |>
tally() |>
arrange(desc(n)) |>
slice(1)
## # A tibble: 1 × 2
## artist n
## <chr> <int>
## 1 Jay-Z 5
## # A tibble: 317 × 3
## # Groups: track [316]
## track artist n
## <chr> <chr> <int>
## 1 Higher Creed 57
## 2 Amazed Lonestar 55
## 3 Breathe Hill, Faith 53
## 4 Kryptonite 3 Doors Down 53
## 5 With Arms Wide Open Creed 47
## 6 I Wanna Know Joe 44
## 7 Everything You Want Vertical Horizon 41
## 8 Bent matchbox twenty 39
## 9 He Wasn't Man Enough Braxton, Toni 37
## 10 (Hot S**t) Country G... Nelly 34
## # ℹ 307 more rows
The pivot_wider()
function allows transformation in the
other direction from thin to wide.