Debugging + Data Wrangling

Course Goals (Debugging)

With this class, we cannot cover every possible situation that you will encounter. The goals are to:

Give you a broad range of tools that can be employed to manipulate, visualize, and analyze data, and
teach you to find help when you or your code “gets stuck”.

Process for writing code

When writing code (and conducting statistical analyses) an iterative approach is a good strategy.

Test each line of code as you write it and if necessary confirm that nested functions are giving the desired results.
Start simple and then add more complexity.

Debugging Overview

Finding your bug is a process of confirming the many things that you believe are true – until you find one which is not true. - Norm Matloff

Debugging Guide

We will first focus on debugging when an error, or warning is tripped.

Realize you have a bug (if error or warning, read the message)
Make it repeatable
Identify the problematic line (using print statements can be helpful)
Fix it and test it (evaluate nested functions if necessary)

Warnings vs. Errors

R will flag, print out a message, in two cases: warnings and errors.

What is the difference between the two?
Is the R process treated differently for errors and warnings?

Warnings vs. Errors

Fatal errors are signaled with stop() and force all execution of code to stop triggering an error.
Warnings are generated with warning() and display potential problems. Warnings do not stop code from executing.
Messages can also be passed using message(), which pass along information.

Bugs without warning/error

In other cases, we will have bugs in our code that don’t necessarily give a warning or an error.

How do we identify these bugs?
How can we exit a case where:
- R is running and may be stuck?
- the code won’t execute because of misaligned parenthesis, braces, brackets?

Note: NA values often return a warning message, but not always.

Operators in R

Most mathematical operators are self explanatory, but here are a few more important operators.
- == will test for equality. For example to determine if pi equals three, this can be evaluated with pi == 3 in R and will return FALSE. Note this operator returns a logical value.
- & is the AND operator, so TRUE & FALSE will return FALSE.
- | is the OR operator, so TRUE | FALSE will return TRUE.
- ! is the NOT operator, so ! TRUE will return FALSE.
- ^ permits power terms, so 4 ^ 2 returns 16 and 4 ^ .5 returns 2.

Exercise: Order of operations

Note that order of operations is important in writing R code.

4 - 2 ^ 2
(4 - 2) ^ 2
5 * 2 - 3 ^ 2
pi == 3
! TRUE & pi == 3
! (TRUE | FALSE)

Evaluate all expressions. Note ! is R’s not operator.

Solution: Order of operations

The results of the R code are:

4 - 2 ^ 2

## [1] 0

(4 - 2) ^ 2

## [1] 4

5 * 2 - 3 ^ 2

## [1] 1

Solution: Order of operations

The results of the R code are:

pi == 3

## [1] FALSE

! TRUE & pi == 3

## [1] FALSE

! (TRUE | FALSE)

## [1] FALSE

Data Wrangling

As a statistician or more generally a data scientist the ability to manipulate, process, clean, and merge datasets is an essential skill.

These skills are generally referred to as data wrangling.
In a data analysis or visualization setting, they will undoubtedly require a majority of your time.
Wrangling data can be a painful process.
This lecture will provide some tools and example of organizing data.

Data Wrangling Concepts

Wide and thin datasets
Merging and joining relational data
Dealing with strings (later)
Dealing with date/time objects (later)

Tidy Data

Rules for Tidy Data

The concept of tidy data can be attributed to Hadley Wickham and has three principles for organizing data. Tidy Data Reference

Each variable must have its own column,
Each observation must have its own row, and
Each value must have its own cell.

Tidy Data Visual

Visual Representation of Tidy Data. Source: R4DS

Why use Tidy Data

Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham

Storing data in a consistent way gives familiarity with methods for manipulating data.
Tidy data structure takes advantage of vectorised operations in R
Many useful packages: such as dplyr and ggplot2 require tidy data.

Merge / Join

Merge

An important skill is merging or combining data sets.

Consider the two data frames, how can we merge them and what should be the dimensions of the merged data frame.

school	state
MSU	MT
VT	VA
Mines	CO

school	enrollment
Mines	5794
MSU	15688
VT	30598

pre-sort

One possibility is to use the arrange the data frames first and combine. Note: not generally advised.

school_locations <- school_locations[order(school_locations$school),]
school_pop <- school_pop[order(school_pop$school),]

pre-sort

One possibility is to use the arrange the data frames first and combine. Note: not generally advised.

school_locations

##   school state
## 3  Mines    CO
## 1    MSU    MT
## 2     VT    VA

school_pop

##   school enrollment
## 1  Mines       5794
## 2    MSU      15688
## 3     VT      30598

rbind() and cbind()

Now, given that the data frames are both sorted the same way, we can bind the rows together.

comb_df <- cbind(school_locations, school_pop)
comb_df

##   school state school enrollment
## 3  Mines    CO  Mines       5794
## 1    MSU    MT    MSU      15688
## 2     VT    VA     VT      30598

comb_df <- comb_df[,-3]

rbind() and cbind()

Now assume we want to add another school to the data frame.

new_school <- c('Luther', 'IA',2337)
rbind(comb_df, new_school)

##   school state enrollment
## 3  Mines    CO       5794
## 1    MSU    MT      15688
## 2     VT    VA      30598
## 4 Luther    IA       2337

Note: if your strings are saved as factors, this chunk of code will give you an error.

bind_rows() / bind_cols()

dplyr also contains functions for binding rows and columns.

Exercise

Now combine the following information into a single table sorted alphabetically by the name of the ski hill.

ski.hill	skiable.acres
Big Sky	5800
Bridger Bowl	2000
Jackson	2500+
Steamboat	2965

##     ski.resort ticket.cost
## 1 Bridger Bowl          60
## 2      Big Sky     depends
## 3    Steamboat         145
## 4      Jackson         130

##    ski.hill skiable.acres ticket.cost
## 1 Discovery          2200          20

Solution 1: bind_cols()

Now combine the following information into a single table sorted alphabetically by the name of the ski hill.

ski.hill	skiable.acres	ticket.cost
Big Sky	5800	60
Bridger Bowl	2000	depends
Discovery	2200	20
Jackson	2500+	145
Steamboat	2965	130

Solution 1: bind_cols()

ski_acres |>
  bind_cols(df_cost) |> 
  select(-ski.resort) |>
  bind_rows(disco) |>
  arrange(ski.hill) |>
  kable()

Better Option: join()

We could have also used some of the more advanced merge (join) features from dplyr.

school_df <- full_join(school_locations, school_pop, by='school')
school_df

##   school state enrollment
## 1  Mines    CO       5794
## 2    MSU    MT      15688
## 3     VT    VA      30598

more about join

The dplyr join command includes the following options:

left join
right join
inner join
full join
anti join

Exercise

Combine the data sets

##       ski.hill skiable.acres
## 1      Big Sky          5800
## 2 Bridger Bowl          2000
## 3      Jackson         2500+
## 4    Steamboat          2965

##     ski.resort ticket.cost
## 1 Bridger Bowl          60
## 2      Big Sky     depends
## 3    Steamboat         145
## 4      Jackson         130

## # A tibble: 1 × 3
##   ski.hill  skiable.acres ticket.cost
##   <chr>     <chr>         <chr>      
## 1 Discovery 2200          20

Better Solution

full_join(ski_acres, df_cost, by = c("ski.hill"= "ski.resort")) |>
  full_join(disco, by = join_by(ski.hill, skiable.acres, ticket.cost)) |>
  arrange(ski.hill)

##       ski.hill skiable.acres ticket.cost
## 1      Big Sky          5800     depends
## 2 Bridger Bowl          2000          60
## 3    Discovery          2200          20
## 4      Jackson         2500+         130
## 5    Steamboat          2965         145

wide / long data

Hoops Data

This is a wide dataset

hoops <- read_csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/TourneyDetailedResults.csv')
head(hoops)

## # A tibble: 6 × 34
##   Season Daynum Wteam Wscore Lteam Lscore Wloc  Numot  Wfgm  Wfga Wfgm3 Wfga3
##    <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   2003    134  1421     92  1411     84 N         1    32    69    11    29
## 2   2003    136  1112     80  1436     51 N         0    31    66     7    23
## 3   2003    136  1113     84  1272     71 N         0    31    59     6    14
## 4   2003    136  1141     79  1166     73 N         0    29    53     3     7
## 5   2003    136  1143     76  1301     74 N         1    27    64     7    20
## 6   2003    136  1163     58  1140     53 N         0    17    52     4    14
## # ℹ 22 more variables: Wftm <dbl>, Wfta <dbl>, Wor <dbl>, Wdr <dbl>,
## #   Wast <dbl>, Wto <dbl>, Wstl <dbl>, Wblk <dbl>, Wpf <dbl>, Lfgm <dbl>,
## #   Lfga <dbl>, Lfgm3 <dbl>, Lfga3 <dbl>, Lftm <dbl>, Lfta <dbl>, Lor <dbl>,
## #   Ldr <dbl>, Last <dbl>, Lto <dbl>, Lstl <dbl>, Lblk <dbl>, Lpf <dbl>

Plot Creation

Consider a plot where the number of points is colored or faceted by winning or losing.

This requires a dataset that is thin and looks something like

Game Id	Points	win/lose
1	92	win
1	84	lose
2	80	win
2	51	lose

Data Wrangling

lose_score <- hoops |> select(score = Lscore) |>
  mutate(outcome = 'lose', game_id = row_number()) 
win_score <- hoops |> select(score = Wscore) |> 
  mutate(outcome = 'win', game_id = row_number())
thin_scores <- bind_rows(win_score, lose_score) |> 
  arrange(game_id)
thin_scores

## # A tibble: 1,828 × 3
##    score outcome game_id
##    <dbl> <chr>     <int>
##  1    92 win           1
##  2    84 lose          1
##  3    80 win           2
##  4    51 lose          2
##  5    84 win           3
##  6    71 lose          3
##  7    79 win           4
##  8    73 lose          4
##  9    76 win           5
## 10    74 lose          5
## # ℹ 1,818 more rows

Plot Creation

tidyr package

The tidyr package contains additional tools for converting files from long to wide formats.

Consider the billboard dataset which contains the rank of the song (in 2000) for each week after it first entered the list.

billboard

## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

billboard data

If we want to identify songs that reach number 1 quickly, the data needs to manipulated.

billboard_thin <- billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
  values_drop_na = T
  )
billboard_thin

## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # ℹ 5,297 more rows

Billboard Data Exercise

Determine which song in this dataset spent the most time at #1
How many different artists showed up on the billboard list?
Which artists had the most different songs on the billboard list?
Which track was on the most lists?

1. Solution

billboard_thin |>
  filter(rank == 1) |>
  group_by(artist, track) |> 
  tally() |>
  ungroup() |>
  arrange(desc(n)) |>
  slice(1) |>
  kable()

artist	track	n
Destiny’s Child	Independent Women Pa…	11

2. Solution

billboard_thin |>
  group_by(artist) |>
  tally() |>
  tally()

## # A tibble: 1 × 1
##       n
##   <int>
## 1   228

3. Solution

billboard_thin |>
  group_by(artist, track) |>
  slice(1) |>
  ungroup() |>
  group_by(artist) |>
  tally() |>
  arrange(desc(n)) |>
  slice(1)

## # A tibble: 1 × 2
##   artist     n
##   <chr>  <int>
## 1 Jay-Z      5

4. Solution

billboard_thin |>
  group_by(track, artist) |>
  tally() |> arrange(desc(n))

## # A tibble: 317 × 3
## # Groups:   track [316]
##    track                   artist               n
##    <chr>                   <chr>            <int>
##  1 Higher                  Creed               57
##  2 Amazed                  Lonestar            55
##  3 Breathe                 Hill, Faith         53
##  4 Kryptonite              3 Doors Down        53
##  5 With Arms Wide Open     Creed               47
##  6 I Wanna Know            Joe                 44
##  7 Everything You Want     Vertical Horizon    41
##  8 Bent                    matchbox twenty     39
##  9 He Wasn't Man Enough    Braxton, Toni       37
## 10 (Hot S**t) Country G... Nelly               34
## # ℹ 307 more rows

pivot_wider()

The pivot_wider() function allows transformation in the other direction from thin to wide.