Strings and Dates

Published

October 1, 2001

Baltimore Towing Data

An Overview

The first example focuses on a subset of a dataset that contains information on vehicles towed in Baltimore, MD:

  • A larger version of this dataset along with additional descriptions can be found at: https://data.baltimorecity.gov/.

  • The full version of the dataset contains 61,000 rows and 36 columns, where each row corresponds to a vehicle and the columns are information pertaining to the vehicle.

  • We will be working with a smaller dataset with approximately 30,000 rows and 5 columns.

The dataset

First read in the data set which is available at: https://math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv.

baltimore_tow <- read_csv('https://math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv')
Rows: 30263 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): vehicleType, vehicleMake, vehicleModel, receivingDateTime, totalPaid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(baltimore_tow)
Rows: 30,263
Columns: 5
$ vehicleType       <chr> "Van", "Car", "Car", "Car", "Car", "SUV", "Car", "Ca…
$ vehicleMake       <chr> "LEXUS", "Mercedes", "Chysler", "Chevrolet", "Hyunda…
$ vehicleModel      <chr> NA, NA, "Cirrus", "Cavalier", "Tiburon", "RAV4", "32…
$ receivingDateTime <chr> "10/24/2010 12:41:00 PM", "04/28/2015 09:27:00 AM", …
$ totalPaid         <chr> "$322.00", "$130.00", "$280.00", "$1057.00", "$469.0…

Information for a few vehicles

kable(head(baltimore_tow,15))
vehicleType vehicleMake vehicleModel receivingDateTime totalPaid
Van LEXUS NA 10/24/2010 12:41:00 PM $322.00
Car Mercedes NA 04/28/2015 09:27:00 AM $130.00
Car Chysler Cirrus 07/23/2015 07:55:00 AM $280.00
Car Chevrolet Cavalier 10/23/2010 11:35:00 AM $1057.00
Car Hyundai Tiburon 10/25/2010 02:49:00 PM $469.00
SUV Toyota RAV4 10/25/2010 11:12:00 AM $305.00
Car Bmw 325 10/23/2012 07:50:00 PM $220.00
Car Honda Accord 10/25/2010 02:53:00 PM $327.00
Car Ford Taurus 12/23/2010 04:09:00 AM $290.00
SUV Ford Expo 12/23/2010 02:51:00 PM $230.00
SUV Lincoln Mkx 12/23/2010 01:40:00 PM $230.00
Car Geo Prizm 12/23/2010 06:45:00 AM $570.00
Car Kia Spectra 12/23/2010 03:57:00 AM $280.00
Car Nissan NA 12/23/2010 05:08:00 AM $280.00
Pick-up Truck Dodge Dakota 12/23/2010 02:05:00 PM $275.00

Goal 1: Plot the number of Vehicles Towed by Year

The first goal is to determine how many vehicles were towed for each year in the data set.

  • Given that the we don’t have a column for year and the first observation for receiving date is “10/24/2010 12:41:00 PM”.

Option 1: str_sub()

baltimore_tow |>
  mutate(year = str_sub(receivingDateTime, 7, 10)) |>
  ggplot(aes(x = year)) +
  geom_bar() +
  theme_bw() +
  labs(title = 'Number of Vehicles Towed in Baltimore by Year') +
  xlab('') +
  ylab("number of vehicles towed") +
  annotate('text', x = 3, y = 10000, label ='Most data is from 2015 & 2016') +
  annotate('segment', x = 3, y = 9000, xend = 6, yend = 9000, arrow = arrow())

Option 2: Create date-time object

baltimore_tow |>
  mutate(date = parse_date_time(receivingDateTime, "%m/%d/%y %I:%M:%S %p"),
         year = year(date)) |>
  ggplot(aes(x = year)) +
  geom_bar() +
  theme_bw() +
  labs(title = 'Number of Vehicles Towed in Baltimore by Year') +
  xlab('') +
  ylab("number of vehicles towed") +
  annotate('text', x = 2012, y = 10000, label ='Most data is from 2015 & 2016') +
  annotate('segment', x = 2012, y = 9000, xend = 2015, yend = 9000, arrow = arrow())

Goal 2. Type of Vehicles Towed by Morning / Afternoon

Next we wish to compute how many vehicles were towed in the AM and PM for each type of variables.

unique(baltimore_tow$vehicleType)
 [1] "Van"                        "Car"                       
 [3] "SUV"                        "Pick-up Truck"             
 [5] "Motor Cycle (Street Bike)"  "Dirt Bike"                 
 [7] "Commercial Truck"           "Trailer"                   
 [9] "Station Wagon"              "Truck"                     
[11] "Taxi"                       "Pickup Truck"              
[13] "Convertible"                "Tractor Trailer"           
[15] "Tow Truck"                  "All terrain - 4 wheel bike"
[17] "Mini-Bike"                  "Golf Cart"                 
[19] "Boat"                       "Tractor"                   
[21] "Construction Equipment"     "Sport Utility Vehicle"     

However, we want to take a close look at the vehicle types in the data set and perhaps create more useful groups.

Suggested Grouping

First consider reasonable groups for vehicle types.

  1. Cars - (Car, Convertible)
  2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi)
  3. Trucks - (Pick-up Truck, Pickup Truck)
  4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck)
  5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike)
  6. Misc (delete) - (Boat, Golf Cart, Trailer)
baltimore_tow |>
  filter(!vehicleType %in% c('Boat','Golf Cart', 'Trailer')) |>
  mutate(vehicleGroup = case_when(
    vehicleType %in% c('Car','Convertible') ~ 'Cars',
    vehicleType %in% c('SUV', 'Station Wagon','Sport Utility Vehicle', 'Van', 'Taxi') ~ 'Large Cars',
    vehicleType %in% c("Pick-up Truck", "Pickup Truck") ~ 'Trucks',
    vehicleType %in% c("Truck", 'Tractor Trailer', "Tow Truck", "Tractor", "Construction Equipment",'Commercial Truck') ~ 'Large Trucks',
    vehicleType %in% c('Motor Cycle (Street Bike)', "Dirt Bike", "All terrain - 4 wheel bike", "Mini-Bike") ~ 'Bikes',
    TRUE ~ 'Other'
  )) |>
  mutate(time_of_day = str_sub(receivingDateTime, start = -2)) |>
  group_by(vehicleGroup, time_of_day) |>
  tally() |>
  ggplot(aes(x = vehicleGroup, fill = vehicleGroup, y = n)) +
  geom_bar(stat = 'identity') +
  facet_wrap(.~time_of_day, nrow = 2) +
  theme_bw() +
  labs(title = "Number of vehicles towed by car type") +
  xlab('') +
  ylab('') +
  theme(legend.position = 'none') +
  scale_fill_viridis(discrete = TRUE, option = 'D')

Figure 1: Vehicles towed in Baltimore, MD. Groups are defined as Cars - (Car, Convertible), Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi), Trucks - (Pick-up Truck, Pickup Truck), Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck), Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike)

Goal 3: Calculate Average Towing fees by vehicleMake

baltimore_tow |>
  mutate(fees = parse_number(totalPaid)) |>
  group_by(vehicleMake) |>
  summarize(`Average Towing Fee` = mean(fees)) |>
  arrange(vehicleMake) |>
  slice(1:30) |>
  kable()
Table 1: First 30 rows – Uh Oh, major spelling errors
vehicleMake Average Towing Fee
,itsubishi 235.0000
2bp6329 250.0000
3gmc 415.0000
6an8561 220.0000
8bg7541 130.0000
? 1260.0000
ACURA 288.6667
AUDI 277.0000
Aacura 220.0000
Accura 243.0000
Acira 220.0000
Aciura 280.0000
Acrua 289.5000
Acua 242.5000
Acur 130.0000
Acura 342.0937
Acute 220.0000
Apollo 190.0000
Aqcura 162.0000
Aston Martin 355.0000
Aucra 250.0000
Audi 361.2941
Audio 385.0000
Audu 1065.0000
Aura 220.0000
Avp1 130.0000
BMW 297.0000
BUICK 272.0000
Bendron 313.0000
Bentley 595.0000

Goal 4. Data Cleaning

Spelling errors can be addressed, by reassigning vehicle makes to the correct spelling, but is tedious..