(Builds on: Visualization basics, Manipulation basics)
(Leads to: Exploratory data analysis (2D), Function basics, Essentials of relational data, String basics)
Exploratory data analysis is partly a set of techniques, but is mostly a mindset: you want to remain open to what the data is telling you.
library(tidyverse)
library(nycflights13)
Whenever you start working with a new variable, it’s a really good idea
to first take a look at the variable by itself, before you start
combining it with other variables. As well as the visual techniques
you’ll learn in the readings, another quick and dirty function is
count().
df %>% count(grp) is shorthand for
df %>% group_by(grp) %>% summarize(n = n()).
flights %>% 
  count(carrier)
#> # A tibble: 16 × 2
#>    carrier     n
#>    <chr>   <int>
#>  1 9E      18460
#>  2 AA      32729
#>  3 AS        714
#>  4 B6      54635
#>  5 DL      48110
#>  6 EV      54173
#>  7 F9        685
#>  8 FL       3260
#>  9 HA        342
#> 10 MQ      26397
#> 11 OO         32
#> 12 UA      58665
#> 13 US      20536
#> 14 VX       5162
#> 15 WN      12275
#> 16 YV        601
It has two convenient arguments:
sort = TRUE automatically arranges the result so the most common
values are at the top
flights %>% 
  count(dest, sort = TRUE)
#> # A tibble: 105 × 2
#>    dest      n
#>    <chr> <int>
#>  1 ORD   17283
#>  2 ATL   17215
#>  3 LAX   16174
#>  4 BOS   15508
#>  5 MCO   14082
#>  6 CLT   14064
#>  7 SFO   13331
#>  8 FLL   12055
#>  9 MIA   11728
#> 10 DCA    9705
#> # … with 95 more rows
wt = my_variable switches from a count to a weighted sum of
my_variable. For example, the following code gives the total
distance traveled by each carrier. It is particularly useful if you
have data that has already been aggregated.
flights %>% 
  count(carrier, wt = distance)
#> # A tibble: 16 × 2
#>    carrier        n
#>    <chr>      <dbl>
#>  1 9E       9788152
#>  2 AA      43864584
#>  3 AS       1715028
#>  4 B6      58384137
#>  5 DL      59507317
#>  6 EV      30498951
#>  7 F9       1109700
#>  8 FL       2167344
#>  9 HA       1704186
#> 10 MQ      15033955
#> 11 OO         16026
#> 12 UA      89705524
#> 13 US      11365778
#> 14 VX      12902327
#> 15 WN      12229203
#> 16 YV        225395
You can also count() the value of expression. This is a useful
technique to get a quick count of how many missing values there are:
flights %>% 
  count(is.na(dep_delay))
#> # A tibble: 2 × 2
#>   `is.na(dep_delay)`      n
#>   <lgl>               <int>
#> 1 FALSE              328521
#> 2 TRUE                 8255
flights %>% 
  count(
    dep_missing = is.na(dep_time), 
    arr_missing = is.na(arr_time)
  )
#> # A tibble: 3 × 3
#>   dep_missing arr_missing      n
#>   <lgl>       <lgl>        <int>
#> 1 FALSE       FALSE       328063
#> 2 FALSE       TRUE           458
#> 3 TRUE        TRUE          8255
You can combine count() with the cut_* functions from ggplot2 to
compute histograms “by hand”:
# five bins of equal widths
flights %>% 
  count(cut_interval(arr_delay, 5))
#> # A tibble: 6 × 2
#>   `cut_interval(arr_delay, 5)`      n
#>   <fct>                         <int>
#> 1 [-86,186]                    323807
#> 2 (186,457]                      3465
#> 3 (457,729]                        45
#> 4 (729,1e+03]                      25
#> 5 (1e+03,1.27e+03]                  4
#> 6 <NA>                           9430
# five bins with approximately equal numbers of points
flights %>% 
  count(cut_number(arr_delay, 5))
#> # A tibble: 6 × 2
#>   `cut_number(arr_delay, 5)`     n
#>   <fct>                      <int>
#> 1 [-86,-19]                  70875
#> 2 (-19,-10]                  61570
#> 3 (-10,1]                    66972
#> 4 (1,21]                     62970
#> 5 (21,1.27e+03]              64959
#> 6 <NA>                        9430
# hourly bins
flights %>% 
  count(cut_width(arr_delay, 60, boundary = 0))
#> # A tibble: 22 × 2
#>    `cut_width(arr_delay, 60, boundary = 0)`      n
#>    <fct>                                     <int>
#>  1 [-120,-60]                                  240
#>  2 (-60,0]                                  194102
#>  3 (0,60]                                   105215
#>  4 (60,120]                                  17755
#>  5 (120,180]                                  6191
#>  6 (180,240]                                  2291
#>  7 (240,300]                                   941
#>  8 (300,360]                                   365
#>  9 (360,420]                                   144
#> 10 (420,480]                                    37
#> # … with 12 more rows
Introduction [r4ds-7.1]
Questions [r4ds-7.2]
Variation [r4ds-7.3]