Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Row-oriented workflows in R with the tidyverse

Row-oriented workflows in R with the tidyverse

Slides for RStudio webinar
Jenny Bryan
Code and more resources at:
https://rstd.io/row-work

Jennifer (Jenny) Bryan

April 11, 2018
Tweet

More Decks by Jennifer (Jenny) Bryan

Other Decks in Programming

Transcript

  1. Jennifer Bryan 

    RStudio, University of British Columbia
     @JennyBryan  @jennybc
    Row-oriented
    workflows in +

    View Slide

  2. rstd.io/row-work
    GitHub repo has all code.
    Link to slides on SpeakerDeck.
    Get the .R files to play along.
    Or follow via rendered .md.

    View Slide

  3. This work is licensed under a
    Creative Commons
    Attribution-ShareAlike 4.0
    International License.
    To view a copy of this license, visit 

    http://creativecommons.org/licenses/by-sa/4.0/

    View Slide

  4. download materials: rstd.io/row-work
    I assume you know or want to know:
    the tidyverse packages
    the pipe operator, %>%
    list = core data structure
    "apply" or "map" functions,
    e.g. base::lapply() and purrr::map()

    View Slide

  5. download materials: rstd.io/row-work
    tidyverse.org

    View Slide

  6. download materials: rstd.io/row-work
    r4ds.had.co.nz

    View Slide

  7. download materials: rstd.io/row-work
    https://twitter.com/daattali/status/761058049859518464
    https://twitter.com/daattali/status/761233607822221312

    View Slide

  8. download materials: rstd.io/row-work
    > str(i_want)
    List of 2
    $ :List of 2
    ..$ x: num 1
    ..$ y: chr "one"
    $ :List of 2
    ..$ x: num 2
    ..$ y: chr "two"
    > i_have
    # A tibble: 2 x 2
    x y

    1 1. one
    2 2. two
    How to do this?

    View Slide

  9. download materials: rstd.io/row-work
    https://rpubs.com/wch/200398
    Winston compiled,
    I updated.

    View Slide

  10. download materials: rstd.io/row-work
    df <- SOME DATA FRAME
    out <- vector(mode = "list", length = nrow(df))
    for (i in seq_along(out)) {
    out[[i]] <- as.list(df[i, , drop = FALSE])
    }
    out
    for loop

    View Slide

  11. download materials: rstd.io/row-work
    df <- SOME DATA FRAME
    df <- split(df, seq_len(nrow(df)))
    lapply(df, function(row) as.list(row))
    split by row then lapply
    df <- SOME DATA FRAME
    lapply(
    seq_len(nrow(df)),
    function(i) as.list(df[i, , drop = FALSE])
    )
    lapply over row numbers

    View Slide

  12. download materials: rstd.io/row-work
    df <- SOME DATA FRAME
    transpose(df)
    df <- SOME DATA FRAME
    pmap(df, list)
    purrr::pmap()
    purrr::transpose()*
    * Happens to be exactly what's needed in this specific example.

    View Slide

  13. download materials: rstd.io/row-work
    Why so many ways to do
    THING for each row?
    Because there is no way.

    View Slide

  14. download materials: rstd.io/row-work
    Why so many ways to do
    THING for each row?
    Columns are very special in R.
    This is fantastic for data analysis.
    Tradeoff: row-oriented work is harder.

    View Slide

  15. download materials: rstd.io/row-work
    How to choose?
    Speed and ease of:
    • Writing the code
    • Reading the code
    • Executing the code

    View Slide

  16. download materials: rstd.io/row-work
    Of course someone has
    to write loops
    It doesn't have to be you

    View Slide

  17. download materials: rstd.io/row-work
    Pro tip #1
    Use vectorized functions.
    Let other people write loop-y
    code for you.

    View Slide

  18. download materials: rstd.io/row-work
    paste() example
    ex03_row-wise-iteration-are-you-sure.R

    View Slide

  19. download materials: rstd.io/row-work
    Pro tip #2
    Use purrr::map()* and friends.
    Let other people write loop-y
    code for you.
    * Like base::lapply(), but anchors a large, coherent family of map functions.

    View Slide

  20. download materials: rstd.io/row-work
    map(.x, .f, ...)
    purrr::

    View Slide

  21. download materials: rstd.io/row-work
    map(.x, .f, ...)
    for every element of .x
    apply .f

    View Slide

  22. .x = minis

    View Slide

  23. map(minis, antennate)

    View Slide

  24. download materials: rstd.io/row-work
    map(.x, .f, ...)
    .x <- SOME VECTOR OR LIST
    out <- vector(mode = "list", length = length(.x))
    for (i in seq_along(out)) {
    out[[i]] <- .f(.x[[i]])
    }
    out

    View Slide

  25. download materials: rstd.io/row-work
    map(.x, .f, ...)
    purrr::map() implements a for loop!
    But with less code clutter.

    View Slide

  26. download materials: rstd.io/row-work
    purrr::map() example
    ex04_map-example.R

    View Slide

  27. download materials: rstd.io/row-work
    No, I really do
    need to do THING
    for each row.

    View Slide

  28. download materials: rstd.io/row-work
    > str(i_want)
    List of 2
    $ :List of 2
    ..$ x: num 1
    ..$ y: chr "one"
    $ :List of 2
    ..$ x: num 2
    ..$ y: chr "two"
    > i_have
    # A tibble: 2 x 2
    x y

    1 1. one
    2 2. two
    How to do this?

    View Slide

  29. download materials: rstd.io/row-work
    pmap(.l, .f, ...)
    for every tuple in.l
    apply .f

    View Slide

  30. pmap(.l, embody)

    View Slide

  31. pmap(.l, embody)

    View Slide

  32. download materials: rstd.io/row-work
    pmap(.l, .f, ...)
    .l <- LIST OF LENGTH-N VECTORS
    out <- vector(mode = "list", length = N)
    for (i in seq_along(out)) {
    out[[i]] <- .f(.l[[1]][[i]], .l[[2]][[i]], ...)
    }
    out

    View Slide

  33. download materials: rstd.io/row-work
    pmap(.l, .f, ...)
    .l <- LIST OF LENGTH-N VECTORS
    out <- vector(mode = "list", length = N)
    for (i in seq_along(out)) {
    out[[i]] <- .f(.l[[1]][[i]], .l[[2]][[i]], ...)
    }
    out
    A data frame works!
    row i

    View Slide

  34. download materials: rstd.io/row-work
    pmap(.l, .f, ...)
    .l <- LIST OF LENGTH-N VECTORS
    out <- vector(mode = "list", length = N)
    for (i in seq_along(out)) {
    out[[i]] <- .f(.l[[1]][[i]], .l[[2]][[i]], ...)
    }
    out
    pmap() is a for loop!
    it applies .f to each row

    View Slide

  35. download materials: rstd.io/row-work
    purrr::pmap() example
    ex06_runif-via-pmap.R

    View Slide

  36. download materials: rstd.io/row-work
    How to choose?
    Speed and ease of:
    • Writing the code
    • Reading the code
    • Executing the code

    View Slide

  37. download materials: rstd.io/row-work
    map()
    map_lgl(), map_int(), map_dbl(), map_chr()
    map_if(), map_at()
    map_dfr(), map_dfc()
    map2()
    map2_lgl(), map2_int(), map2_dbl(), map2_chr()
    map2_dfr(), map2_dfc()
    pmap()
    pmap_lgl(), pmap_int(), pmap_dbl(), pmap_chr()
    pmap_dfr(), pmap_dfc()
    imap()
    imap_lgl(), imap_chr(), imap_int(), imap_dbl()
    imap_dfr(), imap_dfc()

    View Slide

  38. download materials: rstd.io/row-work
    map()
    map_lgl(), map_int(), map_dbl(), map_chr()
    map_if(), map_at()
    map_dfr(), map_dfc()
    map2()
    map2_lgl(), map2_int(), map2_dbl(), map2_chr()
    map2_dfr(), map2_dfc()
    pmap()
    pmap_lgl(), pmap_int(), pmap_dbl(), pmap_chr()
    pmap_dfr(), pmap_dfc()
    imap()
    imap_lgl(), imap_chr(), imap_int(), imap_dbl()
    imap_dfr(), imap_dfc()
    purrr's map functions have
    a common interface


    learn it once,
    use it everywhere

    View Slide

  39. download materials: rstd.io/row-work
    df <- SOME DATA FRAME
    out <- vector(mode = "list", length = nrow(df))
    for (i in seq_along(out)) {
    out[[i]] <- as.list(df[i, , drop = FALSE])
    }
    out
    for loop
    df <- SOME DATA FRAME
    df <- split(df, seq_len(nrow(df)))
    lapply(df, function(row) as.list(row))
    split by row then lapply
    df <- SOME DATA FRAME
    lapply(
    seq_len(nrow(df)),
    function(i) as.list(df[i, , drop = FALSE])
    )
    lapply over row numbers
    df <- SOME DATA FRAME
    pmap(df, list)
    purrr::pmap()
    df <- SOME DATA FRAME
    transpose(df)
    purrr::transpose()

    View Slide

  40. download materials: rstd.io/row-work

    View Slide

  41. download materials: rstd.io/row-work
    code for that study:
    iterate-over-rows.R

    View Slide

  42. download materials: rstd.io/row-work
    purrr::pmap(df, .f)
    for each row of df
    do this

    View Slide

  43. download materials: rstd.io/row-work
    What if I need to work
    on groups of rows?

    View Slide

  44. download materials: rstd.io/row-work
    Pro tip #3
    Use dplyr::group_by() +
    summarize().
    Let other people write loop-y
    code for you.

    View Slide

  45. download materials: rstd.io/row-work
    group_by() + summarize() example
    ex07_group-by-summarise.R

    View Slide

  46. download materials: rstd.io/row-work
    No, I really must work
    on groups of rows.

    View Slide

  47. download materials: rstd.io/row-work
    Use nesting
    to restate as
    "do THING for each row"

    View Slide

  48. download materials: rstd.io/row-work
    Use nesting
    to restate as
    "do THING for each row"
    DONE
    * See everything up 'til now in this talk.
    *

    View Slide

  49. download materials: rstd.io/row-work
    dplyr::group_by() + tidyr::nest()
    ex08_nesting-is-good.R

    View Slide

  50. download materials: rstd.io/row-work
    embrace the data frame
    esp. the tibble = tidyverse data frame
    embrace lists
    embrace lists as variables in a tibble
    "list-columns", may come from nesting
    embrace purrr::map() & friends
    Tips for row-oriented workflows

    View Slide