Robert Myles McDonnell

3 minute read

I really don’t like splitting data into ‘train’ and ‘test’. I don’t mean that I’m against the idea of it, though you could say it’s a waste of data that could be used to better your model, but I mean that actual assignment in R of ‘train’ and ‘test’. I always liked destructuring in Python, and I like it a lot in 2018 JavaScript, so when I remembered that the zeallot package has it, I thought it would be a good opportunity to see how that could fit in a tidy modelling pipeline. Much to my delight, one little helper function later, it works perfectly.

Loading packages and data:

library(recipes); library(rsample); library(tidyverse); library(zeallot)

# load some data:
data("credit_data")

Next comes our little helper function:

m_bake <- function(recipe_object, data){
  cd <- initial_split(credit_data)
  tr <- training(cd)
  te <- testing(cd)
  x1 <- bake(recipe_object, tr)
  x2 <- bake(recipe_object, te)
  return(list(x1, x2))
}

Now you get a nice clean pipeline with a train/test split as a result, using zeallot’s great %->% operator:

recipe(Status ~ ., data = credit_data) %>% 
  step_knnimpute(all_predictors()) %>% 
  step_center(all_numeric()) %>% 
  step_dummy(-all_numeric()) %>% 
  prep() %>% 
  m_bake(data = credit_data) %->% c(train, test)

ls()
## [1] "credit_data" "m_bake"      "test"        "train"
head(train); head(test)
## # A tibble: 6 x 23
##   Seniority  Time    Age Expenses Income Assets  Debt Amount  Price
##       <dbl> <dbl>  <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
## 1      1.01  13.6  -7.08    17.4   -12.4 -5435. -342. -239.   -617.
## 2      9.01  13.6  20.9     -7.57  -10.4 -5435. -342.  -38.9   195.
## 3      2.01 -10.4   8.92    34.4    58.6 -2435. -342.  961.   1522.
## 4     -7.99 -10.4 -11.1     -9.57  -34.4 -5435. -342. -729.   -553.
## 5     -6.99  13.6  -1.08    19.4    72.6 -1935. -342. -389.    182.
## 6     21.0   13.6   6.92    19.4   -16.4  4565. -342.  561.    337.
## # ... with 14 more variables: Home_other <dbl>, Home_owner <dbl>,
## #   Home_parents <dbl>, Home_priv <dbl>, Home_rent <dbl>,
## #   Marital_married <dbl>, Marital_separated <dbl>, Marital_single <dbl>,
## #   Marital_widow <dbl>, Records_yes <dbl>, Job_freelance <dbl>,
## #   Job_others <dbl>, Job_partime <dbl>, Status_good <dbl>
## # A tibble: 6 x 23
##   Seniority   Time      Age Expenses Income Assets   Debt Amount  Price
##       <dbl>  <dbl>    <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     -7.99  13.6  -13.1        7.43   40.6 -2935. -342.  -139.  -138. 
## 2     -7.99  13.6   -5.08      34.4   -34.4  9565. -342.   161.   494. 
## 3     11.0  -10.4   -0.0804    19.4    28.6 -1935.  -82.4 -439.  -523. 
## 4     -7.99 -22.4   30.9       19.4   -10.4 -1273. -342.  -139.  -277. 
## 5     -7.99   1.56  -1.08     -10.6   -11.4 -4685. -342.    61.1   48.2
## 6     -2.99  13.6  -15.1      -10.6   183.   4565. -342.    61.1 -304. 
## # ... with 14 more variables: Home_other <dbl>, Home_owner <dbl>,
## #   Home_parents <dbl>, Home_priv <dbl>, Home_rent <dbl>,
## #   Marital_married <dbl>, Marital_separated <dbl>, Marital_single <dbl>,
## #   Marital_widow <dbl>, Records_yes <dbl>, Job_freelance <dbl>,
## #   Job_others <dbl>, Job_partime <dbl>, Status_good <dbl>

Lovely.

(Ok, I still had to split it (!!), but once you write this function, you can just call this.)

comments powered by Disqus