Working to get textual data converted into numerical can be done in
many different ways. The steps included in textrecipes
should hopefully give you the flexibility to perform most of your
desired text preprocessing tasks. This vignette will showcase examples
that combine multiple steps.
This vignette will not do any modeling with the processed text as its
purpose it to showcase the flexibility and modularity. Therefore the
only packages needed will be recipes and
textrecipes. Examples will be performed on the
tate_text data-set which is packaged with
modeldata.
library(recipes)
library(textrecipes)
library(modeldata)
data("tate_text")Sometimes it is enough to know the counts of a handful of specific
words. This can be easily be achieved by using the arguments
custom_stopword_source and keep = TRUE in
step_stopwords.
words <- c("or", "and", "on")
okc_rec <- recipe(~ ., data = tate_text) %>%
step_tokenize(medium) %>%
step_stopwords(medium, custom_stopword_source = words, keep = TRUE) %>%
step_tf(medium)
okc_obj <- okc_rec %>%
prep()
bake(okc_obj, tate_text) %>%
select(starts_with("tf_medium"))
#> # A tibble: 4,284 × 3
#> tf_medium_and tf_medium_on tf_medium_or
#> <int> <int> <int>
#> 1 1 0 1
#> 2 0 1 0
#> 3 0 1 0
#> 4 0 1 0
#> 5 0 1 0
#> 6 0 1 0
#> 7 0 1 0
#> 8 0 1 0
#> 9 1 1 0
#> 10 0 1 0
#> # … with 4,274 more rowsYou might know of certain words you don’t want included which isn’t a
part of the stop word list of choice. This can easily be done by
applying the step_stopwords step twice, once for the stop
words and once for your special words.
stopwords_list <- c("was", "she's", "who", "had", "some", "same", "you", "most",
"it's", "they", "for", "i'll", "which", "shan't", "we're",
"such", "more", "with", "there's", "each")
words <- c("sad", "happy")
okc_rec <- recipe(~ ., data = tate_text) %>%
step_tokenize(medium) %>%
step_stopwords(medium, custom_stopword_source = stopwords_list) %>%
step_stopwords(medium, custom_stopword_source = words) %>%
step_tfidf(medium)
okc_obj <- okc_rec %>%
prep()
bake(okc_obj, tate_text) %>%
select(starts_with("tfidf_medium"))
#> # A tibble: 4,284 × 951
#> tfidf_mediu…¹ tfidf…² tfidf…³ tfidf…⁴ tfidf…⁵ tfidf…⁶ tfidf…⁷ tfidf…⁸ tfidf…⁹
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 0 0 0 0 0 0 0
#> 2 0 0 0 0 0 0 0 0 0
#> 3 0 0 0 0 0 0 0 0 0
#> 4 0 0 0 0 0 0 0 0 0
#> 5 0 0 0 0 0 0 0 0 0
#> 6 0 0 0 0 0 0 0 0 0
#> 7 0 0 0 0 0 0 0 0 0
#> 8 0 0 0 0 0 0 0 0 0
#> 9 0 0 0 0 0 0 0 0 0
#> 10 0 0 0 0 0 0 0 0 0
#> # … with 4,274 more rows, 942 more variables: tfidf_medium_151 <dbl>,
#> # tfidf_medium_16 <dbl>, tfidf_medium_160 <dbl>, tfidf_medium_16mm <dbl>,
#> # tfidf_medium_18 <dbl>, tfidf_medium_19 <dbl>, tfidf_medium_2 <dbl>,
#> # tfidf_medium_20 <dbl>, tfidf_medium_2000 <dbl>, tfidf_medium_201 <dbl>,
#> # tfidf_medium_21 <dbl>, tfidf_medium_22 <dbl>, tfidf_medium_220 <dbl>,
#> # tfidf_medium_23 <dbl>, tfidf_medium_24 <dbl>, tfidf_medium_25 <dbl>,
#> # tfidf_medium_26 <dbl>, tfidf_medium_27 <dbl>, tfidf_medium_28 <dbl>, …Another thing one might want to look at is the use of different
letters in a certain text. For this we can use the built-in character
tokenizer and keep only the characters using the
step_stopwords step.
okc_rec <- recipe(~ ., data = tate_text) %>%
step_tokenize(medium, token = "characters") %>%
step_stopwords(medium, custom_stopword_source = letters, keep = TRUE) %>%
step_tf(medium)
okc_obj <- okc_rec %>%
prep()
bake(okc_obj, tate_text) %>%
select(starts_with("tf_medium"))
#> # A tibble: 4,284 × 26
#> tf_medium_a tf_medi…¹ tf_me…² tf_me…³ tf_me…⁴ tf_me…⁵ tf_me…⁶ tf_me…⁷ tf_me…⁸
#> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 0 2 3 4 0 0 0 3
#> 2 1 0 1 0 2 0 1 1 1
#> 3 1 0 1 0 2 0 1 1 1
#> 4 1 0 1 0 2 0 1 1 1
#> 5 3 0 1 0 0 0 0 0 2
#> 6 3 0 1 0 0 0 0 0 2
#> 7 3 0 2 0 1 0 0 0 2
#> 8 1 0 1 1 1 0 0 0 0
#> 9 5 0 1 1 0 0 0 0 2
#> 10 1 0 0 0 1 0 0 0 1
#> # … with 4,274 more rows, 17 more variables: tf_medium_j <int>,
#> # tf_medium_k <int>, tf_medium_l <int>, tf_medium_m <int>, tf_medium_n <int>,
#> # tf_medium_o <int>, tf_medium_p <int>, tf_medium_q <int>, tf_medium_r <int>,
#> # tf_medium_s <int>, tf_medium_t <int>, tf_medium_u <int>, tf_medium_v <int>,
#> # tf_medium_w <int>, tf_medium_x <int>, tf_medium_y <int>, tf_medium_z <int>,
#> # and abbreviated variable names ¹tf_medium_b, ²tf_medium_c, ³tf_medium_d,
#> # ⁴tf_medium_e, ⁵tf_medium_f, ⁶tf_medium_g, ⁷tf_medium_h, ⁸tf_medium_iSometimes fairly complicated computations. Here we would like the
term frequency inverse document frequency (TF-IDF) of the most common
500 ngrams done on stemmed tokens. It is quite a handful and would
seldom be included as a option in most other libraries. But the
modularity of textrecipes makes this task fairly easy.
First we will tokenize according to words, then stemming those words.
We will then paste together the stemmed tokens using
step_untokenize so we are back at string that we then
tokenize again but this time using the ngram tokenizers. Lastly just
filtering and tfidf as usual.
okc_rec <- recipe(~ ., data = tate_text) %>%
step_tokenize(medium, token = "words") %>%
step_stem(medium) %>%
step_untokenize(medium) %>%
step_tokenize(medium, token = "ngrams") %>%
step_tokenfilter(medium, max_tokens = 500) %>%
step_tfidf(medium)
okc_obj <- okc_rec %>%
prep()
bake(okc_obj, tate_text) %>%
select(starts_with("tfidf_medium"))
#> # A tibble: 4,284 × 499
#> tfidf_mediu…¹ tfidf…² tfidf…³ tfidf…⁴ tfidf…⁵ tfidf…⁶ tfidf…⁷ tfidf…⁸ tfidf…⁹
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 0 0 0 0 0 0 0
#> 2 0 0 0 0 0 0 0 0 0
#> 3 0 0 0 0 0 0 0 0 0
#> 4 0 0 0 0 0 0 0 0 0
#> 5 0 0 0 0 0 0 0 0 0
#> 6 0 0 0 0 0 0 0 0 0
#> 7 0 0 0 0 0 0 0 0 0
#> 8 0 0 0 0 0 0 0 0 0
#> 9 0 0 0 0 0 0 0 0 0
#> 10 0 0 0 0 0 0 0 0 0
#> # … with 4,274 more rows, 490 more variables:
#> # `tfidf_medium_2 photograph black` <dbl>,
#> # `tfidf_medium_2 photograph colour` <dbl>,
#> # `tfidf_medium_2 project black` <dbl>,
#> # `tfidf_medium_2 project colour` <dbl>, `tfidf_medium_2 work on` <dbl>,
#> # `tfidf_medium_3 flat screen` <dbl>, `tfidf_medium_3 monitor colour` <dbl>,
#> # `tfidf_medium_3 photograph colour` <dbl>, …