Add feature columns to a (sento_)corpus object — add

Adds new feature columns, either user-supplied or based on keyword(s)/regex pattern search, to a provided sento_corpus or a quanteda corpus object.

add_features(
  corpus,
  featuresdf = NULL,
  keywords = NULL,
  do.binary = TRUE,
  do.regex = FALSE
)

Arguments

corpus: a sento_corpus object created with sento_corpus, or a quanteda corpus object.
featuresdf: a named data.frame of type numeric where each columns is a new feature to be added to the inputted corpus object. If the number of rows in featuresdf is not equal to the number of documents in corpus, recycling will occur. The numeric values should be between 0 and 1 (included).
keywords: a named list. For every element, a new feature column is added with a value of 1 for the texts in which (at least one of) the keyword(s) appear(s), and 0 if not (for do.binary = TRUE), or with as value the normalized number of times the keyword(s) occur(s) in the text (for do.binary = FALSE). If no texts match a keyword, no column is added. The list names are used as the names of the new features. For more complex searching, instead of just keywords, one can also directly use a single regex expression to define a new feature (see the details section).
do.binary: a logical, if do.binary = FALSE, the number of occurrences are normalized between 0 and 1 (see argument keywords).
do.regex: a logical vector equal in length to the number of elements in the keywords argument list, or a single value if it applies to all. It should be set to TRUE at those positions where a single regex expression is used to identify the particular feature.

Value

An updated corpus object.

Details

If a provided feature name is already part of the corpus, it will be replaced. The featuresdf and keywords arguments can be provided at the same time, or only one of them, leaving the other at NULL. We use the stringi package for searching the keywords. The do.regex argument points to the corresponding elements in keywords. For FALSE, we transform the keywords into a simple regex expression, involving "\b" for exact word boundary matching and (if multiple keywords) | as OR operator. The elements associated to TRUE do not undergo this transformation, and are evaluated as given, if the corresponding keywords vector consists of only one expression. For a large corpus and/or complex regex patterns, this function may require some patience. Scaling between 0 and 1 is performed via min-max normalization, per column.

Author

Samuel Borms

Examples

set.seed(505)

# construct a corpus and add (a) feature(s) to it
corpus <- quanteda::corpus_sample(
  sento_corpus(corpusdf = sentometrics::usnews), 500
)
corpus1 <- add_features(corpus,
                        featuresdf = data.frame(random = runif(quanteda::ndoc(corpus))))
corpus2 <- add_features(corpus,
                        keywords = list(pres = "president", war = "war"),
                        do.binary = FALSE)
corpus3 <- add_features(corpus,
                        keywords = list(pres = c("Obama", "US president")))
corpus4 <- add_features(corpus,
                        featuresdf = data.frame(all = 1),
                        keywords = list(pres1 = "Obama|US [p|P]resident",
                                        pres2 = "\\bObama\\b|\\bUS president\\b",
                                        war = "war"),
                        do.regex = c(TRUE, TRUE, FALSE))

sum(quanteda::docvars(corpus3, "pres")) ==
  sum(quanteda::docvars(corpus4, "pres2")) # TRUE
#> [1] TRUE

# adding a complementary feature
nonpres <- data.frame(nonpres = as.numeric(!quanteda::docvars(corpus3, "pres")))
corpus3 <- add_features(corpus3, featuresdf = nonpres)