Set up lexicons (and valence word list) for use in sentiment analysis

Structures provided lexicon(s) and optionally valence words. One can for example combine (part of) the built-in lexicons from data("list_lexicons") with other lexicons, and add one of the built-in valence word lists from data("list_valence_shifters"). This function makes the output coherent, by converting all words to lowercase and checking for duplicates. All entries consisting of more than one word are discarded, as required for bag-of-words sentiment analysis.

sento_lexicons(lexiconsIn, valenceIn = NULL, do.split = FALSE)

Arguments

lexiconsIn: a named list of (raw) lexicons, each element as a data.table or a data.frame with respectively a character column (the words) and a numeric column (the polarity scores). This argument can be one of the built-in lexicons accessible via sentometrics::list_lexicons.
valenceIn: a single valence word list as a data.table or a data.frame with respectively a "x" and a "y" or "t" column. The first column has the words, "y" has the values for bigram shifting, and "t" has the types of the valence shifter for a clustered approach to sentiment calculation (supported types: 1 = negators, 2 = amplifiers, 3 = deamplifiers, 4 = adversative conjunctions). Type 4 is only used in a clusters-based sentence-level sentiment calculation. If three columns are provided, only the first two will be considered. This argument can be one of the built-in valence word lists accessible via sentometrics::list_valence_shifters. A word that appears in both a lexicon and the valence word list is prioritized as a lexical entry during sentiment calculation. If NULL, valence shifting is not applied in the sentiment analysis.
do.split: a logical that if TRUE splits every lexicon into a separate positive polarity and negative polarity lexicon.

Value

A list of class sento_lexicons with each lexicon as a separate element according to its name, as a data.table, and optionally an element named valence that comprises the valence words. Every "x" column contains the words, every "y" column contains the scores. The "t" column for valence shifters contains the different types.

Author

Samuel Borms

Examples

data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")

# lexicons straight from built-in word lists
l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])

# including a self-made lexicon, with and without valence shifters
lexIn <- c(list(myLexicon = data.table::data.table(w = c("nice", "boring"), s = c(2, -1))),
           list_lexicons[c("GI_en")])
valIn <- list_valence_shifters[["en"]]
l2 <- sento_lexicons(lexIn)
l3 <- sento_lexicons(lexIn, valIn)
l4 <- sento_lexicons(lexIn, valIn[, c("x", "y")], do.split = TRUE)
l5 <- sento_lexicons(lexIn, valIn[, c("x", "t")], do.split = TRUE)
l6 <- l5[c("GI_en_POS", "valence")] # preserves sento_lexicons class

if (FALSE) { # \dontrun{
# include lexicons from lexicon package
lexIn2 <- list(hul = lexicon::hash_sentiment_huliu, joc = lexicon::hash_sentiment_jockers)
l7 <- sento_lexicons(c(lexIn, lexIn2), valIn)} # }

if (FALSE) { # \dontrun{
# faulty extraction, no replacement allowed
l5["valence"]
l2[0]
l3[22]
l4[1] <- l2[1]
l4[[1]] <- l2[[1]]
l4$GI_en_NEG <- l2$myLexicon} # }