Given a corpus of texts, computes sentiment per document or sentence using the valence shifting augmented bag-of-words approach, based on the lexicons provided and a choice of aggregation across words.
compute_sentiment( x, lexicons, how = "proportional", tokens = NULL, do.sentence = FALSE, nCore = 1 )
x is a
sento_corpus object: a
sentiment object, i.e., a
the sentiment scores
data.table with an
"date" and a
and all lexicon-feature sentiment scores columns. The tokenized sentences are not provided but can be
stringi::stri_split_boundaries(texts, type = "sentence"). A
sentiment object can
be aggregated (into time series) with the
x is a quanteda
corpus object: a sentiment scores
data.table with an
"id" and a
"word_count" column, and all lexicon-feature
sentiment scores columns.
x is a tm
SimpleCorpus object, a tm
VCorpus object, or a
vector: a sentiment scores
data.table with an auto-created
"id" column, a
column, and all lexicon sentiment scores columns.
do.sentence = TRUE, an additional
"sentence_id" column along the
"id" column is added.
For a separate calculation of positive (resp. negative) sentiment, provide distinct positive (resp.
negative) lexicons (see the
do.split option in the
sento_lexicons function). All
are converted to 0, under the assumption that this is equivalent to no sentiment. Per default
tokens = NULL,
meaning the corpus is internally tokenized as unigrams, with punctuation and numbers but not stopwords removed.
All tokens are converted to lowercase, in line with what the
sento_lexicons function does for the
lexicons and valence shifters. Word counts are based on that same tokenization.
lexicons argument has no
"valence" element, the sentiment computed corresponds to simple unigram
matching with the lexicons [unigrams approach]. If valence shifters are included in
lexicons with a
"y" column, the polarity of a word detected from a lexicon gets multiplied with the associated
value of a valence shifter if it appears right before the detected word (examples: not good or can't defend) [bigrams
approach]. If the valence table contains a
"t" column, valence shifters are searched for in a cluster centered around
a detected polarity word [clusters approach]. The latter approach is a simplified version of the one utilized by the
sentimentr package. A cluster amounts to four words before and two words after a polarity word. A cluster never overlaps
with a preceding one. Roughly speaking, the polarity of a cluster is calculated as \(n(1 + 0.80d)S + \sum s\). The polarity
score of the detected word is \(S\), \(s\) represents polarities of eventual other sentiment words, and \(d\) is
the difference between the number of amplifiers (
t = 2) and the number of deamplifiers (
t = 3). If there
is an odd number of negators (
t = 1), \(n = -1\) and amplifiers are counted as deamplifiers, else \(n = 1\).
The sentence-level sentiment calculation approaches each sentence as if it is a document. Depending on the input either
the unigrams, bigrams or clusters approach is used. We enhanced latter approach following more closely the default
sentimentr settings. They use a cluster of five words before and two words after a polarized word. The cluster
is limited to the words after a previous comma and before a next comma. Adversative conjunctions (
t = 4) are
accounted for here. The cluster is reweighted based on the value \(1 + 0.25adv\), where \(adv\) is the difference
between the number of adversative conjunctions found before and after the polarized word.
data("usnews", package = "sentometrics") txt <- system.file("texts", "txt", package = "tm") reuters <- system.file("texts", "crude", package = "tm") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]][, c("x", "t")]) # from a sento_corpus object - unigrams approach corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 200) sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol") # from a character vector - bigrams approach sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts") # from a corpus object - clusters approach corpusQ <- quanteda::corpus(usnews, text_field = "texts") corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200) sent3 <- compute_sentiment(corpusQSample, l3, how = "counts") # from an already tokenized corpus - using the 'tokens' argument toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword")) sent4 <- compute_sentiment(corpusQSample, l1, how = "counts", tokens = toks) # from a SimpleCorpus object - unigrams approach scorp <- tm::SimpleCorpus(tm::DirSource(txt)) sent5 <- compute_sentiment(scorp, l1, how = "proportional") # from a VCorpus object - unigrams approach ## in contrast to what as.sento_corpus(vcorp) would do, the ## sentiment calculator handles multiple character vectors within ## a single corpus element as separate documents vcorp <- tm::VCorpus(tm::DirSource(reuters)) sent6 <- compute_sentiment(vcorp, l1) # from a sento_corpus object - unigrams approach with tf-idf weighting sent7 <- compute_sentiment(corpusSample, l1, how = "TFIDF") # sentence-by-sentence computation sent8 <- compute_sentiment(corpusSample, l1, how = "proportionalSquareRoot", do.sentence = TRUE) # from a (fake) multilingual corpus usnews[["language"]] <- "en" # add language column usnews$language[1:100] <- "fr" lEn <- sento_lexicons(list("FEEL_en" = list_lexicons$FEEL_en_tr, "HENRY" = list_lexicons$HENRY_en), list_valence_shifters$en) lFr <- sento_lexicons(list("FEEL_fr" = list_lexicons$FEEL_fr), list_valence_shifters$fr) lexicons <- list(en = lEn, fr = lFr) corpusLang <- sento_corpus(corpusdf = usnews[1:250, ]) sent9 <- compute_sentiment(corpusLang, lexicons, how = "proportional")