R/sentiment_engines.R
compute_sentiment.Rd
Given a corpus of texts, computes sentiment per document or sentence using the valence shifting augmented bag-of-words approach, based on the lexicons provided and a choice of aggregation across words.
compute_sentiment(
x,
lexicons,
how = "proportional",
tokens = NULL,
do.sentence = FALSE,
nCore = 1
)
either a sento_corpus
object created with sento_corpus
, a quanteda
corpus
object, a tm SimpleCorpus
object, a tm
VCorpus
object, or a character
vector. Only a sento_corpus
object incorporates
a date dimension. In case of a corpus
object, the numeric
columns from the
docvars
are considered as features over which sentiment will be computed. In
case of a character
vector, sentiment is only computed across lexicons.
a sento_lexicons
object created using sento_lexicons
.
a single character
vector defining how to perform aggregation within
documents or sentences. For available options, see get_hows()$words
.
a list
of tokenized documents, or if do.sentence = TRUE
a list
of
list
s of tokenized sentences. This allows to specify your own tokenization scheme. Can indirectly result from
the quanteda's tokens
function, the tokenizers package, or other (see examples).
Make sure the tokens are constructed from (the texts from) the x
argument, are unigrams, and preferably
set to lowercase, otherwise, results may be spurious and errors could occur. By default set to NULL
.
a logical
to indicate whether the sentiment computation should be done on
sentence-level rather than document-level. By default do.sentence = FALSE
.
a positive numeric
that will be passed on to the numThreads
argument of the
setThreadOptions
function, to parallelize the sentiment computation across texts. A
value of 1 (default) implies no parallelization. Parallelization will improve speed of the sentiment
computation only for a sufficiently large corpus.
If x
is a sento_corpus
object: a sentiment
object, i.e., a data.table
containing
the sentiment scores data.table
with an "id"
, a "date"
and a "word_count"
column,
and all lexicon-feature sentiment scores columns. The tokenized sentences are not provided but can be
obtained as stringi::stri_split_boundaries(texts, type = "sentence")
. A sentiment
object can
be aggregated (into time series) with the aggregate.sentiment
function.
If x
is a quanteda corpus
object: a sentiment scores
data.table
with an "id"
and a "word_count"
column, and all lexicon-feature
sentiment scores columns.
If x
is a tm SimpleCorpus
object, a tm VCorpus
object, or a character
vector: a sentiment scores data.table
with an auto-created "id"
column, a "word_count"
column, and all lexicon sentiment scores columns.
When do.sentence = TRUE
, an additional "sentence_id"
column along the
"id"
column is added.
For a separate calculation of positive (resp. negative) sentiment, provide distinct positive (resp.
negative) lexicons (see the do.split
option in the sento_lexicons
function). All NA
s
are converted to 0, under the assumption that this is equivalent to no sentiment. Per default tokens = NULL
,
meaning the corpus is internally tokenized as unigrams, with punctuation and numbers but not stopwords removed.
All tokens are converted to lowercase, in line with what the sento_lexicons
function does for the
lexicons and valence shifters. Word counts are based on that same tokenization.
If the lexicons
argument has no "valence"
element, the sentiment computed corresponds to simple unigram
matching with the lexicons [unigrams approach]. If valence shifters are included in lexicons
with a
corresponding "y"
column, the polarity of a word detected from a lexicon gets multiplied with the associated
value of a valence shifter if it appears right before the detected word (examples: not good or can't defend) [bigrams
approach]. If the valence table contains a "t"
column, valence shifters are searched for in a cluster centered around
a detected polarity word [clusters approach]. The latter approach is a simplified version of the one utilized by the
sentimentr package. A cluster amounts to four words before and two words after a polarity word. A cluster never overlaps
with a preceding one. Roughly speaking, the polarity of a cluster is calculated as \(n(1 + 0.80d)S + \sum s\). The polarity
score of the detected word is \(S\), \(s\) represents polarities of eventual other sentiment words, and \(d\) is
the difference between the number of amplifiers (t = 2
) and the number of deamplifiers (t = 3
). If there
is an odd number of negators (t = 1
), \(n = -1\) and amplifiers are counted as deamplifiers, else \(n = 1\).
The sentence-level sentiment calculation approaches each sentence as if it is a document. Depending on the input either
the unigrams, bigrams or clusters approach is used. We enhanced latter approach following more closely the default
sentimentr settings. They use a cluster of five words before and two words after a polarized word. The cluster
is limited to the words after a previous comma and before a next comma. Adversative conjunctions (t = 4
) are
accounted for here. The cluster is reweighted based on the value \(1 + 0.25adv\), where \(adv\) is the difference
between the number of adversative conjunctions found before and after the polarized word.
data("usnews", package = "sentometrics")
txt <- system.file("texts", "txt", package = "tm")
reuters <- system.file("texts", "crude", package = "tm")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]])
l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]][, c("x", "t")])
# from a sento_corpus object - unigrams approach
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 200)
sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol")
# from a character vector - bigrams approach
sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts")
# from a corpus object - clusters approach
corpusQ <- quanteda::corpus(usnews, text_field = "texts")
corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200)
sent3 <- compute_sentiment(corpusQSample, l3, how = "counts")
# from an already tokenized corpus - using the 'tokens' argument
toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword"))
sent4 <- compute_sentiment(corpusQSample, l1[1], how = "counts", tokens = toks)
# from a SimpleCorpus object - unigrams approach
scorp <- tm::SimpleCorpus(tm::DirSource(txt))
sent5 <- compute_sentiment(scorp, l1, how = "proportional")
# from a VCorpus object - unigrams approach
## in contrast to what as.sento_corpus(vcorp) would do, the
## sentiment calculator handles multiple character vectors within
## a single corpus element as separate documents
vcorp <- tm::VCorpus(tm::DirSource(reuters))
sent6 <- compute_sentiment(vcorp, l1)
# from a sento_corpus object - unigrams approach with tf-idf weighting
sent7 <- compute_sentiment(corpusSample, l1, how = "TFIDF")
# sentence-by-sentence computation
sent8 <- compute_sentiment(corpusSample, l1, how = "proportionalSquareRoot",
do.sentence = TRUE)
# from a (fake) multilingual corpus
usnews[["language"]] <- "en" # add language column
usnews$language[1:100] <- "fr"
lEn <- sento_lexicons(list("FEEL_en" = list_lexicons$FEEL_en_tr,
"HENRY" = list_lexicons$HENRY_en),
list_valence_shifters$en)
lFr <- sento_lexicons(list("FEEL_fr" = list_lexicons$FEEL_fr),
list_valence_shifters$fr)
lexicons <- list(en = lEn, fr = lFr)
corpusLang <- sento_corpus(corpusdf = usnews[1:250, ])
sent9 <- compute_sentiment(corpusLang, lexicons, how = "proportional")