Formalizes a collection of texts into a sento_corpus
object derived from the quanteda
corpus
object. The quanteda package provides a robust text mining infrastructure
(see their website), including a handy corpus manipulation toolset. This function
performs a set of checks on the input data and prepares the corpus for further analysis by structurally
integrating a date dimension and numeric metadata features.
sento_corpus(corpusdf, do.clean = FALSE)
a data.frame
(or a data.table
, or a tbl
) with as named columns: a document "id"
column (coercible to character
mode), a "date"
column (as "yyyy-mm-dd"
), a "texts"
column
(in character
mode), an optional "language"
column (in character
mode), and a series of
feature columns of type numeric
, with values between 0 and 1 to specify the degree of connectedness of
a feature to a document. Features could be for instance topics (e.g., legal or economic) or article sources (e.g., online or
print). When no feature column is provided, a feature named "dummyFeature"
is added. All spaces in the names of the features are replaced by '_'
. Feature columns with values not
between 0 and 1 are rescaled column-wise.
a logical
, if TRUE
all texts undergo a cleaning routine to eliminate common textual garbage.
This includes a brute force replacement of HTML tags and non-alphanumeric characters by an empty string. To use with care
if the text is meant to have non-alphanumeric characters! Preferably, cleaning is done outside of this function call.
A sento_corpus
object, derived from a quanteda corpus
object. The corpus is ordered by date.
A sento_corpus
object is a specialized instance of a quanteda corpus
. Any
quanteda function applicable to its corpus
object can also be applied to a sento_corpus
object. However, changing a given sento_corpus
object too drastically using some of quanteda's functions might
alter the very structure the corpus is meant to have (as defined in the corpusdf
argument) to be able to be used as
an input in other functions of the sentometrics package. There are functions, including
corpus_sample
or corpus_subset
, that do not change the actual corpus
structure and may come in handy.
To add additional features, use add_features
. Binary features are useful as
a mechanism to select the texts which have to be integrated in the respective feature-based sentiment measure(s), but
applies only when do.ignoreZeros = TRUE
. Because of this (implicit) selection that can be performed, having
complementary features (e.g., "economy"
and "noneconomy"
) makes sense.
It is also possible to add one non-numerical feature, that is, "language"
, to designate the language
of the corpus texts. When this feature is provided, a list
of lexicons for different
languages is expected in the compute_sentiment
function.
data("usnews", package = "sentometrics")
# corpus construction
corp <- sento_corpus(corpusdf = usnews)
# take a random subset making use of quanteda
corpusSmall <- quanteda::corpus_sample(corp, size = 500)
# deleting a feature
quanteda::docvars(corp, field = "wapo") <- NULL
# deleting all features results in the addition of a dummy feature
quanteda::docvars(corp, field = c("economy", "noneconomy", "wsj")) <- NULL
#> Warning: No remaining features. A 'dummyFeature' feature valued at 1 throughout is added.
if (FALSE) { # \dontrun{
# to add or replace features, use the add_features() function...
quanteda::docvars(corp, field = c("wsj", "new")) <- 1} # }
# corpus creation when no features are present
corpusDummy <- sento_corpus(corpusdf = usnews[, 1:3])
#> We detected no features, so we added a dummy feature 'dummyFeature'.
# corpus creation with a qualitative language feature
usnews[["language"]] <- "en"
usnews[["language"]][c(200:400)] <- "nl"
corpusLang <- sento_corpus(corpusdf = usnews)