In this application, we create a regular and a sentiment-adjusted
version of the well-known news-based Economic Policy
Uncertainty (EPU) index. It fits very well within the
package’s framework. Besides, the actual EPU index from January 1985 up
to July 2018 for the U.S. is part of the package, see
sentometrics::epu
.
By a regular EPU index, we mean to largely follow the core methodology outlined here.
We load and transform the built-in corpus of 4145 U.S. news articles
between 1995 and 2014 into a sento_corpus
object. We only
include the journal features, indicating article source (one of The Wall
Street Journal or The Washington Post).
data("usnews")
corpus <- sento_corpus(usnews[, c("id", "date", "texts", "wsj", "wapo")])
sento_lexicons
object
We take the original keywords used to pinpoint coverage about the
economy (E), policy concerns (P) and uncertainty (U), and organize them
into a sento_lexicons
object.
keywords <- list(
E = c("economy", "economic"),
P = c("congress", "legislation", "white house", "regulation", "deficit", "federal reserve"),
U = c("uncertainty", "uncertain")
)
keywords_dt <- lapply(keywords, function(kw) data.table(x = kw, y = 1))
lex <- sento_lexicons(keywords_dt)
These keywords lexicons are used in the
compute_sentiment()
function. For each EPU dimension, the
obtained scores represent the number of keywords present in a given news
article.
s <- compute_sentiment(corpus, lex, "counts")
s[, -c("date", "word_count")]
## id E--wsj E--wapo P--wsj P--wapo U--wsj U--wapo
## <char> <num> <num> <num> <num> <num> <num>
## 1: 830981846 0 0 0 0 0 0
## 2: 842617067 0 0 0 0 0 0
## 3: 830982165 0 1 0 0 0 0
## 4: 830982389 0 0 0 0 0 0
## 5: 842615996 0 0 0 0 0 0
## ---
## 4141: 842613758 2 0 0 0 0 0
## 4142: 842615135 2 0 0 0 0 0
## 4143: 842617266 2 0 0 0 0 0
## 4144: 842614354 1 0 0 0 0 0
## 4145: 842616130 1 0 0 0 0 0
sentiment
object
As we are not interested in the number of keywords, but only in
whether a keyword was present, we need some adjustments. In particular,
we split the sentiment
object and compute an EPU column per
newspaper, with a score of 1 if at least two categories’ keywords show
up, and 0 if not.
sA <- s[, 1:3]
sB <- s[, -c(1:3)]
to_epu <- function(x) as.numeric(rowSums(x > 0) >= 2) # >= 3 is too strict for this corpus
sB[, "EPU--wsj" := to_epu(.SD), .SDcols = endsWith(colnames(sB), "wsj")]
sB[, "EPU--wapo" := to_epu(.SD), .SDcols = endsWith(colnames(sB), "wapo")]
s2 <- as.sentiment(cbind(sA, sB[, c("EPU--wsj", "EPU--wapo")]))
This new sentiment
object is aggregated into a monthly
average and a 12-monthly linear moving average time series per
newspaper. Setting howDocs = "equal_weight"
and
do.ignoreZeros = FALSE
normalizes the monthly values by the
combined number of news articles in that month.
w <- data.frame("simple" = c(rep(0, 11), 1), "linear" = weights_exponential(12, alphas = 10^-10)[, 1])
ctr <- ctr_agg(howDocs = "equal_weight", do.ignoreZeros = FALSE,
howTime = "own", by = "month", lag = 12, weights = w)
sm <- aggregate(s2, ctr)
The next step is to scale the newspaper-level EPU time series to unit standard deviation before a certain date (in this case, before 2005). Rather than unit standard deviation, we standardize to a standard deviation of 100.
dt <- as.data.table(subset(sm, date < "2005-01-01"))
sds <- apply(dt[, -1], 2, sd)
sm2 <- scale(sm, center = FALSE, scale = sds/100)
subset(sm2, date < "2005-01-01")[["stats"]]
## EPU--wsj--simple EPU--wapo--simple EPU--wsj--linear EPU--wapo--linear
## mean 96.1926353 76.9955241 319.1703100 271.85568619
## sd 100.0000000 100.0000000 100.0000000 100.00000000
## max 380.3967953 389.0856498 644.9521817 526.74844061
## min 0.0000000 0.0000000 179.4097421 93.69645949
## meanCorr 0.1047356 0.1347121 0.1217423 0.08630789
To then obtain the actual EPU index, the newspaper-level series are
averaged by reapplying the aggregate()
function. We are
left with two series, one a moving average of the other.
A sentiment-adjusted EPU index adds a layer of news sentiment analysis to the typical creation process. The resulting index will fluctuate between negative and positive values depending on how news writes about topics related to economic policy uncertainty, rather than only analyzing if they write about it.
We start from scratch by reinitializing the corpus.
corpus <- sento_corpus(usnews[, c("id", "date", "texts", "wsj", "wapo")])
We move forward by adding binary features to the corpus for the E, P
and U keywords defined earlier. Next, we compute an EPU feature, based
on the same mechanism using the self-created to_epu()
function.
corpus <- add_features(corpus, keywords = keywords, do.binary = TRUE)
dv <- as.data.table(docvars(corpus))
dv[, EPU := to_epu(.SD), .SDcols = c("E", "P", "U")]
Having detected the news articles to count in the EPU index, we need
to normalize these counts. We do so per newspaper using some
data.table
magic, and then add the
appropriate features to the corpus using
add_features()
.
# compute total number of articles per journal and month
totArticles <- dv[, date := format(date, "%Y-%m")][,
lapply(.SD, sum), by = date, .SDcols = c("wsj", "wapo")]
setnames(totArticles, c("wsj", "wapo"), c("wsjT", "wapoT"))
dv <- merge(dv, totArticles, by = "date")
dv[, c("wsj", "wapo") := list((wsj * EPU) / wsjT, (EPU * wapo) / wapoT)]
for (j in which(colnames(dv) %in% c("wsj", "wapo"))) # replace NaN and Inf values due to zero division
set(dv, which(is.na(dv[[j]]) | is.infinite(dv[[j]])), j, 0)
corpus <- add_features(corpus, featuresdf = dv[, c("wsj", "wapo", "EPU")])
We continue with a subsetted corpus carrying those articles discussing enough EPU material. We clean the features keeping only the normalized newspaper-level features.
corpus <- corpus_subset(corpus, EPU == 1)
docvars(corpus, c("E", "P", "U", "EPU")) <- NULL
The news sentiment layer is added by applying the all-at-once
sentiment computation and aggregation function
sento_measures()
. The popular Harvard General Inquirer is
the sentiment lexicon at service. Averaging across the newspaper series
gives a final EPU index. Specific scaling as shown before is left
aside.
sentLex <- sento_lexicons(sentometrics::list_lexicons[c("GI_en")])
ctr <- ctr_agg("counts", "equal_weight", "equal_weight", by = "month", lag = 12)
sm <- sento_measures(corpus, sentLex, ctr)
sm2 <- aggregate(sm, features = list(journals = c("wsj", "wapo")))
plot(sm2)