Here is an overview of the most important anticipated developments, and known bugs or minor unfinished business. If you want to help out on some of these things, contact Samuel Borms, or simply open an issue and file a pull request on GitHub.

Extensions

  • Implement a sento_train() function to for instance generate a lexicon from a corpus.

  • Add topic modeling functionality into the add_features() function (or as part of the sento_train() function).

  • Expand the number of available models in the sento_model() function (e.g. constrained regression, and PCA).

  • Implement an optimization approach into the aggregate.sento_measures(..., do.global = TRUE) function to extract optimized weights across dimensions (make it possibly available through the sento_model() function); this includes allowing weights to be set in the aggregate.sento_measures() function instead of averaging by default.

  • Implement fast textual sentiment computation for lexicons with ngrams.

  • Implement a scale.sentiment() function.

  • Add a head.sento_measures() and a tail.sento_measures() function.

  • Implement a structure to support high-frequency intraday aggregation.

  • Make more lexicons available (e.g. in German and Spanish).

  • Give more control to the user to play with glmnet parameters in the sento_model() function.

  • Write a helper function to aggregate an attributions object into clusters.

  • Resolve inconsistency with data.frame input columns ("text(s)" & "(doc_)id") in the sentometrics, quanteda and tm corpus creators.

  • Prepare functional CRAN version of sentometrics.app package.

  • Find additional computational speed gains (especially after recent additions which introduced some overhead).

  • Add a "binary" option to get_hows()[["words"]] that turns the sentiment computation into an indicator-like calculation (value of 1 if a text has at least one lexicon word).

Tweaks and bugs

  • Optimize parallelization of iterative model runs (e.g. avoid unnecessary copying of objects across cores).

  • Add a delete_features() function as an intuitive counterpart to add_features().

  • Solve issue that column names of sentiment measures output do not deal well with special characters but still get through.

  • Handle data.frame and matrix input in sento_model(..., y, ...) function more consistently.

  • Add references to external textdata package in examples (e.g. for extra lexicons).

  • Be more flexible for the features in a sento_corpus object by also allowing values outside 0 and 1.

  • Make sure subsetting does not maintain a sentiment object when it is not supposed to be.

  • Remove all but one (not all) duplicate entries in the sento_lexicons() function.

  • Make sure you can also add the "language" identifier to a corpus with add_features().