quanteda is in development and will remain so until we declare a 1.0 version, at which time we will only add new functions, not change the names of existing ones. In the meantime, we suggest:
tokenize(mytexts, what = "sentence") instead of tokenize(mytexts, "sentence") – since the order is not stable; and also using named formals rather than relying on current defaults, e.g. tokenize(mytexts, removePunct = FALSE) since the default values are not stable.All testing should be in tests/testthat/test_
For performance comprisons, we write up the results and document them in the vignette performance_comparisons.Rmd.
Development and branches: We add new features through workingDev. Before merging this with dev, we make sure the build passes a full CRAN check.
encoding() to detect encoding, and replace iconv() calls with stringi::stri_encode() in corpus()tokenize() based on that package):
ntoken()ntype() and nfeature()phrasetotoken()ie2010Corpus (and see if CRAN lets us get away with it)language()tokenizedTexts objects:
dfm.tokenizedTextsremoveFeatures.tokenizedTextssyllables.tokenizedTextsremoveFeatures now much faster, based on fixed binary matches and stringi character classesreadability()nsentence()ngrams added as an option to tokenize()lexdiv() to make the API similar to readability() and to use data.tablesegment() to make use of new tokenizer that segments on sentencesbigrams, ngrams punctuation sensitive in the same way that collocations is currentlycollocations code for bigrams and trigrams and reduce the internal memory usagecorpus.VCorpus() is fully workingdfm documentation needs to group arguments into sections and describe how these correspond to the logical workflowkwic to use new tokenizer, and to allow searches for multi-word regular expressionssettings() and figure out how to add additional objects to a corpus, namely one or more:
similarity()wordstem(), stopwords(), and syllables()textmodel: Devise and document a consistent, logical, and easy-to-use and remember scheme for textmodels.convert() needs substantial work+ is defined.resample functionality to enable resampling from different text unitsindex (?) for pre-tokenizing and indexing a corpusPlease use the issue page on the GitHub repository, or contact kbenoit@lse.ac.uk directly.