A simple tokeniser (based on the Penn Treebank conventions) is included. When we write a sentence, it's common for us to use punctuation and contractions. A tokeniser parses these sentences containing normal natural language, and returns a string representing the original slightly differently. Contractions (such as "cannot") are split ("can not"), punctuation split (though abbreviations are usually left alone), and lots of other fiddly things are performed which make writing algorithms to analyse the words of the string much easier. Pass in a normal string, with one sentence per line. The return type is the original string in tokenised form. Because punctuation is split, you can get access to the individual words of the string by performing a .split().
An implementation of the Porter Stemming algorithm is included. A stemmer attempts to remove common suffixes from words, reducing them to something similar to the root morpheme (the 'stem'; e.g abatements, abated and abate all stem to "abat"). This implementation is based off the reference C code which is particularly ugly, but fast (the RB NLP version stems over 23,000 words in half a second on my machine). Stemming is particularly useful in information retrieval as it allows searches for words with a common root to match even if the words themselves are different. Stemming also reduces index size by about a third. Pass in a single word in lowercase. The return type is the stemmed form of the word.
Stop words are words which are considered too common to have any valid linguistic relevance, and so can be safely ignored (but only in certain situations, such as indexing). Stop word lists are very different and are usually linked closely with a particular function, so this implementation allows you to import you own (though a simple list useful for many purposes is included as well).
Collocations are multi word expressions (such as "New York", "Kick the bucket" etc.) which occur statistically more often than chance in a document corpus. They are useful for a number of NLP and especially IR fields - tokenising query input to include Collocations means that semi-automatic phrase matching is performed. For example, a user searching for "Star Wars" is unlikely to simply search for "Star"; indexing and searching using Collocations helps increase the accuracy of the results returned. This implementation of collocation detection uses the T-Test to perform statistical analysis. In the future the likelihood test will be used as this works better over sparse data.
What might be included in the future?
The next item intended to be added to the project is an implementation of the Text Tiling algorithm. Text Tiling is a topic based document segmenter (an algorithm which splits up a document into chunks based on the topic being discussed in the chunk) and works relatively well. More distantly into the future, a PLSA topic segmenter (which has received results ~91%) may be implemented. After text tiling, a basic noun phrase chunker will be added, followed by an English POS (Part Of Speech) tagger. PLSA, PLSI and ME implementations may also be added in a future release. The best way to ensure this is to email me offering help, or just saying that it would help you. The more demand I see for this (regardless of my own needs), the faster I'm likely to get around to doing it.