SciPy

tethne.model.corpus package

Submodules

tethne.model.corpus.mallet module

Classes and methods related to the MALLETModelManager.

class tethne.model.corpus.mallet.LDAModel(*args, **kwargs)[source]

Bases: tethne.model.Model

Generates a LDAModel from a Corpus using MALLET.

The Corpus should already contain at least one featurset, indicated by the feature parameter, such as wordcounts. You may specify two working directories: temppath should be a working directory that will contain intermediate files (e.g. documents, data files, metadata), while outpath will contain the final model and any plots generated during the modeling process. If temppath is not provided, generates and uses a system temporary directory.

Tethne comes bundled with a recent version of MALLET. If you would rather use your own install, you can do so by providing the mallet_path parameter. This should point to the directory containing /bin/mallet.

topic_over_time Calculate the representation of topic k in the corpus over time.
Parameters:

D : Corpus

feature : str

Key from D.features containing wordcounts (or whatever you want to model with).

outpath : str

Path to output directory.

temppath : str

Path to temporary directory.

mallet_path : str

Path to MALLET install directory (contains bin/mallet).

Examples

Starting with some JSTOR DfR data (with wordcounts), a typical workflow might look something like this:

>>> from nltk.corpus import stopwords                 #  1. Get stoplist.
>>> stoplist = stopwords.words()

>>> from tethne.readers import dfr                    #  2. Build Corpus.
>>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist)

>>> def filt(s, C, DC):                           # 3. Filter wordcounts.
...     if C > 3 and DC > 1 and len(s) > 3:
...         return True
...     return False
>>> C.filter_features('wordcounts', 'wc_filtered', filt)

>>> from tethne.model import MALLETModelManager       #   4. Get Manager.
>>> outpath = '/path/to/my/working/directory'
>>> mallet = '/Applications/mallet-2.0.7'
>>> M = MALLETModelManager(C, 'wc_filtered', outpath, mallet_path=mallet)

>>> M.prep()                                          #    5. Prep model.

>>> model = M.build(Z=50, max_iter=300)               #   6. Build model.
>>> model                                             # (may take awhile)
<tethne.model.corpus.ldamodel.LDAModel at 0x10bfac710>

A plot showing the log-likelihood/topic over modeling iterations should be generated in your outpath. For example:

_images/ldamodel_LL.png

Behind the scenes, the prep() procedure generates a plain-text corpus file at temppath, along with a metadata file. MALLET’s import-file procedure is then called, which translates the corpus into MALLET’s internal format (also stored at the temppath).

The build() procedure then invokes MALLET’s train-topics procedure. This step may take a considerable amount of time, anywhere from a few minutes (small corpus, few topics) to a few hours (large corpus, many topics).

For a Corpus with a few thousand Papers, 300 - 500 iterations is often sufficient to achieve convergence for 20-100 topics.

Once the LDAModel is built, you can access its methods directly. See full method descriptions in LDAModel.

For more information about topic modeling with MALLET see this tutorial.

list_topic(k, Nwords=10)[source]

List the top topn words for topic k.

Examples

>>> model.list_topic(1, Nwords=5)
[ 'opposed', 'terminates', 'trichinosis', 'cistus', 'acaule' ]
list_topics(Nwords=10)[source]

List the top Nwords words for each topic.

load(**kwargs)[source]
mallet_path = '/Users/erickpeirson/Projects/tethne/tethne/bin/mallet-2.0.7'
prep()[source]
print_topics(Nwords=10)[source]

Print the top Nwords words for each topic.

run(**kwargs)[source]

Calls MALLET’s train-topic method.

topic_over_time(k, mode='counts', slice_kwargs={})[source]

Calculate the representation of topic k in the corpus over time.

topics_in(d, topn=5)[source]

List the top topn topics in document d.

Module contents

Corpus models describe latent topics (dimensions) that explain the distribution of features (eg words) among documents in a Corpus.

Tethne presently represents two corpus models:

ldamodel.LDAModel
dtmmodel.DTMModel

Most model classes are subclasses of BaseModel. It is assumed that each model describes a set of items (eg Papers or authors), a set of dimensions that describe those items (eg topics), and a set of features that comprise those dimensions (eg words).