SciPy

tethne.model.managers.mallet module

Classes and methods related to the MALLETModelManager.

class tethne.model.managers.mallet.MALLETModelManager(D, feature='unigrams', outpath='/tmp/', temppath=None, mallet_path='./model/bin/mallet-2.0.7')[source]

Bases: tethne.model.managers.ModelManager

Generates a LDAModel from a Corpus using MALLET.

The Corpus should already contain at least one featurset, indicated by the feature parameter, such as wordcounts. You may specify two working directories: temppath should be a working directory that will contain intermediate files (e.g. documents, data files, metadata), while outpath will contain the final model and any plots generated during the modeling process. If temppath is not provided, generates and uses a system temporary directory.

Tethne comes bundled with a recent version of MALLET. If you would rather use your own install, you can do so by providing the mallet_path parameter. This should point to the directory containing /bin/mallet.

topic_over_time Representation of topic k over ‘date’ slice axis.
Parameters:

D : Corpus

feature : str

Key from D.features containing wordcounts (or whatever you want to model with).

outpath : str

Path to output directory.

temppath : str

Path to temporary directory.

mallet_path : str

Path to MALLET install directory (contains bin/mallet).

Examples

Starting with some JSTOR DfR data (with wordcounts), a typical workflow might look something like this:

>>> from nltk.corpus import stopwords                 #  1. Get stoplist.
>>> stoplist = stopwords.words()

>>> from tethne.readers import dfr                    #  2. Build Corpus.
>>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist)

>>> def filt(s, C, DC):                           # 3. Filter wordcounts.
...     if C > 3 and DC > 1 and len(s) > 3:
...         return True
...     return False
>>> C.filter_features('wordcounts', 'wc_filtered', filt)

>>> from tethne.model import MALLETModelManager       #   4. Get Manager.
>>> outpath = '/path/to/my/working/directory'
>>> mallet = '/Applications/mallet-2.0.7'
>>> M = MALLETModelManager(C, 'wc_filtered', outpath, mallet_path=mallet)

>>> M.prep()                                          #    5. Prep model.

>>> model = M.build(Z=50, max_iter=300)               #   6. Build model.
>>> model                                             # (may take awhile)
<tethne.model.corpus.ldamodel.LDAModel at 0x10bfac710>

A plot showing the log-likelihood/topic over modeling iterations should be generated in your outpath. For example:

_images/ldamodel_LL.png

Behind the scenes, the prep() procedure generates a plain-text corpus file at temppath, along with a metadata file. MALLET’s import-file procedure is then called, which translates the corpus into MALLET’s internal format (also stored at the temppath).

The build() procedure then invokes MALLET’s train-topics procedure. This step may take a considerable amount of time, anywhere from a few minutes (small corpus, few topics) to a few hours (large corpus, many topics).

For a Corpus with a few thousand Papers, 300 - 500 iterations is often sufficient to achieve convergence for 20-100 topics.

Once the LDAModel is built, you can access its methods directly. See full method descriptions in LDAModel.

For more information about topic modeling with MALLET see this tutorial.

topic_over_time(k, threshold=0.05, mode='documents', normed=True, plot=False, figargs={'figsize': (10, 10)})[source]

Representation of topic k over ‘date’ slice axis.

The Corpus used to initialize the LDAModelManager must have been already sliced by ‘date’.

Parameters:

k : int

Topic index.

threshold : float

Minimum representation of k in a document.

mode : str

‘documents’ counts the number documents that contain k; ‘proportions’ sums the representation of k in each document that contains it.

normed : bool

(default: True) Normalizes values by the number of documents in each slice.

plot : bool

(default: False) If True, generates a MatPlotLib figure and saves it to the MALLETModelManager outpath.

figargs : dict

kwargs dict for matplotlib.pyplot.figure().

Returns:

keys : array

Keys into ‘date’ slice axis.

R : array

Representation of topic k over time.

Examples

>>> keys, repr = M.topic_over_time(1, plot=True)

...should return keys (date) and repr (% documents) for topic 1, and generate a plot like this one in your outpath.

_images/topic_1_over_time.png