SciPy

tethne.model.managers.dtm module

Classes and methods related to the DTMModelManager.

class tethne.model.managers.dtm.DTMModelManager(D, feature='unigrams', outpath='/tmp', temppath=None, dtm_path='./bin/main')[source]

Bases: tethne.model.managers.ModelManager

Generates a DTMModel from a Corpus using Gerrish’s C++ implementation.

You should be sure to slice your Corpus by ‘date’ using the ‘time_period’ method (for details, see Corpus.slice()).

plot_topic_evolution Plot the probability of the top Nwords words in topic k over time.
topic_over_time Representation of topic k over ‘date’ slice axis.
Parameters:

D : Corpus

outpath : str

Path to output directory.

dtm_path : str

Path to MALLET install directory (contains bin/mallet).

Examples

Starting with some JSTOR DfR data (with wordcounts), a typical workflow might look something like this:

>>> from nltk.corpus import stopwords                 #  1. Get stoplist.
>>> stoplist = stopwords.words()

>>> from tethne.readers import dfr                    #  2. Build Corpus.
>>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist)

>>> def filt(s, C, DC):                           # 3. Filter wordcounts.
...     if C > 3 and DC > 1 and len(s) > 3:
...         return True
...     return False
>>> C.filter_features('wordcounts', 'wc_filtered', filt)

>>> C.slice('date', 'time_period', window_size=5)     #  4. Slice Corpus.

>>> from tethne.model import DTMModelManager          #   5. Get Manager.
>>> outpath = '/path/to/my/working/directory'
>>> dtm = '/path/to/dtm/bin/main'
>>> M = DTMModelManager(C, 'wc_filtered', outpath, dtm_path=dtm)

>>> M.prep()                                          #    6. Prep model.

>>> model = M.build(Z=50)                             #   7. Build model.
>>> model                                             # (may take awhile)
<tethne.model.corpus.dtmmodel.DTMModel at 0x10bfac710>

A plot showing the log-likelihood/topic over modeling iterations should be generated in your outpath. For example:

_images/dtmmodel_LL.png

Behind the scenes, the prep() procedure generates data files at temppath describing your Corpus:

  • tethne-vocab.dat contains all of the words in the corpus, one per line.

  • tethne-mult.dat contains wordcounts for each document; words are represented by integer indices corresponding to line numbers in tethne-vocab.dat. Documents are ordered by publication date (earliest to latest).

  • tethne-seq.dat describes how documents are to be apportioned among time-periods. The first line is the number of time periods, and the subsequent lines specify the number of documents in each successive time-period.

  • tethne-meta.dat is a tab-delimted metadata file. Those records occur in the same order as in the documents in tethne-mult.dat. For example:

    id       date    atitle
    10.2307/2437162  1945    SOME ECOTYPIC RELATIONS OF DESCHAMPSIA CAESPITOSA
    10.2307/4353229  1940    ENVIRONMENTAL INFLUENCE AND TRANSPLANT EXPERIMENTS
    10.2307/4353158  1937    SOME FUNDAMENTAL PROBLEMS OF TAXONOMY AND PHYLOGENETICS
    

The build() procedure then starts the DTM modeling algorithm. This step may take a considerable amount of time, anywhere from a few minutes (small corpus, few topics) to a few hours (large corpus, many topics). Warning: this implementation of DTM is known to run into memory issues with large vocabularies. If a memory-leak does occur, try using a more restrictive filter to the featureset, using Corpus.filter_features().

Once the DTMModel is built, you can access its methods directly. See full method descriptions in DTMModel. Of special interest are:

DTMModel.list_topic_diachronic
DTMModel.print_topic_diachronic
DTMModel.topic_evolution Generate a plot that shows p(w|z) over time for the top Nwords terms.

To plot the evolution of a topic over time, use plot_topic_evolution().

>>> M.plot_topic_evolution(2, plot=True)

...should generate a plot at outpath called topic_2_evolution.png:

_images/topic_2_evolution.png
plot_topic_evolution(k, Nwords=5, plot=False, figargs={'figsize': (10, 10)})[source]

Plot the probability of the top Nwords words in topic k over time.

If plot is True, generates a plot image at outpath.

TODO: should return a Figure object.

Parameters:

k : int

Topic index.

Nwords : int

Number of words to include in plot.

plot : bool

(default: False) If True, generates a plot image at outpath.

figargs : dict

Keyword arguments to pass to matplotlib.pyplot.plot().

Returns:

keys : list

Start-date of each time-period.

t_series : list

Array of p(w|t) for Nwords for each time-period.

Examples

>>> M.plot_topic_evolution(2, plot=True)

...should generate a plot at outpath called topic_2_evolution.png:

_images/topic_2_evolution.png
topic_over_time(k, threshold=0.05, mode='documents', normed=True, plot=False, figargs={'figsize': (10, 10)})[source]

Representation of topic k over ‘date’ slice axis.

The Corpus used to initialize the DTMModelManager must have been already sliced by ‘date’.

Parameters:

k : int

Topic index.

threshold : float

Minimum representation of k in a document.

mode : str

‘documents’ counts the number documents that contain k; ‘proportions’ sums the representation of k in each document that contains it.

normed : bool

(default: True) Normalizes values by the number of documents in each slice.

plot : bool

(default: False) If True, generates a MatPlotLib figure and saves it to the MALLETModelManager outpath.

figargs : dict

kwargs dict for matplotlib.pyplot.figure().

Returns:

keys : array

Keys into ‘date’ slice axis.

R : array

Representation of topic k over time.

Examples

>>> keys, repr = M.topic_over_time(1, plot=True)

...should return keys (date) and repr (% documents) for topic 1, and generate a plot like this one in your outpath.

_images/topic_1_over_time.png