tethne.model.managers.dtm module¶
Classes and methods related to the DTMModelManager.
- class tethne.model.managers.dtm.DTMModelManager(D, feature='unigrams', outpath='/tmp', temppath=None, dtm_path='./bin/main')[source]¶
Bases: tethne.model.managers.ModelManager
Generates a DTMModel from a Corpus using Gerrish’s C++ implementation.
You should be sure to slice your Corpus by ‘date’ using the ‘time_period’ method (for details, see Corpus.slice()).
plot_topic_evolution Plot the probability of the top Nwords words in topic k over time. topic_over_time Representation of topic k over ‘date’ slice axis. Parameters: D : Corpus
outpath : str
Path to output directory.
dtm_path : str
Path to MALLET install directory (contains bin/mallet).
Examples
Starting with some JSTOR DfR data (with wordcounts), a typical workflow might look something like this:
>>> from nltk.corpus import stopwords # 1. Get stoplist. >>> stoplist = stopwords.words() >>> from tethne.readers import dfr # 2. Build Corpus. >>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist) >>> def filt(s, C, DC): # 3. Filter wordcounts. ... if C > 3 and DC > 1 and len(s) > 3: ... return True ... return False >>> C.filter_features('wordcounts', 'wc_filtered', filt) >>> C.slice('date', 'time_period', window_size=5) # 4. Slice Corpus. >>> from tethne.model import DTMModelManager # 5. Get Manager. >>> outpath = '/path/to/my/working/directory' >>> dtm = '/path/to/dtm/bin/main' >>> M = DTMModelManager(C, 'wc_filtered', outpath, dtm_path=dtm) >>> M.prep() # 6. Prep model. >>> model = M.build(Z=50) # 7. Build model. >>> model # (may take awhile) <tethne.model.corpus.dtmmodel.DTMModel at 0x10bfac710>
A plot showing the log-likelihood/topic over modeling iterations should be generated in your outpath. For example:
Behind the scenes, the prep() procedure generates data files at temppath describing your Corpus:
tethne-vocab.dat contains all of the words in the corpus, one per line.
tethne-mult.dat contains wordcounts for each document; words are represented by integer indices corresponding to line numbers in tethne-vocab.dat. Documents are ordered by publication date (earliest to latest).
tethne-seq.dat describes how documents are to be apportioned among time-periods. The first line is the number of time periods, and the subsequent lines specify the number of documents in each successive time-period.
tethne-meta.dat is a tab-delimted metadata file. Those records occur in the same order as in the documents in tethne-mult.dat. For example:
id date atitle 10.2307/2437162 1945 SOME ECOTYPIC RELATIONS OF DESCHAMPSIA CAESPITOSA 10.2307/4353229 1940 ENVIRONMENTAL INFLUENCE AND TRANSPLANT EXPERIMENTS 10.2307/4353158 1937 SOME FUNDAMENTAL PROBLEMS OF TAXONOMY AND PHYLOGENETICS
The build() procedure then starts the DTM modeling algorithm. This step may take a considerable amount of time, anywhere from a few minutes (small corpus, few topics) to a few hours (large corpus, many topics). Warning: this implementation of DTM is known to run into memory issues with large vocabularies. If a memory-leak does occur, try using a more restrictive filter to the featureset, using Corpus.filter_features().
Once the DTMModel is built, you can access its methods directly. See full method descriptions in DTMModel. Of special interest are:
DTMModel.list_topic_diachronic DTMModel.print_topic_diachronic DTMModel.topic_evolution Generate a plot that shows p(w|z) over time for the top Nwords terms. To plot the evolution of a topic over time, use plot_topic_evolution().
>>> M.plot_topic_evolution(2, plot=True)
...should generate a plot at outpath called topic_2_evolution.png:
- plot_topic_evolution(k, Nwords=5, plot=False, figargs={'figsize': (10, 10)})[source]¶
Plot the probability of the top Nwords words in topic k over time.
If plot is True, generates a plot image at outpath.
TODO: should return a Figure object.
Parameters: k : int
Topic index.
Nwords : int
Number of words to include in plot.
plot : bool
(default: False) If True, generates a plot image at outpath.
figargs : dict
Keyword arguments to pass to matplotlib.pyplot.plot().
Returns: keys : list
Start-date of each time-period.
t_series : list
Array of p(w|t) for Nwords for each time-period.
Examples
>>> M.plot_topic_evolution(2, plot=True)
...should generate a plot at outpath called topic_2_evolution.png:
- topic_over_time(k, threshold=0.05, mode='documents', normed=True, plot=False, figargs={'figsize': (10, 10)})[source]¶
Representation of topic k over ‘date’ slice axis.
The Corpus used to initialize the DTMModelManager must have been already sliced by ‘date’.
Parameters: k : int
Topic index.
threshold : float
Minimum representation of k in a document.
mode : str
‘documents’ counts the number documents that contain k; ‘proportions’ sums the representation of k in each document that contains it.
normed : bool
(default: True) Normalizes values by the number of documents in each slice.
plot : bool
(default: False) If True, generates a MatPlotLib figure and saves it to the MALLETModelManager outpath.
figargs : dict
kwargs dict for matplotlib.pyplot.figure().
Returns: keys : array
Keys into ‘date’ slice axis.
R : array
Representation of topic k over time.
Examples
>>> keys, repr = M.topic_over_time(1, plot=True)
...should return keys (date) and repr (% documents) for topic 1, and generate a plot like this one in your outpath.