tethne.model.corpus package¶
Submodules¶
tethne.model.corpus.mallet module¶
Classes and methods related to the MALLETModelManager.
-
class
tethne.model.corpus.mallet.LDAModel(*args, **kwargs)[source]¶ Bases:
tethne.model.ModelGenerates a
LDAModelfrom aCorpususing MALLET.The
Corpusshould already contain at least one featurset, indicated by the feature parameter, such as wordcounts. You may specify two working directories: temppath should be a working directory that will contain intermediate files (e.g. documents, data files, metadata), while outpath will contain the final model and any plots generated during the modeling process. If temppath is not provided, generates and uses a system temporary directory.Tethne comes bundled with a recent version of MALLET. If you would rather use your own install, you can do so by providing the mallet_path parameter. This should point to the directory containing
/bin/mallet.topic_over_timeCalculate the representation of topic kin the corpus over time.Parameters: D :
Corpusfeature : str
Key from D.features containing wordcounts (or whatever you want to model with).
outpath : str
Path to output directory.
temppath : str
Path to temporary directory.
mallet_path : str
Path to MALLET install directory (contains bin/mallet).
Examples
Starting with some JSTOR DfR data (with wordcounts), a typical workflow might look something like this:
>>> from nltk.corpus import stopwords # 1. Get stoplist. >>> stoplist = stopwords.words() >>> from tethne.readers import dfr # 2. Build Corpus. >>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist) >>> def filt(s, C, DC): # 3. Filter wordcounts. ... if C > 3 and DC > 1 and len(s) > 3: ... return True ... return False >>> C.filter_features('wordcounts', 'wc_filtered', filt) >>> from tethne.model import MALLETModelManager # 4. Get Manager. >>> outpath = '/path/to/my/working/directory' >>> mallet = '/Applications/mallet-2.0.7' >>> M = MALLETModelManager(C, 'wc_filtered', outpath, mallet_path=mallet) >>> M.prep() # 5. Prep model. >>> model = M.build(Z=50, max_iter=300) # 6. Build model. >>> model # (may take awhile) <tethne.model.corpus.ldamodel.LDAModel at 0x10bfac710>
A plot showing the log-likelihood/topic over modeling iterations should be generated in your outpath. For example:
Behind the scenes, the
prep()procedure generates a plain-text corpus file at temppath, along with a metadata file. MALLET’simport-fileprocedure is then called, which translates the corpus into MALLET’s internal format (also stored at the temppath).The
build()procedure then invokes MALLET’strain-topicsprocedure. This step may take a considerable amount of time, anywhere from a few minutes (small corpus, few topics) to a few hours (large corpus, many topics).For a
Corpuswith a few thousandPapers, 300 - 500 iterations is often sufficient to achieve convergence for 20-100 topics.Once the
LDAModelis built, you can access its methods directly. See full method descriptions inLDAModel.For more information about topic modeling with MALLET see this tutorial.
-
list_topic(k, Nwords=10)[source]¶ List the top
topnwords for topick.Examples
>>> model.list_topic(1, Nwords=5) [ 'opposed', 'terminates', 'trichinosis', 'cistus', 'acaule' ]
-
mallet_path= '/Users/erickpeirson/Projects/tethne/tethne/bin/mallet-2.0.7'¶
-
Module contents¶
Corpus models describe latent topics (dimensions) that explain the
distribution of features (eg words) among documents in a Corpus.
Tethne presently represents two corpus models:
ldamodel.LDAModel |
|
dtmmodel.DTMModel |
Most model classes are subclasses of BaseModel. It is assumed that
each model describes a set of items (eg Papers or authors), a set
of dimensions that describe those items (eg topics), and a set of features
that comprise those dimensions (eg words).

