tethne.model.corpus package¶
Submodules¶
tethne.model.corpus.mallet module¶
Classes and methods related to the MALLETModelManager
.
-
class
tethne.model.corpus.mallet.
LDAModel
(*args, **kwargs)[source]¶ Bases:
tethne.model.Model
,tethne.model.corpus.LDAMixin
Generates a
LDAModel
from aCorpus
using MALLET.The
Corpus
should already contain at least one featurset, indicated by the feature parameter, such as wordcounts. You may specify two working directories: temppath should be a working directory that will contain intermediate files (e.g. documents, data files, metadata), while outpath will contain the final model and any plots generated during the modeling process. If temppath is not provided, generates and uses a system temporary directory.Tethne comes bundled with a recent version of MALLET. If you would rather use your own install, you can do so by providing the mallet_path parameter. This should point to the directory containing
/bin/mallet
.topic_over_time
Calculate the representation of topic k
in the corpus over time.Parameters: D :
Corpus
feature : str
Key from D.features containing wordcounts (or whatever you want to model with).
outpath : str
Path to output directory.
temppath : str
Path to temporary directory.
mallet_path : str
Path to MALLET install directory (contains bin/mallet).
Examples
Starting with some JSTOR DfR data (with wordcounts), a typical workflow might look something like this:
>>> from nltk.corpus import stopwords # 1. Get stoplist. >>> stoplist = stopwords.words() >>> from tethne.readers import dfr # 2. Build Corpus. >>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist) >>> def filt(s, C, DC): # 3. Filter wordcounts. ... if C > 3 and DC > 1 and len(s) > 3: ... return True ... return False >>> C.filter_features('wordcounts', 'wc_filtered', filt) >>> from tethne.model import MALLETModelManager # 4. Get Manager. >>> outpath = '/path/to/my/working/directory' >>> mallet = '/Applications/mallet-2.0.7' >>> M = MALLETModelManager(C, 'wc_filtered', outpath, mallet_path=mallet) >>> M.prep() # 5. Prep model. >>> model = M.build(Z=50, max_iter=300) # 6. Build model. >>> model # (may take awhile) <tethne.model.corpus.ldamodel.LDAModel at 0x10bfac710>
A plot showing the log-likelihood/topic over modeling iterations should be generated in your outpath. For example:
Behind the scenes, the
prep()
procedure generates a plain-text corpus file at temppath, along with a metadata file. MALLET’simport-file
procedure is then called, which translates the corpus into MALLET’s internal format (also stored at the temppath).The
build()
procedure then invokes MALLET’strain-topics
procedure. This step may take a considerable amount of time, anywhere from a few minutes (small corpus, few topics) to a few hours (large corpus, many topics).For a
Corpus
with a few thousandPaper
s, 300 - 500 iterations is often sufficient to achieve convergence for 20-100 topics.Once the
LDAModel
is built, you can access its methods directly. See full method descriptions inLDAModel
.For more information about topic modeling with MALLET see this tutorial.
-
mallet_path
= '/Users/erickpeirson/Projects/tethne/tethne/model/corpus/../../bin/mallet-2.0.7'¶
-
-
tethne.model.corpus.mallet.
mallet_to_phi_featureset
(wt_path)[source]¶ Generate a
FeatureSet
describing word-topic assignments from a MALLET word-topic output file.Parameters: wt_path : str
Full path to the word-topic data file created by MALLET.
Returns: phi :
FeatureSet
-
tethne.model.corpus.mallet.
mallet_to_theta_featureset
(dt_path)[source]¶ Generate a
FeatureSet
describing document-topic assignments from a MALLET document-topic output file.Parameters: dt_path : str
Full path to the document-topic data file created by MALLET.
Returns: theta :
FeatureSet
tethne.model.corpus.gensim module¶
Module contents¶
Corpus models describe latent topics (dimensions) that explain the
distribution of features (eg words) among documents in a Corpus
.
Tethne presently represents two corpus models:
ldamodel.LDAModel |
|
dtmmodel.DTMModel |
Most model classes are subclasses of BaseModel
. It is assumed that
each model describes a set of items (eg Paper
s or authors), a set
of dimensions that describe those items (eg topics), and a set of features
that comprise those dimensions (eg words).
-
class
tethne.model.corpus.
LDAMixin
[source]¶ Bases:
object
-
list_topic
(k, Nwords=10)[source]¶ List the top
topn
words for topick
.Examples
>>> model.list_topic(1, Nwords=5) [ 'opposed', 'terminates', 'trichinosis', 'cistus', 'acaule' ]
-