tethne.model.corpus.dtmmodel module¶
Classes and methods related to the DTMModel.
- class tethne.model.corpus.dtmmodel.DTMModel(e_theta, phi, metadata, vocabulary)[source]¶
Bases: tethne.model.basemodel.BaseModel
Represents a Dynamic Topic Model (DTM).
The DTM is similar to the LDA model (see LDAModel) except that each topic is permitted to evolve over time (i.e. probabilities associated with terms in the topic can change). For a complete description of the model see Blei & Lafferty 2006.
To generate a DTMModel from a Corpus use the DTMModelManager, which relies on S. Gerrish’s C++ implementation of DTM. Alternatively, you can build the model externally (e.g. using the Gerrish DTM implementation directly), and then load the results with from_gerrish().
If you are using a different implementation of DTM, you can initialize a DTMModel directly by providing parameters and metadata.
- e_theta should describe the distribution of topics (rows) in documents (cols).
- phi should describe the topic (dimension 0) distributions over words (dimension 1) over time (dimension 2).
- metadata should map matrix indices for documents onto Paper IDs (or whatever you use to identify documents).
- vocabulary should map matrix indices for words onto word-strings.
list_topic Yields the top Nwords for topic k. list_topics Yields the top Nwords for each topic. topic_evolution Generate a plot that shows p(w|z) over time for the top Nwords terms. print_topic Yields the top Nwords for topic k. print_topics Yields the top Nwords for each topic. Parameters: e_theta : matrix-like
Distribution of topics (Z) in documents (M). Shape: (Z, M).
phi : matrix-like
Topic (Z) distribution over words (W), over time (T). Shape: (Z, W, T)
metadata : dict
Maps matrix indices onto document datadata.
vocabulary : dict
Maps W indices onto words.
- dimension(d, top=None, asmatrix=False, **kwargs)¶
Describes a dimension (eg a topic).
Subclass must provide _dimension_description(d) method.
Parameters: d : int
Dimension index.
Returns: description : list
A list of ( feature, weight ) tuples (e.g. word, prob ).
- dimension_items(d, threshold, **kwargs)¶
Describes a dimension in terms of the items that contain it.
Subclass must provide _dimension_items(d, threshold) method.
Parameters: d : int
Dimension index.
threshold : float
Minimum representation of d in item.
Returns: description : list
A list of ( item, weight ) tuples.
- dimension_relationship(d, e, **kwargs)¶
Describes the relationship between two dimensions.
Subclass must provide _dimension_relationship(d, e) method.
Parameters: d : int
Dimension index.
e : int
Dimension index.
Returns: relationship : list
A list of ( factor , weight ) tuples.
- item(i, top=None, **kwargs)¶
Describes an item in terms of dimensions and weights.
Subclass must provide _item_description(i) method.
Parameters: i : int
Index for an item.
top : int
(optional) Number of (highest-w) dimensions to return.
Returns: description : list
A list of ( dimension , weight ) tuples.
- item_relationship(i, j, **kwargs)¶
Describes the relationship between two items.
Subclass must provide _item_relationship(i, j) method.
Parameters: i : int
Item index.
j : int
Item index.
Returns: relationship : list
A list of ( dimension , weight ) tuples.
- list_topic(k, t, Nwords=10)[source]¶
Yields the top Nwords for topic k.
Parameters: k : int
A topic index.
t : int
A time index.
Nwords : int
Number of words to return.
Returns: as_list : list
List of words in topic.
- list_topics(t, Nwords=10)[source]¶
Yields the top Nwords for each topic.
Parameters: t : int
A time index.
Nwords : int
Number of words to return for each topic.
Returns: as_dict : dict
Keys are topic indices, values are list of words.
- print_topic(k, t, Nwords=10)[source]¶
Yields the top Nwords for topic k.
Parameters: k : int
A topic index.
t : int
A time index.
Nwords : int
Number of words to return.
Returns: as_string : str
Joined list of words in topic.
- class tethne.model.corpus.dtmmodel.GerrishLoader(target, metadata_path, vocabulary_path)[source]¶
Bases: object
Helper class for parsing results from S. Gerrish’s C++ implementation
Parameters: target : str
Path to lda-seq output directory.
metadata : str
Path to metadata file.
vocabulary : str
Path to vocabulary file.
Returns:
- tethne.model.corpus.dtmmodel.from_gerrish(target, metadata, vocabulary, metadata_key='doi')[source]¶
Generate a DTMModel from the output of S. Gerrish’s C++ DTM implementation.
The Gerrish DTM implementation generates a large number of data files contained in a directory called lda-seq. The target parameter should be the path to that directory.
metadata should be the path to a tab-delimted metadata file. Those records should occur in the same order as in the corpus data files used to generate the model. For example:
id date atitle 10.2307/2437162 1945 SOME ECOTYPIC RELATIONS OF DESCHAMPSIA CAESPITOSA 10.2307/4353229 1940 ENVIRONMENTAL INFLUENCE AND TRANSPLANT EXPERIMENTS 10.2307/4353158 1937 SOME FUNDAMENTAL PROBLEMS OF TAXONOMY AND PHYLOGENETICS
vocabulary should be the path to a file containing the words used to generate the model, one per line.
Parameters: target : str
Path to lda-seq output directory.
metadata : str
Path to metadata file.
vocabulary : str
Path to vocabulary file.
Returns: