SciPy

tethne.model.corpus.dtmmodel module

Classes and methods related to the DTMModel.

class tethne.model.corpus.dtmmodel.DTMModel(e_theta, phi, metadata, vocabulary)[source]

Bases: tethne.model.basemodel.BaseModel

Represents a Dynamic Topic Model (DTM).

The DTM is similar to the LDA model (see LDAModel) except that each topic is permitted to evolve over time (i.e. probabilities associated with terms in the topic can change). For a complete description of the model see Blei & Lafferty 2006.

To generate a DTMModel from a Corpus use the DTMModelManager, which relies on S. Gerrish’s C++ implementation of DTM. Alternatively, you can build the model externally (e.g. using the Gerrish DTM implementation directly), and then load the results with from_gerrish().

If you are using a different implementation of DTM, you can initialize a DTMModel directly by providing parameters and metadata.

  • e_theta should describe the distribution of topics (rows) in documents (cols).
  • phi should describe the topic (dimension 0) distributions over words (dimension 1) over time (dimension 2).
  • metadata should map matrix indices for documents onto Paper IDs (or whatever you use to identify documents).
  • vocabulary should map matrix indices for words onto word-strings.
list_topic Yields the top Nwords for topic k.
list_topics Yields the top Nwords for each topic.
topic_evolution Generate a plot that shows p(w|z) over time for the top Nwords terms.
print_topic Yields the top Nwords for topic k.
print_topics Yields the top Nwords for each topic.
Parameters:

e_theta : matrix-like

Distribution of topics (Z) in documents (M). Shape: (Z, M).

phi : matrix-like

Topic (Z) distribution over words (W), over time (T). Shape: (Z, W, T)

metadata : dict

Maps matrix indices onto document datadata.

vocabulary : dict

Maps W indices onto words.

dimension(d, top=None, asmatrix=False, **kwargs)

Describes a dimension (eg a topic).

Subclass must provide _dimension_description(d) method.

Parameters:

d : int

Dimension index.

Returns:

description : list

A list of ( feature, weight ) tuples (e.g. word, prob ).

dimension_items(d, threshold, **kwargs)

Describes a dimension in terms of the items that contain it.

Subclass must provide _dimension_items(d, threshold) method.

Parameters:

d : int

Dimension index.

threshold : float

Minimum representation of d in item.

Returns:

description : list

A list of ( item, weight ) tuples.

dimension_relationship(d, e, **kwargs)

Describes the relationship between two dimensions.

Subclass must provide _dimension_relationship(d, e) method.

Parameters:

d : int

Dimension index.

e : int

Dimension index.

Returns:

relationship : list

A list of ( factor , weight ) tuples.

item(i, top=None, **kwargs)

Describes an item in terms of dimensions and weights.

Subclass must provide _item_description(i) method.

Parameters:

i : int

Index for an item.

top : int

(optional) Number of (highest-w) dimensions to return.

Returns:

description : list

A list of ( dimension , weight ) tuples.

item_relationship(i, j, **kwargs)

Describes the relationship between two items.

Subclass must provide _item_relationship(i, j) method.

Parameters:

i : int

Item index.

j : int

Item index.

Returns:

relationship : list

A list of ( dimension , weight ) tuples.

list_topic(k, t, Nwords=10)[source]

Yields the top Nwords for topic k.

Parameters:

k : int

A topic index.

t : int

A time index.

Nwords : int

Number of words to return.

Returns:

as_list : list

List of words in topic.

list_topic_diachronic(k, Nwords=10)[source]
list_topics(t, Nwords=10)[source]

Yields the top Nwords for each topic.

Parameters:

t : int

A time index.

Nwords : int

Number of words to return for each topic.

Returns:

as_dict : dict

Keys are topic indices, values are list of words.

print_topic(k, t, Nwords=10)[source]

Yields the top Nwords for topic k.

Parameters:

k : int

A topic index.

t : int

A time index.

Nwords : int

Number of words to return.

Returns:

as_string : str

Joined list of words in topic.

print_topic_diachronic(k, Nwords=10)[source]
print_topics(t, Nwords=10)[source]

Yields the top Nwords for each topic.

Parameters:

t : int

A time index.

Nwords : int

Number of words to return for each topic.

Returns:

as_string : str

Newline-delimited lists of words for each topic.

topic_evolution(k, Nwords=5)[source]

Generate a plot that shows p(w|z) over time for the top Nwords terms.

Parameters:

k : int

A topic index.

Nwords : int

Number of words to return.

Returns:

keys : list

Start-date of each time-period.

t_series : list

Array of p(w|t) for Nwords for each time-period.

class tethne.model.corpus.dtmmodel.GerrishLoader(target, metadata_path, vocabulary_path)[source]

Bases: object

Helper class for parsing results from S. Gerrish’s C++ implementation

Parameters:

target : str

Path to lda-seq output directory.

metadata : str

Path to metadata file.

vocabulary : str

Path to vocabulary file.

Returns:

DTMModel

load()[source]
tethne.model.corpus.dtmmodel.from_gerrish(target, metadata, vocabulary, metadata_key='doi')[source]

Generate a DTMModel from the output of S. Gerrish’s C++ DTM implementation.

The Gerrish DTM implementation generates a large number of data files contained in a directory called lda-seq. The target parameter should be the path to that directory.

metadata should be the path to a tab-delimted metadata file. Those records should occur in the same order as in the corpus data files used to generate the model. For example:

id       date    atitle
10.2307/2437162  1945    SOME ECOTYPIC RELATIONS OF DESCHAMPSIA CAESPITOSA
10.2307/4353229  1940    ENVIRONMENTAL INFLUENCE AND TRANSPLANT EXPERIMENTS
10.2307/4353158  1937    SOME FUNDAMENTAL PROBLEMS OF TAXONOMY AND PHYLOGENETICS

vocabulary should be the path to a file containing the words used to generate the model, one per line.

Parameters:

target : str

Path to lda-seq output directory.

metadata : str

Path to metadata file.

vocabulary : str

Path to vocabulary file.

Returns:

DTMModel