tethne.model.corpus.dtmmodel module¶

Classes and methods related to the DTMModel.

class tethne.model.corpus.dtmmodel.DTMModel(e_theta, phi, metadata, vocabulary)[source]¶

Bases: tethne.model.basemodel.BaseModel

Represents a Dynamic Topic Model (DTM).

The DTM is similar to the LDA model (see LDAModel) except that each topic is permitted to evolve over time (i.e. probabilities associated with terms in the topic can change). For a complete description of the model see Blei & Lafferty 2006.

To generate a DTMModel from a Corpus use the DTMModelManager, which relies on S. Gerrish’s C++ implementation of DTM. Alternatively, you can build the model externally (e.g. using the Gerrish DTM implementation directly), and then load the results with from_gerrish().

If you are using a different implementation of DTM, you can initialize a DTMModel directly by providing parameters and metadata.

e_theta should describe the distribution of topics (rows) in documents (cols).
phi should describe the topic (dimension 0) distributions over words (dimension 1) over time (dimension 2).
metadata should map matrix indices for documents onto Paper IDs (or whatever you use to identify documents).
vocabulary should map matrix indices for words onto word-strings.

`list_topic`	Yields the top `Nwords` for topic `k`.
`list_topics`	Yields the top `Nwords` for each topic.
`topic_evolution`	Generate a plot that shows p(w\|z) over time for the top `Nwords` terms.
`print_topic`	Yields the top `Nwords` for topic `k`.
`print_topics`	Yields the top `Nwords` for each topic.

Parameters:

Parameters:	e_theta : matrix-like Distribution of topics (Z) in documents (M). Shape: (Z, M). phi : matrix-like Topic (Z) distribution over words (W), over time (T). Shape: (Z, W, T) metadata : dict Maps matrix indices onto document datadata. vocabulary : dict Maps W indices onto words.

e_theta : matrix-like

Distribution of topics (Z) in documents (M). Shape: (Z, M).

phi : matrix-like

Topic (Z) distribution over words (W), over time (T). Shape: (Z, W, T)

metadata : dict

Maps matrix indices onto document datadata.

vocabulary : dict

Maps W indices onto words.

dimension(d, top=None, asmatrix=False, **kwargs)¶

Describes a dimension (eg a topic).

Subclass must provide _dimension_description(d) method.

Parameters:

Parameters:	d : int Dimension index.
Returns:	description : list A list of ( feature, weight ) tuples (e.g. word, prob ).

d : int

Dimension index.

Returns:

description : list

A list of ( feature, weight ) tuples (e.g. word, prob ).

dimension_items(d, threshold, **kwargs)¶

Describes a dimension in terms of the items that contain it.

Subclass must provide _dimension_items(d, threshold) method.

Parameters:

Parameters:	d : int Dimension index. threshold : float Minimum representation of `d` in item.
Returns:	description : list A list of ( item, weight ) tuples.

d : int

Dimension index.

threshold : float

Minimum representation of d in item.

Returns:

description : list

A list of ( item, weight ) tuples.

dimension_relationship(d, e, **kwargs)¶

Describes the relationship between two dimensions.

Subclass must provide _dimension_relationship(d, e) method.

Parameters:

Parameters:	d : int Dimension index. e : int Dimension index.
Returns:	relationship : list A list of ( factor , weight ) tuples.

d : int

Dimension index.

e : int

Dimension index.

Returns:

relationship : list

A list of ( factor , weight ) tuples.

item(i, top=None, **kwargs)¶

Describes an item in terms of dimensions and weights.

Subclass must provide _item_description(i) method.

Parameters:

Parameters:	i : int Index for an item. top : int (optional) Number of (highest-w) dimensions to return.
Returns:	description : list A list of ( dimension , weight ) tuples.

i : int

Index for an item.

top : int

(optional) Number of (highest-w) dimensions to return.

Returns:

description : list

A list of ( dimension , weight ) tuples.

item_relationship(i, j, **kwargs)¶

Describes the relationship between two items.

Subclass must provide _item_relationship(i, j) method.

Parameters:

Parameters:	i : int Item index. j : int Item index.
Returns:	relationship : list A list of ( dimension , weight ) tuples.

i : int

Item index.

j : int

Item index.

Returns:

relationship : list

A list of ( dimension , weight ) tuples.

list_topic(k, t, Nwords=10)[source]¶

Yields the top Nwords for topic k.

Parameters:

Parameters:	k : int A topic index. t : int A time index. Nwords : int Number of words to return.
Returns:	as_list : list List of words in topic.

k : int

A topic index.

t : int

A time index.

Nwords : int

Number of words to return.

Returns:

as_list : list

List of words in topic.

list_topic_diachronic(k, Nwords=10)[source]¶

list_topics(t, Nwords=10)[source]¶

Yields the top Nwords for each topic.

Parameters:

Parameters:	t : int A time index. Nwords : int Number of words to return for each topic.
Returns:	as_dict : dict Keys are topic indices, values are list of words.

t : int

A time index.

Nwords : int

Number of words to return for each topic.

Returns:

as_dict : dict

Keys are topic indices, values are list of words.

print_topic(k, t, Nwords=10)[source]¶

Yields the top Nwords for topic k.

Parameters:

Parameters:	k : int A topic index. t : int A time index. Nwords : int Number of words to return.
Returns:	as_string : str Joined list of words in topic.

k : int

A topic index.

t : int

A time index.

Nwords : int

Number of words to return.

Returns:

as_string : str

Joined list of words in topic.

print_topic_diachronic(k, Nwords=10)[source]¶

print_topics(t, Nwords=10)[source]¶

Yields the top Nwords for each topic.

Parameters:

Parameters:	t : int A time index. Nwords : int Number of words to return for each topic.
Returns:	as_string : str Newline-delimited lists of words for each topic.

t : int

A time index.

Nwords : int

Number of words to return for each topic.

Returns:

as_string : str

Newline-delimited lists of words for each topic.

topic_evolution(k, Nwords=5)[source]¶

Generate a plot that shows p(w|z) over time for the top Nwords terms.

Parameters:

Parameters:	k : int A topic index. Nwords : int Number of words to return.
Returns:	keys : list Start-date of each time-period. t_series : list Array of p(w\|t) for Nwords for each time-period.

k : int

A topic index.

Nwords : int

Number of words to return.

Returns:

keys : list

Start-date of each time-period.

t_series : list

Array of p(w|t) for Nwords for each time-period.

class tethne.model.corpus.dtmmodel.GerrishLoader(target, metadata_path, vocabulary_path)[source]¶

Bases: object

Helper class for parsing results from S. Gerrish’s C++ implementation

Parameters:

Parameters:	target : str Path to `lda-seq` output directory. metadata : str Path to metadata file. vocabulary : str Path to vocabulary file.
Returns:	`DTMModel`

target : str

Path to lda-seq output directory.

metadata : str

Path to metadata file.

vocabulary : str

Path to vocabulary file.

Returns:

DTMModel

load()[source]¶

tethne.model.corpus.dtmmodel.from_gerrish(target, metadata, vocabulary, metadata_key='doi')[source]¶

Generate a DTMModel from the output of S. Gerrish’s C++ DTM implementation.

The Gerrish DTM implementation generates a large number of data files contained in a directory called lda-seq. The target parameter should be the path to that directory.

metadata should be the path to a tab-delimted metadata file. Those records should occur in the same order as in the corpus data files used to generate the model. For example:

id       date    atitle
2307/2437162  1945    SOME ECOTYPIC RELATIONS OF DESCHAMPSIA CAESPITOSA
2307/4353229  1940    ENVIRONMENTAL INFLUENCE AND TRANSPLANT EXPERIMENTS
2307/4353158  1937    SOME FUNDAMENTAL PROBLEMS OF TAXONOMY AND PHYLOGENETICS

vocabulary should be the path to a file containing the words used to generate the model, one per line.

Parameters:

Parameters:	target : str Path to `lda-seq` output directory. metadata : str Path to metadata file. vocabulary : str Path to vocabulary file.
Returns:	`DTMModel`

target : str

Path to lda-seq output directory.

metadata : str

Path to metadata file.

vocabulary : str

Path to vocabulary file.

Returns:

DTMModel

tethne.model.corpus.dtmmodel module¶

Previous topic

Next topic

This Page