SciPy

tethne.model.corpus.ldamodel module

Classes and methods related to the LDAModel.

class tethne.model.corpus.ldamodel.LDAModel(theta, phi, metadata, vocabulary)[source]

Bases: tethne.model.basemodel.BaseModel

Represents a Latent Dirichlet Allocation (LDA) topic model.

In the LDA model, topics (dimensions) are probability distributions over words (features), and documents (items) are comprised of mixtures of topics. For a complete description of the model, see Blei & Jordan (2003).

To generate a LDAModel from a Corpus using MALLET, use the MALLETModelManager. Additional managers for LDAModels will be added shortly.

You can also initialize a LDAModel directly by providing the following parameters:

  • theta, describes the proportion of topics (cols) in each document (rows).
  • phi describes the topic (rows) distributions over words (cols).
  • metadata should map matrix indices for documents onto Paper IDs (or whatever you use to identify documents).
  • vocabulary should map matrix indices for words onto word-strings.

metadata and vocabulary mappings.

Finally, you can use from_mallet() to generate a LDAModel from MALLET output.

list_topic Yields a list of the top Nwords for topic k.
list_topics Yields lists of the top Nwords for each topic.
print_topic Yields the top Nwords for topic k as a string.
print_topics Yields the top Nwords for each topic, as a string.
Parameters:

theta : matrix-like

Distribution of topics (cols) in documents (rows). Rows sum to 1.

phi : matrix-like

Distribution over words (cols) for topics (rows). Rows sum to 1.

metadata : dict

Maps matrix indices onto document datadata.

vocabulary : dict

Maps W indices onto words.

dimension(d, top=None, asmatrix=False, **kwargs)

Describes a dimension (eg a topic).

Subclass must provide _dimension_description(d) method.

Parameters:

d : int

Dimension index.

Returns:

description : list

A list of ( feature, weight ) tuples (e.g. word, prob ).

dimension_items(d, threshold, **kwargs)

Describes a dimension in terms of the items that contain it.

Subclass must provide _dimension_items(d, threshold) method.

Parameters:

d : int

Dimension index.

threshold : float

Minimum representation of d in item.

Returns:

description : list

A list of ( item, weight ) tuples.

dimension_relationship(d, e, **kwargs)

Describes the relationship between two dimensions.

Subclass must provide _dimension_relationship(d, e) method.

Parameters:

d : int

Dimension index.

e : int

Dimension index.

Returns:

relationship : list

A list of ( factor , weight ) tuples.

item(i, top=None, **kwargs)

Describes an item in terms of dimensions and weights.

Subclass must provide _item_description(i) method.

Parameters:

i : int

Index for an item.

top : int

(optional) Number of (highest-w) dimensions to return.

Returns:

description : list

A list of ( dimension , weight ) tuples.

item_relationship(i, j, **kwargs)

Describes the relationship between two items.

Subclass must provide _item_relationship(i, j) method.

Parameters:

i : int

Item index.

j : int

Item index.

Returns:

relationship : list

A list of ( dimension , weight ) tuples.

list_topic(k, Nwords=10)[source]

Yields a list of the top Nwords for topic k.

Parameters:

k : int

A topic index.

Nwords : int

Number of words to return.

Returns:

as_list : list

List of words in topic.

Examples

>>> model.list_topic(1, Nwords=5)
[ 'opposed', 'terminates', 'trichinosis', 'cistus', 'acaule' ]
list_topics(Nwords=10)[source]

Yields lists of the top Nwords for each topic.

Parameters:

Nwords : int

Number of words to return for each topic.

Returns:

as_dict : dict

Keys are topic indices, values are list of words.

print_topic(k, Nwords=10)[source]

Yields the top Nwords for topic k as a string.

Parameters:

k : int

A topic index.

Nwords : int

Number of words to return.

Returns:

as_string : str

Joined list of words in topic.

Examples

>>> model.print_topic(1, Nwords=5)
'opposed, terminates, trichinosis, cistus, acaule'
print_topics(Nwords=10)[source]

Yields the top Nwords for each topic, as a string.

Parameters:

Nwords : int

Number of words to return for each topic.

Returns:

as_string : str

Newline-delimited lists of words for each topic.

class tethne.model.corpus.ldamodel.MALLETLoader(top_doc, word_top, metapath)[source]

Bases: object

Used by from_mallet() to load MALLET output.

load()[source]

Load a LDAModel from MALLET output.

Returns:self.model : LDAModel
tethne.model.corpus.ldamodel.from_mallet(top_doc, word_top, metadata)[source]

Generate a LDAModel from MALLET output.

MALLET’s LDA topic modeling algorithm produces multiple output files. See the MALLET documentation for details. When invoking MALLET’s train-topics procedure, you should have provided the --output-doc-topics and --word-topic-counts-file parameters; the top_doc and word_top parameters should be paths to those two files.

You should also provide the path metadata to a tab-separated file containing metadata about the documents used to build the model. The first column should be the ID used in the original corpus files. For example:

10.2307/1709733 1962     BOTANICAL CLASSIFICATION        SCIENCE
10.2307/20000814 1974    THE USE OF DIFFERENTIAL SYSTEMATICS IN GEOGRAPHIC RESEARCH      AREA
Parameters:

top_doc : string

Path to topic-document datafile generated with --output-doc-topics.

word_top : string

Path to word-topic datafile generated with --word-topic-counts-file.

metadata : string

Path to tab-separated metadata file with Paper keys.

Returns:

ldamodel : LDAModel