tethne.model.corpus.ldamodel module¶
Classes and methods related to the LDAModel.
- class tethne.model.corpus.ldamodel.LDAModel(theta, phi, metadata, vocabulary)[source]¶
Bases: tethne.model.basemodel.BaseModel
Represents a Latent Dirichlet Allocation (LDA) topic model.
In the LDA model, topics (dimensions) are probability distributions over words (features), and documents (items) are comprised of mixtures of topics. For a complete description of the model, see Blei & Jordan (2003).
To generate a LDAModel from a Corpus using MALLET, use the MALLETModelManager. Additional managers for LDAModels will be added shortly.
You can also initialize a LDAModel directly by providing the following parameters:
- theta, describes the proportion of topics (cols) in each document (rows).
- phi describes the topic (rows) distributions over words (cols).
- metadata should map matrix indices for documents onto Paper IDs (or whatever you use to identify documents).
- vocabulary should map matrix indices for words onto word-strings.
metadata and vocabulary mappings.
Finally, you can use from_mallet() to generate a LDAModel from MALLET output.
list_topic Yields a list of the top Nwords for topic k. list_topics Yields lists of the top Nwords for each topic. print_topic Yields the top Nwords for topic k as a string. print_topics Yields the top Nwords for each topic, as a string. Parameters: theta : matrix-like
Distribution of topics (cols) in documents (rows). Rows sum to 1.
phi : matrix-like
Distribution over words (cols) for topics (rows). Rows sum to 1.
metadata : dict
Maps matrix indices onto document datadata.
vocabulary : dict
Maps W indices onto words.
- dimension(d, top=None, asmatrix=False, **kwargs)¶
Describes a dimension (eg a topic).
Subclass must provide _dimension_description(d) method.
Parameters: d : int
Dimension index.
Returns: description : list
A list of ( feature, weight ) tuples (e.g. word, prob ).
- dimension_items(d, threshold, **kwargs)¶
Describes a dimension in terms of the items that contain it.
Subclass must provide _dimension_items(d, threshold) method.
Parameters: d : int
Dimension index.
threshold : float
Minimum representation of d in item.
Returns: description : list
A list of ( item, weight ) tuples.
- dimension_relationship(d, e, **kwargs)¶
Describes the relationship between two dimensions.
Subclass must provide _dimension_relationship(d, e) method.
Parameters: d : int
Dimension index.
e : int
Dimension index.
Returns: relationship : list
A list of ( factor , weight ) tuples.
- item(i, top=None, **kwargs)¶
Describes an item in terms of dimensions and weights.
Subclass must provide _item_description(i) method.
Parameters: i : int
Index for an item.
top : int
(optional) Number of (highest-w) dimensions to return.
Returns: description : list
A list of ( dimension , weight ) tuples.
- item_relationship(i, j, **kwargs)¶
Describes the relationship between two items.
Subclass must provide _item_relationship(i, j) method.
Parameters: i : int
Item index.
j : int
Item index.
Returns: relationship : list
A list of ( dimension , weight ) tuples.
- list_topic(k, Nwords=10)[source]¶
Yields a list of the top Nwords for topic k.
Parameters: k : int
A topic index.
Nwords : int
Number of words to return.
Returns: as_list : list
List of words in topic.
Examples
>>> model.list_topic(1, Nwords=5) [ 'opposed', 'terminates', 'trichinosis', 'cistus', 'acaule' ]
- list_topics(Nwords=10)[source]¶
Yields lists of the top Nwords for each topic.
Parameters: Nwords : int
Number of words to return for each topic.
Returns: as_dict : dict
Keys are topic indices, values are list of words.
- print_topic(k, Nwords=10)[source]¶
Yields the top Nwords for topic k as a string.
Parameters: k : int
A topic index.
Nwords : int
Number of words to return.
Returns: as_string : str
Joined list of words in topic.
Examples
>>> model.print_topic(1, Nwords=5) 'opposed, terminates, trichinosis, cistus, acaule'
- class tethne.model.corpus.ldamodel.MALLETLoader(top_doc, word_top, metapath)[source]¶
Bases: object
Used by from_mallet() to load MALLET output.
- tethne.model.corpus.ldamodel.from_mallet(top_doc, word_top, metadata)[source]¶
Generate a LDAModel from MALLET output.
MALLET’s LDA topic modeling algorithm produces multiple output files. See the MALLET documentation for details. When invoking MALLET’s train-topics procedure, you should have provided the --output-doc-topics and --word-topic-counts-file parameters; the top_doc and word_top parameters should be paths to those two files.
You should also provide the path metadata to a tab-separated file containing metadata about the documents used to build the model. The first column should be the ID used in the original corpus files. For example:
10.2307/1709733 1962 BOTANICAL CLASSIFICATION SCIENCE 10.2307/20000814 1974 THE USE OF DIFFERENTIAL SYSTEMATICS IN GEOGRAPHIC RESEARCH AREA
Parameters: top_doc : string
Path to topic-document datafile generated with --output-doc-topics.
word_top : string
Path to word-topic datafile generated with --word-topic-counts-file.
metadata : string
Path to tab-separated metadata file with Paper keys.
Returns: ldamodel : LDAModel