Tethne and Gensim¶
Gensim is a lovely package for topic modeling in Python. As of v0.8.1 there are a few ways to use Tethne and Gensim together for fun or profit.
Export a Gensim-friendly bag-of-words corpus¶
Suppose you want to topic model (in Gensim) abstracts from a Web of Science
collection. Here’s a fairly typical approach to generating a
StructuredFeatureSet from abstracts:
>>> from tethne.readers.wos import read >>> corpus = read('/path/to/my/data') >>> from nltk.tokenize import word_tokenize >>> corpus.index_feature('abstract', word_tokenize, structured=True)
At this point you might do some filtering or transformation (see Generating and Visualizing Topic Models with Tethne and MALLET).
Gensim’s :class`gensim.models.ldamodel.LdaModel` requires a “corpus” in
“bag of words” format. This is just a list of lists, in which each sub-list is
a sequence of (token, count) tuples for a particular document (see the Gensim
documentation for more
details). Gensim also needs a vocabulary (
id2word), that is just a dict
that maps integer keys to string representations of the words in your corpus.
>>> gensim_corpus, id2word = corpus.features['abstract'].to_gensim_corpus()
You can pass these objects directly to
LdaModel. For example:
>>> model = models.ldamodel.LdaModel(corpus=gensim_corpus, id2word=id2word, num_topics=20, update_every=1, chunksize=10000, passes=1)
Export a raw text corpus¶
If you would rather start with the document structure described in this
tutorial (i.e. each document is
a list of strings), you can pass
will return a (corpus, vocabulary) tuple, except that
vocabulary here is
None so yar can discard it.
This can be useful if you are using Gensim’s phrase detection model. Note, however,
that this really only makes sense for
FeatureSets do not preserve order.
Let Tethne talk to Gensim¶
Tethne also provides a wrapper for Gensim,
GensimLDAModel, which has
a nearly identical API to the MALLET-backed
LDAModel described in
Generating and Visualizing Topic Models with Tethne and MALLET.
>>> from tethne import GensimLDAModel >>> model = GensimLDAModel(corpus, featureset_name='abstract') >>> model.fit(Z=20)
Pretty much anything that you can do with
LDAModel (e.g. building
topic-based graphs) you can also do with
theta from a Gensim
If you have already fit your model with Gensim, and simply want to work with
the results as a
FeatureSet (e.g. to create a graph), you can use
load document-topic and topic-word assignments.
For example, suppose that you want to build a topic-cooccurrence graph. In the
code block below,
ldamodel is a
corpus is the bag-of-words corpus that you used to create the
>>> from tethne import gensim_to_theta_featureset >>> theta = gensim_to_theta_featureset(ldamodel, corpus) >>> from tethne import feature_cooccurrence >>> graph = feature_cooccurrence(theta, min_weight=0.05)
If you instead wanted to work with topic-word assignments, you would use
gensim_to_phi_featureset(). This function does not require
so it’s as simple as:
>>> from tethne import gensim_to_phi_featureset >>> theta = gensim_to_theta_featureset(ldamodel)
And then you can build, say, a term-coassignment graph like so (you may want to use :prop:`.FeatureSet.norm` here, keeping in mind that the resulting values are no longer the phi posterior probabilities but rather relative probabilities within a topic).
>>> from tethne import feature_cooccurrence >>> graph = feature_cooccurrence(phi.norm, min_weight=0.01)