SciPy
Need help? Have a feature request? Please check out the tethne-users group .

Tethne and Gensim

Gensim is a lovely package for topic modeling in Python. As of v0.8.1 there are a few ways to use Tethne and Gensim together for fun or profit.

Export a Gensim-friendly bag-of-words corpus

Both FeatureSet and StructuredFeatureSet now have methods called to_gensim_corpus() that can generate a bag-of-words representation usable in Gensim’s LDA and LSI (LSA) models.

Suppose you want to topic model (in Gensim) abstracts from a Web of Science collection. Here’s a fairly typical approach to generating a StructuredFeatureSet from abstracts:

>>> from tethne.readers.wos import read
>>> corpus = read('/path/to/my/data')
>>> from nltk.tokenize import word_tokenize
>>> corpus.index_feature('abstract', word_tokenize, structured=True)

At this point you might do some filtering or transformation (see Generating and Visualizing Topic Models with Tethne and MALLET).

Gensim’s :class`gensim.models.ldamodel.LdaModel` requires a “corpus” in “bag of words” format. This is just a list of lists, in which each sub-list is a sequence of (token, count) tuples for a particular document (see the Gensim documentation for more details). Gensim also needs a vocabulary (id2word), that is just a dict that maps integer keys to string representations of the words in your corpus.

Et voila:

>>> gensim_corpus, id2word = corpus.features['abstract'].to_gensim_corpus()

You can pass these objects directly to LdaModel. For example:

>>> model = models.ldamodel.LdaModel(corpus=gensim_corpus, id2word=id2word,
                                     num_topics=20, update_every=1,
                                     chunksize=10000, passes=1)

This works basically the same for both StructuredFeatureSet.to_gensim_corpus() and FeatureSet.to_gensim_corpus().

Export a raw text corpus

If you would rather start with the document structure described in this tutorial (i.e. each document is a list of strings), you can pass raw=False to to_gensim_corpus(). It will return a (corpus, vocabulary) tuple, except that vocabulary here is None so yar can discard it.

This can be useful if you are using Gensim’s phrase detection model. Note, however, that this really only makes sense for StructuredFeatureSets, since (by definition) FeatureSets do not preserve order.

Let Tethne talk to Gensim

Tethne also provides a wrapper for Gensim, GensimLDAModel, which has a nearly identical API to the MALLET-backed LDAModel described in Generating and Visualizing Topic Models with Tethne and MALLET.

>>> from tethne import GensimLDAModel
>>> model = GensimLDAModel(corpus, featureset_name='abstract')
>>> model.fit(Z=20)

Pretty much anything that you can do with LDAModel (e.g. building topic-based graphs) you can also do with GensimLDAModel.

Load phi or theta from a Gensim LdaModel as FeatureSets

If you have already fit your model with Gensim, and simply want to work with the results as a FeatureSet (e.g. to create a graph), you can use gensim_to_theta_featureset() and gensim_to_phi_featureset() to load document-topic and topic-word assignments.

For example, suppose that you want to build a topic-cooccurrence graph. In the code block below, ldamodel is a gensim.models.ldamodel.LdaModel, and corpus is the bag-of-words corpus that you used to create the LdaModel.

>>> from tethne import gensim_to_theta_featureset
>>> theta = gensim_to_theta_featureset(ldamodel, corpus)
>>> from tethne import feature_cooccurrence
>>> graph = feature_cooccurrence(theta, min_weight=0.05)

If you instead wanted to work with topic-word assignments, you would use gensim_to_phi_featureset(). This function does not require corpus, so it’s as simple as:

>>> from tethne import gensim_to_phi_featureset
>>> theta = gensim_to_theta_featureset(ldamodel)

And then you can build, say, a term-coassignment graph like so (you may want to use :prop:`.FeatureSet.norm` here, keeping in mind that the resulting values are no longer the phi posterior probabilities but rather relative probabilities within a topic).

>>> from tethne import feature_cooccurrence
>>> graph = feature_cooccurrence(phi.norm, min_weight=0.01)