.. _gensim-tutorial:
Tethne and Gensim
=================
`Gensim `_ is a lovely package for topic
modeling in Python. As of v0.8.1 there are a few ways to use Tethne and Gensim
together for fun or profit.
Export a Gensim-friendly bag-of-words corpus
--------------------------------------------
Both :class:`.FeatureSet` and :class:`.StructuredFeatureSet` now have methods
called ``to_gensim_corpus()`` that can generate a bag-of-words representation
usable in Gensim's LDA and LSI (LSA) models.
Suppose you want to topic model (in Gensim) abstracts from a Web of Science
collection. Here's a fairly typical approach to generating a
:class:`.StructuredFeatureSet` from abstracts:
.. code-block:: python
>>> from tethne.readers.wos import read
>>> corpus = read('/path/to/my/data')
>>> from nltk.tokenize import word_tokenize
>>> corpus.index_feature('abstract', word_tokenize, structured=True)
At this point you might do some filtering or transformation (see
:ref:`mallet-tutorial`\).
Gensim's :class`gensim.models.ldamodel.LdaModel` requires a "corpus" in
"bag of words" format. This is just a list of lists, in which each sub-list is
a sequence of (token, count) tuples for a particular document (see `the Gensim
documentation
`_ for more
details). Gensim also needs a vocabulary (``id2word``), that is just a dict
that maps integer keys to string representations of the words in your corpus.
Et voila:
.. code-block:: python
>>> gensim_corpus, id2word = corpus.features['abstract'].to_gensim_corpus()
You can pass these objects directly to ``LdaModel``. For example:
.. code-block:: python
>>> model = models.ldamodel.LdaModel(corpus=gensim_corpus, id2word=id2word,
num_topics=20, update_every=1,
chunksize=10000, passes=1)
This works basically the same for both
:meth:`.StructuredFeatureSet.to_gensim_corpus` and
:meth:`.FeatureSet.to_gensim_corpus`\.
Export a raw text corpus
------------------------
If you would rather start with the document structure described in `this
tutorial `_ (i.e. each document is
a list of strings), you can pass ``raw=False`` to ``to_gensim_corpus()``. It
will return a (corpus, vocabulary) tuple, except that ``vocabulary`` here is
``None`` so yar can discard it.
.. code-bock:: python
>>> raw_corpus, _ = corpus.features['abstract'].to_gensim_corpus(raw=True)
This can be useful if you are using Gensim's `phrase detection
`_ model. Note, however,
that this really only makes sense for :class:`.StructuredFeatureSet`\s, since
(by definition) :class:`.FeatureSet`\s do not preserve order.
Let Tethne talk to Gensim
-------------------------
Tethne also provides a wrapper for Gensim, :class:`.GensimLDAModel`\, which has
a nearly identical API to the MALLET-backed :class:`.LDAModel` described in
:ref:`mallet-tutorial`\.
.. code-block:: python
>>> from tethne import GensimLDAModel
>>> model = GensimLDAModel(corpus, featureset_name='abstract')
>>> model.fit(Z=20)
Pretty much anything that you can do with :class:`.LDAModel` (e.g. building
topic-based graphs) you can also do with :class:`.GensimLDAModel`\.
Load ``phi`` or ``theta`` from a Gensim ``LdaModel`` as ``FeatureSet``\s
------------------------------------------------------------------------
If you have already fit your model with Gensim, and simply want to work with
the results as a :class:`.FeatureSet` (e.g. to create a graph), you can use
:func:`.gensim_to_theta_featureset` and :func:`.gensim_to_phi_featureset` to
load document-topic and topic-word assignments.
For example, suppose that you want to build a topic-cooccurrence graph. In the
code block below, ``ldamodel`` is a :class:`gensim.models.ldamodel.LdaModel`\,
and ``corpus`` is the bag-of-words corpus that you used to create the
``LdaModel``.
.. code-block:: python
>>> from tethne import gensim_to_theta_featureset
>>> theta = gensim_to_theta_featureset(ldamodel, corpus)
>>> from tethne import feature_cooccurrence
>>> graph = feature_cooccurrence(theta, min_weight=0.05)
If you instead wanted to work with topic-word assignments, you would use
:func:`.gensim_to_phi_featureset`\. This function does not require ``corpus``,
so it's as simple as:
.. code-block:: python
>>> from tethne import gensim_to_phi_featureset
>>> theta = gensim_to_theta_featureset(ldamodel)
And then you can build, say, a term-coassignment graph like so (you may want to
use :prop:`.FeatureSet.norm` here, keeping in mind that the resulting values
are no longer the phi posterior probabilities but rather relative probabilities
within a topic).
.. code-block:: python
>>> from tethne import feature_cooccurrence
>>> graph = feature_cooccurrence(phi.norm, min_weight=0.01)