.. _gensim-tutorial: Tethne and Gensim ================= `Gensim `_ is a lovely package for topic modeling in Python. As of v0.8.1 there are a few ways to use Tethne and Gensim together for fun or profit. Export a Gensim-friendly bag-of-words corpus -------------------------------------------- Both :class:`.FeatureSet` and :class:`.StructuredFeatureSet` now have methods called ``to_gensim_corpus()`` that can generate a bag-of-words representation usable in Gensim's LDA and LSI (LSA) models. Suppose you want to topic model (in Gensim) abstracts from a Web of Science collection. Here's a fairly typical approach to generating a :class:`.StructuredFeatureSet` from abstracts: .. code-block:: python >>> from tethne.readers.wos import read >>> corpus = read('/path/to/my/data') >>> from nltk.tokenize import word_tokenize >>> corpus.index_feature('abstract', word_tokenize, structured=True) At this point you might do some filtering or transformation (see :ref:`mallet-tutorial`\). Gensim's :class`gensim.models.ldamodel.LdaModel` requires a "corpus" in "bag of words" format. This is just a list of lists, in which each sub-list is a sequence of (token, count) tuples for a particular document (see `the Gensim documentation `_ for more details). Gensim also needs a vocabulary (``id2word``), that is just a dict that maps integer keys to string representations of the words in your corpus. Et voila: .. code-block:: python >>> gensim_corpus, id2word = corpus.features['abstract'].to_gensim_corpus() You can pass these objects directly to ``LdaModel``. For example: .. code-block:: python >>> model = models.ldamodel.LdaModel(corpus=gensim_corpus, id2word=id2word, num_topics=20, update_every=1, chunksize=10000, passes=1) This works basically the same for both :meth:`.StructuredFeatureSet.to_gensim_corpus` and :meth:`.FeatureSet.to_gensim_corpus`\. Export a raw text corpus ------------------------ If you would rather start with the document structure described in `this tutorial `_ (i.e. each document is a list of strings), you can pass ``raw=False`` to ``to_gensim_corpus()``. It will return a (corpus, vocabulary) tuple, except that ``vocabulary`` here is ``None`` so yar can discard it. .. code-bock:: python >>> raw_corpus, _ = corpus.features['abstract'].to_gensim_corpus(raw=True) This can be useful if you are using Gensim's `phrase detection `_ model. Note, however, that this really only makes sense for :class:`.StructuredFeatureSet`\s, since (by definition) :class:`.FeatureSet`\s do not preserve order. Let Tethne talk to Gensim ------------------------- Tethne also provides a wrapper for Gensim, :class:`.GensimLDAModel`\, which has a nearly identical API to the MALLET-backed :class:`.LDAModel` described in :ref:`mallet-tutorial`\. .. code-block:: python >>> from tethne import GensimLDAModel >>> model = GensimLDAModel(corpus, featureset_name='abstract') >>> model.fit(Z=20) Pretty much anything that you can do with :class:`.LDAModel` (e.g. building topic-based graphs) you can also do with :class:`.GensimLDAModel`\. Load ``phi`` or ``theta`` from a Gensim ``LdaModel`` as ``FeatureSet``\s ------------------------------------------------------------------------ If you have already fit your model with Gensim, and simply want to work with the results as a :class:`.FeatureSet` (e.g. to create a graph), you can use :func:`.gensim_to_theta_featureset` and :func:`.gensim_to_phi_featureset` to load document-topic and topic-word assignments. For example, suppose that you want to build a topic-cooccurrence graph. In the code block below, ``ldamodel`` is a :class:`gensim.models.ldamodel.LdaModel`\, and ``corpus`` is the bag-of-words corpus that you used to create the ``LdaModel``. .. code-block:: python >>> from tethne import gensim_to_theta_featureset >>> theta = gensim_to_theta_featureset(ldamodel, corpus) >>> from tethne import feature_cooccurrence >>> graph = feature_cooccurrence(theta, min_weight=0.05) If you instead wanted to work with topic-word assignments, you would use :func:`.gensim_to_phi_featureset`\. This function does not require ``corpus``, so it's as simple as: .. code-block:: python >>> from tethne import gensim_to_phi_featureset >>> theta = gensim_to_theta_featureset(ldamodel) And then you can build, say, a term-coassignment graph like so (you may want to use :prop:`.FeatureSet.norm` here, keeping in mind that the resulting values are no longer the phi posterior probabilities but rather relative probabilities within a topic). .. code-block:: python >>> from tethne import feature_cooccurrence >>> graph = feature_cooccurrence(phi.norm, min_weight=0.01)