Quickstart
==========

Load some data
--------------

Assuming that you have a JSTOR DfR dataset (in XML format) containing some wordcount
data unzipped at ``/path/to/my/dataset``, create a :class:`.Corpus` with:

.. code-block:: python

   >>> from tethne.readers import dfr
   >>> C = dfr.read_corpus('/path/to/my/dataset', 'uni')
   
Or if you're working with data from the Web of Science, try:

.. code-block:: python

   >>> from tethne.readers import wos
   >>> C = dfr.read_corpus('/path/to/my/wosdata.txt')
   
Index your :class:`.Corpus` by publication date and journal using the 
:func:`tethne.classes.corpus.Corpus.slice` method.

.. code-block:: python

   >>> C.slice('date', method='time_period', window_size=5)
   >>> C.slice('jtitle')
   
Now use :func:`tethne.classes.corpus.Corpus.plot_distribution` to see how your 
:class:`.Paper`\s are distributed over time...

.. code-block:: python

   >>> C.plot_distribution('date')
   
.. figure:: _static/images/corpus_plot_distribution.png
   :width: 400
   :align: center

...or by both time and journal:
   
.. code-block:: python

   >>> C.plot_distribution('date', 'jtitle')   

.. figure:: _static/images/corpus_plot_distribution_2d.png
   :width: 600
   :align: center
   
Simple networks simply
----------------------
   
Network-building methods are in :mod:`.networks`\. You can create a coauthorship
network like this:

.. code-block:: python

   >>> from tethne.networks import authors
   >>> coauthors = authors.coauthors(C)
   
To introduce a temporal component, slice your :class:`.Corpus` and then create a :class:`.GraphCollection` (``cumulative=True`` means that the coauthorship network
will grow over time without losing old connections):

.. code-block:: python

   >>> C.slice('date', 'time_period', window_size=5, cumulative=True)	# 5-year bins.
   >>> from tethne import GraphCollection
   >>> G = GraphCollection().build(C, 'date', 'authors', 'coauthors')
   
If you're using WoS data (with citations), you can also build citation-based graphs (see
:mod:`.networks.papers`\). Here's a static co-citation graph from a :class:`.Corpus`:

.. code-block:: python

   >>> C.slice('date', 'time_period', window_size=5) 	# No need for `cumulative` here.
   >> from tethne.networks import papers
   >>> cocitation = papers.cocitation(C.all_papers(), threshold=2, topn=300)

``threshold=2`` means that papers must be co-cited twice, and ``topn=300`` means that
only the top 300 most cited papers will be included.

To see a time-variant co-citation network, build a :class:`.GraphCollection` just as
before:

.. code-block:: python

   >>> G = GraphCollection().build(C, 'date', 'papers', 'cocitation', threshold=2, topn=300)

Visualize your networks
-----------------------

You can export a graph for visualization in `Cytoscape <http://cytoscape.org>`_ using
:mod:`.writers`\:

.. code-block:: python

   >>> from tethne.writers import graph
   >>> graph.to_graphml(coauthors, '/path/to/my/graph.graphml')
   
To visualize a :class:`.GraphCollection` as a dynamic graph in Cytoscape, export it using
:func:`.writers.collection.to_dxgmml`\:

.. code-block:: python

   >>> from tethne.writers import collection
   >>> collection.to_dxgmml(G, '/path/to/my/dynamicNetwork.xgmml')
   
Working with Words
------------------

Suppose you loaded up a :class:`.Corpus` from some DfR datasets, using:

.. code-block:: python

   >>> from tethne.readers import dfr
   >>> C = dfr.corpus_from_dir('/path/to/my/dataset', 'uni')
   
Now you have some ``'unigrams'`` in ``C.features``. There are surely plenty of junk words
in there. You can apply a stoplist when you load the :class:`.Corpus`, by passing it to
``exclude``:

.. code-block:: python

   >>> from nltk.corpus import stopwords
   >>> stoplist = stopwords.words()
   >>> from tethne.readers import dfr
   >>> C = dfr.corpus_from_dir('/path/to/my/dataset', 'uni', exclude=stoplist)
   
If you have some recent WoS data with abstracts, you can get a featureset from abstract
terms, too:

.. code-block:: python

   >>> from tethne.readers import wos
   >>> C = dfr.read_corpus('/path/to/my/wosdata.txt')
   >>> C.abstract_to_features()	# Automatically applies a stoplist.
   
Filter the words in the :class:`.Corpus` further, using :func:`.Corpus.filter_features`\.
Maybe you only want words that occur more than three times overall, occur in more than one 
document, and are at least four characters in length. 

.. code-block:: python

   >>> def filt(s, C, DC):
   ...     if C > 3 and DC > 1 and len(s) > 3:
   ...         return True
   ...     return False
   >>> C.filter_features('unigrams', 'wordcounts_filtered', filt)

You can see how the word ``four`` is distributed across your :class:`.Corpus` using 
:func:`.Corpus.plot_distribution`\:

.. code-block:: python

   >>> C.slice('date', method='time_period', window_size=5)        
   >>> C.slice('jtitle')
   >>> fkwargs = {
   ...     'featureset': 'wordcounts_filtered',
   ...     'feature': 'four',
   ...     'mode': 'counts',
   ...     'normed': True,
   ...     }
   >>> fig = C.plot_distribution('date', 'jtitle', mode='features',
				 fkwargs=fkwargs, interpolation='none')
   >>> fig.savefig('/path/to/dist.png')
   
.. figure:: _static/images/testdist.png
   :width: 600
   :align: center   

Models Based on Words
---------------------

Topic models are pretty popular. You can create a LDA topic model with MALLET
using a :class:`.MALLETModelManager`\. First, get the manager:

.. code-block:: python

   >>> from tethne.model import MALLETModelManager
   >>> outpath = '/path/to/my/working/directory'
   >>> mallet = '/Applications/mallet-2.0.7'	# Path to MALLET install directory.
   >>> M = MALLETModelManager(C, 'wordcounts_filtered', outpath, mallet_path=mallet)
   
Now ``prep`` and ``build``. Here's an example for 50 topics:

.. code-block:: python

   >>> M.prep()
   >>> model = M.build(Z=50, max_iter=300)  # May take a while.
   
Here are the top 5 words in topic 1:

.. code-block:: python

   >>> model.print_topic(1, Nwords=5)
   'opposed, terminates, trichinosis, cistus, acaule'   
   
To view the representation of topic 1 over the slices in the :class:`.Corpus`\...

.. code-block:: python

   >>> keys, repr = M.topic_over_time(1, plot=True)

...which should return ``keys`` (date) and ``repr`` (% documents) for topic 1,
and generate a plot like this one in your ``outpath``.

.. figure:: _static/images/topic_1_over_time.png
   :width: 400
   :align: center

Combining corpus models and social models
-----------------------------------------

Social models (see :mod:`.model.social`\) represent the dynamics of social influence
in terms of behavior adoption. Writing about a topic is a behavior.

Suppose you have a coauthorship :class:`.GraphCollection`\...

.. code-block:: python

   >>> C.slice('date', 'time_period', window_size=5, cumulative=True)
   >>> from tethne import GraphCollection         
   >>> G = GraphCollection().build(C, 'date', 'authors', 'coauthors')
   
...and the :class:`.LDAModel` (``model``) from the last section. 

You can use the :class:`.TAPModelManager` to generate a :class:`.TAPModel`\, which
represents a Topical Affinity Propagation social influence model.

.. code-block:: python

   >>> from tethne.model import TAPModelManager
   >>> T = TAPModelManager(C, G, model)
   >>> T.build()		# This may take a while.

You can get the social influence :class:`.GraphCollection` for topic 1:
   
.. code-block:: python

   >>> IG = T.graph_collection(0)

Since you used ``cumulative=True`` when creating the coauthors :class:`.GraphCollection`\,
the latest graph in the social influence GraphCollection is the most interesting:

.. code-block:: python

   >>> ig = IG[sorted(IG.graphs.keys())[-1]]
   
Write it to GraphML:

.. code-block:: python

   >>> from tethne.writers import graph
   >>> graph.to_graphml(ig, '/path/to/my/graph.graphml')

And then visualize it in `Cytoscape <http://www.cytoscape.org>`_ or `Gephi
<http://gephi.org>`_.

.. figure:: _static/images/tap_topic0.png
   :width: 600
   :align: center