Quickstart ========== Load some data -------------- Assuming that you have a JSTOR DfR dataset (in XML format) containing some wordcount data unzipped at ``/path/to/my/dataset``, create a :class:`.Corpus` with: .. code-block:: python >>> from tethne.readers import dfr >>> C = dfr.read_corpus('/path/to/my/dataset', 'uni') Or if you're working with data from the Web of Science, try: .. code-block:: python >>> from tethne.readers import wos >>> C = dfr.read_corpus('/path/to/my/wosdata.txt') Index your :class:`.Corpus` by publication date and journal using the :func:`tethne.classes.corpus.Corpus.slice` method. .. code-block:: python >>> C.slice('date', method='time_period', window_size=5) >>> C.slice('jtitle') Now use :func:`tethne.classes.corpus.Corpus.plot_distribution` to see how your :class:`.Paper`\s are distributed over time... .. code-block:: python >>> C.plot_distribution('date') .. figure:: _static/images/corpus_plot_distribution.png :width: 400 :align: center ...or by both time and journal: .. code-block:: python >>> C.plot_distribution('date', 'jtitle') .. figure:: _static/images/corpus_plot_distribution_2d.png :width: 600 :align: center Simple networks simply ---------------------- Network-building methods are in :mod:`.networks`\. You can create a coauthorship network like this: .. code-block:: python >>> from tethne.networks import authors >>> coauthors = authors.coauthors(C) To introduce a temporal component, slice your :class:`.Corpus` and then create a :class:`.GraphCollection` (``cumulative=True`` means that the coauthorship network will grow over time without losing old connections): .. code-block:: python >>> C.slice('date', 'time_period', window_size=5, cumulative=True) # 5-year bins. >>> from tethne import GraphCollection >>> G = GraphCollection().build(C, 'date', 'authors', 'coauthors') If you're using WoS data (with citations), you can also build citation-based graphs (see :mod:`.networks.papers`\). Here's a static co-citation graph from a :class:`.Corpus`: .. code-block:: python >>> C.slice('date', 'time_period', window_size=5) # No need for `cumulative` here. >> from tethne.networks import papers >>> cocitation = papers.cocitation(C.all_papers(), threshold=2, topn=300) ``threshold=2`` means that papers must be co-cited twice, and ``topn=300`` means that only the top 300 most cited papers will be included. To see a time-variant co-citation network, build a :class:`.GraphCollection` just as before: .. code-block:: python >>> G = GraphCollection().build(C, 'date', 'papers', 'cocitation', threshold=2, topn=300) Visualize your networks ----------------------- You can export a graph for visualization in `Cytoscape `_ using :mod:`.writers`\: .. code-block:: python >>> from tethne.writers import graph >>> graph.to_graphml(coauthors, '/path/to/my/graph.graphml') To visualize a :class:`.GraphCollection` as a dynamic graph in Cytoscape, export it using :func:`.writers.collection.to_dxgmml`\: .. code-block:: python >>> from tethne.writers import collection >>> collection.to_dxgmml(G, '/path/to/my/dynamicNetwork.xgmml') Working with Words ------------------ Suppose you loaded up a :class:`.Corpus` from some DfR datasets, using: .. code-block:: python >>> from tethne.readers import dfr >>> C = dfr.corpus_from_dir('/path/to/my/dataset', 'uni') Now you have some ``'unigrams'`` in ``C.features``. There are surely plenty of junk words in there. You can apply a stoplist when you load the :class:`.Corpus`, by passing it to ``exclude``: .. code-block:: python >>> from nltk.corpus import stopwords >>> stoplist = stopwords.words() >>> from tethne.readers import dfr >>> C = dfr.corpus_from_dir('/path/to/my/dataset', 'uni', exclude=stoplist) If you have some recent WoS data with abstracts, you can get a featureset from abstract terms, too: .. code-block:: python >>> from tethne.readers import wos >>> C = dfr.read_corpus('/path/to/my/wosdata.txt') >>> C.abstract_to_features() # Automatically applies a stoplist. Filter the words in the :class:`.Corpus` further, using :func:`.Corpus.filter_features`\. Maybe you only want words that occur more than three times overall, occur in more than one document, and are at least four characters in length. .. code-block:: python >>> def filt(s, C, DC): ... if C > 3 and DC > 1 and len(s) > 3: ... return True ... return False >>> C.filter_features('unigrams', 'wordcounts_filtered', filt) You can see how the word ``four`` is distributed across your :class:`.Corpus` using :func:`.Corpus.plot_distribution`\: .. code-block:: python >>> C.slice('date', method='time_period', window_size=5) >>> C.slice('jtitle') >>> fkwargs = { ... 'featureset': 'wordcounts_filtered', ... 'feature': 'four', ... 'mode': 'counts', ... 'normed': True, ... } >>> fig = C.plot_distribution('date', 'jtitle', mode='features', fkwargs=fkwargs, interpolation='none') >>> fig.savefig('/path/to/dist.png') .. figure:: _static/images/testdist.png :width: 600 :align: center Models Based on Words --------------------- Topic models are pretty popular. You can create a LDA topic model with MALLET using a :class:`.MALLETModelManager`\. First, get the manager: .. code-block:: python >>> from tethne.model import MALLETModelManager >>> outpath = '/path/to/my/working/directory' >>> mallet = '/Applications/mallet-2.0.7' # Path to MALLET install directory. >>> M = MALLETModelManager(C, 'wordcounts_filtered', outpath, mallet_path=mallet) Now ``prep`` and ``build``. Here's an example for 50 topics: .. code-block:: python >>> M.prep() >>> model = M.build(Z=50, max_iter=300) # May take a while. Here are the top 5 words in topic 1: .. code-block:: python >>> model.print_topic(1, Nwords=5) 'opposed, terminates, trichinosis, cistus, acaule' To view the representation of topic 1 over the slices in the :class:`.Corpus`\... .. code-block:: python >>> keys, repr = M.topic_over_time(1, plot=True) ...which should return ``keys`` (date) and ``repr`` (% documents) for topic 1, and generate a plot like this one in your ``outpath``. .. figure:: _static/images/topic_1_over_time.png :width: 400 :align: center Combining corpus models and social models ----------------------------------------- Social models (see :mod:`.model.social`\) represent the dynamics of social influence in terms of behavior adoption. Writing about a topic is a behavior. Suppose you have a coauthorship :class:`.GraphCollection`\... .. code-block:: python >>> C.slice('date', 'time_period', window_size=5, cumulative=True) >>> from tethne import GraphCollection >>> G = GraphCollection().build(C, 'date', 'authors', 'coauthors') ...and the :class:`.LDAModel` (``model``) from the last section. You can use the :class:`.TAPModelManager` to generate a :class:`.TAPModel`\, which represents a Topical Affinity Propagation social influence model. .. code-block:: python >>> from tethne.model import TAPModelManager >>> T = TAPModelManager(C, G, model) >>> T.build() # This may take a while. You can get the social influence :class:`.GraphCollection` for topic 1: .. code-block:: python >>> IG = T.graph_collection(0) Since you used ``cumulative=True`` when creating the coauthors :class:`.GraphCollection`\, the latest graph in the social influence GraphCollection is the most interesting: .. code-block:: python >>> ig = IG[sorted(IG.graphs.keys())[-1]] Write it to GraphML: .. code-block:: python >>> from tethne.writers import graph >>> graph.to_graphml(ig, '/path/to/my/graph.graphml') And then visualize it in `Cytoscape `_ or `Gephi `_. .. figure:: _static/images/tap_topic0.png :width: 600 :align: center