.. _working-with-corpora:

Working with Corpora
====================

Building a :class:`.Corpus` is the starting-point for working with bibliographic data
in Tethne. :class:`.Corpus` objects encapsulate and index :class:`.Paper`\s and
:ref:`corpora-featuresets`\, and mechanisms for analyzing data diachronically.

The :class:`.Corpus` class lives in :mod:`tethne.classes`\, but can be imported directly
from :mod:`tethne`\:

.. code-block:: python

   >>> from tethne import Corpus

Creating Corpora
----------------

Minimally, a :class:`.Corpus` requires a set of :class:`.Paper`\s. 

.. code-block:: python

   >>> from tethne.readers import wos
   >>> papers = wos.read('/path/to/wosdata.txt')    # Load some data.

   >>> MyCorpus = Corpus(papers)
   
Indexing
````````
   
By default, :class:`.Paper`\s and their cited references are indexed by 'ayjid', an
identifier generated from the first-author, publication date, and journal name of each
entry. 

You can use alternate indexing fields for :class:`.Paper`\s and their cited references:

.. code-block:: python

   >>> MyCorpus = Corpus(papers, index_by='wosid', index_citation_by='ayjid')
      
These are (usually) your options for index fields for :class:`.Papers` (``index_by``):

==============  =============
Source          Fields
==============  =============
Web of Science  wosid, ayjid
Scopus          eid, ayjid
JSTOR DfR		doi, ayjid
==============  =============

It should be obvious that ``ayjid`` is a good option if you plan to integrate data from
multiple datasources. ``ayjid`` is your best option for ``index_citations_by``, unless
you're confident that all cited references include alternate identifiers (this is rare).

By default, a :class:`.Corpus` calls its own :func:`Corpus.index` method on instantiation.
This results in a few useful attributes:

=============       ======================================================================
Attribute           Type/Description
=============       ======================================================================
papers              A dictionary mapping :class:`.Paper` IDs onto :class:`.Paper` 
                    instances.
authors             A dictionary mapping author names onto lists of :class:`.Paper` IDs.
citations           A dictionary mapping citation IDs onto cited references (themselves
                    :class:`.Paper` instances), if data available.
papers_citing       A dictionary mapping citation IDs onto lists of citing 
                    :class:`.Paper`\s (by ID) in the dataset, if data available.
=============       ======================================================================

If the :class:`.Paper`\s in the :class:`.Corpus` contain cited references, then a 
featureset called ``citations`` will also be created.

Directly from data
``````````````````

All of the modules in :mod:`.readers` should include methods to generate a
:class:`.Corpus` directly from data:

* :func:`.dfr.read_corpus`
* :func:`.scopus.read_corpus`
* :func:`.wos.read_corpus`

.. _corpora-featuresets:

Featuresets
-----------

In Tethne, a feature is a scalar property of one or more document in a :class:`.Corpus`\.
The most straightforward example of a feature is a word, which can occur some number of
times ( >= 0 ) in a document.

A featureset is a set of data structures that describe the distribution of features 
over :class:`.Paper`\s in a corpus. For example, a :class:`.Corpus` might contain a
featureset describing the distribution of words or citations over its :class:`.Paper`\s.

In Tethne v0.6.0-beta, featuresets are simply dictionaries contained in the 
:class:`.Corpus`\' ``features`` attribute. Each featureset should contain the following
keys and values:

+-------------------+--------------------------------------------------------------------+
| Key               | Value Type/Description                                             |
+===================+====================================================================+
| ``index``         | Dictionary mapping integer IDs onto string representations of      |
|                   | features. For wordcounts, think of this as a vocabulary.           |
+-------------------+--------------------------------------------------------------------+
| ``features``      | Dictionary mapping :class:`.Paper` IDs onto sparse feature vectors |
|                   | (e.g. wordcounts). These vectors are lists of ( feature index,     |
|                   | value ) tuples. See :ref:`sparse-feature-vectors`\.                |
+-------------------+--------------------------------------------------------------------+
| ``counts``        | Dictionary mapping feature indices (in ``index``) onto the sum of  |
|                   | values from ``features``. For wordcounts, for example, this is the |
|                   | total number of times that a word occurs in the :class:`.Corpus`\. |
+-------------------+--------------------------------------------------------------------+
| ``documentCounts``| Dictionary mapping feature indices (in ``index``) onto the number  |
|                   | of :class:`.Paper`\s in which the feature occurs (e.g. the number  |
|                   | of documents containing that word).                                |
+-------------------+--------------------------------------------------------------------+
| ``papers``        | Dictionary mapping feature indices onto sparse vectors over        |
|                   | :class:`.Paper` IDs. Similar to :ref:`sparse-feature-vectors`\,    |
|                   | except that column indices are :class:`.Paper` IDs instead of      |
|                   | feature indices.                                                   |
+-------------------+--------------------------------------------------------------------+

.. _sparse-feature-vector:

Sparse feature vector
`````````````````````

The number of features associated with a :class:`.Paper` is potentially much smaller than
the total number of features in a given featureset. For example, a set of documents may
have a vocabulary of tens of thousands of words, only a few hundred of which appear in
any one document alone. Thus ff we were to represent a featureset as a matrix of papers 
and the words that they contain, most of the cells of that matrix would contain zeros. 

To save space on disk and in memory, Tethne represents featuresets sparsely, meaning that
only non-zero values are stored. In Tethne, a sparse feature vector is a list of 
(feature,value) tuples.

Consider the document: ``the cow jumped over the moon``. We can represent this document
in terms of a wordcount vector:

.. code-block:: python

   [ ('the', 2), ('cow', 1), ('jumped', 1), ('over', 1), ('moon', 1) ]
   
Of course, features are usually indexed by integer IDs. The mapping between integer IDs
and feature strings are stored in the featureset's ``'index'`` dictionary, which might
look something like:

.. code-block:: python

   >>> MyCorpus.features['wordcounts']['index']
   {0: 'the', 1: 'cow', 2: 'jumped', 3: 'over', 4: 'moon'}

So the wordcount vector for this paper would look like:

.. code-block:: python

   [ (0, 2), (1, 1), (2, 1), (3, 1), (4, 1) ]


Generating and modifying featuresets
````````````````````````````````````
The following methods in :class:`.Corpus` can be used to generate and modify featuresets:

.. currentmodule:: tethne.classes.corpus

.. autosummary::
   :nosignatures:
   
   Corpus.abstract_to_features
   Corpus.add_features
   Corpus.apply_stoplist
   Corpus.feature_counts
   Corpus.filter_features
   Corpus.transform
   
Analyzing featuresets
`````````````````````

The following methods are useful for inspecting the distribution of features across a
:class:`.Corpus`\:

.. currentmodule:: tethne.classes.corpus

.. autosummary::
   :nosignatures: 
                      
   Corpus.feature_distribution  
   Corpus.plot_distribution
   
The :mod:`.analyze.features` module provides methods for calculating similarity or
distance between :class:`.Paper`\s based on features.