.. _working-with-corpora: Working with Corpora ==================== Building a :class:`.Corpus` is the starting-point for working with bibliographic data in Tethne. :class:`.Corpus` objects encapsulate and index :class:`.Paper`\s and :ref:`corpora-featuresets`\, and mechanisms for analyzing data diachronically. The :class:`.Corpus` class lives in :mod:`tethne.classes`\, but can be imported directly from :mod:`tethne`\: .. code-block:: python >>> from tethne import Corpus Creating Corpora ---------------- Minimally, a :class:`.Corpus` requires a set of :class:`.Paper`\s. .. code-block:: python >>> from tethne.readers import wos >>> papers = wos.read('/path/to/wosdata.txt') # Load some data. >>> MyCorpus = Corpus(papers) Indexing ```````` By default, :class:`.Paper`\s and their cited references are indexed by 'ayjid', an identifier generated from the first-author, publication date, and journal name of each entry. You can use alternate indexing fields for :class:`.Paper`\s and their cited references: .. code-block:: python >>> MyCorpus = Corpus(papers, index_by='wosid', index_citation_by='ayjid') These are (usually) your options for index fields for :class:`.Papers` (``index_by``): ============== ============= Source Fields ============== ============= Web of Science wosid, ayjid Scopus eid, ayjid JSTOR DfR doi, ayjid ============== ============= It should be obvious that ``ayjid`` is a good option if you plan to integrate data from multiple datasources. ``ayjid`` is your best option for ``index_citations_by``, unless you're confident that all cited references include alternate identifiers (this is rare). By default, a :class:`.Corpus` calls its own :func:`Corpus.index` method on instantiation. This results in a few useful attributes: ============= ====================================================================== Attribute Type/Description ============= ====================================================================== papers A dictionary mapping :class:`.Paper` IDs onto :class:`.Paper` instances. authors A dictionary mapping author names onto lists of :class:`.Paper` IDs. citations A dictionary mapping citation IDs onto cited references (themselves :class:`.Paper` instances), if data available. papers_citing A dictionary mapping citation IDs onto lists of citing :class:`.Paper`\s (by ID) in the dataset, if data available. ============= ====================================================================== If the :class:`.Paper`\s in the :class:`.Corpus` contain cited references, then a featureset called ``citations`` will also be created. Directly from data `````````````````` All of the modules in :mod:`.readers` should include methods to generate a :class:`.Corpus` directly from data: * :func:`.dfr.read_corpus` * :func:`.scopus.read_corpus` * :func:`.wos.read_corpus` .. _corpora-featuresets: Featuresets ----------- In Tethne, a feature is a scalar property of one or more document in a :class:`.Corpus`\. The most straightforward example of a feature is a word, which can occur some number of times ( >= 0 ) in a document. A featureset is a set of data structures that describe the distribution of features over :class:`.Paper`\s in a corpus. For example, a :class:`.Corpus` might contain a featureset describing the distribution of words or citations over its :class:`.Paper`\s. In Tethne v0.6.0-beta, featuresets are simply dictionaries contained in the :class:`.Corpus`\' ``features`` attribute. Each featureset should contain the following keys and values: +-------------------+--------------------------------------------------------------------+ | Key | Value Type/Description | +===================+====================================================================+ | ``index`` | Dictionary mapping integer IDs onto string representations of | | | features. For wordcounts, think of this as a vocabulary. | +-------------------+--------------------------------------------------------------------+ | ``features`` | Dictionary mapping :class:`.Paper` IDs onto sparse feature vectors | | | (e.g. wordcounts). These vectors are lists of ( feature index, | | | value ) tuples. See :ref:`sparse-feature-vectors`\. | +-------------------+--------------------------------------------------------------------+ | ``counts`` | Dictionary mapping feature indices (in ``index``) onto the sum of | | | values from ``features``. For wordcounts, for example, this is the | | | total number of times that a word occurs in the :class:`.Corpus`\. | +-------------------+--------------------------------------------------------------------+ | ``documentCounts``| Dictionary mapping feature indices (in ``index``) onto the number | | | of :class:`.Paper`\s in which the feature occurs (e.g. the number | | | of documents containing that word). | +-------------------+--------------------------------------------------------------------+ | ``papers`` | Dictionary mapping feature indices onto sparse vectors over | | | :class:`.Paper` IDs. Similar to :ref:`sparse-feature-vectors`\, | | | except that column indices are :class:`.Paper` IDs instead of | | | feature indices. | +-------------------+--------------------------------------------------------------------+ .. _sparse-feature-vector: Sparse feature vector ````````````````````` The number of features associated with a :class:`.Paper` is potentially much smaller than the total number of features in a given featureset. For example, a set of documents may have a vocabulary of tens of thousands of words, only a few hundred of which appear in any one document alone. Thus ff we were to represent a featureset as a matrix of papers and the words that they contain, most of the cells of that matrix would contain zeros. To save space on disk and in memory, Tethne represents featuresets sparsely, meaning that only non-zero values are stored. In Tethne, a sparse feature vector is a list of (feature,value) tuples. Consider the document: ``the cow jumped over the moon``. We can represent this document in terms of a wordcount vector: .. code-block:: python [ ('the', 2), ('cow', 1), ('jumped', 1), ('over', 1), ('moon', 1) ] Of course, features are usually indexed by integer IDs. The mapping between integer IDs and feature strings are stored in the featureset's ``'index'`` dictionary, which might look something like: .. code-block:: python >>> MyCorpus.features['wordcounts']['index'] {0: 'the', 1: 'cow', 2: 'jumped', 3: 'over', 4: 'moon'} So the wordcount vector for this paper would look like: .. code-block:: python [ (0, 2), (1, 1), (2, 1), (3, 1), (4, 1) ] Generating and modifying featuresets ```````````````````````````````````` The following methods in :class:`.Corpus` can be used to generate and modify featuresets: .. currentmodule:: tethne.classes.corpus .. autosummary:: :nosignatures: Corpus.abstract_to_features Corpus.add_features Corpus.apply_stoplist Corpus.feature_counts Corpus.filter_features Corpus.transform Analyzing featuresets ````````````````````` The following methods are useful for inspecting the distribution of features across a :class:`.Corpus`\: .. currentmodule:: tethne.classes.corpus .. autosummary:: :nosignatures: Corpus.feature_distribution Corpus.plot_distribution The :mod:`.analyze.features` module provides methods for calculating similarity or distance between :class:`.Paper`\s based on features.