SciPy

Working with Corpora

Building a Corpus is the starting-point for working with bibliographic data in Tethne. Corpus objects encapsulate and index Papers and Featuresets, and mechanisms for analyzing data diachronically.

The Corpus class lives in tethne.classes, but can be imported directly from tethne:

>>> from tethne import Corpus

Creating Corpora

Minimally, a Corpus requires a set of Papers.

>>> from tethne.readers import wos
>>> papers = wos.read('/path/to/wosdata.txt')    # Load some data.

>>> MyCorpus = Corpus(papers)

Indexing

By default, Papers and their cited references are indexed by ‘ayjid’, an identifier generated from the first-author, publication date, and journal name of each entry.

You can use alternate indexing fields for Papers and their cited references:

>>> MyCorpus = Corpus(papers, index_by='wosid', index_citation_by='ayjid')

These are (usually) your options for index fields for Papers (index_by):

Source Fields
Web of Science wosid, ayjid
Scopus eid, ayjid
JSTOR DfR doi, ayjid

It should be obvious that ayjid is a good option if you plan to integrate data from multiple datasources. ayjid is your best option for index_citations_by, unless you’re confident that all cited references include alternate identifiers (this is rare).

By default, a Corpus calls its own Corpus.index() method on instantiation. This results in a few useful attributes:

Attribute Type/Description
papers A dictionary mapping Paper IDs onto Paper instances.
authors A dictionary mapping author names onto lists of Paper IDs.
citations A dictionary mapping citation IDs onto cited references (themselves Paper instances), if data available.
papers_citing A dictionary mapping citation IDs onto lists of citing Papers (by ID) in the dataset, if data available.

If the Papers in the Corpus contain cited references, then a featureset called citations will also be created.

Directly from data

All of the modules in readers should include methods to generate a Corpus directly from data:

Featuresets

In Tethne, a feature is a scalar property of one or more document in a Corpus. The most straightforward example of a feature is a word, which can occur some number of times ( >= 0 ) in a document.

A featureset is a set of data structures that describe the distribution of features over Papers in a corpus. For example, a Corpus might contain a featureset describing the distribution of words or citations over its Papers.

In Tethne v0.6.0-beta, featuresets are simply dictionaries contained in the Corpusfeatures attribute. Each featureset should contain the following keys and values:

Key Value Type/Description
index Dictionary mapping integer IDs onto string representations of features. For wordcounts, think of this as a vocabulary.
features Dictionary mapping Paper IDs onto sparse feature vectors (e.g. wordcounts). These vectors are lists of ( feature index, value ) tuples. See sparse-feature-vectors.
counts Dictionary mapping feature indices (in index) onto the sum of values from features. For wordcounts, for example, this is the total number of times that a word occurs in the Corpus.
documentCounts Dictionary mapping feature indices (in index) onto the number of Papers in which the feature occurs (e.g. the number of documents containing that word).
papers Dictionary mapping feature indices onto sparse vectors over Paper IDs. Similar to sparse-feature-vectors, except that column indices are Paper IDs instead of feature indices.

Sparse feature vector

The number of features associated with a Paper is potentially much smaller than the total number of features in a given featureset. For example, a set of documents may have a vocabulary of tens of thousands of words, only a few hundred of which appear in any one document alone. Thus ff we were to represent a featureset as a matrix of papers and the words that they contain, most of the cells of that matrix would contain zeros.

To save space on disk and in memory, Tethne represents featuresets sparsely, meaning that only non-zero values are stored. In Tethne, a sparse feature vector is a list of (feature,value) tuples.

Consider the document: the cow jumped over the moon. We can represent this document in terms of a wordcount vector:

[ ('the', 2), ('cow', 1), ('jumped', 1), ('over', 1), ('moon', 1) ]

Of course, features are usually indexed by integer IDs. The mapping between integer IDs and feature strings are stored in the featureset’s 'index' dictionary, which might look something like:

>>> MyCorpus.features['wordcounts']['index']
{0: 'the', 1: 'cow', 2: 'jumped', 3: 'over', 4: 'moon'}

So the wordcount vector for this paper would look like:

[ (0, 2), (1, 1), (2, 1), (3, 1), (4, 1) ]

Generating and modifying featuresets

The following methods in Corpus can be used to generate and modify featuresets:

Corpus.abstract_to_features Generates a unigram (wordcount) featureset from the abstracts of all Papers in the Corpus (if available).
Corpus.add_features Add a new featureset to the Corpus.
Corpus.apply_stoplist Apply stoplist to the featureset fold, resulting in featureset fnew.
Corpus.feature_counts Get the frequency of a feature in a particular slice of axis.
Corpus.filter_features Create a new featureset by applying a filter to an existing featureset.
Corpus.transform Transform values in featureset fold, creating a new featureset fnew.

Analyzing featuresets

The following methods are useful for inspecting the distribution of features across a Corpus:

Corpus.feature_distribution Get the distribution of a feature over one or two slice axes.
Corpus.plot_distribution Plot distribution of papers or features along slice axes, using MatPlotLib.

The analyze.features module provides methods for calculating similarity or distance between Papers based on features.