Working with Corpora¶

Building a Corpus is the starting-point for working with bibliographic data in Tethne. Corpus objects encapsulate and index Papers and Featuresets, and mechanisms for analyzing data diachronically.

The Corpus class lives in tethne.classes, but can be imported directly from tethne:

>>> from tethne import Corpus

Creating Corpora¶

Minimally, a Corpus requires a set of Papers.

>>> from tethne.readers import wos
>>> papers = wos.read('/path/to/wosdata.txt')    # Load some data.

>>> MyCorpus = Corpus(papers)

Indexing¶

By default, Papers and their cited references are indexed by ‘ayjid’, an identifier generated from the first-author, publication date, and journal name of each entry.

You can use alternate indexing fields for Papers and their cited references:

>>> MyCorpus = Corpus(papers, index_by='wosid', index_citation_by='ayjid')

These are (usually) your options for index fields for Papers (index_by):

Source	Fields
Web of Science	wosid, ayjid
Scopus	eid, ayjid
JSTOR DfR	doi, ayjid

It should be obvious that ayjid is a good option if you plan to integrate data from multiple datasources. ayjid is your best option for index_citations_by, unless you’re confident that all cited references include alternate identifiers (this is rare).

By default, a Corpus calls its own Corpus.index() method on instantiation. This results in a few useful attributes:

Attribute	Type/Description
papers	A dictionary mapping `Paper` IDs onto `Paper` instances.
authors	A dictionary mapping author names onto lists of `Paper` IDs.
citations	A dictionary mapping citation IDs onto cited references (themselves `Paper` instances), if data available.
papers_citing	A dictionary mapping citation IDs onto lists of citing `Paper`s (by ID) in the dataset, if data available.

If the Papers in the Corpus contain cited references, then a featureset called citations will also be created.

Directly from data¶

All of the modules in readers should include methods to generate a Corpus directly from data:

Featuresets¶

In Tethne, a feature is a scalar property of one or more document in a Corpus. The most straightforward example of a feature is a word, which can occur some number of times ( >= 0 ) in a document.

A featureset is a set of data structures that describe the distribution of features over Papers in a corpus. For example, a Corpus might contain a featureset describing the distribution of words or citations over its Papers.

In Tethne v0.6.0-beta, featuresets are simply dictionaries contained in the Corpus‘ features attribute. Each featureset should contain the following keys and values:

Key	Value Type/Description
`index`	Dictionary mapping integer IDs onto string representations of features. For wordcounts, think of this as a vocabulary.
`features`	Dictionary mapping `Paper` IDs onto sparse feature vectors (e.g. wordcounts). These vectors are lists of ( feature index, value ) tuples. See sparse-feature-vectors.
`counts`	Dictionary mapping feature indices (in `index`) onto the sum of values from `features`. For wordcounts, for example, this is the total number of times that a word occurs in the `Corpus`.
`documentCounts`	Dictionary mapping feature indices (in `index`) onto the number of `Paper`s in which the feature occurs (e.g. the number of documents containing that word).
`papers`	Dictionary mapping feature indices onto sparse vectors over `Paper` IDs. Similar to sparse-feature-vectors, except that column indices are `Paper` IDs instead of feature indices.

Sparse feature vector¶

The number of features associated with a Paper is potentially much smaller than the total number of features in a given featureset. For example, a set of documents may have a vocabulary of tens of thousands of words, only a few hundred of which appear in any one document alone. Thus ff we were to represent a featureset as a matrix of papers and the words that they contain, most of the cells of that matrix would contain zeros.

To save space on disk and in memory, Tethne represents featuresets sparsely, meaning that only non-zero values are stored. In Tethne, a sparse feature vector is a list of (feature,value) tuples.

Consider the document: the cow jumped over the moon. We can represent this document in terms of a wordcount vector:

[ ('the', 2), ('cow', 1), ('jumped', 1), ('over', 1), ('moon', 1) ]

Of course, features are usually indexed by integer IDs. The mapping between integer IDs and feature strings are stored in the featureset’s 'index' dictionary, which might look something like:

>>> MyCorpus.features['wordcounts']['index']
{0: 'the', 1: 'cow', 2: 'jumped', 3: 'over', 4: 'moon'}

So the wordcount vector for this paper would look like:

[ (0, 2), (1, 1), (2, 1), (3, 1), (4, 1) ]

Generating and modifying featuresets¶

The following methods in Corpus can be used to generate and modify featuresets:

`Corpus.abstract_to_features`	Generates a unigram (wordcount) featureset from the abstracts of all `Paper`s in the `Corpus` (if available).
`Corpus.add_features`	Add a new featureset to the `Corpus`.
`Corpus.apply_stoplist`	Apply `stoplist` to the featureset `fold`, resulting in featureset `fnew`.
`Corpus.feature_counts`	Get the frequency of a feature in a particular `slice` of `axis`.
`Corpus.filter_features`	Create a new featureset by applying a filter to an existing featureset.
`Corpus.transform`	Transform values in featureset `fold`, creating a new featureset `fnew`.

Analyzing featuresets¶

The following methods are useful for inspecting the distribution of features across a Corpus:

`Corpus.feature_distribution`	Get the distribution of a `feature` over one or two slice axes.
`Corpus.plot_distribution`	Plot distribution of papers or features along slice axes, using MatPlotLib.

The analyze.features module provides methods for calculating similarity or distance between Papers based on features.

Working with Corpora
- Creating Corpora
  - Indexing
  - Directly from data
- Featuresets

Working with Corpora¶

Creating Corpora¶

Indexing¶

Directly from data¶

Featuresets¶

Sparse feature vector¶

Generating and modifying featuresets¶

Analyzing featuresets¶

Table Of Contents

Previous topic

Next topic

This Page