Working with Corpora¶
Building a Corpus is the starting-point for working with bibliographic data in Tethne. Corpus objects encapsulate and index Papers and Featuresets, and mechanisms for analyzing data diachronically.
The Corpus class lives in tethne.classes, but can be imported directly from tethne:
>>> from tethne import Corpus
Creating Corpora¶
Minimally, a Corpus requires a set of Papers.
>>> from tethne.readers import wos
>>> papers = wos.read('/path/to/wosdata.txt') # Load some data.
>>> MyCorpus = Corpus(papers)
Indexing¶
By default, Papers and their cited references are indexed by ‘ayjid’, an identifier generated from the first-author, publication date, and journal name of each entry.
You can use alternate indexing fields for Papers and their cited references:
>>> MyCorpus = Corpus(papers, index_by='wosid', index_citation_by='ayjid')
These are (usually) your options for index fields for Papers (index_by):
Source | Fields |
---|---|
Web of Science | wosid, ayjid |
Scopus | eid, ayjid |
JSTOR DfR | doi, ayjid |
It should be obvious that ayjid is a good option if you plan to integrate data from multiple datasources. ayjid is your best option for index_citations_by, unless you’re confident that all cited references include alternate identifiers (this is rare).
By default, a Corpus calls its own Corpus.index() method on instantiation. This results in a few useful attributes:
Attribute | Type/Description |
---|---|
papers | A dictionary mapping Paper IDs onto Paper instances. |
authors | A dictionary mapping author names onto lists of Paper IDs. |
citations | A dictionary mapping citation IDs onto cited references (themselves Paper instances), if data available. |
papers_citing | A dictionary mapping citation IDs onto lists of citing Papers (by ID) in the dataset, if data available. |
If the Papers in the Corpus contain cited references, then a featureset called citations will also be created.
Featuresets¶
In Tethne, a feature is a scalar property of one or more document in a Corpus. The most straightforward example of a feature is a word, which can occur some number of times ( >= 0 ) in a document.
A featureset is a set of data structures that describe the distribution of features over Papers in a corpus. For example, a Corpus might contain a featureset describing the distribution of words or citations over its Papers.
In Tethne v0.6.0-beta, featuresets are simply dictionaries contained in the Corpus‘ features attribute. Each featureset should contain the following keys and values:
Key | Value Type/Description |
---|---|
index | Dictionary mapping integer IDs onto string representations of features. For wordcounts, think of this as a vocabulary. |
features | Dictionary mapping Paper IDs onto sparse feature vectors (e.g. wordcounts). These vectors are lists of ( feature index, value ) tuples. See sparse-feature-vectors. |
counts | Dictionary mapping feature indices (in index) onto the sum of values from features. For wordcounts, for example, this is the total number of times that a word occurs in the Corpus. |
documentCounts | Dictionary mapping feature indices (in index) onto the number of Papers in which the feature occurs (e.g. the number of documents containing that word). |
papers | Dictionary mapping feature indices onto sparse vectors over Paper IDs. Similar to sparse-feature-vectors, except that column indices are Paper IDs instead of feature indices. |
Sparse feature vector¶
The number of features associated with a Paper is potentially much smaller than the total number of features in a given featureset. For example, a set of documents may have a vocabulary of tens of thousands of words, only a few hundred of which appear in any one document alone. Thus ff we were to represent a featureset as a matrix of papers and the words that they contain, most of the cells of that matrix would contain zeros.
To save space on disk and in memory, Tethne represents featuresets sparsely, meaning that only non-zero values are stored. In Tethne, a sparse feature vector is a list of (feature,value) tuples.
Consider the document: the cow jumped over the moon. We can represent this document in terms of a wordcount vector:
[ ('the', 2), ('cow', 1), ('jumped', 1), ('over', 1), ('moon', 1) ]
Of course, features are usually indexed by integer IDs. The mapping between integer IDs and feature strings are stored in the featureset’s 'index' dictionary, which might look something like:
>>> MyCorpus.features['wordcounts']['index']
{0: 'the', 1: 'cow', 2: 'jumped', 3: 'over', 4: 'moon'}
So the wordcount vector for this paper would look like:
[ (0, 2), (1, 1), (2, 1), (3, 1), (4, 1) ]
Generating and modifying featuresets¶
The following methods in Corpus can be used to generate and modify featuresets:
Corpus.abstract_to_features | Generates a unigram (wordcount) featureset from the abstracts of all Papers in the Corpus (if available). |
Corpus.add_features | Add a new featureset to the Corpus. |
Corpus.apply_stoplist | Apply stoplist to the featureset fold, resulting in featureset fnew. |
Corpus.feature_counts | Get the frequency of a feature in a particular slice of axis. |
Corpus.filter_features | Create a new featureset by applying a filter to an existing featureset. |
Corpus.transform | Transform values in featureset fold, creating a new featureset fnew. |
Analyzing featuresets¶
The following methods are useful for inspecting the distribution of features across a Corpus:
Corpus.feature_distribution | Get the distribution of a feature over one or two slice axes. |
Corpus.plot_distribution | Plot distribution of papers or features along slice axes, using MatPlotLib. |
The analyze.features module provides methods for calculating similarity or distance between Papers based on features.