SciPy

tethne.readers.dfr module

Methods for parsing JSTOR Data-for-Research datasets.

read(datapath) Yields Paper s from JSTOR DfR package.
ngrams(datapath[, N, ignore_hash, mode]) Yields N-grams from a JSTOR DfR dataset.
read_corpus(path[, features, exclude]) Generate a Corpus from a single DfR dataset.
from_dir(path) Convenience function for generating a list of Paper from a directory of JSTOR DfR datasets.
ngrams_from_dir(path[, N, ignore_hash, mode]) Load ngrams from a directory of JSTOR DfR datasets.
corpus_from_dir(path[, features, exclude]) Generate a Corpus from a directory containing multiple DfR datasets.
class tethne.readers.dfr.GramGenerator(path, elem, values=False, keys=False, ignore_hash=True)[source]

Bases: object

Yields N-gram data from on-disk dataset, to make loading big datasets a bit more memory-friendly.

Reusable, in the sense that items(), iteritems(), keys(), and values() all return new GramGenerator instances with the same path. This allows a GramGenerator to sneakily pass as an ngrams dict in most practical situations.

items()[source]

Returns a GramGenerator that produces key,value tuples.

iteritems()[source]

Returns a GramGenerator that produces key,value tuples.

keys()[source]

Returns a GramGenerator that produces only keys.

next()[source]
values()[source]

Returns a GramGenerator that produces only values.

tethne.readers.dfr.corpus_from_dir(path, features=None, exclude=None, **kwargs)[source]

Generate a Corpus from a directory containing multiple DfR datasets.

If features is provided (see below), will also load ngrams.

Parameters:

path : string

Path to directory containing DfR dataset directories.

features : list

List of feature-grams (e.g. ‘uni’, ‘bi’, ‘tri’) to load from dataset.

exclude : list

Stoplist for feature-grams.

**kwargs

Use this to pass kwargs to ngrams().

Returns:

Corpus

Examples

>>> from nltk.corpus import stopwords    # Get a stoplist.
>>> stoplist = stopwords.words()
>>> from tethne.readers import dfr
>>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist)
tethne.readers.dfr.from_dir(path)[source]

Convenience function for generating a list of Paper from a directory of JSTOR DfR datasets.

Parameters:

path : string

Path to directory containing DfR dataset directories.

Returns:

papers : list

A list of Paper objects.

Raises:

IOError

Invalid path.

Examples

>>> from tethne.readers import dfr
>>> papers = dfr.from_dir("/Path/to/datadir")
tethne.readers.dfr.ngrams(datapath, N='uni', ignore_hash=True, mode='heavy')[source]

Yields N-grams from a JSTOR DfR dataset.

Parameters:

datapath : string

Path to unzipped JSTOR DfR folder containing N-grams (e.g. ‘bigrams’).

N : string

‘uni’, ‘bi’, ‘tri’, or ‘quad’

ignore_hash : bool

If True, will exclude all N-grams that contain the hash ‘#’ character.

mode : str

If ‘heavy’ (default), loads all data into memory and returns a dict. If ‘light’, returns a (somewhat) reusable GramGenerator. See GramGenerator for usage.

Returns:

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

Examples

>>> from tethne.readers import dfr
>>> trigrams = dfr.ngrams("/Path/to/DfR", N='tri')
tethne.readers.dfr.ngrams_from_dir(path, N='uni', ignore_hash=True, mode='heavy')[source]

Load ngrams from a directory of JSTOR DfR datasets.

Parameters:

path : string

Path to directory containing DfR dataset directories.

N : string

‘uni’, ‘bi’, ‘tri’, or ‘quad’

ignore_hash : bool

If True, will exclude all N-grams that contain the hash ‘#’ character.

mode : str

If ‘heavy’ (default), loads all data into memory and returns a dict. If ‘light’, returns a (somewhat) reusable GramGenerator. See GramGenerator for usage.

Returns:

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

Examples

>>> from tethne.readers import dfr
>>> ngrams = dfr.ngrams_from_dir("/Path/to/datadir", 'uni')
tethne.readers.dfr.read(datapath)[source]

Yields Paper s from JSTOR DfR package.

Each Paper is tagged with an accession id for this read/conversion.

Parameters:

filepath : string

Filepath to unzipped JSTOR DfR folder containing a citations.XML file.

Returns:

papers : list

A list of Paper objects.

Examples

>>> from tethne.readers import dfr
>>> papers = dfr.read("/Path/to/DfR")
tethne.readers.dfr.read_corpus(path, features=None, exclude=None, **kwargs)[source]

Generate a Corpus from a single DfR dataset.

If features is provided (see below), will also load ngrams.

Parameters:

filepath : string

Filepath to unzipped JSTOR DfR folder containing a citations.XML file.

features : list

List of feature-grams (e.g. ‘uni’, ‘bi’, ‘tri’) to load from dataset.

exclude : list

Stoplist for feature-grams.

**kwargs

Use this to pass kwargs to ngrams().

Returns:

Corpus

Examples

>>> from nltk.corpus import stopwords    # Get a stoplist.
>>> stoplist = stopwords.words()
>>> from tethne.readers import dfr
>>> MyCorpus = dfr.read_corpus("/Path/to/DfR", ['uni'], stoplist)
tethne.readers.dfr.tokenize(ngrams, min_tf=2, min_df=2, min_len=3, apply_stoplist=False)[source]

Builds a vocabulary, and replaces words with vocab indices.

Parameters:

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

apply_stoplist : bool

If True, will exclude all N-grams that contain words in the NLTK stoplist.

Returns:

t_ngrams : dict

Tokenized ngrams, as doi:{i:count}.

vocab : dict

Vocabulary as i:term.

token_tf : Counter

Term counts for corpus, as i:count.