tethne.readers.dfr module¶

Methods for parsing JSTOR Data-for-Research datasets.

`read`(datapath)	Yields `Paper` s from JSTOR DfR package.
`ngrams`(datapath[, N, ignore_hash, mode])	Yields N-grams from a JSTOR DfR dataset.
`read_corpus`(path[, features, exclude])	Generate a `Corpus` from a single DfR dataset.
`from_dir`(path)	Convenience function for generating a list of `Paper` from a directory of JSTOR DfR datasets.
`ngrams_from_dir`(path[, N, ignore_hash, mode])	Load ngrams from a directory of JSTOR DfR datasets.
`corpus_from_dir`(path[, features, exclude])	Generate a `Corpus` from a directory containing multiple DfR datasets.

class tethne.readers.dfr.GramGenerator(path, elem, values=False, keys=False, ignore_hash=True)[source]¶

Bases: object

Yields N-gram data from on-disk dataset, to make loading big datasets a bit more memory-friendly.

Reusable, in the sense that items(), iteritems(), keys(), and values() all return new GramGenerator instances with the same path. This allows a GramGenerator to sneakily pass as an ngrams dict in most practical situations.

items()[source]¶: Returns a GramGenerator that produces key,value tuples.

iteritems()[source]¶: Returns a GramGenerator that produces key,value tuples.

keys()[source]¶: Returns a GramGenerator that produces only keys.

next()[source]¶

values()[source]¶: Returns a GramGenerator that produces only values.

tethne.readers.dfr.corpus_from_dir(path, features=None, exclude=None, **kwargs)[source]¶

Generate a Corpus from a directory containing multiple DfR datasets.

If features is provided (see below), will also load ngrams.

Parameters:

Parameters:	path : string Path to directory containing DfR dataset directories. features : list List of feature-grams (e.g. ‘uni’, ‘bi’, ‘tri’) to load from dataset. exclude : list Stoplist for feature-grams. **kwargs Use this to pass kwargs to `ngrams()`.
Returns:	`Corpus`

path : string

Path to directory containing DfR dataset directories.

features : list

List of feature-grams (e.g. ‘uni’, ‘bi’, ‘tri’) to load from dataset.

exclude : list

Stoplist for feature-grams.

**kwargs

Use this to pass kwargs to ngrams().

Returns:

Corpus

Examples

>>> from nltk.corpus import stopwords    # Get a stoplist.
>>> stoplist = stopwords.words()
>>> from tethne.readers import dfr
>>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist)

tethne.readers.dfr.from_dir(path)[source]¶

Convenience function for generating a list of Paper from a directory of JSTOR DfR datasets.

Parameters:

Parameters:	path : string Path to directory containing DfR dataset directories.
Returns:	papers : list A list of `Paper` objects.
Raises:	IOError Invalid path.

path : string

Path to directory containing DfR dataset directories.

Returns:

papers : list

A list of Paper objects.

Raises:

IOError

Invalid path.

Examples

>>> from tethne.readers import dfr
>>> papers = dfr.from_dir("/Path/to/datadir")

tethne.readers.dfr.ngrams(datapath, N='uni', ignore_hash=True, mode='heavy')[source]¶

Yields N-grams from a JSTOR DfR dataset.

Parameters:

Parameters:	datapath : string Path to unzipped JSTOR DfR folder containing N-grams (e.g. ‘bigrams’). N : string ‘uni’, ‘bi’, ‘tri’, or ‘quad’ ignore_hash : bool If True, will exclude all N-grams that contain the hash ‘#’ character. mode : str If ‘heavy’ (default), loads all data into memory and returns a dict. If ‘light’, returns a (somewhat) reusable `GramGenerator`. See `GramGenerator` for usage.
Returns:	ngrams : dict Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

datapath : string

Path to unzipped JSTOR DfR folder containing N-grams (e.g. ‘bigrams’).

N : string

‘uni’, ‘bi’, ‘tri’, or ‘quad’

ignore_hash : bool

If True, will exclude all N-grams that contain the hash ‘#’ character.

mode : str

If ‘heavy’ (default), loads all data into memory and returns a dict. If ‘light’, returns a (somewhat) reusable GramGenerator. See GramGenerator for usage.

Returns:

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

Examples

>>> from tethne.readers import dfr
>>> trigrams = dfr.ngrams("/Path/to/DfR", N='tri')

tethne.readers.dfr.ngrams_from_dir(path, N='uni', ignore_hash=True, mode='heavy')[source]¶

Load ngrams from a directory of JSTOR DfR datasets.

Parameters:

Parameters:	path : string Path to directory containing DfR dataset directories. N : string ‘uni’, ‘bi’, ‘tri’, or ‘quad’ ignore_hash : bool If True, will exclude all N-grams that contain the hash ‘#’ character. mode : str If ‘heavy’ (default), loads all data into memory and returns a dict. If ‘light’, returns a (somewhat) reusable `GramGenerator`. See `GramGenerator` for usage.
Returns:	ngrams : dict Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

path : string

Path to directory containing DfR dataset directories.

N : string

‘uni’, ‘bi’, ‘tri’, or ‘quad’

ignore_hash : bool

If True, will exclude all N-grams that contain the hash ‘#’ character.

mode : str

If ‘heavy’ (default), loads all data into memory and returns a dict. If ‘light’, returns a (somewhat) reusable GramGenerator. See GramGenerator for usage.

Returns:

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

Examples

>>> from tethne.readers import dfr
>>> ngrams = dfr.ngrams_from_dir("/Path/to/datadir", 'uni')

tethne.readers.dfr.read(datapath)[source]¶

Yields Paper s from JSTOR DfR package.

Each Paper is tagged with an accession id for this read/conversion.

Parameters:

Parameters:	filepath : string Filepath to unzipped JSTOR DfR folder containing a citations.XML file.
Returns:	papers : list A list of `Paper` objects.

filepath : string

Filepath to unzipped JSTOR DfR folder containing a citations.XML file.

Returns:

papers : list

A list of Paper objects.

Examples

>>> from tethne.readers import dfr
>>> papers = dfr.read("/Path/to/DfR")

tethne.readers.dfr.read_corpus(path, features=None, exclude=None, **kwargs)[source]¶

Generate a Corpus from a single DfR dataset.

If features is provided (see below), will also load ngrams.

Parameters:

Parameters:	filepath : string Filepath to unzipped JSTOR DfR folder containing a citations.XML file. features : list List of feature-grams (e.g. ‘uni’, ‘bi’, ‘tri’) to load from dataset. exclude : list Stoplist for feature-grams. **kwargs Use this to pass kwargs to `ngrams()`.
Returns:	`Corpus`

filepath : string

Filepath to unzipped JSTOR DfR folder containing a citations.XML file.

features : list

List of feature-grams (e.g. ‘uni’, ‘bi’, ‘tri’) to load from dataset.

exclude : list

Stoplist for feature-grams.

**kwargs

Use this to pass kwargs to ngrams().

Returns:

Corpus

Examples

>>> from nltk.corpus import stopwords    # Get a stoplist.
>>> stoplist = stopwords.words()
>>> from tethne.readers import dfr
>>> MyCorpus = dfr.read_corpus("/Path/to/DfR", ['uni'], stoplist)

tethne.readers.dfr.tokenize(ngrams, min_tf=2, min_df=2, min_len=3, apply_stoplist=False)[source]¶

Builds a vocabulary, and replaces words with vocab indices.

Parameters:

Parameters:	ngrams : dict Keys are paper DOIs, values are lists of (Ngram, frequency) tuples. apply_stoplist : bool If True, will exclude all N-grams that contain words in the NLTK stoplist.
Returns:	t_ngrams : dict Tokenized ngrams, as doi:{i:count}. vocab : dict Vocabulary as i:term. token_tf : `Counter` Term counts for corpus, as i:count.

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

apply_stoplist : bool

If True, will exclude all N-grams that contain words in the NLTK stoplist.

Returns:

t_ngrams : dict

Tokenized ngrams, as doi:{i:count}.

vocab : dict

Vocabulary as i:term.

token_tf : Counter

Term counts for corpus, as i:count.

tethne.readers.dfr module¶

Previous topic

Next topic

This Page