tethne.readers.dfr module¶
Methods for parsing JSTOR Data-for-Research datasets.
read(datapath) | Yields Paper s from JSTOR DfR package. |
ngrams(datapath[, N, ignore_hash, mode]) | Yields N-grams from a JSTOR DfR dataset. |
read_corpus(path[, features, exclude]) | Generate a Corpus from a single DfR dataset. |
from_dir(path) | Convenience function for generating a list of Paper from a directory of JSTOR DfR datasets. |
ngrams_from_dir(path[, N, ignore_hash, mode]) | Load ngrams from a directory of JSTOR DfR datasets. |
corpus_from_dir(path[, features, exclude]) | Generate a Corpus from a directory containing multiple DfR datasets. |
- class tethne.readers.dfr.GramGenerator(path, elem, values=False, keys=False, ignore_hash=True)[source]¶
Bases: object
Yields N-gram data from on-disk dataset, to make loading big datasets a bit more memory-friendly.
Reusable, in the sense that items(), iteritems(), keys(), and values() all return new GramGenerator instances with the same path. This allows a GramGenerator to sneakily pass as an ngrams dict in most practical situations.
- items()[source]¶
Returns a GramGenerator that produces key,value tuples.
- iteritems()[source]¶
Returns a GramGenerator that produces key,value tuples.
- keys()[source]¶
Returns a GramGenerator that produces only keys.
- values()[source]¶
Returns a GramGenerator that produces only values.
- tethne.readers.dfr.corpus_from_dir(path, features=None, exclude=None, **kwargs)[source]¶
Generate a Corpus from a directory containing multiple DfR datasets.
If features is provided (see below), will also load ngrams.
Parameters: path : string
Path to directory containing DfR dataset directories.
features : list
List of feature-grams (e.g. ‘uni’, ‘bi’, ‘tri’) to load from dataset.
exclude : list
Stoplist for feature-grams.
**kwargs
Use this to pass kwargs to ngrams().
Returns: Examples
>>> from nltk.corpus import stopwords # Get a stoplist. >>> stoplist = stopwords.words() >>> from tethne.readers import dfr >>> C = dfr.corpus_from_dir('/path/to/DfR/datasets', 'uni', stoplist)
- tethne.readers.dfr.from_dir(path)[source]¶
Convenience function for generating a list of Paper from a directory of JSTOR DfR datasets.
Parameters: path : string
Path to directory containing DfR dataset directories.
Returns: papers : list
A list of Paper objects.
Raises: IOError
Invalid path.
Examples
>>> from tethne.readers import dfr >>> papers = dfr.from_dir("/Path/to/datadir")
- tethne.readers.dfr.ngrams(datapath, N='uni', ignore_hash=True, mode='heavy')[source]¶
Yields N-grams from a JSTOR DfR dataset.
Parameters: datapath : string
Path to unzipped JSTOR DfR folder containing N-grams (e.g. ‘bigrams’).
N : string
‘uni’, ‘bi’, ‘tri’, or ‘quad’
ignore_hash : bool
If True, will exclude all N-grams that contain the hash ‘#’ character.
mode : str
If ‘heavy’ (default), loads all data into memory and returns a dict. If ‘light’, returns a (somewhat) reusable GramGenerator. See GramGenerator for usage.
Returns: ngrams : dict
Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.
Examples
>>> from tethne.readers import dfr >>> trigrams = dfr.ngrams("/Path/to/DfR", N='tri')
- tethne.readers.dfr.ngrams_from_dir(path, N='uni', ignore_hash=True, mode='heavy')[source]¶
Load ngrams from a directory of JSTOR DfR datasets.
Parameters: path : string
Path to directory containing DfR dataset directories.
N : string
‘uni’, ‘bi’, ‘tri’, or ‘quad’
ignore_hash : bool
If True, will exclude all N-grams that contain the hash ‘#’ character.
mode : str
If ‘heavy’ (default), loads all data into memory and returns a dict. If ‘light’, returns a (somewhat) reusable GramGenerator. See GramGenerator for usage.
Returns: ngrams : dict
Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.
Examples
>>> from tethne.readers import dfr >>> ngrams = dfr.ngrams_from_dir("/Path/to/datadir", 'uni')
- tethne.readers.dfr.read(datapath)[source]¶
Yields Paper s from JSTOR DfR package.
Each Paper is tagged with an accession id for this read/conversion.
Parameters: filepath : string
Filepath to unzipped JSTOR DfR folder containing a citations.XML file.
Returns: papers : list
A list of Paper objects.
Examples
>>> from tethne.readers import dfr >>> papers = dfr.read("/Path/to/DfR")
- tethne.readers.dfr.read_corpus(path, features=None, exclude=None, **kwargs)[source]¶
Generate a Corpus from a single DfR dataset.
If features is provided (see below), will also load ngrams.
Parameters: filepath : string
Filepath to unzipped JSTOR DfR folder containing a citations.XML file.
features : list
List of feature-grams (e.g. ‘uni’, ‘bi’, ‘tri’) to load from dataset.
exclude : list
Stoplist for feature-grams.
**kwargs
Use this to pass kwargs to ngrams().
Returns: Examples
>>> from nltk.corpus import stopwords # Get a stoplist. >>> stoplist = stopwords.words() >>> from tethne.readers import dfr >>> MyCorpus = dfr.read_corpus("/Path/to/DfR", ['uni'], stoplist)
- tethne.readers.dfr.tokenize(ngrams, min_tf=2, min_df=2, min_len=3, apply_stoplist=False)[source]¶
Builds a vocabulary, and replaces words with vocab indices.
Parameters: ngrams : dict
Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.
apply_stoplist : bool
If True, will exclude all N-grams that contain words in the NLTK stoplist.
Returns: t_ngrams : dict
Tokenized ngrams, as doi:{i:count}.
vocab : dict
Vocabulary as i:term.
token_tf : Counter
Term counts for corpus, as i:count.