SciPy

tethne.readers package

Submodules

tethne.readers.base module

class tethne.readers.base.BaseParser(path, **kwargs)[source]

Bases: object

Base class for all data parsers. Do not instantiate directly.

new_entry()[source]

Prepare a new data entry.

postprocess_entry()[source]
set_value(tag, value)[source]
class tethne.readers.base.FTParser(*args, **kwargs)[source]

Bases: tethne.readers.base.IterParser

Base parser for field-tagged data files.

end_tag = 'ED'

Signals the end of a data entry.

is_end(tag)[source]
is_eof(tag)[source]
is_start(tag)[source]
next()[source]

Get the next line of data.

Returns:

tag : str

data :

open()[source]

Open the data file.

start_tag = 'ST'

Signals the start of a data entry.

class tethne.readers.base.IterParser(*args, **kwargs)[source]

Bases: tethne.readers.base.BaseParser

concat_fields = []

Multi-line fields here should be concatenated, rather than represented as lists.

entry_class

Model for data entry.

alias of dobject

handle(tag, data)[source]

Process a single line of data, and store the result.

Parameters:

tag : str

data :

parse(parse_only=None)[source]
start()[source]

Find the first data entry and prepare to parse.

tags = {}
class tethne.readers.base.RDFParser(path, **kwargs)[source]

Bases: tethne.readers.base.BaseParser

concat_fields = []
entry_elements = ['Document']
handle(tag, data)[source]
meta_elements = []
next()[source]
open()[source]
parse()[source]
class tethne.readers.base.XMLParser(*args, **kwargs)[source]

Bases: tethne.readers.base.IterParser

entry_class

alias of dobject

entry_element = 'article'
is_end(tag)[source]
is_eof(tag)[source]
is_start(tag)[source]
new_entry()[source]

Prepare a new data entry.

next(child)[source]
open()[source]
parse(parse_only=None)[source]
start()[source]
class tethne.readers.base.dobject[source]

Bases: object

tethne.readers.dfr module

Methods for parsing JSTOR Data-for-Research datasets.

class tethne.readers.dfr.DfRParser(*args, **kwargs)[source]

Bases: tethne.readers.base.XMLParser

entry_class

alias of Paper

handle_author(value)[source]
handle_journaltitle(value)[source]
handle_pubdate(value)[source]
handle_title(value)[source]
handle_unicode(value)[source]
open()[source]
postprocess_authors_full(entry)[source]
tags = {'journaltitle': 'journal', 'type': 'documentType', 'pubdate': 'date', 'author': 'authors_full'}
class tethne.readers.dfr.GramGenerator(path, elem, values=False, keys=False, ignore_hash=True)[source]

Bases: object

Yields N-gram data from on-disk dataset, to make loading big datasets a bit more memory-friendly.

Reusable, in the sense that items(), items(), keys(), and values() all return new GramGenerator instances with the same path. This allows a GramGenerator to sneakily pass as an ngrams dict in most practical situations.

items()[source]

Returns a GramGenerator that produces key,value tuples.

keys()[source]

Returns a GramGenerator that produces only keys.

next()[source]
values()[source]

Returns a GramGenerator that produces only values.

tethne.readers.dfr.ngrams(path, elem, ignore_hash=True)[source]

Yields N-grams from a JSTOR DfR dataset.

Parameters:

path : string

Path to unzipped JSTOR DfR folder containing N-grams.

elem : string

Name of subdirectory containing N-grams. (e.g. ‘bigrams’).

ignore_hash : bool

If True, will exclude all N-grams that contain the hash ‘#’ character.

Returns:

ngrams : FeatureSet

tethne.readers.dfr.read(path, corpus=True, index_by='doi', load_ngrams=True, parse_only=None, corpus_class=<class 'tethne.classes.corpus.Corpus'>, **kwargs)[source]

Yields Paper s from JSTOR DfR package.

Each Paper is tagged with an accession id for this read/conversion.

Parameters:

filepath : string

Filepath to unzipped JSTOR DfR folder containing a citations.xml file.

Returns:

papers : list

A list of Paper objects.

Examples

>>> from tethne.readers import dfr
>>> papers = dfr.read("/Path/to/DfR")
tethne.readers.dfr.streaming_read(path, corpus=True, index_by='doi', parse_only=None, **kwargs)[source]
tethne.readers.dfr.tokenize(ngrams, min_tf=2, min_df=2, min_len=3, apply_stoplist=False)[source]

Builds a vocabulary, and replaces words with vocab indices.

Parameters:

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

apply_stoplist : bool

If True, will exclude all N-grams that contain words in the NLTK stoplist.

Returns:

t_ngrams : dict

Tokenized ngrams, as doi:{i:count}.

vocab : dict

Vocabulary as i:term.

token_tf : Counter

Term counts for corpus, as i:count.

tethne.readers.dspace module

Reader for the DuraSpace API.

This module is not yet implemented. Earlier versions of Tethne had a crude method for reading data from an old REST API. The current version of DuraSpace has a very different API, and we ought to write a much better client.

Stay tuned!

tethne.readers.scopus module

Reader for Scopus bibliographic database.

Earlier versions of Tethne had limited support for Scopus. As of 0.8, we’ve moved away from Scopus due to (a) lack of access, and (b) lack of time. If you’re interested in using Scopus data in Tethne, please consider contributing to the project.

tethne.readers.wos module

Parser for Web of Science field-tagged bibliographic data.

Tethne parsers Web of Science field-tagged data into a set of Papers, which are then encapsulated in a Corpus. The WoSParser can be instantiated directly, or you can simply use read() to parse a single file or a directory containing several data files.

>>> from tethne.readers import wos
>>> corpus = wos.read("/path/to/some/wos/data")
>>> corpus
<tethne.classes.corpus.Corpus object at 0x10057c2d0>
class tethne.readers.wos.WoSParser(*args, **kwargs)[source]

Bases: tethne.readers.base.FTParser

Parser for Web of Science field-tagged data.

>>> from tethne.readers.wos import WoSParser
>>> parser = WoSParser("/path/to/download.txt")
>>> papers = parser.read()
concat_fields = ['abstract', 'keywords', 'funding', 'title', 'references', 'journal']

Fields that span multiple lines that should be concatenated into a single value.

end_tag = 'ER'

Field-tag used to mark the end of a record.

entry_class

The class that should be used to represent a single bibliographic record. This can be changed to support more sophisticated data models.

alias of Paper

handle_AF(value)[source]
handle_AU(value)[source]
handle_CR(value)[source]

Parses cited references.

handle_PY(value)[source]

WoS publication years are cast to integers.

handle_TI(value)[source]

Convert article titles to Title Case.

handle_VL(value)[source]

Volume should be a unicode string, even if it looks like an integer.

parse_author(value)[source]

Attempts to split an author name into last and first parts.

postprocess_WC(entry)[source]

Parse WC keywords.

Subject keywords are usually semicolon-delimited.

postprocess_authorKeywords(entry)[source]

Parse author keywords.

Author keywords are usually semicolon-delimited.

postprocess_authors_full(entry)[source]

If only a single author was found, ensure that authors_full is nonetheless a list.

postprocess_authors_init(entry)[source]

If only a single author was found, ensure that authors_init is nonetheless a list.

postprocess_citedReferences(entry)[source]

If only a single cited reference was found, ensure that citedReferences is nonetheless a list.

postprocess_funding(entry)[source]

Separates funding agency from grant numbers.

postprocess_keywordsPlus(entry)[source]

Parse WoS “Keyword Plus” keywords.

Keyword Plus keywords are usually semicolon-delimited.

postprocess_subject(entry)[source]

Parse subject keywords.

Subject keywords are usually semicolon-delimited.

start_tag = 'PT'

Field-tag used to mark the start of a record.

tags = {'EM': 'emailAddress', 'CL': 'conferenceLocation', 'AB': 'abstract', 'FU': 'funding', 'AF': 'authors_full', 'ED': 'editors', 'IS': 'issue', 'DE': 'keywordsPlus', 'VL': 'volume', 'CY': 'conferenceDate', 'AU': 'authors_init', 'HO': 'conferenceHost', 'BS': 'bookSeriesSubtitle', 'UT': 'wosid', 'CR': 'citedReferences', 'DT': 'documentType', 'SP': 'conferenceSponsors', 'BN': 'ISSN', 'ID': 'authorKeywords', 'CT': 'conferenceTitle', 'PU': 'publisher', 'PI': 'publisherCity', 'RP': 'reprintAddress', 'PA': 'publisherAddress', 'LA': 'language', 'TC': 'timesCited', 'PY': 'date', 'EP': 'pageEnd', 'DI': 'doi', 'SO': 'journal', 'SN': 'ISSN', 'TI': 'title', 'SC': 'subject', 'BP': 'pageStart', 'C1': 'authorAddress', 'NR': 'citationCount', 'CA': 'groupAuthors', 'SE': 'bookSeriesTitle', 'JI': 'isoSource'}

Maps field-tags onto field names.

tethne.readers.wos.corpus_from_dir(path, **kwargs)[source]
tethne.readers.wos.from_dir(path, corpus=True, **kwargs)[source]
tethne.readers.wos.read(path, corpus=True, index_by='wosid', streaming=False, parse_only=None, corpus_class=<class 'tethne.classes.corpus.Corpus'>, **kwargs)[source]

Parse one or more WoS field-tagged data files.

Parameters:

path : str

Path to WoS field-tagged data. Can be a path directly to a single data file, or to a directory containing several data files.

corpus : bool

If True (default), returns a Corpus. If False, will return only a list of Papers.

Returns:

Corpus or Paper

Examples

>>> from tethne.readers import wos
>>> corpus = wos.read("/path/to/some/wos/data")
>>> corpus
<tethne.classes.corpus.Corpus object at 0x10057c2d0>
tethne.readers.wos.read_corpus(path, **kwargs)[source]

Danger

read_corpus is deprecated in v0.8, use read() instead.

tethne.readers.wos.streaming_read(path, corpus=True, index_by='wosid', parse_only=None, **kwargs)[source]

tethne.readers.zotero module

class tethne.readers.zotero.ZoteroParser(path, **kwargs)[source]

Bases: tethne.readers.base.RDFParser

Reads Zotero RDF files.

entry_class

alias of Paper

entry_elements = ['bib:Illustration', 'bib:Recording', 'bib:Legislation', 'bib:Document', 'bib:BookSection', 'bib:Book', 'bib:Data', 'bib:Letter', 'bib:Report', 'bib:Article', 'bib:Manuscript', 'bib:Image', 'bib:ConferenceProceedings', 'bib:Thesis']
handle_abstract(value)[source]

Abstract handler.

Parameters:

value

Returns:

abstract.toPython()

Basically, RDF literals are casted to their corresponding Python data types.

handle_author(value)[source]
handle_authors_full(value)[source]
handle_date(value)[source]

Attempt to coerced date to ISO8601.

handle_documentType(value)[source]
Parameters:

value

Returns:

value.toPython()

Basically, RDF literals are casted to their corresponding Python data types.

handle_identifier(value)[source]
handle_isPartOf(value)[source]

rdf:link rdf:resource points to the resource described by a record.

handle_pages(value)[source]
handle_title(value)[source]

Title handler Parameters ———- value

Returns:title.toPython()
meta_elements = [('date', rdflib.term.URIRef(u'http://purl.org/dc/elements/1.1/date')), ('identifier', rdflib.term.URIRef(u'http://purl.org/dc/elements/1.1/identifier')), ('abstract', rdflib.term.URIRef(u'http://purl.org/dc/terms/abstract')), ('authors_full', rdflib.term.URIRef(u'http://purl.org/net/biblio#authors')), ('link', rdflib.term.URIRef(u'http://purl.org/rss/1.0/modules/link/link')), ('title', rdflib.term.URIRef(u'http://purl.org/dc/elements/1.1/title')), ('isPartOf', rdflib.term.URIRef(u'http://purl.org/dc/terms/isPartOf')), ('pages', rdflib.term.URIRef(u'http://purl.org/net/biblio#pages')), ('documentType', rdflib.term.URIRef(u'http://www.zotero.org/namespaces/export#itemType'))]
open()[source]

Fixes RDF validation issues. Zotero incorrectly uses rdf:resource as a child element for Attribute; rdf:resource should instead be used as an attribute of link:link.

Attempt to load full-text content from resource.

postprocess_pages(entry)[source]
tags = {'isPartOf': 'journal'}
tethne.readers.zotero.extract_pdf(fpath)[source]

Extracts structured text content from a PDF at fpath.

Parameters:

fpath : str

Path to the PDF.

Returns:

StructuredFeature

A StructuredFeature that contains page and sentence contexts.

tethne.readers.zotero.extract_text(fpath)[source]

Extracts structured text content from a plain-text file at fpath.

Parameters:

fpath : str

Path to the text file..

Returns:

StructuredFeature

A StructuredFeature that contains sentence context.

tethne.readers.zotero.read(path, corpus=True, index_by='uri', follow_links=False, **kwargs)[source]

Read bibliographic data from Zotero RDF.

Parameters:

path : str

Path to the output directory created by Zotero. Expected to contain a file called [directory_name].rdf.

corpus : bool

(default: True) If True, returns a Corpus. Otherwise, returns a list of Papers.

index_by : str

(default: 'identifier') Paper attribute name to use as the primary indexing field. If the field is missing on a Paper, a unique identifier will be generated based on the title and author names.

follow_links : bool

If True, attempts to load full-text content from attached files (e.g. PDFs with embedded text). Default: False.

kwargs : kwargs

Passed to the Corpus constructor.

Returns:

corpus : Corpus

Examples

Assuming that the Zotero collection was exported to the directory /my/working/dir with the name myCollection, a subdirectory should have been created at /my/working/dir/myCollection, and an RDF file should exist at /my/working/dir/myCollection/myCollection.rdf.

>>> from tethne.readers.zotero import read
>>> myCorpus = read('/my/working/dir/myCollection')
>>> myCorpus
<tethne.classes.corpus.Corpus object at 0x10047e350>

Module contents

Methods for parsing bibliographic datasets.

merge(corpus_1, corpus_2[, match_by, ...]) Combines two Corpus instances.
dfr Methods for parsing JSTOR Data-for-Research datasets.
wos Parser for Web of Science field-tagged bibliographic data.
zotero
scopus Reader for Scopus bibliographic database.

Each module in tethne.readers provides a read function that yields a Corpus instance.

exception tethne.readers.DataError(value)[source]

Bases: exceptions.Exception

tethne.readers.merge(corpus_1, corpus_2, match_by=['ayjid'], match_threshold=1.0, index_by='ayjid')[source]

Combines two Corpus instances.

The default behavior is to match Papers using the fields in match_by. If several fields are specified, match_threshold can be used to control how well two Papers must match to be combined.

Alternatively, match_by can be a callable object that accepts two Paper instances, and returns bool. This allows for more complex evaluations.

Where two matched Papers have values for the same field, values from the Paper instance in corpus_1 will always be preferred.

Parameters:

corpus_1 : Corpus

Values from this Corpus will always be preferred in cases of conflict.

corpus_2 : Corpus

match_by : list or callable

Either a list of fields used to evaluate whether or not two Papers should be combined, OR a callable that accepts two Paper instances and returns bool.

match_threshold : float

if match_by is a list containing more than one field, specifies the proportion of fields that must match for two Paper instances to be combined.

index_by : str

The field to use as the primary indexing field in the new Corpus. Default is ayjid, since this is virtually always available.

Returns:

combined : Corpus

Examples

>>> from tethne.readers import wos, dfr, merge
>>> wos_corpus = wos.read("/Path/to/data1.txt")
>>> dfr_corpus = dfr.read("/Path/to/DfR")
>>> corpus = merge(wos_corpus, dfr_corpus)