tethne.readers package¶
Submodules¶
tethne.readers.base module¶
-
class
tethne.readers.base.
BaseParser
(path, **kwargs)[source]¶ Bases:
object
Base class for all data parsers. Do not instantiate directly.
-
class
tethne.readers.base.
FTParser
(*args, **kwargs)[source]¶ Bases:
tethne.readers.base.IterParser
Base parser for field-tagged data files.
-
end_tag
= 'ED'¶ Signals the end of a data entry.
-
start_tag
= 'ST'¶ Signals the start of a data entry.
-
-
class
tethne.readers.base.
IterParser
(*args, **kwargs)[source]¶ Bases:
tethne.readers.base.BaseParser
-
concat_fields
= []¶ Multi-line fields here should be concatenated, rather than represented as lists.
-
handle
(tag, data)[source]¶ Process a single line of data, and store the result.
Parameters: tag : str
data :
-
-
class
tethne.readers.base.
RDFParser
(path, **kwargs)[source]¶ Bases:
tethne.readers.base.BaseParser
-
concat_fields
= []¶
-
entry_elements
= ['Document']¶
-
meta_elements
= []¶
-
-
class
tethne.readers.base.
XMLParser
(*args, **kwargs)[source]¶ Bases:
tethne.readers.base.IterParser
-
entry_element
= 'article'¶
-
tethne.readers.dfr module¶
Methods for parsing JSTOR Data-for-Research datasets.
-
class
tethne.readers.dfr.
DfRParser
(*args, **kwargs)[source]¶ Bases:
tethne.readers.base.XMLParser
-
entry_class
¶ alias of
Paper
-
-
class
tethne.readers.dfr.
GramGenerator
(path, elem, values=False, keys=False, ignore_hash=True)[source]¶ Bases:
object
Yields N-gram data from on-disk dataset, to make loading big datasets a bit more memory-friendly.
Reusable, in the sense that
items()
,items()
,keys()
, andvalues()
all return newGramGenerator
instances with the same path. This allows aGramGenerator
to sneakily pass as an ngrams dict in most practical situations.-
items
()[source]¶ Returns a
GramGenerator
that produces key,value tuples.
-
keys
()[source]¶ Returns a
GramGenerator
that produces only keys.
-
values
()[source]¶ Returns a
GramGenerator
that produces only values.
-
-
tethne.readers.dfr.
ngrams
(path, elem, ignore_hash=True)[source]¶ Yields N-grams from a JSTOR DfR dataset.
Parameters: path : string
Path to unzipped JSTOR DfR folder containing N-grams.
elem : string
Name of subdirectory containing N-grams. (e.g. ‘bigrams’).
ignore_hash : bool
If True, will exclude all N-grams that contain the hash ‘#’ character.
Returns: ngrams :
FeatureSet
-
tethne.readers.dfr.
read
(path, corpus=True, index_by='doi', load_ngrams=True, parse_only=None, corpus_class=<class 'tethne.classes.corpus.Corpus'>, **kwargs)[source]¶ Yields
Paper
s from JSTOR DfR package.Each
Paper
is tagged with an accession id for this read/conversion.Parameters: filepath : string
Filepath to unzipped JSTOR DfR folder containing a citations.xml file.
Returns: papers : list
A list of
Paper
objects.Examples
>>> from tethne.readers import dfr >>> papers = dfr.read("/Path/to/DfR")
-
tethne.readers.dfr.
streaming_read
(path, corpus=True, index_by='doi', parse_only=None, **kwargs)[source]¶
-
tethne.readers.dfr.
tokenize
(ngrams, min_tf=2, min_df=2, min_len=3, apply_stoplist=False)[source]¶ Builds a vocabulary, and replaces words with vocab indices.
Parameters: ngrams : dict
Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.
apply_stoplist : bool
If True, will exclude all N-grams that contain words in the NLTK stoplist.
Returns: t_ngrams : dict
Tokenized ngrams, as doi:{i:count}.
vocab : dict
Vocabulary as i:term.
token_tf :
Counter
Term counts for corpus, as i:count.
tethne.readers.dspace module¶
Reader for the DuraSpace API.
This module is not yet implemented. Earlier versions of Tethne had a crude method for reading data from an old REST API. The current version of DuraSpace has a very different API, and we ought to write a much better client.
Stay tuned!
tethne.readers.scopus module¶
Reader for Scopus bibliographic database.
Earlier versions of Tethne had limited support for Scopus. As of 0.8, we’ve moved away from Scopus due to (a) lack of access, and (b) lack of time. If you’re interested in using Scopus data in Tethne, please consider contributing to the project.
tethne.readers.wos module¶
Parser for Web of Science field-tagged bibliographic data.
Tethne parsers Web of Science field-tagged data into a set of
Paper
s, which are then encapsulated in a Corpus
. The
WoSParser
can be instantiated directly, or you can simply use
read()
to parse a single file or a directory containing several data
files.
>>> from tethne.readers import wos
>>> corpus = wos.read("/path/to/some/wos/data")
>>> corpus
<tethne.classes.corpus.Corpus object at 0x10057c2d0>
-
class
tethne.readers.wos.
WoSParser
(*args, **kwargs)[source]¶ Bases:
tethne.readers.base.FTParser
Parser for Web of Science field-tagged data.
>>> from tethne.readers.wos import WoSParser >>> parser = WoSParser("/path/to/download.txt") >>> papers = parser.read()
-
concat_fields
= ['abstract', 'keywords', 'funding', 'title', 'references', 'journal']¶ Fields that span multiple lines that should be concatenated into a single value.
-
end_tag
= 'ER'¶ Field-tag used to mark the end of a record.
-
entry_class
¶ The class that should be used to represent a single bibliographic record. This can be changed to support more sophisticated data models.
alias of
Paper
Attempts to split an author name into last and first parts.
Parse author keywords.
Author keywords are usually semicolon-delimited.
If only a single author was found, ensure that
authors_full
is nonetheless a list.
If only a single author was found, ensure that
authors_init
is nonetheless a list.
-
postprocess_citedReferences
(entry)[source]¶ If only a single cited reference was found, ensure that
citedReferences
is nonetheless a list.
-
postprocess_keywordsPlus
(entry)[source]¶ Parse WoS “Keyword Plus” keywords.
Keyword Plus keywords are usually semicolon-delimited.
-
postprocess_subject
(entry)[source]¶ Parse subject keywords.
Subject keywords are usually semicolon-delimited.
-
start_tag
= 'PT'¶ Field-tag used to mark the start of a record.
Maps field-tags onto field names.
-
-
tethne.readers.wos.
read
(path, corpus=True, index_by='wosid', streaming=False, parse_only=None, corpus_class=<class 'tethne.classes.corpus.Corpus'>, **kwargs)[source]¶ Parse one or more WoS field-tagged data files.
Parameters: path : str
Path to WoS field-tagged data. Can be a path directly to a single data file, or to a directory containing several data files.
corpus : bool
Returns: Examples
>>> from tethne.readers import wos >>> corpus = wos.read("/path/to/some/wos/data") >>> corpus <tethne.classes.corpus.Corpus object at 0x10057c2d0>
tethne.readers.zotero module¶
-
class
tethne.readers.zotero.
ZoteroParser
(path, **kwargs)[source]¶ Bases:
tethne.readers.base.RDFParser
Reads Zotero RDF files.
-
entry_class
¶ alias of
Paper
-
entry_elements
= ['bib:Illustration', 'bib:Recording', 'bib:Legislation', 'bib:Document', 'bib:BookSection', 'bib:Book', 'bib:Data', 'bib:Letter', 'bib:Report', 'bib:Article', 'bib:Manuscript', 'bib:Image', 'bib:ConferenceProceedings', 'bib:Thesis']¶
-
handle_abstract
(value)[source]¶ Abstract handler.
Parameters: value
Returns: abstract.toPython()
Basically, RDF literals are casted to their corresponding Python data types.
-
handle_documentType
(value)[source]¶ Parameters: value
Returns: value.toPython()
Basically, RDF literals are casted to their corresponding Python data types.
-
meta_elements
= [('date', rdflib.term.URIRef(u'http://purl.org/dc/elements/1.1/date')), ('identifier', rdflib.term.URIRef(u'http://purl.org/dc/elements/1.1/identifier')), ('abstract', rdflib.term.URIRef(u'http://purl.org/dc/terms/abstract')), ('authors_full', rdflib.term.URIRef(u'http://purl.org/net/biblio#authors')), ('link', rdflib.term.URIRef(u'http://purl.org/rss/1.0/modules/link/link')), ('title', rdflib.term.URIRef(u'http://purl.org/dc/elements/1.1/title')), ('isPartOf', rdflib.term.URIRef(u'http://purl.org/dc/terms/isPartOf')), ('pages', rdflib.term.URIRef(u'http://purl.org/net/biblio#pages')), ('documentType', rdflib.term.URIRef(u'http://www.zotero.org/namespaces/export#itemType'))]¶
-
open
()[source]¶ Fixes RDF validation issues. Zotero incorrectly uses
rdf:resource
as a child element for Attribute;rdf:resource
should instead be used as an attribute oflink:link
.
-
-
tethne.readers.zotero.
extract_pdf
(fpath)[source]¶ Extracts structured text content from a PDF at
fpath
.Parameters: fpath : str
Path to the PDF.
Returns: A
StructuredFeature
that contains page and sentence contexts.
-
tethne.readers.zotero.
extract_text
(fpath)[source]¶ Extracts structured text content from a plain-text file at
fpath
.Parameters: fpath : str
Path to the text file..
Returns: A
StructuredFeature
that contains sentence context.
-
tethne.readers.zotero.
read
(path, corpus=True, index_by='uri', follow_links=False, **kwargs)[source]¶ Read bibliographic data from Zotero RDF.
Parameters: path : str
Path to the output directory created by Zotero. Expected to contain a file called
[directory_name].rdf
.corpus : bool
index_by : str
follow_links : bool
If
True
, attempts to load full-text content from attached files (e.g. PDFs with embedded text). Default: False.kwargs : kwargs
Passed to the
Corpus
constructor.Returns: corpus :
Corpus
Examples
Assuming that the Zotero collection was exported to the directory
/my/working/dir
with the namemyCollection
, a subdirectory should have been created at/my/working/dir/myCollection
, and an RDF file should exist at/my/working/dir/myCollection/myCollection.rdf
.>>> from tethne.readers.zotero import read >>> myCorpus = read('/my/working/dir/myCollection') >>> myCorpus <tethne.classes.corpus.Corpus object at 0x10047e350>
Module contents¶
Methods for parsing bibliographic datasets.
merge (corpus_1, corpus_2[, match_by, ...]) |
Combines two Corpus instances. |
dfr |
Methods for parsing JSTOR Data-for-Research datasets. |
wos |
Parser for Web of Science field-tagged bibliographic data. |
zotero |
|
scopus |
Reader for Scopus bibliographic database. |
Each module in tethne.readers
provides a read
function that yields
a Corpus
instance.
-
tethne.readers.
merge
(corpus_1, corpus_2, match_by=['ayjid'], match_threshold=1.0, index_by='ayjid')[source]¶ Combines two
Corpus
instances.The default behavior is to match
Paper
s using the fields inmatch_by
. If several fields are specified,match_threshold
can be used to control how well twoPaper
s must match to be combined.Alternatively,
match_by
can be a callable object that accepts twoPaper
instances, and returns bool. This allows for more complex evaluations.Where two matched
Paper
s have values for the same field, values from thePaper
instance incorpus_1
will always be preferred.Parameters: corpus_1 :
Corpus
Values from this
Corpus
will always be preferred in cases of conflict.corpus_2 :
Corpus
match_by : list or callable
match_threshold : float
if
match_by
is a list containing more than one field, specifies the proportion of fields that must match for twoPaper
instances to be combined.index_by : str
The field to use as the primary indexing field in the new
Corpus
. Default is ayjid, since this is virtually always available.Returns: combined :
Corpus
Examples
>>> from tethne.readers import wos, dfr, merge >>> wos_corpus = wos.read("/Path/to/data1.txt") >>> dfr_corpus = dfr.read("/Path/to/DfR") >>> corpus = merge(wos_corpus, dfr_corpus)