Parsing Bibliographic Metadata¶
Note
For instructions on acquiring bibliographic data from several sources, see Getting Bibliographic Metadata.
Tethne provides several parsing modules, located in tethne.readers
. The
recommended pattern for parsing data is to import the parsing module
corresponding to your data type, and use its’ read
function to parse your
data. For example:
>>> from tethne.readers import wos
>>> myCorpus = wos.read('/path/to/my/data.txt')
By default, read
will return a Corpus
object.
>>> myCorpus
<tethne.classes.corpus.Corpus object at 0x1046aa7d0>
A Corpus
is a collection of Paper
s that can be indexed in
a variety of ways. A Corpus
behaves like a list of Paper
s:
>>> len(myCorpus) # How many Papers do I have?
500
>>> myCorpus[0] # Returns the first Paper.
<tethne.classes.paper.Paper at 0x10bcde290>
>>> myCorpus[-1] # Returns the last Paper.
<tethne.classes.paper.Paper at 0x103911f50>
Depending on which module you use, read
will make assumptions about which
field to use as the primary index for the Paper
s in your dataset.
The default for Web of Science data, for example, is 'wosid'
(the value of
the UT
field-tag).
>>> myCorpus.index_by
'wosid'
If you’d prefer to index by a different field, you can pass the index_by
parameter.
>>> myOtherCorpus = wos.read('/path/to/my/data.txt', index_by='doi')
>>> myOtherCorpus.index_by
'doi'
New in 0.8: Streaming¶
With large collections of metadata, even just tens of thousands of records,
memory consumption can get a bit out of hand. In Tethne 0.8, you can “stream”
your corpus by passing streaming=True
to read()
(WoS and DfR only).
Rather than hold all of your metadata in memory, Tethne will cache the metadata
on disk (look for a hidden folder called .tethne
in your cwd), and then
access those bits of metadata that you need later on.
This will lead to a bit of a performance hit if you’re iterating over all of your records, but may be a suitable trade-off if you don’t have billions of gigabytes of RAM.
New in 0.8: parse_only
¶
An alternative (or complementary) approach to streaming is to only parse those
specific fields that you need for your analysis. You can now pass a list of
field names to read()
(WoS and DfR only) using the parameter parse_only
,
and Tethne will parse only those fields (plus the indexing field). For example:
>>> corpus = dfr.read('/path/to/data', parse_only=['title', 'date'])
>>> corpus[0].__dict__
{'date': 1965, 'doi': '10.2307/4108217', 'title': 'Plant Mutations'}
Module-specific details¶
The following sections describe specific behaviors of each of the parsing modules.
Web of Science¶
To parse a Web of Science field-tagged file, or a collection of field-tagged
files, use the tethne.readers.wos.read()
method.
To parse a single file, provide the path to that data file. For example:
>>> from tethne.readers import wos
>>> corpus = wos.read('/path/to/my/data.txt')
Parsing Several WoS Files¶
Often you’ll be working with datasets comprised of multiple data files. The Web of Science database only allows you to download 500 records at a time (because they’re dirty capitalists). You can use the ``read`` function to load a list of ``Paper``s from a directory containing multiple data files.
The read
function knows that your path is a directory and not a data file;
it looks inside of that directory for WoS data files.
>>> corpus = wos.read('/Path/to/my/wos/data/dir')
JSTOR Data for Research¶
The DfR parsing module is tethne.readers.dfr
.
>>> from tethne.readers import dfr
The DfR reader works just like the WoS reader. To load a single dataset, provide
the path to the folder created when you unzipped your dataset download (it
should contain a file called citations.xml
).
>>> corpus = dfr.read('/path/to/my/dfr', features=['uni'])
Whereas Corpora generated from WoS datasets are indexed by wosid
by default,
Corpora generated from DfR datasets are indexed by doi
.
>>> corpus.indexed_papers.keys()[0:10] # The first 10 dois.
['10.2307/2418718',
'10.2307/2258178',
'10.2307/3241549',
'10.2307/2416998',
'10.2307/20000814',
'10.2307/2428935',
'10.2307/2418714',
'10.2307/1729159',
'10.2307/2407516',
'10.2307/2816048']
But unlike WoS datasets, DfR datasets can contain wordcounts and N-grams in
addition to bibliographic data. read
will find these extra data about your
Bibliographic records, and load them as
tethne.classes.feature.FeatureSet
instances.
>>> corpus.features
{'authors': <tethne.classes.feature.FeatureSet at 0x100434e90>,
'citations': <tethne.classes.feature.FeatureSet at 0x10041b990>,
'wordcounts': <tethne.classes.feature.FeatureSet at 0x107688750>}
Parsing Several DfR Files¶
Just like the WoS parser, the DfR read
function can load several datasets
at once. Instead of providing a path to a single dataset, provide a path to a
directory containing several datasets. read
will look for DfR datasets, and
load them all into a single Corpus
.
>>> corpus = dfr.read('/path/to/many/datasets')
Zotero RDF¶
Note
In previous versions, Zotero.read()
required the path to the
directory created by Zotero on export. As of 0.8, the preferred
approach is to pass the full path to the RDF document. The old
behavior should also still work.
The Zotero parsing module is tethne.readers.zotero
.
>>> from tethne.readers import zotero
The Zotero reader works just like the WoS and DfR readers. To load a single dataset, provide the path to the RDF file.
>>> corpus = zotero.read('/path/to/my/rdf/export/export.rdf')
Since RDF relies on Uniform Resource Identifiers (URIs), the default
indexing field for Zotero datasets is uri
.
>>> corpus.indexed_papers.items()[0:5] # The first 10 URIs.
[('http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3527233/',
<tethne.classes.paper.Paper at 0x10976dcd0>),
('http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1513266/',
<tethne.classes.paper.Paper at 0x109dbf050>),
('http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2211313/',
<tethne.classes.paper.Paper at 0x109712bd0>),
('http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2886068/',
<tethne.classes.paper.Paper at 0x1095dc9d0>),
('http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1914331/',
<tethne.classes.paper.Paper at 0x1095dc5d0>)]