tethne.classes package¶
Submodules¶
tethne.classes.corpus module¶
A Corpus
is a container for Paper
s.
-
class
tethne.classes.corpus.
Corpus
(papers=[], index_by=None, index_fields=['authors', 'citations', 'ayjid', 'date'], index_features=['authors', 'citations'], **kwargs)[source]¶ Bases:
object
A
Corpus
represents a collection ofPaper
instances.distribution
Calculates the number of papers in each slice, as defined by slice_kwargs
.feature_distribution
Calculates the distribution of a feature across slices of the corpus. features
Contains FeatureSet
s for aCorpus
instance.index
Indexes the Paper
s in thisCorpus
instance by the attributeattr
.index_by
Specifies the field in Paper
s that should be used as the primary indexing field for aCorpus
instance.index_feature
Creates a new FeatureSet
from the attributefeature_name
in eachPaper
.indexed_papers
The primary index for Paper
s in aCorpus
instance.indices
Contains field indices for the Paper
s in aCorpus
instance.papers
A list of all Paper
s in theCorpus
.select
Retrieves a subset of Paper
s based on selection criteria.slice
Returns a generator that yields (key, subcorpus)
tuples for sequential time windows.subcorpus
Generates a new Corpus
using the criteria inselector
.top_features
Retrieves the top topn
most numerous features in the corpus.Corpus
objects are generated by the bibliographic readers in thetethne.readers
module.>>> from tethne.readers.wos import read >>> read('/path/to/data') <tethne.classes.corpus.Corpus object at 0x10278ea10>
You can also build a
Corpus
from a list ofPaper
s.>>> papers = however_you_generate_papers() # <- list of Papers. >>> corpus = Corpus(papers)
All of the
Paper
s in theCorpus
will be indexed. You can control which field is used for indexing by passing theindex_by
keyword argument to one of theread
methods or to theCorpus
constructor.>>> corpus = Corpus(papers, index_by='doi') >>> corpus.indexed_papers.keys() ['doi/123', 'doi/456', ..., 'doi/789']
The WoS
read
method uses thewosid
field by default, and the DfRread
method usesdoi
. The Zoteroread
method tries to use whatever it can find. If the selectedindex_by
field is not set or not available, a unique key will be generated using the title and author names.By default,
Corpus
will also index theauthors
andcitations
fields. To control which fields are indexed, pass theindex_fields
argument, or callCorpus.index()
directly.>>> corpus = Corpus(papers, index_fields=['authors', 'date']) >>> corpus.indices.keys() ['authors', 'date']
Similarly,
Corpus
will index features. By default,authors
andcitations
will be indexed as features (i.e. available for network-building methods). To control which fields are indexed as features, pass theindex_features
argument, or callCorpus.index_features()
.>>> corpus = Corpus(papers, index_features=['unigrams']) >>> corpus.features.keys() ['unigrams']
There are a variety of ways to select
Paper
s from the corpus.>>> corpus = Corpus(papers) >>> corpus[0] # Integer indices yield a single Paper. <tethne.classes.paper.Paper object at 0x103037c10> >>> corpus[range(0,5)] # A list of indices will yield a list of Papers. [<tethne.classes.paper.Paper object at 0x103037c10>, <tethne.classes.paper.Paper object at 0x10301c890>, ... <tethne.classes.paper.Paper object at 0x10302f5d0>] >>> corpus[('date', 1995)] # You can select based on indexed fields. [<tethne.classes.paper.Paper object at 0x103037c10>, <tethne.classes.paper.Paper object at 0x10301c890>, ... <tethne.classes.paper.Paper object at 0x10302f5d0>] >>> corpus['citations', ('DOLE RJ 1952 CELL')] # All papers with this citation! [<tethne.classes.paper.Paper object at 0x103037c10>, <tethne.classes.paper.Paper object at 0x10301c890>, ... <tethne.classes.paper.Paper object at 0x10302f5d0>] >>> corpus[('date', range(1993, 1995))] # Multiple values are supported, too. [<tethne.classes.paper.Paper object at 0x103037c10>, <tethne.classes.paper.Paper object at 0x10301c890>, ... <tethne.classes.paper.Paper object at 0x10302f5d0>]
If you prefer to retrieve a
Corpus
rather than simply a list ofPaper
instances (e.g. to build networks), useCorpus.subcorpus()
.subcorpus
accepts selector arguments just likeCorpus.__getitem__()
.>>> corpus = Corpus(papers) >>> subcorpus = corpus.subcorpus(('date', 1995)) >>> subcorpus <tethne.classes.corpus.Corpus object at 0x10278ea10>
-
distribution
(**slice_kwargs)[source]¶ Calculates the number of papers in each slice, as defined by
slice_kwargs
.Parameters: slice_kwargs : kwargs
Keyword arguments to be passed to
Corpus.slice()
.Returns: list
Examples
>>> corpus.distribution(step_size=1, window_size=1) [5, 5]
-
feature_distribution
(featureset_name, feature, mode='counts', **slice_kwargs)[source]¶ Calculates the distribution of a feature across slices of the corpus.
Parameters: featureset_name : str
Name of a
FeatureSet
in theCorpus
.feature : str
Name of the specific feature of interest. E.g. if
featureset_name='citations'
, thenfeature
could be something like'DOLE RJ 1965 CELL'
.mode : str
(default:
'counts'
) If set to'counts'
, values will be the sum of all count values for the feature in each slice. If set to'documentCounts'
, values will be the number of papers in which the feature occurs in each slice.slice_kwargs : kwargs
Keyword arguments to be passed to
Corpus.slice()
.Returns: list
Examples
>>> corpus.feature_distribution(featureset_name='citations', ... feature='DOLE RJ 1965 CELL', ... step_size=1, window_size=1) [2, 15, 25, 1]
-
features
= {}¶ Contains
FeatureSet
s for aCorpus
instance.New
FeatureSet
s can be created from attributes ofPaper
usingindex_feature()
.
-
index
(attr)[source]¶ Indexes the
Paper
s in thisCorpus
instance by the attributeattr
.New indices are added to
indices
.Parameters: attr : str
The name of a
Paper
attribute.
-
index_by
= None¶ Specifies the field in
Paper
s that should be used as the primary indexing field for aCorpus
instance.
-
index_class
¶ alias of
dict
-
index_feature
(feature_name, tokenize=<function <lambda>>, structured=False)[source]¶ Creates a new
FeatureSet
from the attributefeature_name
in eachPaper
.New
FeatureSet
s are added tofeatures
.Parameters: feature_name : str
The name of a
Paper
attribute.
-
index_kwargs
= {}¶
-
index_paper_by_feature
(paper, feature_name, tokenize=<function <lambda>>, structured=False)[source]¶
-
indexed_papers
= {}¶ The primary index for
Paper
s in aCorpus
instance. Keys are based onindex_by
, and values arePaper
instances.
-
indices
= {}¶ Contains field indices for the
Paper
s in aCorpus
instance.The
'citations'
index, for example, allows us to look up all of the Papers that contain a particular bibliographic reference:>>> for citation, papers in corpus.indices['citations'].items()[7:10]: ... print 'The following Papers cite {0} \n\n\t{1} \n'.format(citation, '\n\t'.join(papers)) The following Papers cite WHITFIELD J 2006 NATURE: WOS:000252758800011 WOS:000253464000004 The following Papers cite WANG T 2006 GLOBAL CHANGE BIOL: WOS:000282225000019 WOS:000281546800001 WOS:000251903200006 WOS:000292901400010 WOS:000288656800015 WOS:000318353300001 WOS:000296710600017 WOS:000255552100006 WOS:000272153800012 The following Papers cite LINKOSALO T 2009 AGR FOREST METEOROL: WOS:000298398700003
Notice that the values above are not Papers themselves, but identifiers. These are the same identifiers used in the primary index, so we can use them to look up
Paper
s:>>> papers = corpus.indices['citations']['CARLSON SM 2004 EVOL ECOL RES'] # Who cited Carlson 2004? >>> print papers >>> for paper in papers: ... print corpus.indexed_papers[paper] ['WOS:000311994600006', 'WOS:000304903100014', 'WOS:000248812000005'] <tethne.classes.paper.Paper object at 0x112d1fe10> <tethne.classes.paper.Paper object at 0x1121e8310> <tethne.classes.paper.Paper object at 0x1144ad390>
You can create new indices using
index()
.
-
select
(selector, index_only=False)[source]¶ Retrieves a subset of
Paper
s based on selection criteria.There are a variety of ways to select
Paper
s.>>> corpus = Corpus(papers) >>> corpus[0] # Integer indices yield a single Paper. <tethne.classes.paper.Paper object at 0x103037c10> >>> corpus[range(0,5)] # A list of indices yields a list of Papers. [<tethne.classes.paper.Paper object at 0x103037c10>, <tethne.classes.paper.Paper object at 0x10301c890>, ... <tethne.classes.paper.Paper object at 0x10302f5d0>] >>> corpus[('date', 1995)] # Select based on indexed fields. [<tethne.classes.paper.Paper object at 0x103037c10>, <tethne.classes.paper.Paper object at 0x10301c890>, ... <tethne.classes.paper.Paper object at 0x10302f5d0>] >>> corpus['citations', ('DOLE RJ 1952 CELL')] # Citing papers! [<tethne.classes.paper.Paper object at 0x103037c10>, <tethne.classes.paper.Paper object at 0x10301c890>, ... <tethne.classes.paper.Paper object at 0x10302f5d0>] >>> corpus[('date', range(1993, 1995))] # Multiple values are OK. [<tethne.classes.paper.Paper object at 0x103037c10>, <tethne.classes.paper.Paper object at 0x10301c890>, ... <tethne.classes.paper.Paper object at 0x10302f5d0>]
If you prefer to retrieve a
Corpus
rather than simply a list ofPaper
instances (e.g. to build networks), useCorpus.subcorpus()
.Parameters: selector : object
See method description.
Returns: list
A list of
Paper
s.
-
slice
(window_size=1, step_size=1, cumulative=False, count_only=False, subcorpus=True, feature_name=None)[source]¶ Returns a generator that yields
(key, subcorpus)
tuples for sequential time windows.Two common slicing patterns are the “sliding time-window” and the “time-period” patterns. Whereas time-period slicing divides the corpus into subcorpora by sequential non-overlapping time periods, subcorpora generated by time-window slicing can overlap.
Sliding time-window – Set
step_size=1
, andwindow_size
to the desired value. Time-period –step_size
andwindow_size
should have the same value.The value of
key
is always the first year in the slice.Parameters: window_size : int
(default: 1) Size of the time window, in years.
step_size : int
(default: 1) Number of years to advance window at each step.
Returns: generator
Examples
>>> from tethne.readers.wos import read >>> corpus = read('/path/to/data') >>> for key, subcorpus in corpus.slice(): ... print key, len(subcorpus) 2005, 5 2006, 5
-
subcorpus
(selector)[source]¶ Generates a new
Corpus
using the criteria inselector
.Accepts selector arguments just like
Corpus.select()
.>>> corpus = Corpus(papers) >>> subcorpus = corpus.subcorpus(('date', 1995)) >>> subcorpus <tethne.classes.corpus.Corpus object at 0x10278ea10>
-
top_features
(featureset_name, topn=20, by='counts', perslice=False, slice_kwargs={})[source]¶ Retrieves the top
topn
most numerous features in the corpus.Parameters: featureset_name : str
Name of a
FeatureSet
in theCorpus
.topn : int
(default:
20
) Number of features to return.by : str
(default:
'counts'
) If'counts'
, uses the sum of feature count values to rank features. If'documentCounts'
, uses the number of papers in which features occur.perslice : bool
(default: False) If True, retrieves the top
topn
features in each slice.slice_kwargs : kwargs
If
perslice=True
, these keyword arguments are passed toCorpus.slice()
.
-
tethne.classes.feature module¶
Classes in this module provide structures for additional data about
Paper
s.
-
class
tethne.classes.feature.
BaseFeatureSet
(features={})[source]¶ Bases:
object
-
N_documents
¶
-
N_features
¶
-
top
(topn, by='counts')[source]¶ Get the top
topn
features in theFeatureSet
.Parameters: topn : int
Number of features to return.
by : str
(default: ‘counts’) How features should be sorted. Must be ‘counts’ or ‘documentcounts’.
Returns: list
-
unique
¶ The set of unique elements in this
FeatureSet
.
-
-
class
tethne.classes.feature.
Feature
(data)[source]¶ Bases:
list
A
Feature
instance is a sparse vector of features over a given concept (usually aPaper
).For example, a
Feature
might represent word counts for a singlePaper
.A
Feature
may be initialized from a list of(feature, value)
tuples...>>> myFeature = Feature([('the', 2), ('pine', 1), ('trapezoid', 5)])
...or by passing a list of raw feature tokens:
To get the set of unique features in this
Feature
, use :prop:`.Feature.unique`:>>> myFeature.unique set(['the', 'pine', 'trapezoid'])
Normalized feature values (so that all values sum to 1.) can be accessed using :prop:`.Feature.norm`.
>>> myFeature.norm [('the', 0.25), ('pine', 0.125), ('trapezoid', 0.625)]
-
norm
¶
-
-
class
tethne.classes.feature.
FeatureSet
(features=None)[source]¶ Bases:
tethne.classes.feature.BaseFeatureSet
A
FeatureSet
organizes multipleFeature
instances.-
to_gensim_corpus
(raw=False)[source]¶ Yield a bag-of-words corpus compatible with the Gensim package.
Returns a (corpus, index) tuple (see below).
Parameters: context : str
If provided, each “document” in the Gensim corpus will be a chunk of type
context
.raw : bool
(default: False) If True, documents will be sequences of tokens rather than sequences of (term, count) “bag-of-words” tuples.
Returns: list
A list of lists of (id, count) tuples. Each sub-list represents a single context item (e.g. a document, or a paragraph). This is the “bag of words” representation used in Gensim.
dict
Maps integer IDs to words.
Examples
>>> from tethne.readers.wos import read >>> corpus = read('/path/to/my/data') >>> from nltk.tokenize import word_tokenize >>> corpus.index_feature('abstract', word_tokenize) >>> gensim_corpus, id2word = corpus.features['abstract'].to_gensim_corpus() >>> from gensim import corpora, models >>> model = models.ldamodel.LdaModel(corpus=gensim_corpus, id2word=id2word, num_topics=5, update_every=1, chunksize=100, passes=1)
-
transform
(func)[source]¶ Apply a transformation to tokens in this
FeatureSet
.Parameters: func : callable
Should take four parameters: token, value in document (e.g. count), value in
FeatureSet
(e.g. overall count), and document count (i.e. number of documents in which the token occurs). Should return a new numeric (int or float) value, or None. If value is 0 or None, the token will be excluded.Returns: Examples
Apply a tf*idf transformation.
>>> words = corpus.features['words'] >>> def tfidf(f, c, C, DC): ... tf = float(c) ... idf = log(float(len(words.features))/float(DC)) ... return tf*idf >>> corpus.features['words_tfidf'] = words.transform(tfidf)
-
-
class
tethne.classes.feature.
StructuredFeature
(tokens, contexts=None, reference=None)[source]¶ Bases:
list
A
StructuredFeature
represents the contents of a document as an array of tokens, divisible into a set of nested contexts.The canonical use-case is to represent a document as a set of words divided into sentences, paragraphs, and (perhaps) pages.
Parameters: tokens : list
An ordered list of tokens.
contexts : list
A list of (name, indices) 2-tuples, where
name
is string-like and indices is an iterable of int token indices.reference : tuple
A (feature, map) 2-tuple, where
feature
is aStructuredFeature
andmap
is a dict mapping token indices in thisStructuredFeature
to token indices infeature
.-
add_context
(name, indices, level=None)[source]¶ Add a new context level to the hierarchy.
By default, new contexts are added to the lowest level of the hierarchy. To insert the context elsewhere in the hierarchy, use the
level
argument. For example,level=0
would insert the context at the highest level of the hierarchy.Parameters: name : str
indices : list
Token indices at which each chunk in the context begins.
level : int
Level in the hierarchy at which to insert the context. By default, inserts context at the lowest level of the hierarchy
-
context_chunk
(context, j)[source]¶ Retrieve the tokens in the
j``th chunk of context ``context
.Parameters: context : str
Context name.
j : int
Index of a context chunk.
Returns: chunk : list
List of tokens in the selected chunk.
-
-
class
tethne.classes.feature.
StructuredFeatureSet
(features={})[source]¶ Bases:
tethne.classes.feature.BaseFeatureSet
A
StructuredFeatureSet
organizes severalStructuredFeature
instances.-
context_chunks
(context=None)[source]¶ Retrieves all tokens, divided into the chunks in context
context
.If
context
is not found in a feature, then the feature will be treated as a single chunk.Parameters: context : str
Context name.
Returns: papers : list
2-tuples of (paper ID, chunk indices).
chunks : list
Each item in
chunks
is a list of tokens.
-
to_gensim_corpus
(context=None, raw=False)[source]¶ Yield a bag-of-words corpus compatible with the Gensim package.
Returns a (corpus, index) tuple (see below).
Parameters: context : str
If provided, each “document” in the Gensim corpus will be a chunk of type
context
.raw : bool
(default: False) If True, documents will be sequences of tokens rather than sequences of (term, count) “bag-of-words” tuples.
Returns: list
A list of lists of (id, count) tuples. Each sub-list represents a single context item (e.g. a document, or a paragraph). This is the “bag of words” representation used in Gensim.
dict
Maps integer IDs to words.
Examples
>>> from tethne.readers.wos import read >>> corpus = read('/path/to/my/data') >>> from nltk.tokenize import word_tokenize >>> corpus.index_feature('abstract', word_tokenize, structured=True) >>> gensim_corpus, id2word = corpus.features['abstract'].to_gensim_corpus() >>> from gensim import corpora, models >>> model = models.ldamodel.LdaModel(corpus=gensim_corpus, id2word=id2word, num_topics=5, update_every=1, chunksize=100, passes=1)
-
-
tethne.classes.feature.
argsort
(l)¶
tethne.classes.graphcollection module¶
A GraphCollection
is a set of graphs generated from a
Corpus
or model.
-
class
tethne.classes.graphcollection.
GraphCollection
(corpus=None, method=None, slice_kwargs={}, method_kwargs={}, directed=False)[source]¶ Bases:
dict
A
GraphCollection
is an indexed set of networkx.Graphs.When you add a networkx.Graph, the nodes are indexed and relabeled.
>>> from tethne import GraphCollection >>> import networkx as nx >>> G = GraphCollection() >>> g = nx.Graph() >>> g.add_node('A', yes='no') >>> g.add_edge('A', 'B', c='d') >>> G['graph1'] = g # You can also use G.add('graph1', g) >>> G.graph1.nodes(data=True) [(0, {}), (1, {'yes': 'no'})] >>> G.node_index, G.node_lookup ({0: 'B', 1: 'A', -1: None}, {'A': 1, None: -1, 'B': 0})
To build a
GraphCollection
from aCorpus
, pass it and a method to the constructor, or useGraphCollection.build()
.>>> corpus = read(datapath) >>> G = GraphCollection(corpus, coauthors) >>> G.build(corpus, authors)
-
add
(name, graph)[source]¶ Index and add a networkx.Graph to the
GraphCollection
.Parameters: name : hashable
Unique name used to identify the graph.
graph : networkx.Graph
Raises: ValueError
If name has already been used in this
GraphCollection
.
-
analyze
(method_name, mapper=<built-in function map>, invert=False, **kwargs)[source]¶ Apply a method from NetworkX to each of the graphs in the
GraphCollection
.Parameters: method : str or iterable
Must be the name of a method accessible directly from the networkx namespace. If an iterable, should be the complete dot-path to the method, e.g.
nx.connected.is_connected
would be written as['connected', 'is_connected']
.mapper : func
A mapping function. Be default uses Python’s builtin
map
function. MUST return results in order.results_by : str
(default: ‘graph’). By default, the top-level key in the results are graph names. If results_by=’node’, node labels are used as top-level keys.
kwargs : kwargs
Any additional kwargs are passed to the NetworkX method.
Returns: dict
Examples
>>> G.analyze('betweenness_centrality') {'test': {0: 1.0, 1: 0.0, 2: 0.0}, 'test2': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0}} >>> G.analyze('betweenness_centrality', results_by='node') {0: {'test': 1.0, 'test2': 0.0}, 1: {'test': 0.0, 'test2': 0.0}, 2: {'test': 0.0, 'test2': 0.0}, 3: {'test2': 0.0}}
-
build
(corpus, method, slice_kwargs={}, method_kwargs={})[source]¶ Generate a set of networkx.Graphs using
method
on the slices incorpus
.Parameters: corpus :
Corpus
method : str or func
If str, looks for
method
in thetethne
namespace.slice_kwargs : dict
Keyword arguments to pass to
corpus
‘slice
method.method_kwargs : dict
Keyword arguments to pass to
method
along withcorpus
.
-
collapse
(weight_attr='weight')[source]¶ Returns a networkx.Graph or
networkx.DiGraph
in which the edges between each pair of nodes are collapsed into a single weighted edge.
-
edge_history
(source, target, attribute)[source]¶ Returns a dictionary of attribute vales for each Graph in the
GraphCollection
for a single edge.Parameters: source : str
Identifier for source node.
target : str
Identifier for target node.
attribute : str
The attribute of interest; e.g. ‘betweenness_centrality’
Returns: history : dict
-
edges
(data=False, native=True)[source]¶ Returns a list of all edges in the
GraphCollection
.Parameters: data : bool
(default: False) If True, returns a list of 3-tuples containing source and target node labels, and attributes.
Returns: edges : list
-
index
(name, graph)[source]¶ Index any new nodes in graph, and relabel the nodes in graph using the index.
Parameters: name : hashable
Unique name used to identify the graph.
graph : networkx.Graph
Returns: indexed_graph : networkx.Graph
-
node_history
(node, attribute)[source]¶ Returns a dictionary of attribute values for each networkx.Graph in the
GraphCollection
for a single node.Parameters: node : str
The node of interest.
attribute : str
The attribute of interest; e.g. ‘betweenness_centrality’
Returns: history : dict
-
nodes
(data=False, native=True)[source]¶ Returns a list of all nodes in the
GraphCollection
.Parameters: data : bool
(default: False) If True, returns a list of 2-tuples containing node labels and attributes.
Returns: nodes : list
-
order
(piecewise=False)[source]¶ Returns the total number of nodes in the
GraphCollection
.
-
size
(piecewise=False)[source]¶ Returns the total number of edges in the
GraphCollection
.
-
union
(weight_attr='_weight')[source]¶ Returns the union of all graphs in this
GraphCollection
.The number of graphs in which an edge exists between each node pair u and v is stored in the edge attribute given be weight_attr (default: _weight).
Parameters: weight_attr : str
(default: ‘_weight’) Name of the edge attribute used to store the number of graphs in which an edge exists between node pairs.
Returns: graph : networkx.Graph
-
tethne.classes.paper module¶
A Paper
represents a single bibliographic record.
-
class
tethne.classes.paper.
Paper
[source]¶ Bases:
object
Tethne’s representation of a bibliographic record.
Fields can be set using dict-like assignment, and accessed as attributes.
>>> myPaper = Paper() >>> myPaper['date'] = 1965 >>> myPaper.date 1965
-
ayjid
¶ Fuzzy WoS-style identifier, comprised of first author’s name (LAST I), pubdate, and journal.
Returns: ayjid : str
-
citations
¶
-
tethne.classes.streaming module¶
-
class
tethne.classes.streaming.
StreamingCorpus
(*args, **kwargs)[source]¶ Bases:
tethne.classes.corpus.Corpus
Provides memory-friendly access to large collections of metadata.
-
index_class
¶ alias of
StreamingIndex
-
papers
¶
-
Module contents¶
The classes
package provides the fundamental classes for working with
bibliographic data in Tethne.
paper |
A Paper represents a single bibliographic record. |
corpus |
A Corpus is a container for Paper s. |
feature |
Classes in this module provide structures for additional data about Paper s. |
graphcollection |
A GraphCollection is a set of graphs generated from a Corpus or model. |