SciPy

tethne.classes package

Submodules

tethne.classes.corpus module

A Corpus is a container for Papers.

class tethne.classes.corpus.Corpus(papers=[], index_by=None, index_fields=['authors', 'citations', 'ayjid', 'date'], index_features=['authors', 'citations'], **kwargs)[source]

Bases: object

A Corpus represents a collection of Paper instances.

distribution Calculates the number of papers in each slice, as defined by slice_kwargs.
feature_distribution Calculates the distribution of a feature across slices of the corpus.
features Contains FeatureSets for a Corpus instance.
index Indexes the Papers in this Corpus instance by the attribute attr.
index_by Specifies the field in Papers that should be used as the primary indexing field for a Corpus instance.
index_feature Creates a new FeatureSet from the attribute feature_name in each Paper.
indexed_papers The primary index for Papers in a Corpus instance.
indices Contains field indices for the Papers in a Corpus instance.
papers A list of all Papers in the Corpus.
select Retrieves a subset of Papers based on selection criteria.
slice Returns a generator that yields (key, subcorpus) tuples for sequential time windows.
subcorpus Generates a new Corpus using the criteria in selector.
top_features Retrieves the top topn most numerous features in the corpus.

Corpus objects are generated by the bibliographic readers in the tethne.readers module.

>>> from tethne.readers.wos import read
>>> read('/path/to/data')
<tethne.classes.corpus.Corpus object at 0x10278ea10>

You can also build a Corpus from a list of Papers.

>>> papers = however_you_generate_papers()   # <- list of Papers.
>>> corpus = Corpus(papers)

All of the Papers in the Corpus will be indexed. You can control which field is used for indexing by passing the index_by keyword argument to one of the read methods or to the Corpus constructor.

>>> corpus = Corpus(papers, index_by='doi')
>>> corpus.indexed_papers.keys()
['doi/123', 'doi/456', ..., 'doi/789']

The WoS read method uses the wosid field by default, and the DfR read method uses doi. The Zotero read method tries to use whatever it can find. If the selected index_by field is not set or not available, a unique key will be generated using the title and author names.

By default, Corpus will also index the authors and citations fields. To control which fields are indexed, pass the index_fields argument, or call Corpus.index() directly.

>>> corpus = Corpus(papers, index_fields=['authors', 'date'])
>>> corpus.indices.keys()
['authors', 'date']

Similarly, Corpus will index features. By default, authors and citations will be indexed as features (i.e. available for network-building methods). To control which fields are indexed as features, pass the index_features argument, or call Corpus.index_features().

>>> corpus = Corpus(papers, index_features=['unigrams'])
>>> corpus.features.keys()
['unigrams']

There are a variety of ways to select Papers from the corpus.

>>> corpus = Corpus(papers)
>>> corpus[0]    # Integer indices yield a single Paper.
<tethne.classes.paper.Paper object at 0x103037c10>

>>> corpus[range(0,5)]  # A list of indices will yield a list of Papers.
[<tethne.classes.paper.Paper object at 0x103037c10>,
 <tethne.classes.paper.Paper object at 0x10301c890>,
 ...
 <tethne.classes.paper.Paper object at 0x10302f5d0>]

>>> corpus[('date', 1995)]  # You can select based on indexed fields.
[<tethne.classes.paper.Paper object at 0x103037c10>,
 <tethne.classes.paper.Paper object at 0x10301c890>,
 ...
 <tethne.classes.paper.Paper object at 0x10302f5d0>]

>>> corpus['citations', ('DOLE RJ 1952 CELL')]   # All papers with this citation!
[<tethne.classes.paper.Paper object at 0x103037c10>,
 <tethne.classes.paper.Paper object at 0x10301c890>,
 ...
 <tethne.classes.paper.Paper object at 0x10302f5d0>]

>>> corpus[('date', range(1993, 1995))]  # Multiple values are supported, too.
[<tethne.classes.paper.Paper object at 0x103037c10>,
 <tethne.classes.paper.Paper object at 0x10301c890>,
 ...
 <tethne.classes.paper.Paper object at 0x10302f5d0>]

If you prefer to retrieve a Corpus rather than simply a list of Paper instances (e.g. to build networks), use Corpus.subcorpus(). subcorpus accepts selector arguments just like Corpus.__getitem__().

>>> corpus = Corpus(papers)
>>> subcorpus = corpus.subcorpus(('date', 1995))
>>> subcorpus
<tethne.classes.corpus.Corpus object at 0x10278ea10>
add_papers(papers)[source]
distribution(**slice_kwargs)[source]

Calculates the number of papers in each slice, as defined by slice_kwargs.

Parameters:

slice_kwargs : kwargs

Keyword arguments to be passed to Corpus.slice().

Returns:

list

Examples

>>> corpus.distribution(step_size=1, window_size=1)
[5, 5]
feature_distribution(featureset_name, feature, mode='counts', **slice_kwargs)[source]

Calculates the distribution of a feature across slices of the corpus.

Parameters:

featureset_name : str

Name of a FeatureSet in the Corpus.

feature : str

Name of the specific feature of interest. E.g. if featureset_name='citations', then feature could be something like 'DOLE RJ 1965 CELL'.

mode : str

(default: 'counts') If set to 'counts', values will be the sum of all count values for the feature in each slice. If set to 'documentCounts', values will be the number of papers in which the feature occurs in each slice.

slice_kwargs : kwargs

Keyword arguments to be passed to Corpus.slice().

Returns:

list

Examples

>>> corpus.feature_distribution(featureset_name='citations',            ...                             feature='DOLE RJ 1965 CELL',            ...                             step_size=1, window_size=1)
[2, 15, 25, 1]
features = {}

Contains FeatureSets for a Corpus instance.

New FeatureSets can be created from attributes of Paper using index_feature().

index(attr)[source]

Indexes the Papers in this Corpus instance by the attribute attr.

New indices are added to indices.

Parameters:

attr : str

The name of a Paper attribute.

index_by = None

Specifies the field in Papers that should be used as the primary indexing field for a Corpus instance.

index_class

alias of dict

index_feature(feature_name, tokenize=<function <lambda>>, structured=False)[source]

Creates a new FeatureSet from the attribute feature_name in each Paper.

New FeatureSets are added to features.

Parameters:

feature_name : str

The name of a Paper attribute.

index_kwargs = {}
index_paper_by_attr(paper, attr)[source]
index_paper_by_feature(paper, feature_name, tokenize=<function <lambda>>, structured=False)[source]
indexed_papers = {}

The primary index for Papers in a Corpus instance. Keys are based on index_by, and values are Paper instances.

indices = {}

Contains field indices for the Papers in a Corpus instance.

The 'citations' index, for example, allows us to look up all of the Papers that contain a particular bibliographic reference:

>>> for citation, papers in corpus.indices['citations'].items()[7:10]:
...     print 'The following Papers cite {0} \n\n\t{1} \n'.format(citation, '\n\t'.join(papers))
The following Papers cite WHITFIELD J 2006 NATURE:
     WOS:000252758800011
     WOS:000253464000004
The following Papers cite WANG T 2006 GLOBAL CHANGE BIOL:
     WOS:000282225000019
     WOS:000281546800001
     WOS:000251903200006
     WOS:000292901400010
     WOS:000288656800015
     WOS:000318353300001
     WOS:000296710600017
     WOS:000255552100006
     WOS:000272153800012
The following Papers cite LINKOSALO T 2009 AGR FOREST METEOROL:
     WOS:000298398700003

Notice that the values above are not Papers themselves, but identifiers. These are the same identifiers used in the primary index, so we can use them to look up Papers:

>>> papers = corpus.indices['citations']['CARLSON SM 2004 EVOL ECOL RES']  # Who cited Carlson 2004?
>>> print papers
>>> for paper in papers:
...     print corpus.indexed_papers[paper]
['WOS:000311994600006', 'WOS:000304903100014', 'WOS:000248812000005']
<tethne.classes.paper.Paper object at 0x112d1fe10>
<tethne.classes.paper.Paper object at 0x1121e8310>
<tethne.classes.paper.Paper object at 0x1144ad390>

You can create new indices using index().

papers

A list of all Papers in the Corpus.

select(selector, index_only=False)[source]

Retrieves a subset of Papers based on selection criteria.

There are a variety of ways to select Papers.

>>> corpus = Corpus(papers)
>>> corpus[0]    # Integer indices yield a single Paper.
<tethne.classes.paper.Paper object at 0x103037c10>

>>> corpus[range(0,5)]  # A list of indices yields a list of Papers.
[<tethne.classes.paper.Paper object at 0x103037c10>,
 <tethne.classes.paper.Paper object at 0x10301c890>,
 ...
 <tethne.classes.paper.Paper object at 0x10302f5d0>]

>>> corpus[('date', 1995)]  # Select based on indexed fields.
[<tethne.classes.paper.Paper object at 0x103037c10>,
 <tethne.classes.paper.Paper object at 0x10301c890>,
 ...
 <tethne.classes.paper.Paper object at 0x10302f5d0>]

>>> corpus['citations', ('DOLE RJ 1952 CELL')]   # Citing papers!
[<tethne.classes.paper.Paper object at 0x103037c10>,
 <tethne.classes.paper.Paper object at 0x10301c890>,
 ...
 <tethne.classes.paper.Paper object at 0x10302f5d0>]

>>> corpus[('date', range(1993, 1995))] # Multiple values are OK.
[<tethne.classes.paper.Paper object at 0x103037c10>,
 <tethne.classes.paper.Paper object at 0x10301c890>,
 ...
 <tethne.classes.paper.Paper object at 0x10302f5d0>]

If you prefer to retrieve a Corpus rather than simply a list of Paper instances (e.g. to build networks), use Corpus.subcorpus().

Parameters:

selector : object

See method description.

Returns:

list

A list of Papers.

slice(window_size=1, step_size=1, cumulative=False, count_only=False, subcorpus=True, feature_name=None)[source]

Returns a generator that yields (key, subcorpus) tuples for sequential time windows.

Two common slicing patterns are the “sliding time-window” and the “time-period” patterns. Whereas time-period slicing divides the corpus into subcorpora by sequential non-overlapping time periods, subcorpora generated by time-window slicing can overlap.

_images/timeline.timeslice.png

Time-period slicing, with a window-size of 4 years.

_images/timeline.timewindow.png

Time-window slicing, with a window-size of 4 years and a step-size of 1 year.

Sliding time-window – Set step_size=1, and window_size to the desired value. Time-periodstep_size and window_size should have the same value.

The value of key is always the first year in the slice.

Parameters:

window_size : int

(default: 1) Size of the time window, in years.

step_size : int

(default: 1) Number of years to advance window at each step.

Returns:

generator

Examples

>>> from tethne.readers.wos import read
>>> corpus = read('/path/to/data')
>>> for key, subcorpus in corpus.slice():
...     print key, len(subcorpus)
2005, 5
2006, 5
subcorpus(selector)[source]

Generates a new Corpus using the criteria in selector.

Accepts selector arguments just like Corpus.select().

>>> corpus = Corpus(papers)
>>> subcorpus = corpus.subcorpus(('date', 1995))
>>> subcorpus
<tethne.classes.corpus.Corpus object at 0x10278ea10>
subfeatures(selector, featureset_name)[source]
top_features(featureset_name, topn=20, by='counts', perslice=False, slice_kwargs={})[source]

Retrieves the top topn most numerous features in the corpus.

Parameters:

featureset_name : str

Name of a FeatureSet in the Corpus.

topn : int

(default: 20) Number of features to return.

by : str

(default: 'counts') If 'counts', uses the sum of feature count values to rank features. If 'documentCounts', uses the number of papers in which features occur.

perslice : bool

(default: False) If True, retrieves the top topn features in each slice.

slice_kwargs : kwargs

If perslice=True, these keyword arguments are passed to Corpus.slice().

tethne.classes.feature module

Classes in this module provide structures for additional data about Papers.

class tethne.classes.feature.BaseFeatureSet(features={})[source]

Bases: object

N_documents
N_features
add(paper_id, feature)[source]
count(elem)[source]
documentCount(elem)[source]
items()[source]
iteritems()[source]
papers_containing(elem)[source]
top(topn, by='counts')[source]

Get the top topn features in the FeatureSet.

Parameters:

topn : int

Number of features to return.

by : str

(default: ‘counts’) How features should be sorted. Must be ‘counts’ or ‘documentcounts’.

Returns:

list

unique

The set of unique elements in this FeatureSet.

class tethne.classes.feature.Feature(data)[source]

Bases: list

A Feature instance is a sparse vector of features over a given concept (usually a Paper).

For example, a Feature might represent word counts for a single Paper.

A Feature may be initialized from a list of (feature, value) tuples...

>>> myFeature = Feature([('the', 2), ('pine', 1), ('trapezoid', 5)])

...or by passing a list of raw feature tokens:

To get the set of unique features in this Feature, use :prop:`.Feature.unique`:

>>> myFeature.unique
set(['the', 'pine', 'trapezoid'])

Normalized feature values (so that all values sum to 1.) can be accessed using :prop:`.Feature.norm`.

>>> myFeature.norm
[('the', 0.25), ('pine', 0.125), ('trapezoid', 0.625)]
extend(data)[source]
norm
top(topn=10)[source]

Get a list of the top topn features in this Feature.

Parameters:topn : int
Returns:list

Examples

>>> myFeature = Feature([('the', 2), ('pine', 1), ('trapezoid', 5)])
>>> myFeature.top(1)
[('trapezoid', 5)]
unique

The set of unique elements in this Feature.

value(element)[source]
class tethne.classes.feature.FeatureSet(features=None)[source]

Bases: tethne.classes.feature.BaseFeatureSet

A FeatureSet organizes multiple Feature instances.

as_matrix()[source]
as_vector(p, norm=False)[source]
transform(func)[source]

Apply a transformation to tokens in this FeatureSet.

Parameters:

func : callable

Should take four parameters: token, value in document (e.g. count), value in FeatureSet (e.g. overall count), and document count (i.e. number of documents in which the token occurs). Should return a new numeric (int or float) value, or None. If value is 0 or None, the token will be excluded.

Returns:

FeatureSet

Examples

Apply a tf*idf transformation.

>>> words = corpus.features['words']
>>> def tfidf(f, c, C, DC):
... tf = float(c)
... idf = log(float(len(words.features))/float(DC))
... return tf*idf
>>> corpus.features['words_tfidf'] = words.transform(tfidf)
translate(func)[source]
class tethne.classes.feature.StructuredFeature(tokens, contexts=None, reference=None)[source]

Bases: list

A StructuredFeature represents the contents of a document as an array of tokens, divisible into a set of nested contexts.

The canonical use-case is to represent a document as a set of words divided into sentences, paragraphs, and (perhaps) pages.

Parameters:

tokens : list

An ordered list of tokens.

contexts : list

A list of (name, indices) 2-tuples, where name is string-like and indices is an iterable of int token indices.

reference : tuple

A (feature, map) 2-tuple, where feature is a StructuredFeature and map is a dict mapping token indices in this StructuredFeature to token indices in feature.

add_context(name, indices, level=None)[source]

Add a new context level to the hierarchy.

By default, new contexts are added to the lowest level of the hierarchy. To insert the context elsewhere in the hierarchy, use the level argument. For example, level=0 would insert the context at the highest level of the hierarchy.

Parameters:

name : str

indices : list

Token indices at which each chunk in the context begins.

level : int

Level in the hierarchy at which to insert the context. By default, inserts context at the lowest level of the hierarchy

context_chunk(context, j)[source]

Retrieve the tokens in the j``th chunk of context ``context.

Parameters:

context : str

Context name.

j : int

Index of a context chunk.

Returns:

chunk : list

List of tokens in the selected chunk.

context_chunks(context)[source]

Retrieves all tokens, divided into the chunks in context context.

Parameters:

context : str

Context name.

Returns:

chunks : list

Each item in chunks is a list of tokens.

unique

The set of unique elements in this Feature.

class tethne.classes.feature.StructuredFeatureSet(features={})[source]

Bases: tethne.classes.feature.BaseFeatureSet

A StructuredFeatureSet organizes several StructuredFeature instances.

context_chunks(context)[source]

Retrieves all tokens, divided into the chunks in context context.

If context is not found in a feature, then the feature will be treated as a single chunk.

Parameters:

context : str

Context name.

Returns:

papers : list

2-tuples of (paper ID, chunk indices).

chunks : list

Each item in chunks is a list of tokens.

transform(func)[source]
tethne.classes.feature.argsort(l)
tethne.classes.feature.feature(f)[source]

Decorator for properties that should be represented as Features.

tethne.classes.graphcollection module

A GraphCollection is a set of graphs generated from a Corpus or model.

class tethne.classes.graphcollection.GraphCollection(corpus=None, method=None, slice_kwargs={}, method_kwargs={}, directed=False)[source]

Bases: dict

A GraphCollection is an indexed set of networkx.Graphs.

When you add a networkx.Graph, the nodes are indexed and relabeled.

>>> from tethne import GraphCollection
>>> import networkx as nx

>>> G = GraphCollection()
>>> g = nx.Graph()
>>> g.add_node('A', yes='no')
>>> g.add_edge('A', 'B', c='d')

>>> G['graph1'] = g    # You can also use G.add('graph1', g)

>>> G.graph1.nodes(data=True)
[(0, {}), (1, {'yes': 'no'})]

>>> G.node_index, G.node_lookup
({0: 'B', 1: 'A', -1: None}, {'A': 1, None: -1, 'B': 0})

To build a GraphCollection from a Corpus, pass it and a method to the constructor, or use GraphCollection.build().

>>> corpus = read(datapath)
>>> G = GraphCollection(corpus, coauthors)

>>> G.build(corpus, authors)
add(name, graph)[source]

Index and add a networkx.Graph to the GraphCollection.

Parameters:

name : hashable

Unique name used to identify the graph.

graph : networkx.Graph

Raises:

ValueError

If name has already been used in this GraphCollection.

analyze(method_name, mapper=<built-in function map>, invert=False, **kwargs)[source]

Apply a method from NetworkX to each of the graphs in the GraphCollection.

Parameters:

method : str or iterable

Must be the name of a method accessible directly from the networkx namespace. If an iterable, should be the complete dot-path to the method, e.g. nx.connected.is_connected would be written as ['connected', 'is_connected'].

mapper : func

A mapping function. Be default uses Python’s builtin map function. MUST return results in order.

results_by : str

(default: ‘graph’). By default, the top-level key in the results are graph names. If results_by=’node’, node labels are used as top-level keys.

kwargs : kwargs

Any additional kwargs are passed to the NetworkX method.

Returns:

dict

Examples

>>> G.analyze('betweenness_centrality')
{'test': {0: 1.0, 1: 0.0, 2: 0.0},
 'test2': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0}}
>>> G.analyze('betweenness_centrality', results_by='node')
{0: {'test': 1.0, 'test2': 0.0},
 1: {'test': 0.0, 'test2': 0.0},
 2: {'test': 0.0, 'test2': 0.0},
 3: {'test2': 0.0}}
build(corpus, method, slice_kwargs={}, method_kwargs={})[source]

Generate a set of networkx.Graphs using method on the slices in corpus.

Parameters:

corpus : Corpus

method : str or func

If str, looks for method in the tethne namespace.

slice_kwargs : dict

Keyword arguments to pass to corpusslice method.

method_kwargs : dict

Keyword arguments to pass to method along with corpus.

collapse(weight_attr='weight')[source]

Returns a networkx.Graph or networkx.DiGraph in which the edges between each pair of nodes are collapsed into a single weighted edge.

edge_distribution()[source]
edge_history(source, target, attribute)[source]

Returns a dictionary of attribute vales for each Graph in the GraphCollection for a single edge.

Parameters:

source : str

Identifier for source node.

target : str

Identifier for target node.

attribute : str

The attribute of interest; e.g. ‘betweenness_centrality’

Returns:

history : dict

edges(data=False, native=True)[source]

Returns a list of all edges in the GraphCollection.

Parameters:

data : bool

(default: False) If True, returns a list of 3-tuples containing source and target node labels, and attributes.

Returns:

edges : list

index(name, graph)[source]

Index any new nodes in graph, and relabel the nodes in graph using the index.

Parameters:

name : hashable

Unique name used to identify the graph.

graph : networkx.Graph

Returns:

indexed_graph : networkx.Graph

node_distribution()[source]
node_history(node, attribute)[source]

Returns a dictionary of attribute values for each networkx.Graph in the GraphCollection for a single node.

Parameters:

node : str

The node of interest.

attribute : str

The attribute of interest; e.g. ‘betweenness_centrality’

Returns:

history : dict

nodes(data=False, native=True)[source]

Returns a list of all nodes in the GraphCollection.

Parameters:

data : bool

(default: False) If True, returns a list of 2-tuples containing node labels and attributes.

Returns:

nodes : list

order(piecewise=False)[source]

Returns the total number of nodes in the GraphCollection.

size(piecewise=False)[source]

Returns the total number of edges in the GraphCollection.

union(weight_attr='_weight')[source]

Returns the union of all graphs in this GraphCollection.

The number of graphs in which an edge exists between each node pair u and v is stored in the edge attribute given be weight_attr (default: _weight).

Parameters:

weight_attr : str

(default: ‘_weight’) Name of the edge attribute used to store the number of graphs in which an edge exists between node pairs.

Returns:

graph : networkx.Graph

tethne.classes.paper module

A Paper represents a single bibliographic record.

class tethne.classes.paper.Paper[source]

Bases: object

Tethne’s representation of a bibliographic record.

Fields can be set using dict-like assignment, and accessed as attributes.

>>> myPaper = Paper()
>>> myPaper['date'] = 1965
>>> myPaper.date
1965
authors
ayjid

Fuzzy WoS-style identifier, comprised of first author’s name (LAST I), pubdate, and journal.

Returns:ayjid : str
citations

tethne.classes.streaming module

class tethne.classes.streaming.StreamingCorpus(*args, **kwargs)[source]

Bases: tethne.classes.corpus.Corpus

Provides memory-friendly access to large collections of metadata.

index_class

alias of StreamingIndex

papers
class tethne.classes.streaming.StreamingIndex(name='index', base_path='.', serializer=<module 'cPickle' from '/Users/erickpeirson/anaconda/lib/python2.7/lib-dynload/cPickle.so'>)[source]

Bases: object

items()[source]
iteritems()[source]
keys()[source]
update(data)[source]
values()[source]

Module contents

The classes package provides the fundamental classes for working with bibliographic data in Tethne.

paper A Paper represents a single bibliographic record.
corpus A Corpus is a container for Papers.
feature Classes in this module provide structures for additional data about Papers.
graphcollection A GraphCollection is a set of graphs generated from a Corpus or model.