SciPy

tethne.analyze.corpus module

Methods for analyzing Corpus objects.

burstness Estimate burstness profile for the topn features (or flist) in feature.
feature_burstness Estimate burstness profile for a feature over the 'date' axis.
plot_burstness Generate a figure depicting burstness profiles for feature.
plot_sigma Plot sigma values for the topn most influential nodes.
sigma Calculate sigma (from Chen 2009) for all of the nodes in a GraphCollection.
tethne.analyze.corpus.burstness(corpus, feature, k=5, topn=20, perslice=False, flist=None, normalize=True, **kwargs)[source]

Estimate burstness profile for the topn features (or flist) in feature.

Uses the popular burstness automaton model inroduced by Kleinberg (2002).

Parameters:

corpus : Corpus

feature : str

Name of featureset in corpus. E.g. 'citations'.

k : int

(default: 5) Number of burst states.

topn : int or float {0.-1.}

(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.

perslice : bool

(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.

flist : list

List of features. If provided, topn and perslice are ignored.

normalize : bool

(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.

kwargs : kwargs

Parameters for burstness automaton HMM.

Returns:

B : dict

Keys are features, values are tuples of ( dates, burstness )

Examples

>>> from tethne.analyze.corpus import burstness
>>> B = burstness(corpus, 'abstractTerms', flist=['process', 'method']
>>> B['process']
([1990, 1991, 1992, 1993], [0., 0.4, 0.6, 0.])
tethne.analyze.corpus.feature_burstness(corpus, feature, findex, k=5, normalize=True, **kwargs)[source]

Estimate burstness profile for a feature over the 'date' axis.

Parameters:

corpus : Corpus

feature : str

Name of featureset in corpus. E.g. 'citations'.

findex : int

Index of feature in corpus.

k : int

(default: 5) Number of burst states.

normalize : bool

(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.

kwargs : kwargs

Parameters for burstness automaton HMM.

tethne.analyze.corpus.plot_burstness(corpus, feature, k=5, topn=20, perslice=False, flist=None, normalize=True, fig=None, **kwargs)[source]

Generate a figure depicting burstness profiles for feature.

Parameters:

corpus : Corpus

feature : str

Name of featureset in corpus. E.g. 'citations'.

k : int

(default: 5) Number of burst states.

topn : int or float {0.-1.}

(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.

perslice : bool

(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.

flist : list

List of features. If provided, topn and perslice are ignored.

normalize : bool

(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.

fig : matplotlib.figure.Figure

(default: None) You may provide a Figure instance if you wish. Otherwise, a new figure is generated.

kwargs : kwargs

Parameters for burstness automaton HMM.

Returns:

fig : matplotlib.figure.Figure

Examples

>>> from tethne.analyze.corpus import burstness
>>> fig = plot_burstness(corpus, 'citations', topn=2, perslice=True)
>>> fig.savefig('~/burstness.png')

Years prior to the first occurrence of each feature are grayed out. Periods in which the feature was bursty are depicted by colored blocks, the opacity of which indicates burstness intensity.

_images/burstness.png
tethne.analyze.corpus.plot_sigma(G, corpus, feature, topn=20, sort_by='max', perslice=False, flist=None, fig=None, **kwargs)[source]

Plot sigma values for the topn most influential nodes.

Parameters:

G : GraphCollection

corpus : Corpus

feature : str

Name of a featureset in corpus.

topn : int or float {0.-1.}

(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.

sort_by : str

(default: ‘max’) Criterion for selecting topn nodes.

perslice : bool

(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.

flist : list

List of nodes. If provided, topn and perslice are ignored.

fig : matplotlib.figure.Figure

(default: None) You may provide a Figure instance if you wish. Otherwise, a new figure is generated.

Returns:

fig : matplotlib.figure.Figure

G : GraphCollection

A co-citation graph collection, updated with sigma node attributes.

Examples

Assuming that you have a Corpus (G) sliced by 'date' and a co-citation GraphCollection (corpus)...

>>> from tethne.analyze.cocitation import plot_sigma
>>> fig,G = plot_sigma(G, corpus, topn=5, perslice=True)
>>> fig.savefig('~/sigma_plot.png')

In this figure, the top 5 most sigma-influential nodes in each slice are shown. Red bands indicate periods in which each paper was influential; opacity indicates the intensity of sigma (normalized by the highest value in the plot). The period prior to the first instance of each node is grayed out.

_images/sigma_plot.png
tethne.analyze.corpus.sigma(G, corpus, feature, **kwargs)[source]

Calculate sigma (from Chen 2009) for all of the nodes in a GraphCollection.

You can set parameters for burstness estimation using kwargs:

Parameter Description
s Scaling parameter ( > 1.)that controls graininess of burst detection. Lower values make the model more sensitive. Defaults to 1.1.
gamma Parameter that controls the ‘cost’ of higher burst states. Defaults to 1.0.
k Number of burst states. Defaults to 5.
Parameters:

G : GraphCollection

corpus : Corpus

feature : str

Name of a featureset in corpus.

Returns:

G : GraphCollection

A graph collection updated with sigma node attributes.

Examples

Assuming that you have a Corpus generated from WoS data that has been sliced by date.

>>> # Generate a co-citation graph collection.
>>> from tethne import GraphCollection
>>> kwargs = { 'threshold':2, 'topn':100 }
>>> G = GraphCollection()
>>> G.build(corpus, 'date', 'papers', 'cocitation', method_kwargs=kwargs)

>>> # Calculate sigma. This may take several minutes, depending on the
>>> #  size of your co-citaiton graph collection.
>>> from tethne.analyze.corpus import sigma
>>> G = sigma(G, corpus, 'citations')

>>> # Visualize...
>>> from tethne.writers import collection
>>> collection.to_dxgmml(G, '~/cocitation.xgmml')

In the visualization below, node and label sizes are mapped to sigma, and border width is mapped to citations.

_images/cocitation_sigma2.png