tethne.analyze.corpus module¶

Methods for analyzing Corpus objects.

`burstness`	Estimate burstness profile for the `topn` features (or `flist`) in `feature`.
`feature_burstness`	Estimate burstness profile for a feature over the `'date'` axis.
`plot_burstness`	Generate a figure depicting burstness profiles for `feature`.
`plot_sigma`	Plot sigma values for the `topn` most influential nodes.
`sigma`	Calculate sigma (from Chen 2009) for all of the nodes in a `GraphCollection`.

tethne.analyze.corpus.burstness(corpus, feature, k=5, topn=20, perslice=False, flist=None, normalize=True, **kwargs)[source]¶

Estimate burstness profile for the topn features (or flist) in feature.

Uses the popular burstness automaton model inroduced by Kleinberg (2002).

Parameters:

Parameters:	corpus : `Corpus` feature : str Name of featureset in `corpus`. E.g. `'citations'`. k : int (default: 5) Number of burst states. topn : int or float {0.-1.} (default: 20) Number (int) or percentage (float) of top-occurring features to return. If `flist` is provided, this parameter is ignored. perslice : bool (default: False) If True, loads `topn` features per slice. Otherwise, loads `topn` features overall. If `flist` is provided, this parameter is ignored. flist : list List of features. If provided, `topn` and `perslice` are ignored. normalize : bool (default: True) If True, burstness is expressed relative to the hightest possible state (`k-1`). Otherwise, states themselves are returned. kwargs : kwargs Parameters for burstness automaton HMM.
Returns:	B : dict Keys are features, values are tuples of ( dates, burstness )

corpus : Corpus

feature : str

Name of featureset in corpus. E.g. 'citations'.

k : int

(default: 5) Number of burst states.

topn : int or float {0.-1.}

(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.

perslice : bool

(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.

flist : list

List of features. If provided, topn and perslice are ignored.

normalize : bool

(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.

kwargs : kwargs

Parameters for burstness automaton HMM.

Returns:

B : dict

Keys are features, values are tuples of ( dates, burstness )

Examples

>>> from tethne.analyze.corpus import burstness
>>> B = burstness(corpus, 'abstractTerms', flist=['process', 'method']
>>> B['process']
([1990, 1991, 1992, 1993], [0., 0.4, 0.6, 0.])

tethne.analyze.corpus.feature_burstness(corpus, feature, findex, k=5, normalize=True, **kwargs)[source]¶

Estimate burstness profile for a feature over the 'date' axis.

Parameters:

Parameters:	corpus : `Corpus` feature : str Name of featureset in `corpus`. E.g. `'citations'`. findex : int Index of `feature` in `corpus`. k : int (default: 5) Number of burst states. normalize : bool (default: True) If True, burstness is expressed relative to the hightest possible state (`k-1`). Otherwise, states themselves are returned. kwargs : kwargs Parameters for burstness automaton HMM.

corpus : Corpus

feature : str

Name of featureset in corpus. E.g. 'citations'.

findex : int

Index of feature in corpus.

k : int

(default: 5) Number of burst states.

normalize : bool

(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.

kwargs : kwargs

Parameters for burstness automaton HMM.

tethne.analyze.corpus.plot_burstness(corpus, feature, k=5, topn=20, perslice=False, flist=None, normalize=True, fig=None, **kwargs)[source]¶

Generate a figure depicting burstness profiles for feature.

Parameters:

Parameters:	corpus : `Corpus` feature : str Name of featureset in `corpus`. E.g. `'citations'`. k : int (default: 5) Number of burst states. topn : int or float {0.-1.} (default: 20) Number (int) or percentage (float) of top-occurring features to return. If `flist` is provided, this parameter is ignored. perslice : bool (default: False) If True, loads `topn` features per slice. Otherwise, loads `topn` features overall. If `flist` is provided, this parameter is ignored. flist : list List of features. If provided, `topn` and `perslice` are ignored. normalize : bool (default: True) If True, burstness is expressed relative to the hightest possible state (`k-1`). Otherwise, states themselves are returned. fig : `matplotlib.figure.Figure` (default: None) You may provide a Figure instance if you wish. Otherwise, a new figure is generated. kwargs : kwargs Parameters for burstness automaton HMM.
Returns:	fig : `matplotlib.figure.Figure`

corpus : Corpus

feature : str

Name of featureset in corpus. E.g. 'citations'.

k : int

(default: 5) Number of burst states.

topn : int or float {0.-1.}

(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.

perslice : bool

(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.

flist : list

List of features. If provided, topn and perslice are ignored.

normalize : bool

(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.

fig : matplotlib.figure.Figure

(default: None) You may provide a Figure instance if you wish. Otherwise, a new figure is generated.

kwargs : kwargs

Parameters for burstness automaton HMM.

Returns:

fig : matplotlib.figure.Figure

Examples

>>> from tethne.analyze.corpus import burstness
>>> fig = plot_burstness(corpus, 'citations', topn=2, perslice=True)
>>> fig.savefig('~/burstness.png')

Years prior to the first occurrence of each feature are grayed out. Periods in which the feature was bursty are depicted by colored blocks, the opacity of which indicates burstness intensity.

tethne.analyze.corpus.plot_sigma(G, corpus, feature, topn=20, sort_by='max', perslice=False, flist=None, fig=None, **kwargs)[source]¶

Plot sigma values for the topn most influential nodes.

Parameters:

Parameters:	G : `GraphCollection` corpus : `Corpus` feature : str Name of a featureset in corpus. topn : int or float {0.-1.} (default: 20) Number (int) or percentage (float) of top-occurring features to return. If `flist` is provided, this parameter is ignored. sort_by : str (default: ‘max’) Criterion for selecting `topn` nodes. perslice : bool (default: False) If True, loads `topn` features per slice. Otherwise, loads `topn` features overall. If `flist` is provided, this parameter is ignored. flist : list List of nodes. If provided, `topn` and `perslice` are ignored. fig : `matplotlib.figure.Figure` (default: None) You may provide a Figure instance if you wish. Otherwise, a new figure is generated.
Returns:	fig : `matplotlib.figure.Figure` G : `GraphCollection` A co-citation graph collection, updated with `sigma` node attributes.

G : GraphCollection

corpus : Corpus

feature : str

Name of a featureset in corpus.

topn : int or float {0.-1.}

(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.

sort_by : str

(default: ‘max’) Criterion for selecting topn nodes.

perslice : bool

(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.

flist : list

List of nodes. If provided, topn and perslice are ignored.

fig : matplotlib.figure.Figure

(default: None) You may provide a Figure instance if you wish. Otherwise, a new figure is generated.

Returns:

fig : matplotlib.figure.Figure

G : GraphCollection

A co-citation graph collection, updated with sigma node attributes.

Examples

Assuming that you have a Corpus (G) sliced by 'date' and a co-citation GraphCollection (corpus)...

>>> from tethne.analyze.cocitation import plot_sigma
>>> fig,G = plot_sigma(G, corpus, topn=5, perslice=True)
>>> fig.savefig('~/sigma_plot.png')

In this figure, the top 5 most sigma-influential nodes in each slice are shown. Red bands indicate periods in which each paper was influential; opacity indicates the intensity of sigma (normalized by the highest value in the plot). The period prior to the first instance of each node is grayed out.

tethne.analyze.corpus.sigma(G, corpus, feature, **kwargs)[source]¶

Calculate sigma (from Chen 2009) for all of the nodes in a GraphCollection.

You can set parameters for burstness estimation using kwargs:

Parameter	Description
s	Scaling parameter ( > 1.)that controls graininess of burst detection. Lower values make the model more sensitive. Defaults to 1.1.
gamma	Parameter that controls the ‘cost’ of higher burst states. Defaults to 1.0.
k	Number of burst states. Defaults to 5.

Parameters:

Parameters:	G : `GraphCollection` corpus : `Corpus` feature : str Name of a featureset in corpus.
Returns:	G : `GraphCollection` A graph collection updated with `sigma` node attributes.

G : GraphCollection

corpus : Corpus

feature : str

Name of a featureset in corpus.

Returns:

G : GraphCollection

A graph collection updated with sigma node attributes.

Examples

Assuming that you have a Corpus generated from WoS data that has been sliced by date.

>>> # Generate a co-citation graph collection.
>>> from tethne import GraphCollection
>>> kwargs = { 'threshold':2, 'topn':100 }
>>> G = GraphCollection()
>>> G.build(corpus, 'date', 'papers', 'cocitation', method_kwargs=kwargs)

>>> # Calculate sigma. This may take several minutes, depending on the
>>> #  size of your co-citaiton graph collection.
>>> from tethne.analyze.corpus import sigma
>>> G = sigma(G, corpus, 'citations')

>>> # Visualize...
>>> from tethne.writers import collection
>>> collection.to_dxgmml(G, '~/cocitation.xgmml')

In the visualization below, node and label sizes are mapped to sigma, and border width is mapped to citations.

tethne.analyze.corpus module¶

Previous topic

Next topic

This Page