tethne.analyze.corpus module¶
Methods for analyzing Corpus objects.
burstness | Estimate burstness profile for the topn features (or flist) in feature. |
feature_burstness | Estimate burstness profile for a feature over the 'date' axis. |
plot_burstness | Generate a figure depicting burstness profiles for feature. |
plot_sigma | Plot sigma values for the topn most influential nodes. |
sigma | Calculate sigma (from Chen 2009) for all of the nodes in a GraphCollection. |
- tethne.analyze.corpus.burstness(corpus, feature, k=5, topn=20, perslice=False, flist=None, normalize=True, **kwargs)[source]¶
Estimate burstness profile for the topn features (or flist) in feature.
Uses the popular burstness automaton model inroduced by Kleinberg (2002).
Parameters: corpus : Corpus
feature : str
Name of featureset in corpus. E.g. 'citations'.
k : int
(default: 5) Number of burst states.
topn : int or float {0.-1.}
(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.
perslice : bool
(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.
flist : list
List of features. If provided, topn and perslice are ignored.
normalize : bool
(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.
kwargs : kwargs
Parameters for burstness automaton HMM.
Returns: B : dict
Keys are features, values are tuples of ( dates, burstness )
Examples
>>> from tethne.analyze.corpus import burstness >>> B = burstness(corpus, 'abstractTerms', flist=['process', 'method'] >>> B['process'] ([1990, 1991, 1992, 1993], [0., 0.4, 0.6, 0.])
- tethne.analyze.corpus.feature_burstness(corpus, feature, findex, k=5, normalize=True, **kwargs)[source]¶
Estimate burstness profile for a feature over the 'date' axis.
Parameters: corpus : Corpus
feature : str
Name of featureset in corpus. E.g. 'citations'.
findex : int
Index of feature in corpus.
k : int
(default: 5) Number of burst states.
normalize : bool
(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.
kwargs : kwargs
Parameters for burstness automaton HMM.
- tethne.analyze.corpus.plot_burstness(corpus, feature, k=5, topn=20, perslice=False, flist=None, normalize=True, fig=None, **kwargs)[source]¶
Generate a figure depicting burstness profiles for feature.
Parameters: corpus : Corpus
feature : str
Name of featureset in corpus. E.g. 'citations'.
k : int
(default: 5) Number of burst states.
topn : int or float {0.-1.}
(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.
perslice : bool
(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.
flist : list
List of features. If provided, topn and perslice are ignored.
normalize : bool
(default: True) If True, burstness is expressed relative to the hightest possible state (k-1). Otherwise, states themselves are returned.
fig : matplotlib.figure.Figure
(default: None) You may provide a Figure instance if you wish. Otherwise, a new figure is generated.
kwargs : kwargs
Parameters for burstness automaton HMM.
Returns: fig : matplotlib.figure.Figure
Examples
>>> from tethne.analyze.corpus import burstness >>> fig = plot_burstness(corpus, 'citations', topn=2, perslice=True) >>> fig.savefig('~/burstness.png')
Years prior to the first occurrence of each feature are grayed out. Periods in which the feature was bursty are depicted by colored blocks, the opacity of which indicates burstness intensity.
- tethne.analyze.corpus.plot_sigma(G, corpus, feature, topn=20, sort_by='max', perslice=False, flist=None, fig=None, **kwargs)[source]¶
Plot sigma values for the topn most influential nodes.
Parameters: G : GraphCollection
corpus : Corpus
feature : str
Name of a featureset in corpus.
topn : int or float {0.-1.}
(default: 20) Number (int) or percentage (float) of top-occurring features to return. If flist is provided, this parameter is ignored.
sort_by : str
(default: ‘max’) Criterion for selecting topn nodes.
perslice : bool
(default: False) If True, loads topn features per slice. Otherwise, loads topn features overall. If flist is provided, this parameter is ignored.
flist : list
List of nodes. If provided, topn and perslice are ignored.
fig : matplotlib.figure.Figure
(default: None) You may provide a Figure instance if you wish. Otherwise, a new figure is generated.
Returns: fig : matplotlib.figure.Figure
G : GraphCollection
A co-citation graph collection, updated with sigma node attributes.
Examples
Assuming that you have a Corpus (G) sliced by 'date' and a co-citation GraphCollection (corpus)...
>>> from tethne.analyze.cocitation import plot_sigma >>> fig,G = plot_sigma(G, corpus, topn=5, perslice=True) >>> fig.savefig('~/sigma_plot.png')
In this figure, the top 5 most sigma-influential nodes in each slice are shown. Red bands indicate periods in which each paper was influential; opacity indicates the intensity of sigma (normalized by the highest value in the plot). The period prior to the first instance of each node is grayed out.
- tethne.analyze.corpus.sigma(G, corpus, feature, **kwargs)[source]¶
Calculate sigma (from Chen 2009) for all of the nodes in a GraphCollection.
You can set parameters for burstness estimation using kwargs:
Parameter Description s Scaling parameter ( > 1.)that controls graininess of burst detection. Lower values make the model more sensitive. Defaults to 1.1. gamma Parameter that controls the ‘cost’ of higher burst states. Defaults to 1.0. k Number of burst states. Defaults to 5. Parameters: G : GraphCollection
corpus : Corpus
feature : str
Name of a featureset in corpus.
Returns: G : GraphCollection
A graph collection updated with sigma node attributes.
Examples
Assuming that you have a Corpus generated from WoS data that has been sliced by date.
>>> # Generate a co-citation graph collection. >>> from tethne import GraphCollection >>> kwargs = { 'threshold':2, 'topn':100 } >>> G = GraphCollection() >>> G.build(corpus, 'date', 'papers', 'cocitation', method_kwargs=kwargs) >>> # Calculate sigma. This may take several minutes, depending on the >>> # size of your co-citaiton graph collection. >>> from tethne.analyze.corpus import sigma >>> G = sigma(G, corpus, 'citations') >>> # Visualize... >>> from tethne.writers import collection >>> collection.to_dxgmml(G, '~/cocitation.xgmml')
In the visualization below, node and label sizes are mapped to sigma, and border width is mapped to citations.