tethne.persistence.hdf5.corpus module¶
- class tethne.persistence.hdf5.corpus.HDF5Corpus(papers, features=None, index_by='wosid', index_citation_by='ayjid', exclude=set([]), filt=None, datapath=None, index=True)[source]¶
Bases: tethne.classes.corpus.Corpus
Provides HDF5 persistence for Corpus.
The HDF5Corpus uses a variety of tables and arrays to store data. The structure of a typical HDF5 repository for an instance of this class is:
/
arrays/
- authors: VLArray (String), vlarray_dict
Maps author indices in authors_index onto the IDs of papers that they authored. Padded with an empty 0th entry.
- authors_index: EArray (String), see vlarray_dict
Maps author indices used in authors to string representations of author names (LAST F). Padded with an empty 0th entry.
- papers_citing: VLArray (String), vlarray_dict
Each row corresponds to a paper, and contains a set of IDs for the papers that cite that paper. Row indices correspond to the entries in papers_citing_index. Padded with an empty 0th entry.
- papers_citing_index: EArray (String), see vlarray_dict
Maps paper indices used in papers_citing to string paper IDs. Padded with an empty 0th entry.
axes/
Each slice axis is represented by a VLArray ([slice axis]) and an EArray ([slice_axis]_keys).
- [slice axis] (e.g. date): VLArray (String)
Each row is a slice, containing a variable-length array of paper IDs.
- [slice axis]_keys (e.g. date_keys): EArray (Int32 or String)
Maps row indices in [slice axis] onto slice names/keys.
citations/
papers_table: Table, see papers_table
Contains metadata about cited references. These are usually not the same papers as those described in papers/.
- features/
This group contains data for featuresets. Each featureset has its own subgroup, as described below.
[featureset name]/
- counts: Array
Overall frequency for features across the whole Corpus.
- documentCounts: Array
Number of papers in which each feature occurs.
- index: Array
Maps indices in counts and documentCounts onto string representations of each feature.
- features/
Contains sparse frequency vectors over features for documents. Each row in the arrays belows corresponds to a single document. The values of indices are feature indices for each document, and the values of values are the frequencies themselves. indices_keys and values_keys should be identical, and map the rows in indices and values onto paper IDs.
Thus a sparse frequency vector over features for a document can be reconstructed as freq[d,:] = [ (I[d,0],V[d,0]) ... (I[d,N],V[d,N])], where I is the variable-length array indices and V is the variable-length array values, and N is the length of the slice I[d,:].
- indices: VLArray
- indices_keys: EArray
- values: VLArray
- values_keys: Earray
- papers/
Contains sparse frequency vectors over documents for features. Same structure as in features/, above, except that rows correspond to features and indices contain variable-length arrays of paper IDs.
papers/
- papers_table: Table, see papers_table
Contains metadata about the papers in this Corpus.
Since some data types (e.g. list, tuple) are not supported in PyTables/HDF5, we make use of cPickle serialization. For example, sparse feature vectors (lists of tuples) are pickled for storage in a StringCol.
- abstract_to_features(remove_stopwords=True)[source]¶
See Corpus.abstract_to_features().
Parameters: remove_stopwords : bool
(default: True) If True, passes tokenizer the NLTK stoplist.
- filter_features(fold, fnew, filt)[source]¶
See Corpus.filter_features().
Parameters: fold : str
Key into features for existing featureset.
fnew : str
Key into features for resulting featuresset.
filt : method
Filter function to apply to the featureset. Should take a feature dict as its sole parameter.
- tethne.persistence.hdf5.corpus.from_hdf5(HD_or_path)[source]¶
Load or transform a HDF5Corpus into a Corpus.
If HD_or_path is a string, will attempt to load the HDF5Corpus from that path.
Parameters: HD_or_path : str or HDF5Corpus
If str, must be a path to a HDF5Corpus HDF5 repo.
Returns: D : Corpus
Examples
>>> C = from_hdf5('/path/to/my/archive/MyH5Corpus.h5')
- tethne.persistence.hdf5.corpus.to_hdf5(obj, datapath=None)[source]¶
Transforms a Corpus into a HDF5Corpus.
Use this method to store your Corpus, e.g. to archive data associated with your study or project.
Parameters: datapath : str
If provided, will create the new HDF5Corpus at that location.
Returns: HD : HDF5Corpus
Examples
>>> HC = C.to_hdf5(datapath='/path/to/my/archive')