SciPy

tethne.persistence.hdf5.corpus module

class tethne.persistence.hdf5.corpus.HDF5Corpus(papers, features=None, index_by='wosid', index_citation_by='ayjid', exclude=set([]), filt=None, datapath=None, index=True)[source]

Bases: tethne.classes.corpus.Corpus

Provides HDF5 persistence for Corpus.

The HDF5Corpus uses a variety of tables and arrays to store data. The structure of a typical HDF5 repository for an instance of this class is:

  • /

    • arrays/

      • authors: VLArray (String), vlarray_dict

        Maps author indices in authors_index onto the IDs of papers that they authored. Padded with an empty 0th entry.

      • authors_index: EArray (String), see vlarray_dict

        Maps author indices used in authors to string representations of author names (LAST F). Padded with an empty 0th entry.

      • papers_citing: VLArray (String), vlarray_dict

        Each row corresponds to a paper, and contains a set of IDs for the papers that cite that paper. Row indices correspond to the entries in papers_citing_index. Padded with an empty 0th entry.

      • papers_citing_index: EArray (String), see vlarray_dict

        Maps paper indices used in papers_citing to string paper IDs. Padded with an empty 0th entry.

    • axes/

      Each slice axis is represented by a VLArray ([slice axis]) and an EArray ([slice_axis]_keys).

      • [slice axis] (e.g. date): VLArray (String)

        Each row is a slice, containing a variable-length array of paper IDs.

      • [slice axis]_keys (e.g. date_keys): EArray (Int32 or String)

        Maps row indices in [slice axis] onto slice names/keys.

    • citations/

      • papers_table: Table, see papers_table

        Contains metadata about cited references. These are usually not the same papers as those described in papers/.

    • features/

      This group contains data for featuresets. Each featureset has its own subgroup, as described below.

      • [featureset name]/

        • counts: Array

          Overall frequency for features across the whole Corpus.

        • documentCounts: Array

          Number of papers in which each feature occurs.

        • index: Array

          Maps indices in counts and documentCounts onto string representations of each feature.

        • features/

          Contains sparse frequency vectors over features for documents. Each row in the arrays belows corresponds to a single document. The values of indices are feature indices for each document, and the values of values are the frequencies themselves. indices_keys and values_keys should be identical, and map the rows in indices and values onto paper IDs.

          Thus a sparse frequency vector over features for a document can be reconstructed as freq[d,:] = [ (I[d,0],V[d,0]) ... (I[d,N],V[d,N])], where I is the variable-length array indices and V is the variable-length array values, and N is the length of the slice I[d,:].

          • indices: VLArray
          • indices_keys: EArray
          • values: VLArray
          • values_keys: Earray
        • papers/

          Contains sparse frequency vectors over documents for features. Same structure as in features/, above, except that rows correspond to features and indices contain variable-length arrays of paper IDs.

    • papers/

      • papers_table: Table, see papers_table

        Contains metadata about the papers in this Corpus.

Since some data types (e.g. list, tuple) are not supported in PyTables/HDF5, we make use of cPickle serialization. For example, sparse feature vectors (lists of tuples) are pickled for storage in a StringCol.

abstract_to_features(remove_stopwords=True)[source]

See Corpus.abstract_to_features().

Parameters:

remove_stopwords : bool

(default: True) If True, passes tokenizer the NLTK stoplist.

filter_features(fold, fnew, filt)[source]

See Corpus.filter_features().

Parameters:

fold : str

Key into features for existing featureset.

fnew : str

Key into features for resulting featuresset.

filt : method

Filter function to apply to the featureset. Should take a feature dict as its sole parameter.

tethne.persistence.hdf5.corpus.from_hdf5(HD_or_path)[source]

Load or transform a HDF5Corpus into a Corpus.

If HD_or_path is a string, will attempt to load the HDF5Corpus from that path.

Parameters:

HD_or_path : str or HDF5Corpus

If str, must be a path to a HDF5Corpus HDF5 repo.

Returns:

D : Corpus

Examples

>>> C = from_hdf5('/path/to/my/archive/MyH5Corpus.h5')
tethne.persistence.hdf5.corpus.to_hdf5(obj, datapath=None)[source]

Transforms a Corpus into a HDF5Corpus.

Use this method to store your Corpus, e.g. to archive data associated with your study or project.

Parameters:

datapath : str

If provided, will create the new HDF5Corpus at that location.

Returns:

HD : HDF5Corpus

Examples

>>> HC = C.to_hdf5(datapath='/path/to/my/archive')