tethne.persistence.hdf5.corpus module¶

class tethne.persistence.hdf5.corpus.HDF5Corpus(papers, features=None, index_by='wosid', index_citation_by='ayjid', exclude=set([]), filt=None, datapath=None, index=True)[source]¶

Bases: tethne.classes.corpus.Corpus

Provides HDF5 persistence for Corpus.

The HDF5Corpus uses a variety of tables and arrays to store data. The structure of a typical HDF5 repository for an instance of this class is:

/
- arrays/
  - authors: VLArray (String), vlarray_dict
    
    Maps author indices in authors_index onto the IDs of papers that they authored. Padded with an empty 0th entry.
  - authors_index: EArray (String), see vlarray_dict
    
    Maps author indices used in authors to string representations of author names (LAST F). Padded with an empty 0th entry.
  - papers_citing: VLArray (String), vlarray_dict
    
    Each row corresponds to a paper, and contains a set of IDs for the papers that cite that paper. Row indices correspond to the entries in papers_citing_index. Padded with an empty 0th entry.
  - papers_citing_index: EArray (String), see vlarray_dict
    
    Maps paper indices used in papers_citing to string paper IDs. Padded with an empty 0th entry.
- axes/
  
  Each slice axis is represented by a VLArray ([slice axis]) and an EArray ([slice_axis]_keys).
  - [slice axis] (e.g. date): VLArray (String)
    
    Each row is a slice, containing a variable-length array of paper IDs.
  - [slice axis]_keys (e.g. date_keys): EArray (Int32 or String)
    
    Maps row indices in [slice axis] onto slice names/keys.
- citations/
  - papers_table: Table, see papers_table
    
    Contains metadata about cited references. These are usually not the same papers as those described in papers/.
- features/
  
  This group contains data for featuresets. Each featureset has its own subgroup, as described below.
  - [featureset name]/
    - counts: Array
      
      Overall frequency for features across the whole Corpus.
    - documentCounts: Array
      
      Number of papers in which each feature occurs.
    - index: Array
      
      Maps indices in counts and documentCounts onto string representations of each feature.
    - features/
      
      Contains sparse frequency vectors over features for documents. Each row in the arrays belows corresponds to a single document. The values of indices are feature indices for each document, and the values of values are the frequencies themselves. indices_keys and values_keys should be identical, and map the rows in indices and values onto paper IDs.
      
      Thus a sparse frequency vector over features for a document can be reconstructed as freq[d,:] = [ (I[d,0],V[d,0]) ... (I[d,N],V[d,N])], where I is the variable-length array indices and V is the variable-length array values, and N is the length of the slice I[d,:].
      - indices: VLArray
      - indices_keys: EArray
      - values: VLArray
      - values_keys: Earray
    - papers/
      
      Contains sparse frequency vectors over documents for features. Same structure as in features/, above, except that rows correspond to features and indices contain variable-length arrays of paper IDs.
- papers/
  - papers_table: Table, see papers_table
    
    Contains metadata about the papers in this Corpus.

Since some data types (e.g. list, tuple) are not supported in PyTables/HDF5, we make use of cPickle serialization. For example, sparse feature vectors (lists of tuples) are pickled for storage in a StringCol.

abstract_to_features(remove_stopwords=True)[source]¶

See Corpus.abstract_to_features().

Parameters:

Parameters:	remove_stopwords : bool (default: True) If True, passes tokenizer the NLTK stoplist.

remove_stopwords : bool

(default: True) If True, passes tokenizer the NLTK stoplist.

filter_features(fold, fnew, filt)[source]¶

See Corpus.filter_features().

Parameters:

Parameters:	fold : str Key into `features` for existing featureset. fnew : str Key into `features` for resulting featuresset. filt : method Filter function to apply to the featureset. Should take a feature dict as its sole parameter.

fold : str

Key into features for existing featureset.

fnew : str

Key into features for resulting featuresset.

filt : method

Filter function to apply to the featureset. Should take a feature dict as its sole parameter.

tethne.persistence.hdf5.corpus.from_hdf5(HD_or_path)[source]¶

Load or transform a HDF5Corpus into a Corpus.

If HD_or_path is a string, will attempt to load the HDF5Corpus from that path.

Parameters:

Parameters:	HD_or_path : str or `HDF5Corpus` If str, must be a path to a `HDF5Corpus` HDF5 repo.
Returns:	D : `Corpus`

HD_or_path : str or HDF5Corpus

If str, must be a path to a HDF5Corpus HDF5 repo.

Returns:

D : Corpus

Examples

>>> C = from_hdf5('/path/to/my/archive/MyH5Corpus.h5')

tethne.persistence.hdf5.corpus.to_hdf5(obj, datapath=None)[source]¶

Transforms a Corpus into a HDF5Corpus.

Use this method to store your Corpus, e.g. to archive data associated with your study or project.

Parameters:

Parameters:	datapath : str If provided, will create the new `HDF5Corpus` at that location.
Returns:	HD : `HDF5Corpus`

datapath : str

If provided, will create the new HDF5Corpus at that location.

Returns:

HD : HDF5Corpus

Examples

>>> HC = C.to_hdf5(datapath='/path/to/my/archive')

tethne.persistence.hdf5.corpus module¶

Previous topic

Next topic

This Page