tethne.writers.corpora module¶
- tethne.writers.corpora.to_documents(target, ngrams, metadata=None, vocab=None)[source]¶
Parameters: target : str
Target path for documents; e.g. ‘./mycorpus’ will result in ‘./mycorpus_docs.txt’ and ‘./mycorpus_meta.csv’.
ngrams : dict
Keys are paper identifiers, values are lists of (ngram, frequency) tuples. If vocab is provided, assumes that ngram is an index into vocab.
metadata : tuple
(keys, dict): keys is a list of metadata keys, and dict contains metadata values dict for each paper. ( [ str ], { str(p) : dict } )
Raises: IOError
- tethne.writers.corpora.to_dtm_input(target, D, feature='unigrams', fields=['date', 'atitle'])[source]¶
Parameters: target : str
Target path for documents; e.g. ‘./mycorpus’ will result in ‘./mycorpus-mult.dat’, ‘./mycorpus-seq.dat’, ‘mycorpus-vocab.dat’, and ‘./mycorpus-meta.dat’.
D : Corpus
Contains Paper objects generated from the same DfR dataset as t_ngrams, indexed by doi and sliced by date.
feature : str
(default: ‘unigrams’) Features in Corpus to use for modeling.
fields : list
(optional) Fields in Paper to include in the metadata file.
Returns: None : If all goes well.
Raises: IOError