Data#
Dataset#
- class supar.utils.data.Dataset(transform: Transform, data: str | Iterable, cache: bool = False, binarize: bool = False, bin: str | None = None, max_len: int | None = None, **kwargs)[source]#
Dataset that is compatible with
torch.utils.data.Dataset
, serving as a wrapper for manipulating all data fields with the operating behaviours defined inTransform
. The data fields of all the instantiated sentences can be accessed as an attribute of the dataset.- Parameters:
transform (Transform) – An instance of
Transform
or its derivations. The instance holds a series of loading and processing behaviours with regard to the specific data format.data (Union[str, Iterable]) – A filename or a list of instances that will be passed into
transform.load()
.cache (bool) – If
True
, tries to use the previously cached binarized data for fast loading. In this way, sentences are loaded on-the-fly according to the meta data. IfFalse
, all sentences will be directly loaded into the memory. Default:False
.binarize (bool) – If
True
, binarizes the dataset once building it. Only works ifcache=True
. Default:False
.bin (str) – Path to binarized files, required if
cache=True
. Default:None
.max_len (int) – Sentences exceeding the length will be discarded. Default:
None
.kwargs (Dict) – Together with data, kwargs will be passed into
transform.load()
to control the loading behaviour.
- sentences#
A list of sentences loaded from the data. Each sentence includes fields obeying the data format defined in
transform
. Ifcache=True
, each is a pointer to the sentence stored in the cache file.- Type:
List[Sentence]