Data#

Dataset#

class supar.utils.data.Dataset(transform: Transform, data: str | Iterable, cache: bool = False, binarize: bool = False, bin: str | None = None, max_len: int | None = None, **kwargs)[source]#

Dataset that is compatible with torch.utils.data.Dataset, serving as a wrapper for manipulating all data fields with the operating behaviours defined in Transform. The data fields of all the instantiated sentences can be accessed as an attribute of the dataset.

Parameters:
  • transform (Transform) – An instance of Transform or its derivations. The instance holds a series of loading and processing behaviours with regard to the specific data format.

  • data (Union[str, Iterable]) – A filename or a list of instances that will be passed into transform.load().

  • cache (bool) – If True, tries to use the previously cached binarized data for fast loading. In this way, sentences are loaded on-the-fly according to the meta data. If False, all sentences will be directly loaded into the memory. Default: False.

  • binarize (bool) – If True, binarizes the dataset once building it. Only works if cache=True. Default: False.

  • bin (str) – Path to binarized files, required if cache=True. Default: None.

  • max_len (int) – Sentences exceeding the length will be discarded. Default: None.

  • kwargs (Dict) – Together with data, kwargs will be passed into transform.load() to control the loading behaviour.

transform#

An instance of Transform.

Type:

Transform

sentences#

A list of sentences loaded from the data. Each sentence includes fields obeying the data format defined in transform. If cache=True, each is a pointer to the sentence stored in the cache file.

Type:

List[Sentence]