class Transform, data: str | Iterable, cache: bool = False, binarize: bool = False, bin: str | None = None, max_len: int | None = None, **kwargs)[source]#

Dataset that is compatible with, serving as a wrapper for manipulating all data fields with the operating behaviours defined in Transform. The data fields of all the instantiated sentences can be accessed as an attribute of the dataset.

  • transform (Transform) – An instance of Transform or its derivations. The instance holds a series of loading and processing behaviours with regard to the specific data format.

  • data (Union[str, Iterable]) – A filename or a list of instances that will be passed into transform.load().

  • cache (bool) – If True, tries to use the previously cached binarized data for fast loading. In this way, sentences are loaded on-the-fly according to the meta data. If False, all sentences will be directly loaded into the memory. Default: False.

  • binarize (bool) – If True, binarizes the dataset once building it. Only works if cache=True. Default: False.

  • bin (str) – Path to binarized files, required if cache=True. Default: None.

  • max_len (int) – Sentences exceeding the length will be discarded. Default: None.

  • kwargs (Dict) – Together with data, kwargs will be passed into transform.load() to control the loading behaviour.


An instance of Transform.




A list of sentences loaded from the data. Each sentence includes fields obeying the data format defined in transform. If cache=True, each is a pointer to the sentence stored in the cache file.