Field#

RawField#

class supar.utils.field.RawField(name: str, fn: Callable | None = None)[source]#

Defines a general datatype.

A RawField object does not assume any property of the datatype and it holds parameters relating to how a datatype should be processed.

Parameters:
  • name (str) – The name of the field.

  • fn (function) – The function used for preprocessing the examples. Default: None.

Field#

class supar.utils.field.Field(name: str, pad: str | None = None, unk: str | None = None, bos: str | None = None, eos: str | None = None, lower: bool = False, use_vocab: bool = True, tokenize: Callable | None = None, fn: Callable | None = None)[source]#

Defines a datatype together with instructions for converting to Tensor. Field models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method.

Parameters:
  • name (str) – The name of the field.

  • pad_token (str) – The string token used as padding. Default: None.

  • unk_token (str) – The string token used to represent OOV words. Default: None.

  • bos_token (str) – A token that will be prepended to every example using this field, or None for no bos_token. Default: None.

  • eos_token (str) – A token that will be appended to every example using this field, or None for no eos_token.

  • lower (bool) – Whether to lowercase the text in this field. Default: False.

  • use_vocab (bool) – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.

  • tokenize (function) – The function used to tokenize strings using this field into sequential examples. Default: None.

  • fn (function) – The function used for preprocessing the examples. Default: None.

preprocess(data: str | Iterable) Iterable[source]#

Loads a single example and tokenize it if necessary. The sequence will be first passed to fn if available. If tokenize is not None, the input will be tokenized. Then the input will be lowercased optionally.

Parameters:

data (Union[str, Iterable]) – The data to be preprocessed.

Returns:

A list of preprocessed sequence.

build(data: Dataset | Iterable[Dataset], min_freq: int = 1, embed: Embedding | None = None, norm: Callable | None = None) Field[source]#

Constructs a Vocab object for this field from one or more datasets. If the vocabulary has already existed, this function will have no effect.

Parameters:
  • data (Union[Dataset, Iterable[Dataset]]) – One or more Dataset object. One of the attributes should be named after the name of this field.

  • min_freq (int) – The minimum frequency needed to include a token in the vocabulary. Default: 1.

  • embed (Embedding) – An Embedding object, words in which will be extended to the vocabulary. Default: None.

  • norm (Callable) – Callable function used for normalizing embedding weights. Default: None.

transform(sequences: Iterable[List[str]]) Iterable[Tensor][source]#

Turns a list of sequences that use this field into tensors.

Each sequence is first preprocessed and then numericalized if needed.

Parameters:

sequences (Iterable[List[str]]) – A list of sequences.

Returns:

A list of tensors transformed from the input sequences.

compose(batch: Iterable[Tensor]) Tensor[source]#

Composes a batch of sequences into a padded tensor.

Parameters:

batch (Iterable[Tensor]) – A list of tensors.

Returns:

A padded tensor converted to proper device.

SubwordField#

class supar.utils.field.SubwordField(*args, **kwargs)[source]#

A field that conducts tokenization and numericalization over each token rather the sequence.

This is customized for models requiring character/subword-level inputs, e.g., CharLSTM and BERT.

Parameters:

fix_len (int) – A fixed length that all subword pieces will be padded to. This is used for truncating the subword pieces exceeding the length. To save the memory, the final length will be the smaller value between the max length of subword pieces in a batch and fix_len.

Examples

>>> from supar.utils.tokenizer import TransformerTokenizer
>>> tokenizer = TransformerTokenizer('bert-base-cased')
>>> field = SubwordField('bert',
                         pad=tokenizer.pad,
                         unk=tokenizer.unk,
                         bos=tokenizer.bos,
                         eos=tokenizer.eos,
                         fix_len=20,
                         tokenize=tokenizer)
>>> field.vocab = tokenizer.vocab  # no need to re-build the vocab
>>> next(field.transform([['This', 'field', 'performs', 'token-level', 'tokenization']]))
tensor([[  101,     0,     0],
        [ 1188,     0,     0],
        [ 1768,     0,     0],
        [10383,     0,     0],
        [22559,   118,  1634],
        [22559,  2734,     0],
        [  102,     0,     0]])
build(data: Dataset | Iterable[Dataset], min_freq: int = 1, embed: Embedding | None = None, norm: Callable | None = None) SubwordField[source]#

Constructs a Vocab object for this field from one or more datasets. If the vocabulary has already existed, this function will have no effect.

Parameters:
  • data (Union[Dataset, Iterable[Dataset]]) – One or more Dataset object. One of the attributes should be named after the name of this field.

  • min_freq (int) – The minimum frequency needed to include a token in the vocabulary. Default: 1.

  • embed (Embedding) – An Embedding object, words in which will be extended to the vocabulary. Default: None.

  • norm (Callable) – Callable function used for normalizing embedding weights. Default: None.

transform(sequences: Iterable[List[str]]) Iterable[Tensor][source]#

Turns a list of sequences that use this field into tensors.

Each sequence is first preprocessed and then numericalized if needed.

Parameters:

sequences (Iterable[List[str]]) – A list of sequences.

Returns:

A list of tensors transformed from the input sequences.

ChartField#

class supar.utils.field.ChartField(name: str, pad: str | None = None, unk: str | None = None, bos: str | None = None, eos: str | None = None, lower: bool = False, use_vocab: bool = True, tokenize: Callable | None = None, fn: Callable | None = None)[source]#

Field dealing with chart inputs.

Examples

>>> chart = [[    None,    'NP',    None,    None,    'S*',     'S'],
             [    None,    None,   'VP*',    None,    'VP',    None],
             [    None,    None,    None,   'VP*', 'S::VP',    None],
             [    None,    None,    None,    None,    'NP',    None],
             [    None,    None,    None,    None,    None,    'S*'],
             [    None,    None,    None,    None,    None,    None]]
>>> next(field.transform([chart]))
tensor([[ -1,  37,  -1,  -1, 107,  79],
        [ -1,  -1, 120,  -1, 112,  -1],
        [ -1,  -1,  -1, 120,  86,  -1],
        [ -1,  -1,  -1,  -1,  37,  -1],
        [ -1,  -1,  -1,  -1,  -1, 107],
        [ -1,  -1,  -1,  -1,  -1,  -1]])
build(data: Dataset | Iterable[Dataset], min_freq: int = 1) ChartField[source]#

Constructs a Vocab object for this field from one or more datasets. If the vocabulary has already existed, this function will have no effect.

Parameters:
  • data (Union[Dataset, Iterable[Dataset]]) – One or more Dataset object. One of the attributes should be named after the name of this field.

  • min_freq (int) – The minimum frequency needed to include a token in the vocabulary. Default: 1.

  • embed (Embedding) – An Embedding object, words in which will be extended to the vocabulary. Default: None.

  • norm (Callable) – Callable function used for normalizing embedding weights. Default: None.

transform(charts: Iterable[List[List]]) Iterable[Tensor][source]#

Turns a list of sequences that use this field into tensors.

Each sequence is first preprocessed and then numericalized if needed.

Parameters:

sequences (Iterable[List[str]]) – A list of sequences.

Returns:

A list of tensors transformed from the input sequences.