VI#

VIConstituencyParser#

class supar.models.const.vi.VIConstituencyParser(*args, **kwargs)[source]#

The implementation of Constituency Parser using variational inference.

MODEL#

alias of VIConstituencyModel

train(train, dev, test, epochs: int = 1000, patience: int = 100, batch_size: int = 5000, update_steps: int = 1, buckets: int = 32, workers: int = 0, amp: bool = False, cache: bool = False, delete: Set = {'', '!', "''", ',', '-NONE-', '.', ':', '?', 'S1', 'TOP', '``'}, equal: Dict = {'ADVP': 'PRT'}, verbose: bool = True, **kwargs)[source]#
Parameters:
  • train/dev/test (Union[str, Iterable]) – Filenames of the train/dev/test datasets.

  • epochs (int) – The number of training iterations.

  • patience (int) – The number of consecutive iterations after which the training process would be early stopped if no improvement.

  • batch_size (int) – The number of tokens in each batch. Default: 5000.

  • update_steps (int) – Gradient accumulation steps. Default: 1.

  • buckets (int) – The number of buckets that sentences are assigned to. Default: 32.

  • workers (int) – The number of subprocesses used for data loading. 0 means only the main process. Default: 0.

  • clip (float) – Clips gradient of an iterable of parameters at specified value. Default: 5.0.

  • amp (bool) – Specifies whether to use automatic mixed precision. Default: False.

  • cache (bool) – If True, caches the data first, suggested for huge files (e.g., > 1M sentences). Default: False.

  • verbose (bool) – If True, increases the output verbosity. Default: True.

evaluate(data: str | Iterable, batch_size: int = 5000, buckets: int = 8, workers: int = 0, amp: bool = False, cache: bool = False, delete: Set = {'', '!', "''", ',', '-NONE-', '.', ':', '?', 'S1', 'TOP', '``'}, equal: Dict = {'ADVP': 'PRT'}, verbose: bool = True, **kwargs)[source]#
Parameters:
  • data (Union[str, Iterable]) – The data for evaluation. Both a filename and a list of instances are allowed.

  • batch_size (int) – The number of tokens in each batch. Default: 5000.

  • buckets (int) – The number of buckets that sentences are assigned to. Default: 8.

  • workers (int) – The number of subprocesses used for data loading. 0 means only the main process. Default: 0.

  • amp (bool) – Specifies whether to use automatic mixed precision. Default: False.

  • cache (bool) – If True, caches the data first, suggested for huge files (e.g., > 1M sentences). Default: False.

  • verbose (bool) – If True, increases the output verbosity. Default: True.

Returns:

The evaluation results.

predict(data: str | Iterable, pred: str | None = None, lang: str | None = None, prob: bool = False, batch_size: int = 5000, buckets: int = 8, workers: int = 0, amp: bool = False, cache: bool = False, verbose: bool = True, **kwargs)[source]#
Parameters:
  • data (Union[str, Iterable]) – The data for prediction. - a filename. If ends with .txt, the parser will seek to make predictions line by line from plain texts. - a list of instances.

  • pred (str) – If specified, the predicted results will be saved to the file. Default: None.

  • lang (str) – Language code (e.g., en) or language name (e.g., English) for the text to tokenize. None if tokenization is not required. Default: None.

  • prob (bool) – If True, outputs the probabilities. Default: False.

  • batch_size (int) – The number of tokens in each batch. Default: 5000.

  • buckets (int) – The number of buckets that sentences are assigned to. Default: 8.

  • workers (int) – The number of subprocesses used for data loading. 0 means only the main process. Default: 0.

  • amp (bool) – Specifies whether to use automatic mixed precision. Default: False.

  • cache (bool) – If True, caches the data first, suggested for huge files (e.g., > 1M sentences). Default: False.

  • verbose (bool) – If True, increases the output verbosity. Default: True.

Returns:

A Dataset object containing all predictions if cache=False, otherwise None.

VIConstituencyModel#

class supar.models.const.vi.VIConstituencyModel(n_words, n_labels, n_tags=None, n_chars=None, encoder='lstm', feat=['char'], n_embed=100, n_pretrained=100, n_feat_embed=100, n_char_embed=50, n_char_hidden=100, char_pad_index=0, elmo='original_5b', elmo_bos_eos=(True, True), bert=None, n_bert_layers=4, mix_dropout=0.0, bert_pooling='mean', bert_pad_index=0, finetune=False, n_plm_embed=0, embed_dropout=0.33, n_encoder_hidden=800, n_encoder_layers=3, encoder_dropout=0.33, n_span_mlp=500, n_pair_mlp=100, n_label_mlp=100, mlp_dropout=0.33, inference='mfvi', max_iter=3, interpolation=0.1, pad_index=0, unk_index=1, **kwargs)[source]#

The implementation of Constituency Parser using variational inference.

Parameters:
  • n_words (int) – The size of the word vocabulary.

  • n_labels (int) – The number of labels in the treebank.

  • n_tags (int) – The number of POS tags, required if POS tag embeddings are used. Default: None.

  • n_chars (int) – The number of characters, required if character-level representations are used. Default: None.

  • encoder (str) – Encoder to use. 'lstm': BiLSTM encoder. 'bert': BERT-like pretrained language model (for finetuning), e.g., 'bert-base-cased'. Default: 'lstm'.

  • feat (List[str]) – Additional features to use, required if encoder='lstm'. 'tag': POS tag embeddings. 'char': Character-level representations extracted by CharLSTM. 'bert': BERT representations, other pretrained language models like RoBERTa are also feasible. Default: ['char'].

  • n_embed (int) – The size of word embeddings. Default: 100.

  • n_pretrained (int) – The size of pretrained word embeddings. Default: 100.

  • n_feat_embed (int) – The size of feature representations. Default: 100.

  • n_char_embed (int) – The size of character embeddings serving as inputs of CharLSTM, required if using CharLSTM. Default: 50.

  • n_char_hidden (int) – The size of hidden states of CharLSTM, required if using CharLSTM. Default: 100.

  • char_pad_index (int) – The index of the padding token in the character vocabulary, required if using CharLSTM. Default: 0.

  • elmo (str) – Name of the pretrained ELMo registered in ELMoEmbedding.OPTION. Default: 'original_5b'.

  • elmo_bos_eos (Tuple[bool]) – A tuple of two boolean values indicating whether to keep start/end boundaries of elmo outputs. Default: (True, False).

  • bert (str) – Specifies which kind of language model to use, e.g., 'bert-base-cased'. This is required if encoder='bert' or using BERT features. The full list can be found in transformers. Default: None.

  • n_bert_layers (int) – Specifies how many last layers to use, required if encoder='bert' or using BERT features. The final outputs would be weighted sum of the hidden states of these layers. Default: 4.

  • mix_dropout (float) – The dropout ratio of BERT layers, required if encoder='bert' or using BERT features. Default: .0.

  • bert_pooling (str) – Pooling way to get token embeddings. first: take the first subtoken. last: take the last subtoken. mean: take a mean over all. Default: mean.

  • bert_pad_index (int) – The index of the padding token in BERT vocabulary, required if encoder='bert' or using BERT features. Default: 0.

  • finetune (bool) – If False, freezes all parameters, required if using pretrained layers. Default: False.

  • n_plm_embed (int) – The size of PLM embeddings. If 0, uses the size of the pretrained embedding model. Default: 0.

  • embed_dropout (float) – The dropout ratio of input embeddings. Default: .33.

  • n_encoder_hidden (int) – The size of encoder hidden states. Default: 800.

  • n_encoder_layers (int) – The number of encoder layers. Default: 3.

  • encoder_dropout (float) – The dropout ratio of encoder layer. Default: .33.

  • n_span_mlp (int) – Span MLP size. Default: 500.

  • n_pair_mlp (int) – Binary factor MLP size. Default: 100.

  • n_label_mlp (int) – Label MLP size. Default: 100.

  • mlp_dropout (float) – The dropout ratio of MLP layers. Default: .33.

  • inference (str) – Approximate inference methods. Default: mfvi.

  • max_iter (int) – Max iteration times for inference. Default: 3.

  • interpolation (int) – Constant to even out the label/edge loss. Default: .1.

  • pad_index (int) – The index of the padding token in the word vocabulary. Default: 0.

  • unk_index (int) – The index of the unknown token in the word vocabulary. Default: 1.

forward(words, feats)[source]#
Parameters:
  • words (LongTensor) – [batch_size, seq_len]. Word indices.

  • feats (List[LongTensor]) – A list of feat indices. The size is either [batch_size, seq_len, fix_len] if feat is 'char' or 'bert', or [batch_size, seq_len] otherwise.

Returns:

Scores of all possible constituents ([batch_size, seq_len, seq_len]), second-order triples ([batch_size, seq_len, seq_len, n_labels]) and all possible labels on each constituent ([batch_size, seq_len, seq_len, n_labels]).

Return type:

Tensor, Tensor, Tensor

loss(s_span, s_pair, s_label, charts, mask)[source]#
Parameters:
  • s_span (Tensor) – [batch_size, seq_len, seq_len]. Scores of all constituents.

  • s_pair (Tensor) – [batch_size, seq_len, seq_len, seq_len]. Scores of second-order triples.

  • s_label (Tensor) – [batch_size, seq_len, seq_len, n_labels]. Scores of all constituent labels.

  • charts (LongTensor) – [batch_size, seq_len, seq_len]. The tensor of gold-standard labels. Positions without labels are filled with -1.

  • mask (BoolTensor) – [batch_size, seq_len, seq_len]. The mask for covering the unpadded tokens in each chart.

Returns:

The training loss and marginals of shape [batch_size, seq_len, seq_len].

Return type:

Tensor, Tensor

decode(s_span, s_label, mask)[source]#
Parameters:
  • s_span (Tensor) – [batch_size, seq_len, seq_len]. Scores of all constituents.

  • s_label (Tensor) – [batch_size, seq_len, seq_len, n_labels]. Scores of all constituent labels.

  • mask (BoolTensor) – [batch_size, seq_len, seq_len]. The mask for covering the unpadded tokens in each chart.

Returns:

Sequences of factorized labeled trees.

Return type:

List[List[Tuple]]