AttachJuxtapose#
AttachJuxtaposeConstituencyParser#
- class supar.models.const.aj.AttachJuxtaposeConstituencyParser(*args, **kwargs)[source]#
- The implementation of AttachJuxtapose Constituency Parser Yang & Deng (2020). - MODEL#
- alias of - AttachJuxtaposeConstituencyModel
 - train(train: str | Iterable, dev: str | Iterable, test: str | Iterable, epochs: int = 1000, patience: int = 100, batch_size: int = 5000, update_steps: int = 1, buckets: int = 32, workers: int = 0, amp: bool = False, cache: bool = False, beam_size: int = 1, delete: Set = {'', '!', "''", ',', '-NONE-', '.', ':', '?', 'S1', 'TOP', '``'}, equal: Dict = {'ADVP': 'PRT'}, verbose: bool = True, **kwargs)[source]#
- Parameters:
- train/dev/test (Union[str, Iterable]) – Filenames of the train/dev/test datasets. 
- epochs (int) – The number of training iterations. 
- patience (int) – The number of consecutive iterations after which the training process would be early stopped if no improvement. 
- batch_size (int) – The number of tokens in each batch. Default: 5000. 
- update_steps (int) – Gradient accumulation steps. Default: 1. 
- buckets (int) – The number of buckets that sentences are assigned to. Default: 32. 
- workers (int) – The number of subprocesses used for data loading. 0 means only the main process. Default: 0. 
- clip (float) – Clips gradient of an iterable of parameters at specified value. Default: 5.0. 
- amp (bool) – Specifies whether to use automatic mixed precision. Default: - False.
- cache (bool) – If - True, caches the data first, suggested for huge files (e.g., > 1M sentences). Default:- False.
- verbose (bool) – If - True, increases the output verbosity. Default:- True.
 
 
 - evaluate(data: str | Iterable, batch_size: int = 5000, buckets: int = 8, workers: int = 0, amp: bool = False, cache: bool = False, beam_size: int = 1, delete: Set = {'', '!', "''", ',', '-NONE-', '.', ':', '?', 'S1', 'TOP', '``'}, equal: Dict = {'ADVP': 'PRT'}, verbose: bool = True, **kwargs)[source]#
- Parameters:
- data (Union[str, Iterable]) – The data for evaluation. Both a filename and a list of instances are allowed. 
- batch_size (int) – The number of tokens in each batch. Default: 5000. 
- buckets (int) – The number of buckets that sentences are assigned to. Default: 8. 
- workers (int) – The number of subprocesses used for data loading. 0 means only the main process. Default: 0. 
- amp (bool) – Specifies whether to use automatic mixed precision. Default: - False.
- cache (bool) – If - True, caches the data first, suggested for huge files (e.g., > 1M sentences). Default:- False.
- verbose (bool) – If - True, increases the output verbosity. Default:- True.
 
- Returns:
- The evaluation results. 
 
 - predict(data: str | Iterable, pred: str | None = None, lang: str | None = None, prob: bool = False, batch_size: int = 5000, buckets: int = 8, workers: int = 0, amp: bool = False, cache: bool = False, beam_size: int = 1, verbose: bool = True, **kwargs)[source]#
- Parameters:
- data (Union[str, Iterable]) – The data for prediction. - a filename. If ends with .txt, the parser will seek to make predictions line by line from plain texts. - a list of instances. 
- pred (str) – If specified, the predicted results will be saved to the file. Default: - None.
- lang (str) – Language code (e.g., - en) or language name (e.g.,- English) for the text to tokenize.- Noneif tokenization is not required. Default:- None.
- prob (bool) – If - True, outputs the probabilities. Default:- False.
- batch_size (int) – The number of tokens in each batch. Default: 5000. 
- buckets (int) – The number of buckets that sentences are assigned to. Default: 8. 
- workers (int) – The number of subprocesses used for data loading. 0 means only the main process. Default: 0. 
- amp (bool) – Specifies whether to use automatic mixed precision. Default: - False.
- cache (bool) – If - True, caches the data first, suggested for huge files (e.g., > 1M sentences). Default:- False.
- verbose (bool) – If - True, increases the output verbosity. Default:- True.
 
- Returns:
- A - Datasetobject containing all predictions if- cache=False, otherwise- None.
 
 - classmethod build(path, min_freq=2, fix_len=20, **kwargs)[source]#
- Build a brand-new Parser, including initialization of all data fields and model parameters. - Parameters:
- path (str) – The path of the model to be saved. 
- min_freq (str) – The minimum frequency needed to include a token in the vocabulary. Default: 2. 
- fix_len (int) – The max length of all subword pieces. The excess part of each piece will be truncated. Required if using CharLSTM/BERT. Default: 20. 
- kwargs (Dict) – A dict holding the unconsumed arguments. 
 
 
 
AttachJuxtaposeConstituencyModel#
- class supar.models.const.aj.AttachJuxtaposeConstituencyModel(n_words, n_labels, n_tags=None, n_chars=None, encoder='lstm', feat=['char'], n_embed=100, n_pretrained=100, n_feat_embed=100, n_char_embed=50, n_char_hidden=100, char_pad_index=0, elmo='original_5b', elmo_bos_eos=(True, True), bert=None, n_bert_layers=4, mix_dropout=0.0, bert_pooling='mean', bert_pad_index=0, finetune=False, n_plm_embed=0, embed_dropout=0.33, n_encoder_hidden=800, n_encoder_layers=3, encoder_dropout=0.33, n_gnn_layers=3, gnn_dropout=0.33, pad_index=0, unk_index=1, **kwargs)[source]#
- The implementation of AttachJuxtapose Constituency Parser Yang & Deng (2020). - Parameters:
- n_words (int) – The size of the word vocabulary. 
- n_labels (int) – The number of labels in the treebank. 
- n_tags (int) – The number of POS tags, required if POS tag embeddings are used. Default: - None.
- n_chars (int) – The number of characters, required if character-level representations are used. Default: - None.
- encoder (str) – Encoder to use. - 'lstm': BiLSTM encoder.- 'bert': BERT-like pretrained language model (for finetuning), e.g.,- 'bert-base-cased'. Default:- 'lstm'.
- feat (List[str]) – Additional features to use, required if - encoder='lstm'.- 'tag': POS tag embeddings.- 'char': Character-level representations extracted by CharLSTM.- 'bert': BERT representations, other pretrained language models like RoBERTa are also feasible. Default: [- 'char'].
- n_embed (int) – The size of word embeddings. Default: 100. 
- n_pretrained (int) – The size of pretrained word embeddings. Default: 100. 
- n_feat_embed (int) – The size of feature representations. Default: 100. 
- n_char_embed (int) – The size of character embeddings serving as inputs of CharLSTM, required if using CharLSTM. Default: 50. 
- n_char_hidden (int) – The size of hidden states of CharLSTM, required if using CharLSTM. Default: 100. 
- char_pad_index (int) – The index of the padding token in the character vocabulary, required if using CharLSTM. Default: 0. 
- elmo (str) – Name of the pretrained ELMo registered in ELMoEmbedding.OPTION. Default: - 'original_5b'.
- elmo_bos_eos (Tuple[bool]) – A tuple of two boolean values indicating whether to keep start/end boundaries of elmo outputs. Default: - (True, False).
- bert (str) – Specifies which kind of language model to use, e.g., - 'bert-base-cased'. This is required if- encoder='bert'or using BERT features. The full list can be found in transformers. Default:- None.
- n_bert_layers (int) – Specifies how many last layers to use, required if - encoder='bert'or using BERT features. The final outputs would be weighted sum of the hidden states of these layers. Default: 4.
- mix_dropout (float) – The dropout ratio of BERT layers, required if - encoder='bert'or using BERT features. Default: .0.
- bert_pooling (str) – Pooling way to get token embeddings. - first: take the first subtoken.- last: take the last subtoken.- mean: take a mean over all. Default:- mean.
- bert_pad_index (int) – The index of the padding token in BERT vocabulary, required if - encoder='bert'or using BERT features. Default: 0.
- finetune (bool) – If - False, freezes all parameters, required if using pretrained layers. Default:- False.
- n_plm_embed (int) – The size of PLM embeddings. If 0, uses the size of the pretrained embedding model. Default: 0. 
- embed_dropout (float) – The dropout ratio of input embeddings. Default: .33. 
- n_encoder_hidden (int) – The size of encoder hidden states. Default: 800. 
- n_encoder_layers (int) – The number of encoder layers. Default: 3. 
- encoder_dropout (float) – The dropout ratio of encoder layers. Default: .33. 
- n_gnn_layers (int) – The number of GNN layers. Default: 3. 
- gnn_dropout (float) – The dropout ratio of GNN layers. Default: .33. 
- pad_index (int) – The index of the padding token in the word vocabulary. Default: 0. 
- unk_index (int) – The index of the unknown token in the word vocabulary. Default: 1. 
 
 - forward(words: LongTensor, feats: List[LongTensor] | None = None) Tensor[source]#
- Parameters:
- words (LongTensor) – - [batch_size, seq_len]. Word indices.
- feats (List[LongTensor]) – A list of feat indices. The size is either - [batch_size, seq_len, fix_len]if- featis- 'char'or- 'bert', or- [batch_size, seq_len]otherwise. Default:- None.
 
- Returns:
- Contextualized output hidden states of shape - [batch_size, seq_len, n_model]of the input.
- Return type:
 
 - loss(x: Tensor, nodes: LongTensor, parents: LongTensor, news: LongTensor, mask: BoolTensor) Tensor[source]#
- Parameters:
- x (Tensor) – - [batch_size, seq_len, n_model]. Contextualized output hidden states.
- nodes (LongTensor) – - [batch_size, seq_len]. The target node positions on rightmost chains.
- parents (LongTensor) – - [batch_size, seq_len]. The parent node labels of terminals.
- news (LongTensor) – - [batch_size, seq_len]. The parent node labels of juxtaposed targets and terminals.
- mask (BoolTensor) – - [batch_size, seq_len]. The mask for covering the unpadded tokens in each chart.
 
- Returns:
- The training loss. 
- Return type: