Embedding#

Embedding#

class supar.utils.embed.Embedding(path: str, unk: str | None = None, skip_first: bool = False, cache: bool = True, sep: str = ' ', **kwargs)[source]#

Defines a container object for holding pretrained embeddings. This object is callable and behaves like torch.nn.Embedding. For huge files, this object supports lazy loading, seeking to retrieve vectors from the disk on the fly if necessary.

Currently available embeddings:
Parameters:
  • path (str) – Path to the embedding file or short name registered in supar.utils.embed.PRETRAINED.

  • unk (Optional[str]) – The string token used to represent OOV tokens. Default: None.

  • skip_first (bool) – If True, skips the first line of the embedding file. Default: False.

  • cache (bool) – If True, instead of loading entire embeddings into memory, seeks to load vectors from the disk once called. Default: True.

  • sep (str) – Separator used by embedding file. Default: ' '.

Examples

>>> import torch.nn as nn
>>> from supar.utils.embed import Embedding
>>> glove = Embedding.load('glove-6b-100')
>>> glove
GloVeEmbedding(n_tokens=400000, dim=100, unk=unk, cache=True)
>>> fasttext = Embedding.load('fasttext-en')
>>> fasttext
FasttextEmbedding(n_tokens=2000000, dim=300, skip_first=True, cache=True)
>>> giga = Embedding.load('giga-100')
>>> giga
GigaEmbedding(n_tokens=372846, dim=100, cache=True)
>>> indices = torch.tensor([glove.vocab[i.lower()] for i in ['She', 'enjoys', 'playing', 'tennis', '.']])
>>> indices
tensor([  67, 8371,  697, 2140,    2])
>>> glove(indices).shape
torch.Size([5, 100])
>>> glove(indices).equal(nn.Embedding.from_pretrained(glove.vectors)(indices))
True

GloVeEmbedding#

class supar.utils.embed.GloVeEmbedding(src: str = '6B', dim: int = 100, reload=False, *args, **kwargs)[source]#

GloVe: Global Vectors for Word Representation. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Parameters:
  • src (str) – Size of the source data for training. Default: 6B.

  • dim (int) – Which dimension of the embeddings to use. Default: 100.

  • reload (bool) – If True, forces a fresh download. Default: False.

Examples

>>> from supar.utils.embed import Embedding
>>> Embedding.load('glove-6b-100')
GloVeEmbedding(n_tokens=400000, dim=100, unk=unk, cache=True)

FasttextEmbedding#

class supar.utils.embed.FasttextEmbedding(lang: str = 'en', reload=False, *args, **kwargs)[source]#

Fasttext word embeddings for 157 languages, trained using CBOW, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.

Parameters:
  • lang (str) – Language code. Default: en.

  • reload (bool) – If True, forces a fresh download. Default: False.

Examples

>>> from supar.utils.embed import Embedding
>>> Embedding.load('fasttext-en')
FasttextEmbedding(n_tokens=2000000, dim=300, skip_first=True, cache=True)

GigaEmbedding#

class supar.utils.embed.GigaEmbedding(reload=False, *args, **kwargs)[source]#

Giga word embeddings, trained on Chinese Gigaword Third Edition for Chinese using word2vec, used by Zhang et al. (2020a) and Zhang et al. (2020b).

Parameters:

reload (bool) – If True, forces a fresh download. Default: False.

Examples

>>> from supar.utils.embed import Embedding
>>> Embedding.load('giga-100')
GigaEmbedding(n_tokens=372846, dim=100, cache=True)

TencentEmbedding#

class supar.utils.embed.TencentEmbedding(dim: int = 100, large: bool = False, reload=False, *args, **kwargs)[source]#

Tencent word embeddings. The embeddings are trained on large-scale text collected from news, webpages, and novels with Directional Skip-Gram. 100-dimension and 200-dimension embeddings for over 12 million Chinese words are provided.

Parameters:
  • dim (int) – Which dimension of the embeddings to use. Currently 100 and 200 are available. Default: 100.

  • large (bool) – If True, uses large version with larger vocab size (12,287,933); 2,000,000 otherwise. Default: False.

  • reload (bool) – If True, forces a fresh download. Default: False.

Examples

>>> from supar.utils.embed import Embedding
>>> Embedding.load('tencent-100')
TencentEmbedding(n_tokens=2000000, dim=100, skip_first=True, cache=True)
>>> Embedding.load('tencent-100-large')
TencentEmbedding(n_tokens=12287933, dim=100, skip_first=True, cache=True)