Embedding#
Embedding#
- class supar.utils.embed.Embedding(path: str, unk: str | None = None, skip_first: bool = False, cache: bool = True, sep: str = ' ', **kwargs)[source]#
Defines a container object for holding pretrained embeddings. This object is callable and behaves like
torch.nn.Embedding
. For huge files, this object supports lazy loading, seeking to retrieve vectors from the disk on the fly if necessary.- Parameters:
path (str) – Path to the embedding file or short name registered in
supar.utils.embed.PRETRAINED
.unk (Optional[str]) – The string token used to represent OOV tokens. Default:
None
.skip_first (bool) – If
True
, skips the first line of the embedding file. Default:False
.cache (bool) – If
True
, instead of loading entire embeddings into memory, seeks to load vectors from the disk once called. Default:True
.sep (str) – Separator used by embedding file. Default:
' '
.
Examples
>>> import torch.nn as nn >>> from supar.utils.embed import Embedding >>> glove = Embedding.load('glove-6b-100') >>> glove GloVeEmbedding(n_tokens=400000, dim=100, unk=unk, cache=True) >>> fasttext = Embedding.load('fasttext-en') >>> fasttext FasttextEmbedding(n_tokens=2000000, dim=300, skip_first=True, cache=True) >>> giga = Embedding.load('giga-100') >>> giga GigaEmbedding(n_tokens=372846, dim=100, cache=True) >>> indices = torch.tensor([glove.vocab[i.lower()] for i in ['She', 'enjoys', 'playing', 'tennis', '.']]) >>> indices tensor([ 67, 8371, 697, 2140, 2]) >>> glove(indices).shape torch.Size([5, 100]) >>> glove(indices).equal(nn.Embedding.from_pretrained(glove.vectors)(indices)) True
GloVeEmbedding#
- class supar.utils.embed.GloVeEmbedding(src: str = '6B', dim: int = 100, reload=False, *args, **kwargs)[source]#
GloVe: Global Vectors for Word Representation. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
- Parameters:
Examples
>>> from supar.utils.embed import Embedding >>> Embedding.load('glove-6b-100') GloVeEmbedding(n_tokens=400000, dim=100, unk=unk, cache=True)
FasttextEmbedding#
- class supar.utils.embed.FasttextEmbedding(lang: str = 'en', reload=False, *args, **kwargs)[source]#
Fasttext word embeddings for 157 languages, trained using CBOW, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
- Parameters:
Examples
>>> from supar.utils.embed import Embedding >>> Embedding.load('fasttext-en') FasttextEmbedding(n_tokens=2000000, dim=300, skip_first=True, cache=True)
GigaEmbedding#
- class supar.utils.embed.GigaEmbedding(reload=False, *args, **kwargs)[source]#
Giga word embeddings, trained on Chinese Gigaword Third Edition for Chinese using word2vec, used by Zhang et al. (2020a) and Zhang et al. (2020b).
- Parameters:
reload (bool) – If
True
, forces a fresh download. Default:False
.
Examples
>>> from supar.utils.embed import Embedding >>> Embedding.load('giga-100') GigaEmbedding(n_tokens=372846, dim=100, cache=True)
TencentEmbedding#
- class supar.utils.embed.TencentEmbedding(dim: int = 100, large: bool = False, reload=False, *args, **kwargs)[source]#
Tencent word embeddings. The embeddings are trained on large-scale text collected from news, webpages, and novels with Directional Skip-Gram. 100-dimension and 200-dimension embeddings for over 12 million Chinese words are provided.
- Parameters:
Examples
>>> from supar.utils.embed import Embedding >>> Embedding.load('tencent-100') TencentEmbedding(n_tokens=2000000, dim=100, skip_first=True, cache=True) >>> Embedding.load('tencent-100-large') TencentEmbedding(n_tokens=12287933, dim=100, skip_first=True, cache=True)