vocab
vocab¶
- class hanlp.common.transform.Vocab(idx_to_token: Optional[List[str]] = None, token_to_idx: Optional[Dict] = None, mutable=True, pad_token='<pad>', unk_token='<unk>')[source]¶
Vocabulary base class which converts tokens to indices and vice versa.
- Parameters
idx_to_token – id to token mapping.
token_to_idx – token to id mapping.
mutable –
True
to allow adding new tokens,False
to map OOV tounk
.pad_token – The token representing padding.
unk_token – The token representing OOV.
- __getitem__(key: Union[str, int, List]) Union[int, str, List] [source]¶
Get the index/indices associated with a token or a list of tokens or vice versa.
- Parameters
key –
str
for token(s) andint
for index/indices.
Returns: Associated indices or tokens.
- add(token: str) int [source]¶
Tries to add a token into a vocab and returns its id. If it has already been there, its id will be returned and the vocab won’t be updated. If the vocab is locked, an assertion failure will occur.
- Parameters
token – A new or existing token.
- Returns
Its associated id.
- copy_from(item: dict)[source]¶
Copy properties from a dict so that it can json de-serialized.
- Parameters
item – A dict holding
token_to_idx
- Returns
Itself.
- property first_token¶
The first token in this vocab.
- get_idx(token: str) int [source]¶
Get the idx of a token. If it’s not there, it will be added to the vocab when the vocab is locked otherwise the id of UNK will be returned.
- Parameters
token – A token.
- Returns
The id of that token.
- get_token(idx: int) str [source]¶
Get the token using its index.
- Parameters
idx – The index to a token.
Returns:
- property locked¶
True
indicates this vocab is locked.
- merge(other)[source]¶
Merge this with another vocab inplace.
- Parameters
other (Vocab) – Another vocab.
- property pad_idx¶
The index of
PAD
token.
- property safe_pad_token: str¶
Get the pad token safely. It always returns a pad token, which is the pad token or the first token if pad does not present in the vocab.
- property safe_pad_token_idx: int¶
Get the idx to the pad token safely. It always returns an index, which corresponds to the pad token or the first token if pad does not present in the vocab.
- property safe_unk_token: str¶
Get the unk token safely. It always returns a unk token, which is the unk token or the first token if unk does not presented in the vocab.
- set_unk_as_safe_unk()[source]¶
Set
self.unk_token = self.safe_unk_token
. It’s useful when the dev/test set contains OOV labels.
- summary(verbose=True) str [source]¶
Get or print a summary of this vocab.
- Parameters
verbose –
True
to print the summary to stdout.- Returns
Summary in text form.
- to_dict() dict [source]¶
Convert this vocab to a dict so that it can be json serialized.
- Returns
A dict.
- property tokens¶
A set of all tokens in this vocab.
- property unk_idx¶
The index of
UNK
token.