vocab

class hanlp.common.transform.Vocab(idx_to_token: List[str] = None, token_to_idx: Dict = None, mutable=True, pad_token='<pad>', unk_token='<unk>')[source]

Vocabulary base class which converts tokens to indices and vice versa.

Parameters
  • idx_to_token – id to token mapping.

  • token_to_idx – token to id mapping.

  • mutableTrue to allow adding new tokens, False to map OOV to unk.

  • pad_token – The token representing padding.

  • unk_token – The token representing OOV.

__getitem__(key: Union[str, int, List]) → Union[int, str, List][source]

Get the index/indices associated with a token or a list of tokens or vice versa.

Parameters

keystr for token(s) and int for index/indices.

Returns: Associated indices or tokens.

add(token: str)int[source]

Tries to add a token into a vocab and returns its id. If it has already been there, its id will be returned and the vocab won’t be updated. If the vocab is locked, an assertion failure will occur.

Parameters

token – A new or existing token.

Returns

Its associated id.

copy_from(item: dict)[source]

Copy properties from a dict so that it can json de-serialized.

Parameters

item – A dict holding token_to_idx

Returns

Itself.

property first_token

The first token in this vocab.

get_idx(token: str)int[source]

Get the idx of a token. If it’s not there, it will be added to the vocab when the vocab is locked otherwise the id of UNK will be returned.

Parameters

token – A token.

Returns

The id of that token.

get_token(idx: int)str[source]

Get the token using its index.

Parameters

idx – The index to a token.

Returns:

lock()[source]

Lock this vocab up so that it won’t accept new tokens.

Returns

Itself.

property locked

True indicates this vocab is locked.

lower()[source]

Convert all tokens to lower case.

Returns

Itself.

merge(other)[source]

Merge this with another vocab inplace.

Parameters

other (Vocab) – Another vocab.

property pad_idx

The index of PAD token.

property safe_pad_token

Get the pad token safely. It always returns a pad token, which is the pad token or the first token if pad does not present in the vocab.

property safe_pad_token_idx

Get the idx to the pad token safely. It always returns an index, which corresponds to the pad token or the first token if pad does not present in the vocab.

property safe_unk_token

Get the unk token safely. It always returns a unk token, which is the unk token or the first token if unk does not presented in the vocab.

set_unk_as_safe_unk()[source]

Set self.unk_token = self.safe_unk_token. It’s useful when the dev/test set contains OOV labels.

summary(verbose=True)str[source]

Get or print a summary of this vocab.

Parameters

verboseTrue to print the summary to stdout.

Returns

Summary in text form.

to_dict()dict[source]

Convert this vocab to a dict so that it can be json serialized.

Returns

A dict.

property tokens

A set of all tokens in this vocab.

property unk_idx

The index of UNK token.

unlock()[source]

Unlock this vocab so that new tokens can be added in.

Returns

Itself.

update(tokens: Iterable[str])None[source]

Update the vocab with these tokens by adding them to vocab one by one.

Parameters

tokens (Iterable[str]) – A list of tokens.