tok
tok¶
Tokenization.
- class hanlp.components.mtl.tasks.tok.tag_tok.TaggingTokenization(trn: Optional[str] = None, dev: Optional[str] = None, tst: Optional[str] = None, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, dependencies: Optional[str] = None, scalar_mix: Optional[hanlp.layers.scalar_mix.ScalarMixWithDropoutBuilder] = None, use_raw_hidden_states=False, lr=0.001, separate_optimizer=False, cls_is_bos=True, sep_is_eos=True, delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, transform=None, tagging_scheme='BMES', crf=False, token_key='token', dict_force: Optional[Union[hanlp_trie.dictionary.DictInterface, Dict[str, Any], Set[str]]] = None, dict_combine: Optional[Union[hanlp_trie.dictionary.DictInterface, Dict[str, Any], Set[str]]] = None, **kwargs)[source]¶
Tokenization which casts a chunking problem into a tagging problem. This task has to create batch of tokens containing both [CLS] and [SEP] since it’s usually the first task and later tasks might need them.
- Parameters
trn – Path to training set.
dev – Path to dev set.
tst – Path to test set.
sampler_builder – A builder which builds a sampler.
dependencies – Its dependencies on other tasks.
scalar_mix – A builder which builds a ScalarMixWithDropout object.
use_raw_hidden_states – Whether to use raw hidden states from transformer without any pooling.
lr – Learning rate for this task.
separate_optimizer – Use customized separate optimizer for this task.
cls_is_bos –
True
to treat the first token asBOS
.sep_is_eos –
True
to treat the last token asEOS
.delimiter – Delimiter used to split a line in the corpus.
max_seq_len – Sentences longer than
max_seq_len
will be split into shorter ones if possible.sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no
sent_delimiter
in a sentence, it will be split at a token anyway.transform – An optional transform to be applied to samples. Usually a character normalization transform is passed in.
tagging_scheme – Either
BMES
orBI
.crf –
True
to enable CRF (Lafferty et al. 2001).token_key – The key to tokens in dataset. This should always be set to
token
in MTL.**kwargs – Not used.
- build_criterion(model=None, **kwargs)[source]¶
Implement this method to build criterion (loss function).
- Parameters
**kwargs – The subclass decides the method signature.
- build_dataloader(data, transform: Optional[hanlp.common.transform.TransformList] = None, training=False, device=None, logger: Optional[logging.Logger] = None, cache=False, gradient_accumulation=1, **kwargs) torch.utils.data.dataloader.DataLoader [source]¶
Build a dataloader for training or evaluation.
- Parameters
data – Either a path or a list of samples.
transform – The transform from MTL, which is usually [TransformerSequenceTokenizer, FieldLength(‘token’)]
training – Whether this method is called on training set.
device – The device dataloader is intended to work with.
logger – Logger for printing message indicating progress.
cache – Whether the dataloader should be cached.
gradient_accumulation – Gradient accumulation to be passed to sampler builder.
**kwargs – Additional experimental arguments.
- build_metric(**kwargs)[source]¶
Implement this to build metric(s).
- Parameters
**kwargs – The subclass decides the method signature.
- build_model(encoder_size, training=True, **kwargs) torch.nn.modules.module.Module [source]¶
Build model.
- Parameters
training –
True
if called during training.**kwargs –
**self.config
.
- build_samples(inputs, cls_is_bos=False, sep_is_eos=False)[source]¶
Build samples for this task. Called when this task is the first task. Default behaviour is to take inputs as list of tokens and put these tokens into a dict per sample.
- Parameters
inputs – Inputs from users, usually a list of lists of tokens.
cls_is_bos – Insert BOS to the head of each sentence.
sep_is_eos – Append EOS to the tail of each sentence.
- Returns
List of samples.
- build_tokenizer(tokenizer: hanlp.transform.transformer_tokenizer.TransformerSequenceTokenizer)[source]¶
Build a transformer tokenizer for this task.
- Parameters
tokenizer – A tokenizer which is shared but can be adjusted to provide per-task settings.
- Returns
A TransformerSequenceTokenizer.
- property dict_combine: hanlp_trie.dictionary.DictInterface¶
The low priority dictionary which perform longest-prefix-matching on model predictions and combing them.
Examples
>>> tok.dict_combine = {'和服', '服务行业'} >>> tok("商品和服务行业") # '和服' is not in the original results ['商品', '和', '服务']. '服务', '行业' are combined to '服务行业' ['商品', '和', '服务行业']
- property dict_force: hanlp_trie.dictionary.DictInterface¶
The high priority dictionary which perform longest-prefix-matching on inputs to split them into two subsets:
spans containing no keywords, which are then fed into tokenizer for further tokenization.
keywords, which will be outputed without furthur tokenization.
Caution
Longest-prefix-matching NEVER guarantee the presence of any keywords. Abuse of
dict_force
can lead to low quality results. For more details, refer to this book.Examples
>>> tok.dict_force = {'和服', '服务行业'} # Force '和服' and '服务行业' by longest-prefix-matching >>> tok("商品和服务行业") ['商品', '和服', '务行业'] >>> tok.dict_force = {'和服务': ['和', '服务']} # Force '和服务' to be tokenized as ['和', '服务'] >>> tok("商品和服务行业") ['商品', '和', '服务', '行业']
- input_is_flat(data) bool [source]¶
Check whether the data is flat (meaning that it’s only a single sample, not even batched).
- Returns
True
to indicate the input data is flat.- Return type
- transform_batch(batch: Dict[str, Any], results: Optional[Dict[str, Any]] = None, cls_is_bos=False, sep_is_eos=False) Dict[str, Any] [source]¶
This method is overrode to honor the zero indexed token used in custom dict. Although for a tokenizer, cls_is_bos = sep_is_eos = True, its tokens don’t contain [CLS] or [SEP]. This behaviour is adopted from the early versions and it is better kept to avoid migration efforts.
- Parameters
batch – A batch of samples.
results – Predicted results from other tasks which might be useful for this task to utilize. Say a dep task uses both token and pos as features, then it will need both tok and pos results to make a batch.
cls_is_bos – First token in this batch is BOS.
sep_is_eos – Last token in this batch is EOS.
- Returns
A batch.