Tokenizers - SpacyTokenizer¶
A custom SpaCy tokenizer ready for tokenizing special tokens which comes from BasicProcessor.
Out of the box SpaCy tokenizer will parse special tokens (tags) separately for example: “<date>” to “< date >”, so that function changes such behavior.
- param nlp
A Language object from SpaCy
- type nlp
spacy.language.Language
- returns
A SpaCy tokenizer
- rtype
spacy.tokenizer.Tokenizer
Example usage:
from MordinezNLP.tokenizers import spacy_tokenizer
import spacy
nlp: Language = spacy.load("en_core_web_sm")
nlp.tokenizer = spacy_tokenizer(nlp)
test_doc = nlp('Hello today is <date>, tomorrow it will be <number> degrees of celcius.')
for token in test_doc:
print(token)
# output
# Hello
# today
# is
# <date>
# ,
# tomorrow
# it
# will
# be
# <number>
# degrees
# of
# celcius
# .