Pipelines - PartOfSpeech

class MordinezNLP.pipelines.PartOfSpeech.PartOfSpeech(nlp: spacy.language.Language, language: str = 'en')

The aim of the class is to tag each token (which comes from MordinezNLP processors) with its POS tag.

process(texts: List[str], tokenizer_threads: int = 8, tokenizer_batch_size: int = 50, pos_batch_size: int = 3000, pos_replacement_list: Optional[Dict[str, str]] = None, token_replacement_list: Optional[Dict[str, str]] = None, return_docs: bool = False, return_string_tokens: bool = False) → Union[Generator[Tuple[List[Union[spacy.tokens.token.Token, str]], List[str]], None, None], Generator[Tuple[List[List[Union[spacy.tokens.token.Token, str]]], List[List[str]]], None, None]]

Main processing function. First step is to tokenize a list of input texts to sentences and then to the tokens. Then such input goes to the StanzaNLP.

For the function List[str] object which comes as an input is a list of docs to process. Each item in a list is a document (SpaCy logic in pipelines). In such case You can specify if You want to return texts in structure documents[sentences[tokens]] or sentences[tokens] (removing the documents layer).

Sometimes You want to force POS tagger to assign POS tag to a specified token or instead of other POS tag. For such cases You can use pos_replacement_list and token_replacement_list. You can import sample token and POS replacement lists from MordinezNLP.utils.pos_replacement_list and MordinezNLP.utils.token_replacement_list.

If You want to use a special attributes for each tokens from SpaCy please pass False as a value of return_string_tokens argument.

Each token parsed by SpaCy tokenizer will by converted to its normal version. For example each n’t will be replaced by not.

Parameters
  • texts (List[str]) – an input texts, each item in a list is a document (SpaCy logic in pipelines)

  • tokenizer_threads (int) – How many threads You want to use in SpaCy tokenization

  • tokenizer_batch_size (int) – Batch size for SpaCy tokenizer

  • pos_batch_size (int) = Batch size for Stanza POS tagger (if enabled) –

  • pos_replacement_list (Union[Dict[str, str], None]) – If not None function will replace each POS tag

  • value set in value field of the dict. Each key is a POS tag to be replaced by its value. (with) –

  • token_replacement_list – If not None function will replace each POS tag with the value set in value field of

  • dict. Each key is token, which will be replaced by its value. (the) –

  • return_docs (bool) – If True function will keep a “documents” layer on output.

  • return_string_tokens (bool) – Function can return tokens as SpaCy Token object (if You need to access token

  • data such as norm) or can return tokens as a string object. If True returns a string tokens. (specified) –

Returns

Union[Generator[Tuple[List[Union[Token, str]], List[str]], None, None], Generator[Tuple[List[List[Union[Token, str]]], List[List[str]]], None, None]]: a list of doc(if return docs is set) with list of sentences with list of tokens and its pos tags.

Example usage:

from MordinezNLP.pipelines import PartOfSpeech
from MordinezNLP.tokenizers import spacy_tokenizer
import spacy

nlp: Language = spacy.load("en_core_web_sm")
nlp.tokenizer = spacy_tokenizer(nlp)

docs_to_tag = [
    'Hello today is <date>, tomorrow it will be <number> degrees of celcius.'
]

pos_tagger = PartOfSpeech(
    nlp,
    'en'
)

pos_output = pos_tagger.process(
    docs_to_tag,
    4,
    30,
    return_docs=True
)