Processors - Basic

class MordinezNLP.processors.Basic.BasicProcessor(language: str = 'en')

The aim of the class is to make use of NLP-dirty texts

get_special_tokens() → List[str]

Function can return all of the special tokens used by process function. It can be needed when training SentencePiece tokenizer.

Returns

all of the special tokens used in process function

Return type

List[str]

static load_language_days(language: str) → List[str]

Return language specific names of days of the week :param language: a language in which return name of days of the week :type language: str

Returns

a list of day names in specified language

Return type

List[str]

static load_language_months(language: str) → List[str]

Function returns language specific names of months

Parameters

language (str) – language in which return names

Returns

a list of months in specified language

Return type

List[str]

static load_numerals(language: str) → List[str]

Build language specific numerals. Currently supported numerals are from 1 to 99.

Parameters

language (str) – a language in which function will return numerals

Returns

a list of numerals in specified language

Return type

List[str]

static load_ordinals(language: str) → List[str]

Build a language specific ordinals. Currently supported ordinals from 1 to 99

Parameters

language (str) – a language in which function will return ordinals

Returns

a list of ordinals in specified language

Return type

List[str]

pos_tag_data(post_processed_data: List[str], replace_with_number: str, tokenizer_threads: int, tokenizer_batch_size: int, pos_batch_size: int) → List[str]

A helper function to postprocess numbers tags and replace according tokens with special token. It also uses SpaCy tokenization to return “normal” form of tokens.

Long story short: This function will parse input “There wasn’t six apples” to “There was not <number> apples”.

Parameters
  • post_processed_data (List(str)) – a postprocessed texts list

  • replace_with_number (str) – a special token to replace numbers with

  • tokenizer_threads (int) – How many threads to use for tokenization

  • tokenizer_batch_size (int) – Batch size for tokenization

  • pos_batch_size (int) – POS tagging batch size, be careful when CUDA is availabe in Your system!

Returns

postprocessed texts

Return type

str

process(text_to_process: Union[str, List[str]], pre_rules: List[Callable] = [], post_rules: List[Callable] = [], language: str = 'en', fix_unicode: bool = True, lower: bool = False, no_line_breaks: bool = False, no_urls: bool = True, no_emails: bool = True, no_phone_numbers: bool = True, no_numbers: bool = True, no_digits: bool = False, no_currency_symbols: bool = True, no_punct: bool = False, no_math: bool = True, no_dates: bool = True, no_multiple_chars: bool = True, no_lists: bool = True, no_brackets: bool = True, replace_with_url: str = '<url>', replace_with_email: str = '<email>', replace_with_phone_number: str = '<phone>', replace_with_number: str = '<number>', replace_with_digit: str = '0', replace_with_currency_symbol: str = '<currency>', replace_with_date: str = '<date>', replace_with_bracket: str = '<bracket>', replace_more: str = '<more>', replace_less: str = '<less>', use_pos_tagging: bool = True, list_processing_threads: int = 8, tokenizer_threads: int = 8, tokenizer_batch_size: int = 60, pos_batch_size: int = 7000) → Union[str, List[str]]

Main text processing function. It mainly uses regexes to find specified patterns in texts and replace them by a defined custom token or fixes parts that are not valuable for humans and machines.

Function also enables users to set pre_rules and post_rules. You can use those lists of Callables to add pre and post processing rules. A good use case is processing CommonCrawl reddit data, where each page has the same schema (headers, navigation bars etc.). In such case You can use pre_rules to filter them and then pass such text into the process function pipeline. Also feel free to add post_rules to match other cases which are not used here.

Depending on parameters function can replace a specified type of data.

Currently supported entities:
  • dates,

  • brackets,

  • simple math strings,

  • phone numbers,

  • emails,

  • urls,

  • numbers and digits

  • multiple characters in single words

Dates

Examples of dates matching in strings for english:
  • 1.02.2030

  • 1st of December 3990

  • first of DecEmber 1233

  • first december 2020

  • early 20s

  • 01.03.4223

  • 11-33-3222

  • 2020s

  • Friday 23 October

  • late 90s

  • in 20s

Brackets

Examples of brackets matching in strings for english:
  • [tryrty]

  • (other text)

Simple math strings

Examples of simple math strings for english:
  • 2 > 3

  • 4<6

  • 4>=4

  • 5<= 4

If You decide to use no_math=False than such cases will be processed with other functions. It meas that one function will remove math operator (<,>,<=,>=) and another will replace numbers with special token.

Multiple characters in single words

Table below show string before and after using a multiple characters in single word processing function

Before

After

‘EEEEEEEEEEEE’

‘’

‘supeeeeeer’

‘super’

‘EEEE<number>!’

‘’

‘suppppprrrrrpper’

‘suprpper’

Processing multiple characters is extremely useful in processing CommonCrawl reddit data.

Lists replacement

Lists in a text with leading “-” or “>” for each item can be parsed to simple and more understandable text. For example list:

My_list:
- item 1
- item 2,
-item 3

Will be parsed to:

My_list: item 1, item 2, item 3.

Use no_lists argument to enable this feature.

Supported languages

Fully supported languages

Partially supported languages

English

German

Be careful

Please don’t replace special tokens in function, because it can wrongly process strings. It will be fixed in feature releases # todo fix regexes in __init__ for special tokens

Parameters
  • text_to_process (Union[str, List[str]]) – An input text or a list of texts (for multiprocess processing) to process by a function

  • pre_rules (List[Callable]) – A list of lambdas that are applied before main preprocessing rules

  • post_rules (List[Callable]) – A list of lambdas that are applied after pre_rules and function processing rules

  • language (str) – A input text language

  • fix_unicode (bool) – replace all non unicode characters to unicode

  • lower (bool) – lowercase all characters

  • no_line_breaks (bool) – fully strip line breaks as opposed to only normalizing them

  • no_urls (bool) – replace all URLs with a special token

  • no_emails (bool) – replace all email addresses with a special token

  • no_phone_numbers (bool) – replace all phone numbers with a special token

  • no_numbers (bool) – replace all numbers with a special token

  • no_digits (bool) – replace all digits with a special token

  • no_currency_symbols (bool) – replace all currency symbols with a special token

  • no_punct (bool) – remove punctuations

  • no_math (bool) – remove >= <= in math strings

  • no_dates (bool) – remove dates strings in input text ‘early 80s’ -> ‘<date>’

  • no_lists (bool) – replace all texts lists

  • no_brackets (bool) – replace brackets: ‘[‘, ‘]’, ‘(‘, ‘)’

  • no_multiple_chars (bool) – reduce multiple characters in string into a single ones ‘supeeeeeer’ -> ‘super’

  • replace_with_url (str) – a special token used to replace urls

  • replace_with_email (str) – a special token used to replace emails

  • replace_with_phone_number (str) – a special token used to replace phone numbers

  • replace_with_number (str) – a special token used to replace numbers

  • replace_with_digit (str) – a special token used to replace digits

  • replace_with_currency_symbol (str) – a special token used to replace currency symbol

  • replace_with_date (str) – a special token used to replace dates

  • replace_with_bracket (str) – a special token used to replace brackets

  • replace_more (str) – a special token used to replace more ‘>’ and more or equal ‘>=’ symbols in math texts

  • replace_less (str) – a special token used to replace less ‘<’ and less or equal ‘<=’ symbols in math texts

  • use_pos_tagging (bool) – if True function will use StanzaNLP & SpaCy for POS tagging and token normalization

  • list_processing_threads (int) – How many threads You want to use to process List(str) which is on a input for

  • function. (this) –

  • tokenizer_threads (int) – How many threads to use during tokenization, this value is passed to the SpaCy pipeline.

  • tokenizer_batch_size (int) –

  • pos_batch_size (int) –

Returns

Post-processed text

Return type

Union[str, List[str]]

process_multiple_characters(text_to_process: str) → str

Function can detect multiplied characters in a word and replace them by a single one.

Before

After

‘EEEEEEEEEEEE!’

‘’

‘supeeeeeer’

‘super’

‘EEEE<number>!’

‘’

‘suppppprrrrrpper’

‘suprpper’

Parameters

text_to_process (str) – An input text to process

Returns

Text with removed duplicated characters in each word

Return type

str

Example usage:

from MordinezNLP.processors import BasicProcessor

bp = BasicProcessor()
post_process = bp.process("this is my text to process by a funcion", language='en')
print(post_process)