Processors - Basic¶

class MordinezNLP.processors.Basic.BasicProcessor(language: str = 'en')¶

The aim of the class is to make use of NLP-dirty texts

get_special_tokens() → List[str]¶

Function can return all of the special tokens used by process function. It can be needed when training SentencePiece tokenizer.

Returns: all of the special tokens used in process function
Return type: List[str]

static load_language_days(language: str) → List[str]¶

Return language specific names of days of the week :param language: a language in which return name of days of the week :type language: str

Returns: a list of day names in specified language
Return type: List[str]

static load_language_months(language: str) → List[str]¶

Function returns language specific names of months

Parameters: language (str) – language in which return names
Returns: a list of months in specified language
Return type: List[str]

static load_numerals(language: str) → List[str]¶

Build language specific numerals. Currently supported numerals are from 1 to 99.

Parameters: language (str) – a language in which function will return numerals
Returns: a list of numerals in specified language
Return type: List[str]

static load_ordinals(language: str) → List[str]¶

Build a language specific ordinals. Currently supported ordinals from 1 to 99

Parameters: language (str) – a language in which function will return ordinals
Returns: a list of ordinals in specified language
Return type: List[str]

pos_tag_data(post_processed_data: List[str], replace_with_number: str, tokenizer_threads: int, tokenizer_batch_size: int, pos_batch_size: int) → List[str]¶

A helper function to postprocess numbers tags and replace according tokens with special token. It also uses SpaCy tokenization to return “normal” form of tokens.

Long story short: This function will parse input “There wasn’t six apples” to “There was not <number> apples”.

Parameters

post_processed_data (List(str)) – a postprocessed texts list
replace_with_number (str) – a special token to replace numbers with
tokenizer_threads (int) – How many threads to use for tokenization
tokenizer_batch_size (int) – Batch size for tokenization
pos_batch_size (int) – POS tagging batch size, be careful when CUDA is availabe in Your system!

Returns

postprocessed texts

Return type

str

process(text_to_process: Union[str, List[str]], pre_rules: List[Callable] = [], post_rules: List[Callable] = [], language: str = 'en', fix_unicode: bool = True, lower: bool = False, no_line_breaks: bool = False, no_urls: bool = True, no_emails: bool = True, no_phone_numbers: bool = True, no_numbers: bool = True, no_digits: bool = False, no_currency_symbols: bool = True, no_punct: bool = False, no_math: bool = True, no_dates: bool = True, no_multiple_chars: bool = True, no_lists: bool = True, no_brackets: bool = True, replace_with_url: str = '<url>', replace_with_email: str = '<email>', replace_with_phone_number: str = '<phone>', replace_with_number: str = '<number>', replace_with_digit: str = '0', replace_with_currency_symbol: str = '<currency>', replace_with_date: str = '<date>', replace_with_bracket: str = '<bracket>', replace_more: str = '<more>', replace_less: str = '<less>', use_pos_tagging: bool = True, list_processing_threads: int = 8, tokenizer_threads: int = 8, tokenizer_batch_size: int = 60, pos_batch_size: int = 7000) → Union[str, List[str]]¶

Main text processing function. It mainly uses regexes to find specified patterns in texts and replace them by a defined custom token or fixes parts that are not valuable for humans and machines.

Function also enables users to set pre_rules and post_rules. You can use those lists of Callables to add pre and post processing rules. A good use case is processing CommonCrawl reddit data, where each page has the same schema (headers, navigation bars etc.). In such case You can use pre_rules to filter them and then pass such text into the process function pipeline. Also feel free to add post_rules to match other cases which are not used here.

Depending on parameters function can replace a specified type of data.

Currently supported entities:

dates,
brackets,
simple math strings,
phone numbers,
emails,
urls,
numbers and digits
multiple characters in single words

Dates

Examples of dates matching in strings for english:

1.02.2030
1st of December 3990
first of DecEmber 1233
first december 2020
early 20s
01.03.4223
11-33-3222
2020s
Friday 23 October
late 90s
in 20s

Brackets

Examples of brackets matching in strings for english:

[tryrty]
(other text)

Simple math strings

Examples of simple math strings for english:

2 > 3
4<6
4>=4
5<= 4

If You decide to use no_math=False than such cases will be processed with other functions. It meas that one function will remove math operator (<,>,<=,>=) and another will replace numbers with special token.

Multiple characters in single words

Table below show string before and after using a multiple characters in single word processing function

Before	After
‘EEEEEEEEEEEE’	‘’
‘supeeeeeer’	‘super’
‘EEEE<number>!’	‘’
‘suppppprrrrrpper’	‘suprpper’

Processing multiple characters is extremely useful in processing CommonCrawl reddit data.

Lists replacement

Lists in a text with leading “-” or “>” for each item can be parsed to simple and more understandable text. For example list:

My_list:
- item 1
- item 2,
-item 3

Will be parsed to:

My_list: item 1, item 2, item 3.

Use no_lists argument to enable this feature.

Supported languages

Fully supported languages	Partially supported languages
English	German

Be careful

Please don’t replace special tokens in function, because it can wrongly process strings. It will be fixed in feature releases # todo fix regexes in __init__ for special tokens

Parameters

text_to_process (Union[str, List[str]]) – An input text or a list of texts (for multiprocess processing) to process by a function
pre_rules (List[Callable]) – A list of lambdas that are applied before main preprocessing rules
post_rules (List[Callable]) – A list of lambdas that are applied after pre_rules and function processing rules
language (str) – A input text language
fix_unicode (bool) – replace all non unicode characters to unicode
lower (bool) – lowercase all characters
no_line_breaks (bool) – fully strip line breaks as opposed to only normalizing them
no_urls (bool) – replace all URLs with a special token
no_emails (bool) – replace all email addresses with a special token
no_phone_numbers (bool) – replace all phone numbers with a special token
no_numbers (bool) – replace all numbers with a special token
no_digits (bool) – replace all digits with a special token
no_currency_symbols (bool) – replace all currency symbols with a special token
no_punct (bool) – remove punctuations
no_math (bool) – remove >= <= in math strings
no_dates (bool) – remove dates strings in input text ‘early 80s’ -> ‘<date>’
no_lists (bool) – replace all texts lists
no_brackets (bool) – replace brackets: ‘[‘, ‘]’, ‘(‘, ‘)’
no_multiple_chars (bool) – reduce multiple characters in string into a single ones ‘supeeeeeer’ -> ‘super’
replace_with_url (str) – a special token used to replace urls
replace_with_email (str) – a special token used to replace emails
replace_with_phone_number (str) – a special token used to replace phone numbers
replace_with_number (str) – a special token used to replace numbers
replace_with_digit (str) – a special token used to replace digits
replace_with_currency_symbol (str) – a special token used to replace currency symbol
replace_with_date (str) – a special token used to replace dates
replace_with_bracket (str) – a special token used to replace brackets
replace_more (str) – a special token used to replace more ‘>’ and more or equal ‘>=’ symbols in math texts
replace_less (str) – a special token used to replace less ‘<’ and less or equal ‘<=’ symbols in math texts
use_pos_tagging (bool) – if True function will use StanzaNLP & SpaCy for POS tagging and token normalization
list_processing_threads (int) – How many threads You want to use to process List(str) which is on a input for
function. (this) –
tokenizer_threads (int) – How many threads to use during tokenization, this value is passed to the SpaCy pipeline.
tokenizer_batch_size (int) –
pos_batch_size (int) –

Returns

Post-processed text

Return type

Union[str, List[str]]

process_multiple_characters(text_to_process: str) → str¶

Function can detect multiplied characters in a word and replace them by a single one.

Before	After
‘EEEEEEEEEEEE!’	‘’
‘supeeeeeer’	‘super’
‘EEEE<number>!’	‘’
‘suppppprrrrrpper’	‘suprpper’

Parameters: text_to_process (str) – An input text to process
Returns: Text with removed duplicated characters in each word
Return type: str

Example usage:

from MordinezNLP.processors import BasicProcessor

bp = BasicProcessor()
post_process = bp.process("this is my text to process by a funcion", language='en')
print(post_process)