Processors - Basic¶
-
class
MordinezNLP.processors.Basic.
BasicProcessor
(language: str = 'en')¶ The aim of the class is to make use of NLP-dirty texts
-
get_special_tokens
() → List[str]¶ Function can return all of the special tokens used by process function. It can be needed when training SentencePiece tokenizer.
- Returns
all of the special tokens used in process function
- Return type
List[str]
-
static
load_language_days
(language: str) → List[str]¶ Return language specific names of days of the week :param language: a language in which return name of days of the week :type language: str
- Returns
a list of day names in specified language
- Return type
List[str]
-
static
load_language_months
(language: str) → List[str]¶ Function returns language specific names of months
- Parameters
language (str) – language in which return names
- Returns
a list of months in specified language
- Return type
List[str]
-
static
load_numerals
(language: str) → List[str]¶ Build language specific numerals. Currently supported numerals are from 1 to 99.
- Parameters
language (str) – a language in which function will return numerals
- Returns
a list of numerals in specified language
- Return type
List[str]
-
static
load_ordinals
(language: str) → List[str]¶ Build a language specific ordinals. Currently supported ordinals from 1 to 99
- Parameters
language (str) – a language in which function will return ordinals
- Returns
a list of ordinals in specified language
- Return type
List[str]
-
pos_tag_data
(post_processed_data: List[str], replace_with_number: str, tokenizer_threads: int, tokenizer_batch_size: int, pos_batch_size: int) → List[str]¶ A helper function to postprocess numbers tags and replace according tokens with special token. It also uses SpaCy tokenization to return “normal” form of tokens.
Long story short: This function will parse input “There wasn’t six apples” to “There was not <number> apples”.
- Parameters
post_processed_data (List(str)) – a postprocessed texts list
replace_with_number (str) – a special token to replace numbers with
tokenizer_threads (int) – How many threads to use for tokenization
tokenizer_batch_size (int) – Batch size for tokenization
pos_batch_size (int) – POS tagging batch size, be careful when CUDA is availabe in Your system!
- Returns
postprocessed texts
- Return type
str
-
process
(text_to_process: Union[str, List[str]], pre_rules: List[Callable] = [], post_rules: List[Callable] = [], language: str = 'en', fix_unicode: bool = True, lower: bool = False, no_line_breaks: bool = False, no_urls: bool = True, no_emails: bool = True, no_phone_numbers: bool = True, no_numbers: bool = True, no_digits: bool = False, no_currency_symbols: bool = True, no_punct: bool = False, no_math: bool = True, no_dates: bool = True, no_multiple_chars: bool = True, no_lists: bool = True, no_brackets: bool = True, replace_with_url: str = '<url>', replace_with_email: str = '<email>', replace_with_phone_number: str = '<phone>', replace_with_number: str = '<number>', replace_with_digit: str = '0', replace_with_currency_symbol: str = '<currency>', replace_with_date: str = '<date>', replace_with_bracket: str = '<bracket>', replace_more: str = '<more>', replace_less: str = '<less>', use_pos_tagging: bool = True, list_processing_threads: int = 8, tokenizer_threads: int = 8, tokenizer_batch_size: int = 60, pos_batch_size: int = 7000) → Union[str, List[str]]¶ Main text processing function. It mainly uses regexes to find specified patterns in texts and replace them by a defined custom token or fixes parts that are not valuable for humans and machines.
Function also enables users to set pre_rules and post_rules. You can use those lists of Callables to add pre and post processing rules. A good use case is processing CommonCrawl reddit data, where each page has the same schema (headers, navigation bars etc.). In such case You can use pre_rules to filter them and then pass such text into the process function pipeline. Also feel free to add post_rules to match other cases which are not used here.
Depending on parameters function can replace a specified type of data.
- Currently supported entities:
dates,
brackets,
simple math strings,
phone numbers,
emails,
urls,
numbers and digits
multiple characters in single words
Dates
- Examples of dates matching in strings for english:
1.02.2030
1st of December 3990
first of DecEmber 1233
first december 2020
early 20s
01.03.4223
11-33-3222
2020s
Friday 23 October
late 90s
in 20s
Brackets
- Examples of brackets matching in strings for english:
[tryrty]
(other text)
Simple math strings
- Examples of simple math strings for english:
2 > 3
4<6
4>=4
5<= 4
If You decide to use no_math=False than such cases will be processed with other functions. It meas that one function will remove math operator (<,>,<=,>=) and another will replace numbers with special token.
Multiple characters in single words
Table below show string before and after using a multiple characters in single word processing function
Before
After
‘EEEEEEEEEEEE’
‘’
‘supeeeeeer’
‘super’
‘EEEE<number>!’
‘’
‘suppppprrrrrpper’
‘suprpper’
Processing multiple characters is extremely useful in processing CommonCrawl reddit data.
Lists replacement
Lists in a text with leading “-” or “>” for each item can be parsed to simple and more understandable text. For example list:
My_list: - item 1 - item 2, -item 3
Will be parsed to:
My_list: item 1, item 2, item 3.
Use no_lists argument to enable this feature.
Supported languages
Fully supported languages
Partially supported languages
English
German
Be careful
Please don’t replace special tokens in function, because it can wrongly process strings. It will be fixed in feature releases # todo fix regexes in __init__ for special tokens
- Parameters
text_to_process (Union[str, List[str]]) – An input text or a list of texts (for multiprocess processing) to process by a function
pre_rules (List[Callable]) – A list of lambdas that are applied before main preprocessing rules
post_rules (List[Callable]) – A list of lambdas that are applied after pre_rules and function processing rules
language (str) – A input text language
fix_unicode (bool) – replace all non unicode characters to unicode
lower (bool) – lowercase all characters
no_line_breaks (bool) – fully strip line breaks as opposed to only normalizing them
no_urls (bool) – replace all URLs with a special token
no_emails (bool) – replace all email addresses with a special token
no_phone_numbers (bool) – replace all phone numbers with a special token
no_numbers (bool) – replace all numbers with a special token
no_digits (bool) – replace all digits with a special token
no_currency_symbols (bool) – replace all currency symbols with a special token
no_punct (bool) – remove punctuations
no_math (bool) – remove >= <= in math strings
no_dates (bool) – remove dates strings in input text ‘early 80s’ -> ‘<date>’
no_lists (bool) – replace all texts lists
no_brackets (bool) – replace brackets: ‘[‘, ‘]’, ‘(‘, ‘)’
no_multiple_chars (bool) – reduce multiple characters in string into a single ones ‘supeeeeeer’ -> ‘super’
replace_with_url (str) – a special token used to replace urls
replace_with_email (str) – a special token used to replace emails
replace_with_phone_number (str) – a special token used to replace phone numbers
replace_with_number (str) – a special token used to replace numbers
replace_with_digit (str) – a special token used to replace digits
replace_with_currency_symbol (str) – a special token used to replace currency symbol
replace_with_date (str) – a special token used to replace dates
replace_with_bracket (str) – a special token used to replace brackets
replace_more (str) – a special token used to replace more ‘>’ and more or equal ‘>=’ symbols in math texts
replace_less (str) – a special token used to replace less ‘<’ and less or equal ‘<=’ symbols in math texts
use_pos_tagging (bool) – if True function will use StanzaNLP & SpaCy for POS tagging and token normalization
list_processing_threads (int) – How many threads You want to use to process List(str) which is on a input for
function. (this) –
tokenizer_threads (int) – How many threads to use during tokenization, this value is passed to the SpaCy pipeline.
tokenizer_batch_size (int) –
pos_batch_size (int) –
- Returns
Post-processed text
- Return type
Union[str, List[str]]
-
process_multiple_characters
(text_to_process: str) → str¶ Function can detect multiplied characters in a word and replace them by a single one.
Before
After
‘EEEEEEEEEEEE!’
‘’
‘supeeeeeer’
‘super’
‘EEEE<number>!’
‘’
‘suppppprrrrrpper’
‘suprpper’
- Parameters
text_to_process (str) – An input text to process
- Returns
Text with removed duplicated characters in each word
- Return type
str
-
Example usage:
from MordinezNLP.processors import BasicProcessor
bp = BasicProcessor()
post_process = bp.process("this is my text to process by a funcion", language='en')
print(post_process)