Welcome to MordinezNLP’s documentation!

Downloaders - Basic

class MordinezNLP.downloaders.Basic.BasicDownloader

Class helps to download multiple files from list of provided links using multithreading.

static download_to_bytes_io(url: str, temp_path: str, use_memory: bool, custom_headers: dict = {}, streamable: bool = False, sleep_time: float = 0, max_retries: int = 10) → Union[_io.BytesIO, pathlib.Path]

Function defines how to download single URL. It is used by download_urls function to use Threading for multithread downloading.

Function makes GET request to a specified URL. If there is an exception, the function will try to download file untill it will be successful or until it reaches 10 unsucessful downloads

Parameters
  • max_retries (int) – How many retries function will make until it marks a file undownloadable

  • sleep_time (int) – A sleep time in seconds that is used to prevent sites from detecting downloading as a DDoS attack

  • streamable (bool) – Sets a request’s stream parameter. More info https://2.python-requests.org/en/v2.8.1/user/advanced/#body-content-workflow

  • custom_headers (bool) – Custom headers used in each request

  • url (str) – valid HTTP/HTTPS URL

  • temp_path (str) – A temporary path where downloaded files will be saved (argument used only during in memory download)

  • use_memory (bool) – When set to True script will download all data to the memory. Otherwise it will save

  • data as temporary files on a disk. (downloaded) –

Returns

downloaded file as a bytes

Return type

io.BytesIO

static download_urls(urls: List[str], file_type_handler: Callable, threads: int = 8, sleep_time: float = 0, custom_headers: Iterable = repeat({}), streamable: Iterable = repeat(False), max_retries: int = 10, use_memory: bool = True, temp_dir: Optional[pathlib.Path] = None) → list

Function allows user to download files from provided URLs in list. Each file is downloaded as BytesIO using specified number of threads and then file_type_handler is used to convert file from BytesIO to specified format. Each file type should have its own file_type_handler.

Sleep_time is used to prevent sites from detecting DDoS attacks. Before downloading file specified thread is going to sleep for specified amount of time.

Function for each thread (and for each single URL) uses download_to_bytes_io function.

Using use_memory user can decide if files will be downloaded to the memory or saved as a temporary files on a disk.

Parameters
  • urls (List[str]) – List of URLs of files to download

  • file_type_handler (Callable) – Function used to convert downloaded file to a specified format

  • threads (int) – Number of threads to download files

  • sleep_time (int) – Time used to prevent file downloads from being detected as DDoS attack

  • max_retries (int) – Refer to download_to_bytes_io function documentation.

  • streamable (bool) – Refer to download_to_bytes_io function documentation.

  • custom_headers (int) – Refer to download_to_bytes_io function documentation.

  • use_memory (bool) – Downloader can download files to memory or to the binary files. Use memory downloader

  • You know that You will download small amout of data, otherwise set this value to False. (when) –

  • temp_dir (Path) – Path to directory where temporary files will be saved.

Returns

A list of downloaded and processed by file_type_handler function files.

Return type

list

Example usage for TXT files:

from MordinezNLP.downloaders import BasicDownloader
from MordinezNLP.downloaders.Processors import text_data_processor

downloaded_elements = BasicDownloader.download_urls(
    [
        "https://raw.githubusercontent.com/BMarcin/MordinezNLP/main/requirements.txt",
        "https://raw.githubusercontent.com/BMarcin/MordinezNLP/main/LICENSE"
    ],
    lambda x: text_data_processor(x),
)

print(downloaded_elements) # <- will display a list with elements where each is a content of a downloaded files

Example usage for PDF files:

from MordinezNLP.downloaders import BasicDownloader
from MordinezNLP.downloaders.Processors import pdf_data_processor

downloaded_pdfs = BasicDownloader.download_urls(
    [
        "https://docs.whirlpool.eu/_doc/19514904100_PL.pdf",
        "https://mpm.pl/docs/_instrukcje/WA-6040S_instrukcja.pdf",
    ],
    lambda x: pdf_data_processor(x)
)

print(downloaded_pdfs) # <- will display a list with elements where each is a content of a downloaded files

Downloaders - CommonCrawl

class MordinezNLP.downloaders.CommonCrawlDownloader.CommonCrawlDownloader(links_to_search: List[str], index_name: str = 'CC-MAIN-2020-24', base_index_url: str = 'http://index.commoncrawl.org', search_for_mime: str = 'text/html', search_for_language: str = 'eng', threads: int = 8)

Class used to download common crawl data using Basic multithreaded downloader.

download(save_to: str, base_url: str = 'https://commoncrawl.s3.amazonaws.com', sleep_time: float = 0)

Main function used to download CC data using multithreaded Base Downloader.

Parameters
  • save_to (str) – path to a folder where the data will be downloaded. Each file is a HTML document downloaded from CC.

  • base_url (str) – base CC URL for example: https://commoncrawl.s3.amazonaws.com

  • sleep_time (int) – A sleep time in seconds that is used to prevent sites from detecting downloading as a DDoS attack

Example usage:

from MordinezNLP.downloaders import CommonCrawlDownloader

ccd = CommonCrawlDownloader(
    [
        "reddit.com/r/space/*",
        "reddit.com/r/spacex/*",
    ]
)
ccd.download('./test_data')

Downloaders - Elastic Search

class MordinezNLP.downloaders.ElasticSearchDownloader.ElasticSearchDownloader(ip: str, port: int, api_key_1: Optional[str] = None, api_key_2: Optional[str] = None, timeout: int = 100)

Class used to download elastic search data from specified index using multithreading todo make tests

get_all_available_indexes() → List[str]

Get all available elastic search indexes.

Returns

a list of string where each element is a elastic search index name

Return type

List[str]

scroll_data(index_name: str, query: dict, processing_function: Callable[[dict], Any], threads: int = 6, scroll: str = '2m', scroll_size: int = 100) → List[any]

A function that scrolls through an elastic search data from index and returns a multithreaded data processed with processing_function. It returns a List of types returned by a processing_function.

Parameters
  • index_name (str) – An index name to scroll/download the data

  • query (dict) – An elastic search query

  • processing_function (Callable[[dict], Any]) – A function that processes single item from elastic search index

  • threads (int) – A number of threads to run processing on

  • scroll (str) – A scroll value

  • scroll_size (int) – A size of scrolling items at once

Returns

Returns a list of processed items with type according to a processing_function or empty list if index doesn’t exists.

Return type

List[any]

Example usage:

from MordinezNLP.downloaders import ElasticSearchDownloader

es = ElasticSearchDownloader(
    ip='',
    port=9200,
    timeout=10
)

body = {}  # <- use your own elastic search query

' Your own processing function for a single element '
def processing_func(data: dict) -> str:
    return data['my_key']['my_next_key'].replace("\r\n", "\n")


' Scroll the data '
downloaded_elastic_search_data = es.scroll_data(
    'my_index_name',
    body,
    processing_func,
    threads=8
)

print(len(downloaded_elastic_search_data))

Downloaders - processors

MordinezNLP.downloaders.Processors.gzip_to_text_data_processor(data: _io.BytesIO) → str

Function can be used together with downloaders to covnert BytesIO to GZIP and unpack it to str.

Parameters

data (BytesIO) – input data which comes from downlaoder class/function

Returns

parsed input

Return type

str

MordinezNLP.downloaders.Processors.pdf_data_processor(data: _io.BytesIO) → str

Function can be used together with downloaders to convert BytesIO from PDF files to str.

Parameters

data (BytesIO) – input data which comes from downloader class/function

Returns

parsed input, more informations about parsing PDFs can be found in method

MordinezNLP.parsers.process_pdf

Return type

str

MordinezNLP.downloaders.Processors.text_data_processor(data: _io.BytesIO) → str

Function can be used together with downloaders to convert BytesIO from text data to str.

Parameters

data (BytesIO) – input data which comes from downloader class/function

Returns

parsed input

Return type

str

Parsers - PDF parser

MordinezNLP.parsers.process_pdf.process_pdf(pdf_input: _io.BytesIO) → List[str]

A function can read strings from PDF docs handled in the BytesIO object. It extracts whole text and removes text that occurs in tables. The reason for that is that tables have mainly messy data for NLP tasks.

Function is divided into two parts. First removes tokens by exact match and the same number of occurences in text and tables. First part uses list of tokens, second uses tokens joined with space.

Parameters

pdf_input (BytesIO) – A PDF as a BytesIO object

Returns

Parsed text without texts found in tables

Return type

List[str]

Example usage for TXT files:

from io import BytesIO
from MordinezNLP.parsers import process_pdf

with open("my_pdf_doc.pdf", "rb") as f:
       pdf = BytesIO(f.read())
   output = process_pdf(pdf)
   print(output)

Parsers - HTML parser

MordinezNLP.parsers.HTML_Parser.HTML_Parser(html_doc: str, separator: str = ' ') → str

Function which removes not vaulable text and tags from HTML docs. It is based on research https://rushter.com/blog/python-fast-html-parser/

IMPORTANT If You must be 100% sure, that text You want to process is a HTML doc. Otherwise some parts of the source text can be deleted because of misunderstanding text as a tags.

Parameters
  • separator – Separator used to join HTML nodes in selectolax package

  • html_doc (str) – a HTML doc

Returns

String which is a vaulable text parsed from HTML doc.

Return type

str

Example usage for HTML files:

from MordinezNLP.parsers import HTML_Parser

with open("my_html_file.html", "r") as f:
    html_content = HTML_Parser(f.read())
    print(html_content)

Processors - Basic

class MordinezNLP.processors.Basic.BasicProcessor(language: str = 'en')

The aim of the class is to make use of NLP-dirty texts

get_special_tokens() → List[str]

Function can return all of the special tokens used by process function. It can be needed when training SentencePiece tokenizer.

Returns

all of the special tokens used in process function

Return type

List[str]

static load_language_days(language: str) → List[str]

Return language specific names of days of the week :param language: a language in which return name of days of the week :type language: str

Returns

a list of day names in specified language

Return type

List[str]

static load_language_months(language: str) → List[str]

Function returns language specific names of months

Parameters

language (str) – language in which return names

Returns

a list of months in specified language

Return type

List[str]

static load_numerals(language: str) → List[str]

Build language specific numerals. Currently supported numerals are from 1 to 99.

Parameters

language (str) – a language in which function will return numerals

Returns

a list of numerals in specified language

Return type

List[str]

static load_ordinals(language: str) → List[str]

Build a language specific ordinals. Currently supported ordinals from 1 to 99

Parameters

language (str) – a language in which function will return ordinals

Returns

a list of ordinals in specified language

Return type

List[str]

pos_tag_data(post_processed_data: List[str], replace_with_number: str, tokenizer_threads: int, tokenizer_batch_size: int, pos_batch_size: int) → List[str]

A helper function to postprocess numbers tags and replace according tokens with special token. It also uses SpaCy tokenization to return “normal” form of tokens.

Long story short: This function will parse input “There wasn’t six apples” to “There was not <number> apples”.

Parameters
  • post_processed_data (List(str)) – a postprocessed texts list

  • replace_with_number (str) – a special token to replace numbers with

  • tokenizer_threads (int) – How many threads to use for tokenization

  • tokenizer_batch_size (int) – Batch size for tokenization

  • pos_batch_size (int) – POS tagging batch size, be careful when CUDA is availabe in Your system!

Returns

postprocessed texts

Return type

str

process(text_to_process: Union[str, List[str]], pre_rules: List[Callable] = [], post_rules: List[Callable] = [], language: str = 'en', fix_unicode: bool = True, lower: bool = False, no_line_breaks: bool = False, no_urls: bool = True, no_emails: bool = True, no_phone_numbers: bool = True, no_numbers: bool = True, no_digits: bool = False, no_currency_symbols: bool = True, no_punct: bool = False, no_math: bool = True, no_dates: bool = True, no_multiple_chars: bool = True, no_lists: bool = True, no_brackets: bool = True, replace_with_url: str = '<url>', replace_with_email: str = '<email>', replace_with_phone_number: str = '<phone>', replace_with_number: str = '<number>', replace_with_digit: str = '0', replace_with_currency_symbol: str = '<currency>', replace_with_date: str = '<date>', replace_with_bracket: str = '<bracket>', replace_more: str = '<more>', replace_less: str = '<less>', use_pos_tagging: bool = True, list_processing_threads: int = 8, tokenizer_threads: int = 8, tokenizer_batch_size: int = 60, pos_batch_size: int = 7000) → Union[str, List[str]]

Main text processing function. It mainly uses regexes to find specified patterns in texts and replace them by a defined custom token or fixes parts that are not valuable for humans and machines.

Function also enables users to set pre_rules and post_rules. You can use those lists of Callables to add pre and post processing rules. A good use case is processing CommonCrawl reddit data, where each page has the same schema (headers, navigation bars etc.). In such case You can use pre_rules to filter them and then pass such text into the process function pipeline. Also feel free to add post_rules to match other cases which are not used here.

Depending on parameters function can replace a specified type of data.

Currently supported entities:
  • dates,

  • brackets,

  • simple math strings,

  • phone numbers,

  • emails,

  • urls,

  • numbers and digits

  • multiple characters in single words

Dates

Examples of dates matching in strings for english:
  • 1.02.2030

  • 1st of December 3990

  • first of DecEmber 1233

  • first december 2020

  • early 20s

  • 01.03.4223

  • 11-33-3222

  • 2020s

  • Friday 23 October

  • late 90s

  • in 20s

Brackets

Examples of brackets matching in strings for english:
  • [tryrty]

  • (other text)

Simple math strings

Examples of simple math strings for english:
  • 2 > 3

  • 4<6

  • 4>=4

  • 5<= 4

If You decide to use no_math=False than such cases will be processed with other functions. It meas that one function will remove math operator (<,>,<=,>=) and another will replace numbers with special token.

Multiple characters in single words

Table below show string before and after using a multiple characters in single word processing function

Before

After

‘EEEEEEEEEEEE’

‘’

‘supeeeeeer’

‘super’

‘EEEE<number>!’

‘’

‘suppppprrrrrpper’

‘suprpper’

Processing multiple characters is extremely useful in processing CommonCrawl reddit data.

Lists replacement

Lists in a text with leading “-” or “>” for each item can be parsed to simple and more understandable text. For example list:

My_list:
- item 1
- item 2,
-item 3

Will be parsed to:

My_list: item 1, item 2, item 3.

Use no_lists argument to enable this feature.

Supported languages

Fully supported languages

Partially supported languages

English

German

Be careful

Please don’t replace special tokens in function, because it can wrongly process strings. It will be fixed in feature releases # todo fix regexes in __init__ for special tokens

Parameters
  • text_to_process (Union[str, List[str]]) – An input text or a list of texts (for multiprocess processing) to process by a function

  • pre_rules (List[Callable]) – A list of lambdas that are applied before main preprocessing rules

  • post_rules (List[Callable]) – A list of lambdas that are applied after pre_rules and function processing rules

  • language (str) – A input text language

  • fix_unicode (bool) – replace all non unicode characters to unicode

  • lower (bool) – lowercase all characters

  • no_line_breaks (bool) – fully strip line breaks as opposed to only normalizing them

  • no_urls (bool) – replace all URLs with a special token

  • no_emails (bool) – replace all email addresses with a special token

  • no_phone_numbers (bool) – replace all phone numbers with a special token

  • no_numbers (bool) – replace all numbers with a special token

  • no_digits (bool) – replace all digits with a special token

  • no_currency_symbols (bool) – replace all currency symbols with a special token

  • no_punct (bool) – remove punctuations

  • no_math (bool) – remove >= <= in math strings

  • no_dates (bool) – remove dates strings in input text ‘early 80s’ -> ‘<date>’

  • no_lists (bool) – replace all texts lists

  • no_brackets (bool) – replace brackets: ‘[‘, ‘]’, ‘(‘, ‘)’

  • no_multiple_chars (bool) – reduce multiple characters in string into a single ones ‘supeeeeeer’ -> ‘super’

  • replace_with_url (str) – a special token used to replace urls

  • replace_with_email (str) – a special token used to replace emails

  • replace_with_phone_number (str) – a special token used to replace phone numbers

  • replace_with_number (str) – a special token used to replace numbers

  • replace_with_digit (str) – a special token used to replace digits

  • replace_with_currency_symbol (str) – a special token used to replace currency symbol

  • replace_with_date (str) – a special token used to replace dates

  • replace_with_bracket (str) – a special token used to replace brackets

  • replace_more (str) – a special token used to replace more ‘>’ and more or equal ‘>=’ symbols in math texts

  • replace_less (str) – a special token used to replace less ‘<’ and less or equal ‘<=’ symbols in math texts

  • use_pos_tagging (bool) – if True function will use StanzaNLP & SpaCy for POS tagging and token normalization

  • list_processing_threads (int) – How many threads You want to use to process List(str) which is on a input for

  • function. (this) –

  • tokenizer_threads (int) – How many threads to use during tokenization, this value is passed to the SpaCy pipeline.

  • tokenizer_batch_size (int) –

  • pos_batch_size (int) –

Returns

Post-processed text

Return type

Union[str, List[str]]

process_multiple_characters(text_to_process: str) → str

Function can detect multiplied characters in a word and replace them by a single one.

Before

After

‘EEEEEEEEEEEE!’

‘’

‘supeeeeeer’

‘super’

‘EEEE<number>!’

‘’

‘suppppprrrrrpper’

‘suprpper’

Parameters

text_to_process (str) – An input text to process

Returns

Text with removed duplicated characters in each word

Return type

str

Example usage:

from MordinezNLP.processors import BasicProcessor

bp = BasicProcessor()
post_process = bp.process("this is my text to process by a funcion", language='en')
print(post_process)

Tokenizers - SpacyTokenizer

A custom SpaCy tokenizer ready for tokenizing special tokens which comes from BasicProcessor.

Out of the box SpaCy tokenizer will parse special tokens (tags) separately for example: “<date>” to “< date >”, so that function changes such behavior.

param nlp

A Language object from SpaCy

type nlp

spacy.language.Language

returns

A SpaCy tokenizer

rtype

spacy.tokenizer.Tokenizer

Example usage:

from MordinezNLP.tokenizers import spacy_tokenizer
import spacy

nlp: Language = spacy.load("en_core_web_sm")
nlp.tokenizer = spacy_tokenizer(nlp)

test_doc = nlp('Hello today is <date>, tomorrow it will be <number> degrees of celcius.')

for token in test_doc:
    print(token)

# output
# Hello
# today
# is
# <date>
# ,
# tomorrow
# it
# will
# be
# <number>
# degrees
# of
# celcius
# .

Pipelines - PartOfSpeech

class MordinezNLP.pipelines.PartOfSpeech.PartOfSpeech(nlp: spacy.language.Language, language: str = 'en')

The aim of the class is to tag each token (which comes from MordinezNLP processors) with its POS tag.

process(texts: List[str], tokenizer_threads: int = 8, tokenizer_batch_size: int = 50, pos_batch_size: int = 3000, pos_replacement_list: Optional[Dict[str, str]] = None, token_replacement_list: Optional[Dict[str, str]] = None, return_docs: bool = False, return_string_tokens: bool = False) → Union[Generator[Tuple[List[Union[spacy.tokens.token.Token, str]], List[str]], None, None], Generator[Tuple[List[List[Union[spacy.tokens.token.Token, str]]], List[List[str]]], None, None]]

Main processing function. First step is to tokenize a list of input texts to sentences and then to the tokens. Then such input goes to the StanzaNLP.

For the function List[str] object which comes as an input is a list of docs to process. Each item in a list is a document (SpaCy logic in pipelines). In such case You can specify if You want to return texts in structure documents[sentences[tokens]] or sentences[tokens] (removing the documents layer).

Sometimes You want to force POS tagger to assign POS tag to a specified token or instead of other POS tag. For such cases You can use pos_replacement_list and token_replacement_list. You can import sample token and POS replacement lists from MordinezNLP.utils.pos_replacement_list and MordinezNLP.utils.token_replacement_list.

If You want to use a special attributes for each tokens from SpaCy please pass False as a value of return_string_tokens argument.

Each token parsed by SpaCy tokenizer will by converted to its normal version. For example each n’t will be replaced by not.

Parameters
  • texts (List[str]) – an input texts, each item in a list is a document (SpaCy logic in pipelines)

  • tokenizer_threads (int) – How many threads You want to use in SpaCy tokenization

  • tokenizer_batch_size (int) – Batch size for SpaCy tokenizer

  • pos_batch_size (int) = Batch size for Stanza POS tagger (if enabled) –

  • pos_replacement_list (Union[Dict[str, str], None]) – If not None function will replace each POS tag

  • value set in value field of the dict. Each key is a POS tag to be replaced by its value. (with) –

  • token_replacement_list – If not None function will replace each POS tag with the value set in value field of

  • dict. Each key is token, which will be replaced by its value. (the) –

  • return_docs (bool) – If True function will keep a “documents” layer on output.

  • return_string_tokens (bool) – Function can return tokens as SpaCy Token object (if You need to access token

  • data such as norm) or can return tokens as a string object. If True returns a string tokens. (specified) –

Returns

Union[Generator[Tuple[List[Union[Token, str]], List[str]], None, None], Generator[Tuple[List[List[Union[Token, str]]], List[List[str]]], None, None]]: a list of doc(if return docs is set) with list of sentences with list of tokens and its pos tags.

Example usage:

from MordinezNLP.pipelines import PartOfSpeech
from MordinezNLP.tokenizers import spacy_tokenizer
import spacy

nlp: Language = spacy.load("en_core_web_sm")
nlp.tokenizer = spacy_tokenizer(nlp)

docs_to_tag = [
    'Hello today is <date>, tomorrow it will be <number> degrees of celcius.'
]

pos_tagger = PartOfSpeech(
    nlp,
    'en'
)

pos_output = pos_tagger.process(
    docs_to_tag,
    4,
    30,
    return_docs=True
)

Utils

MordinezNLP.utils.ngram_iterator.ngram_iterator(string: str, ngram_len: int = 3) → list

Returns an iterator that yeilds the given string and its ngrams. Each subsequent list element has got lenght set to ngram_len. The differences between each of following elements in list is a one letter forward in a context. For example for string “hello” and ngram_len set to 3 it will output [“hel”, “ell”, “llo”]

Parameters
  • string (str) – string to iterate on

  • ngram_len (int) – lenght of each ngram

Returns

ngram - list of ngram_len characters of input string

Return type

list

Example usage:

from MordinezNLP.utils import ngram_iterator

print(list(ngram_iterator("<hello>", 3))) # <- will print ['<he', 'hel', 'ell', 'llo', 'lo>']
MordinezNLP.utils.random_string.random_string(length: int = 64, choices_list: List[str] = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789') → str

Generate random string which contains characters from choices_list arg.

Parameters
  • length (int) – length of generated string

  • choices_list (List[str]) – List of characters from which random string should be generated

Returns

Randomly generated string

Return type

str

Example usage:

from MordinezNLP.utils import random_string
import string

rs = random_string(32)
print(rs)

rs = random_string(10, string.digits)
print(rs)

Indices and tables