Downloaders - Basic¶

class MordinezNLP.downloaders.Basic.BasicDownloader¶

Class helps to download multiple files from list of provided links using multithreading.

static download_to_bytes_io(url: str, temp_path: str, use_memory: bool, custom_headers: dict = {}, streamable: bool = False, sleep_time: float = 0, max_retries: int = 10) → Union[_io.BytesIO, pathlib.Path]¶

Function defines how to download single URL. It is used by download_urls function to use Threading for multithread downloading.

Function makes GET request to a specified URL. If there is an exception, the function will try to download file untill it will be successful or until it reaches 10 unsucessful downloads

Parameters

max_retries (int) – How many retries function will make until it marks a file undownloadable
sleep_time (int) – A sleep time in seconds that is used to prevent sites from detecting downloading as a DDoS attack
streamable (bool) – Sets a request’s stream parameter. More info https://2.python-requests.org/en/v2.8.1/user/advanced/#body-content-workflow
custom_headers (bool) – Custom headers used in each request
url (str) – valid HTTP/HTTPS URL
temp_path (str) – A temporary path where downloaded files will be saved (argument used only during in memory download)
use_memory (bool) – When set to True script will download all data to the memory. Otherwise it will save
data as temporary files on a disk. (downloaded) –

Returns

downloaded file as a bytes

Return type

io.BytesIO

static download_urls(urls: List[str], file_type_handler: Callable, threads: int = 8, sleep_time: float = 0, custom_headers: Iterable = repeat({}), streamable: Iterable = repeat(False), max_retries: int = 10, use_memory: bool = True, temp_dir: Optional[pathlib.Path] = None) → list¶

Function allows user to download files from provided URLs in list. Each file is downloaded as BytesIO using specified number of threads and then file_type_handler is used to convert file from BytesIO to specified format. Each file type should have its own file_type_handler.

Sleep_time is used to prevent sites from detecting DDoS attacks. Before downloading file specified thread is going to sleep for specified amount of time.

Function for each thread (and for each single URL) uses download_to_bytes_io function.

Using use_memory user can decide if files will be downloaded to the memory or saved as a temporary files on a disk.

Parameters

urls (List[str]) – List of URLs of files to download
file_type_handler (Callable) – Function used to convert downloaded file to a specified format
threads (int) – Number of threads to download files
sleep_time (int) – Time used to prevent file downloads from being detected as DDoS attack
max_retries (int) – Refer to download_to_bytes_io function documentation.
streamable (bool) – Refer to download_to_bytes_io function documentation.
custom_headers (int) – Refer to download_to_bytes_io function documentation.
use_memory (bool) – Downloader can download files to memory or to the binary files. Use memory downloader
You know that You will download small amout of data, otherwise set this value to False. (when) –
temp_dir (Path) – Path to directory where temporary files will be saved.

Returns

A list of downloaded and processed by file_type_handler function files.

Return type

list

Example usage for TXT files:

from MordinezNLP.downloaders import BasicDownloader
from MordinezNLP.downloaders.Processors import text_data_processor

downloaded_elements = BasicDownloader.download_urls(
    [
        "https://raw.githubusercontent.com/BMarcin/MordinezNLP/main/requirements.txt",
        "https://raw.githubusercontent.com/BMarcin/MordinezNLP/main/LICENSE"
    ],
    lambda x: text_data_processor(x),
)

print(downloaded_elements) # <- will display a list with elements where each is a content of a downloaded files

Example usage for PDF files:

from MordinezNLP.downloaders import BasicDownloader
from MordinezNLP.downloaders.Processors import pdf_data_processor

downloaded_pdfs = BasicDownloader.download_urls(
    [
        "https://docs.whirlpool.eu/_doc/19514904100_PL.pdf",
        "https://mpm.pl/docs/_instrukcje/WA-6040S_instrukcja.pdf",
    ],
    lambda x: pdf_data_processor(x)
)

print(downloaded_pdfs) # <- will display a list with elements where each is a content of a downloaded files