Downloaders - Basic

class MordinezNLP.downloaders.Basic.BasicDownloader

Class helps to download multiple files from list of provided links using multithreading.

static download_to_bytes_io(url: str, temp_path: str, use_memory: bool, custom_headers: dict = {}, streamable: bool = False, sleep_time: float = 0, max_retries: int = 10) → Union[_io.BytesIO, pathlib.Path]

Function defines how to download single URL. It is used by download_urls function to use Threading for multithread downloading.

Function makes GET request to a specified URL. If there is an exception, the function will try to download file untill it will be successful or until it reaches 10 unsucessful downloads

Parameters
  • max_retries (int) – How many retries function will make until it marks a file undownloadable

  • sleep_time (int) – A sleep time in seconds that is used to prevent sites from detecting downloading as a DDoS attack

  • streamable (bool) – Sets a request’s stream parameter. More info https://2.python-requests.org/en/v2.8.1/user/advanced/#body-content-workflow

  • custom_headers (bool) – Custom headers used in each request

  • url (str) – valid HTTP/HTTPS URL

  • temp_path (str) – A temporary path where downloaded files will be saved (argument used only during in memory download)

  • use_memory (bool) – When set to True script will download all data to the memory. Otherwise it will save

  • data as temporary files on a disk. (downloaded) –

Returns

downloaded file as a bytes

Return type

io.BytesIO

static download_urls(urls: List[str], file_type_handler: Callable, threads: int = 8, sleep_time: float = 0, custom_headers: Iterable = repeat({}), streamable: Iterable = repeat(False), max_retries: int = 10, use_memory: bool = True, temp_dir: Optional[pathlib.Path] = None) → list

Function allows user to download files from provided URLs in list. Each file is downloaded as BytesIO using specified number of threads and then file_type_handler is used to convert file from BytesIO to specified format. Each file type should have its own file_type_handler.

Sleep_time is used to prevent sites from detecting DDoS attacks. Before downloading file specified thread is going to sleep for specified amount of time.

Function for each thread (and for each single URL) uses download_to_bytes_io function.

Using use_memory user can decide if files will be downloaded to the memory or saved as a temporary files on a disk.

Parameters
  • urls (List[str]) – List of URLs of files to download

  • file_type_handler (Callable) – Function used to convert downloaded file to a specified format

  • threads (int) – Number of threads to download files

  • sleep_time (int) – Time used to prevent file downloads from being detected as DDoS attack

  • max_retries (int) – Refer to download_to_bytes_io function documentation.

  • streamable (bool) – Refer to download_to_bytes_io function documentation.

  • custom_headers (int) – Refer to download_to_bytes_io function documentation.

  • use_memory (bool) – Downloader can download files to memory or to the binary files. Use memory downloader

  • You know that You will download small amout of data, otherwise set this value to False. (when) –

  • temp_dir (Path) – Path to directory where temporary files will be saved.

Returns

A list of downloaded and processed by file_type_handler function files.

Return type

list

Example usage for TXT files:

from MordinezNLP.downloaders import BasicDownloader
from MordinezNLP.downloaders.Processors import text_data_processor

downloaded_elements = BasicDownloader.download_urls(
    [
        "https://raw.githubusercontent.com/BMarcin/MordinezNLP/main/requirements.txt",
        "https://raw.githubusercontent.com/BMarcin/MordinezNLP/main/LICENSE"
    ],
    lambda x: text_data_processor(x),
)

print(downloaded_elements) # <- will display a list with elements where each is a content of a downloaded files

Example usage for PDF files:

from MordinezNLP.downloaders import BasicDownloader
from MordinezNLP.downloaders.Processors import pdf_data_processor

downloaded_pdfs = BasicDownloader.download_urls(
    [
        "https://docs.whirlpool.eu/_doc/19514904100_PL.pdf",
        "https://mpm.pl/docs/_instrukcje/WA-6040S_instrukcja.pdf",
    ],
    lambda x: pdf_data_processor(x)
)

print(downloaded_pdfs) # <- will display a list with elements where each is a content of a downloaded files