Downloaders - CommonCrawl

class MordinezNLP.downloaders.CommonCrawlDownloader.CommonCrawlDownloader(links_to_search: List[str], index_name: str = 'CC-MAIN-2020-24', base_index_url: str = 'http://index.commoncrawl.org', search_for_mime: str = 'text/html', search_for_language: str = 'eng', threads: int = 8)

Class used to download common crawl data using Basic multithreaded downloader.

download(save_to: str, base_url: str = 'https://commoncrawl.s3.amazonaws.com', sleep_time: float = 0)

Main function used to download CC data using multithreaded Base Downloader.

Parameters
  • save_to (str) – path to a folder where the data will be downloaded. Each file is a HTML document downloaded from CC.

  • base_url (str) – base CC URL for example: https://commoncrawl.s3.amazonaws.com

  • sleep_time (int) – A sleep time in seconds that is used to prevent sites from detecting downloading as a DDoS attack

Example usage:

from MordinezNLP.downloaders import CommonCrawlDownloader

ccd = CommonCrawlDownloader(
    [
        "reddit.com/r/space/*",
        "reddit.com/r/spacex/*",
    ]
)
ccd.download('./test_data')