Downloaders - CommonCrawl¶
-
class
MordinezNLP.downloaders.CommonCrawlDownloader.
CommonCrawlDownloader
(links_to_search: List[str], index_name: str = 'CC-MAIN-2020-24', base_index_url: str = 'http://index.commoncrawl.org', search_for_mime: str = 'text/html', search_for_language: str = 'eng', threads: int = 8)¶ Class used to download common crawl data using Basic multithreaded downloader.
-
download
(save_to: str, base_url: str = 'https://commoncrawl.s3.amazonaws.com', sleep_time: float = 0)¶ Main function used to download CC data using multithreaded Base Downloader.
- Parameters
save_to (str) – path to a folder where the data will be downloaded. Each file is a HTML document downloaded from CC.
base_url (str) – base CC URL for example: https://commoncrawl.s3.amazonaws.com
sleep_time (int) – A sleep time in seconds that is used to prevent sites from detecting downloading as a DDoS attack
-
Example usage:
from MordinezNLP.downloaders import CommonCrawlDownloader
ccd = CommonCrawlDownloader(
[
"reddit.com/r/space/*",
"reddit.com/r/spacex/*",
]
)
ccd.download('./test_data')