

At this point, we’re definitely screwed if we need to scale up and we don’t change our approach.
#WEBSCRAPER IN PYTHON FULL#
On the full 289 files, this scraper took 319.86 seconds. Even though we’re only making 10 requests, it’s good to be nice.ģ19.86593675613403 seconds to download 289 stories on the pageĪs expected, this scales pretty poorly. Let’s also make sure to sleep for a bit between calls, to be nice to the Hacker News server. requests and BeautifulSoup make extracting the URLs easy. Since there are 30 per page, we only need a few pages to demonstrate the power of multithreading. I’ll walk through a quick example below.įirst, we need get the URLs of all the posts. Let’s say you wanted to download the HTML for a bunch of stories submitted to Hacker News. This is really just about highlighting how you can do faster web scraping with almost no changes. I’ll briefly touch on how multithreading is possible here and why it’s better than multiprocessing, but won’t go into detail. In this post, I’ll use concurrent.futures to make a simple web scraping task 20x faster on my 2015 Macbook Air. I wasn’t as well versed in concurrency and asynchronous programming back in 2016, so this didn’t even enter my mind.
#WEBSCRAPER IN PYTHON CODE#
In light of that, I recently took a look at some of my old web scraping code across various projects and realized I could have gotten results much faster if I had just made a small change and used Python’s built-in concurrent.futures library. You might even say I think about it all the time. Working on GPU-accelerated data science libraries at NVIDIA, I think about accelerating code through parallelism and concurrency pretty frequently.
