A Minimalist End-to-End Scrapy Tutorial (Part V)

Systematic Web Scraping for Beginners

Harry Wang
4 min readApr 13, 2020
Photo by Sarah Sosiak on Unsplash

Part I, Part II, Part III, Part IV, Part V

In this last part of this tutorial series, I am going to briefly cover a very important topic: dynamic scraping. The example code can be found in the following repo:

In real web scraping projects, you often cannot directly crawl the websites using the approach presented in the first four parts of this tutorial for various reasons, such as the web page is dynamically generated (such as the example in this tutorial, where the web page is loaded when a user scrolls to the bottom of the current page), you need to login first via a dynamic login form, etc. In this situation, one option is to use Selenium https://www.selenium.dev/ to simulate real user actions via controlling the browser to get the data.

The webpage we are going to crawl is https://dribbble.com/designers, which is an infinite scroll page — more page contents show up when you scroll to the bottom of the page. Selenium enables us to control a browser using code and we use Chrome in this example. Also, make sure you install Selenium and Scrapy as shown in the requirements.txt file.

First, you need to install Chrome on the machine you are going to run the scraping code and download the Chrome driver file from https://chromedriver.chromium.org/downloads for Selenium. Make sure the driver version matches the installed Chrome version (check it from Menu → Chrome → About Google Chrome) :

You have to replace the Chrome Driver file in the repo with the correct version for the code to work!!

Given the code is quite simple, I won’t go into the details and only explain the key ideas. In the spider file:

  • I first use last_height = driver.execute_script(“return document.body.scrollHeight”) to get the current height of the page
  • then I use driver.execute_script(“window.scrollTo(0, document.body.scrollHeight);”) to scroll to the bottom of the page and get the currently loaded page content
  • pause 5 seconds, and repeat this — each time, more page contents are collected — till I cannot scroll anymore or reach the predefined maximal number of scrolls (10 in this case)

I also included a small example to show how you can automatically find the search box, enter “New York”, and click the search button (pause 1 second between actions):

search_location = driver.find_element_by_css_selector('#location-selectized').send_keys('New York')
sleep(1)
search_button = driver.find_element_by_css_selector('input[type="submit"]') search_button.click()
sleep(5)

Now, when you run scrapy crawl dribbble, an instance of Chrome will be started and you can see the scrolling to the bottom of the page and the search actions I just mentioned — all fully automated :). The extracted data is logged to the console.

In the repo, I also included the code to show how you can use a Proxy service “ProxyMesh” to rotate your IP addresses to avoid the potential banning from the website — you should never aggressively crawl any website, which is essentially a sort of denial-of-service (DOS) attack.

For ProxyMesh, you need to sign up for an account and then you can get a proxy server address such as http://harrywang:mypassword@us-wa.proxymesh.com:31280 and you need to set the local http_proxy environment variable: export http_proxy=http://harrywang:mypassword@us-wa.proxymesh.com:31280 then activate the HttpProxyMiddleware by uncommenting the following part in settings.py:

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100,
}

For ProxyMesh to work with Selenium, do the following two steps:

  1. Add the IP of the machine running the scraping code to ProxyMesh for IP authentication.
  2. Uncomment the following two lines in the dribbble_spider.py file:
# PROXY = "us-wa.proxymesh.com:31280"
# chrome_options.add_argument('--proxy-server=%s' % PROXY)

That’s it! Thanks for reading!

Part I, Part II, Part III, Part IV, Part V

--

--