Human-Machine Knowledge Transfer

Image for post
Image for post
Photo by Franck V. on Unsplash

OpenAI debuted its largest language model GPT3 with 175 billion parameters back in May and started to open the API to selected people last week — you can check out the MIT Technology Review article and a blog post by Arram Sabeti to see many interesting generated texts (lyrics, stories, conversations, user manuals, and even guitar tabs and JSX code).

The MIT article points out: “Exactly what’s going on inside GPT-3 isn’t clear”:

Exactly what’s going on inside GPT-3 isn’t clear. But what it seems to be good at is synthesizing text it has found elsewhere on the internet, making…


Systematic Web Scraping for Beginners

Image for post
Image for post
Photo by Sarah Sosiak on Unsplash

Part I, Part II, Part III, Part IV, Part V

In this last part of this tutorial series, I am going to briefly cover a very important topic: dynamic scraping. The example code can be found in the following repo:

In real web scraping projects, you often cannot directly crawl the websites using the approach presented in the first four parts of this tutorial for various reasons, such as the web page is dynamically generated (such as the example in this tutorial, where the web page is loaded when a user scrolls to the bottom of the current page), you…


Image for post
Image for post
Photo by Kelli McClintock on Unsplash

I often use and study open-source code from the Internet and run into the following situation:

I found a nice code repo on Github, clone it to my local computer, and then have to spend a lot of time just trying to get the code up and running because I have to figure out things like Python version, the required packages with specific versions, Jupyter configurations, etc.

In this short tutorial, I show how I organize a self-contained Python project, which can be up and running with minimal effort. …


Image for post
Image for post
Photo by Hal Gatewood on Unsplash

Github Pages allow you to host static markdown/web pages (HTML and JS) for free. You only need a few steps to have a website up and running.

I will show you how to build such a website using a course website as an example: https://harrywang.github.io/misy331/.

Step 1. Create an empty repo with basic information (you can leave .gitignore and license empty as the default setting):


Systematic Web Scraping for Beginners

Image for post
Image for post
Photo by Paweł Czerwiński on Unsplash

Part I, Part II, Part III, Part IV, Part V

In the previous three parts, you have developed a spider that extracts quote information from http://quotes.toscrape.com and stores the data into a local SQLite database. In this part, I will show you how to deploy the spider to the cloud.

First, let’s see how you can deploy to https://scrapinghub.com — the commercial service ran by the team behind the open-source Scrapy framework.

Create a free account and a new project:


Systematic Web Scraping for Beginners

Image for post
Image for post
Photo by Sarah Dorweiler on Unsplash

Part I, Part II, Part III, Part IV, Part V

In Part II, you have extracted all the required data from the website and stored them in Items. In Part III, I will introduce Item Pipelines to save the extracted data into a database using ORM (SQLAlchemy) and handle the duplicate data issue.

Each item returned by the spider is sent to Item Pipelines (if any) sequentially for additional processing, such as saving items to the database, data validation, removing duplicates, etc. Item pipelines are defined as classes in the pipelines.py


Systematic Web Scraping for Beginners

Image for post
Image for post
Photo by Igor Son on Unsplash

Part I, Part II, Part III, Part IV, Part V

In Part I, you learned how to setup Scrapy project and write a basic spider to extract web pages by following page navigation links. However, the extracted data are merely displayed to the console. In Part II, I will introduce the concepts of Item and ItemLoader and explain why you should use them to store the extracted data.

Let’s first look at Scrapy Architecture:


Systematic Web Scraping for Beginners

Image for post
Image for post
Photo by Paweł Czerwiński on Unsplash

Part I, Part II, Part III, Part IV, Part V

Web scraping is an important skill for data scientists. I have developed a number of ad hoc web scraping projects using Python, BeautifulSoup, and Scrapy in the past few years and read a few books and tons of online tutorials along the way. …


Start a Deep Learning AMI — make sure to choose a GPU one!

Image for post
Image for post

(Originally written on May 17, 2016)

Image for post
Image for post
Photo by Andy Brunner on Unsplash

Manually setting up a few systems (Hadoop, Spark, HBase, and Hive) for big data analytics greatly helped me understand some key big data concepts. Check out the full tutorial: big data 101 cookbook.

Harry Wang

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store