Using Python and Common-Crawl to find products from

Common-Crawl (CC) is an Awesome free and open source collection of crawled data from the world wide web spanning back many years. Common-Crawl does what Google and Bing does but allows for anyone to access their information and analysis the information for free and to use the data commercially for free. The data sets recorded are now tipping many petabytes and are stored on AWS S3 for free courteous of Amazon.

Today we will be investing this available information by using Python and a couple plugins to analyze the stored raw HTML code.
The data stored is in a compressed format due to many pure HTML pages, hence finding specific pages can be difficult. Common-Crawl have provided a useful API ( which Apps may access and use to find all pages with a specific domain name. When visiting the API page, you may notice how there is a long list of entries which start with “/CC-MAIN-…”. This long list is the snapshots or datasets that Common-Crawl find each month, this means that you may actually go back many months or years to find and extract information. For this example, we will use the latest dataset CC-MAIN-2017-39-index.

Below is the API call we will use in our application. The first %s is the data set number so eg. “2017-39” and the second is the %s is the desired domain to search.

The python code accepts a domain name to search and returns all the URLs which belong to the domain name from the common crawl dataset. Once the URLs have been downloaded and stored in a list, we can move on to downloading the compressed page and performing the analysis.

The python code above uses the inbuilt python library “requests” to download a compressed page from the dataset saved on Amazons AWS S3. The downloaded page is then extracted using the “Gzip” library and the raw HTML data is returned as the response. Once the page is downloaded we can use our custom function with the help of the python plugin “Beautifulsoup” to find specific data residing in the HTML code.

Now, there is a bit to explain here.

The function above accepts two inputs, the HTML code and the URL for the page. The page first initialises the BeautifulSoup library to a variable called parser. The function will then check if the page that it is currently inspecting is definitely a products page, as the common crawl API will return pages and URLs which are mixed with menus, corporate info, deals pages, etc. This is done using the function below.

The function above using Beautifulsoup to find certain HTML elements for example divs, spans, and in this case bold statements <b> and checks the content within the bold statement. If the statement contains a string called “ASIN:” then we can be assured that there is a high chance that the page is definitely a product and the function returns the Asin and a boolean True.

If the page is recognised as a product then the extract_product function will create a Product object to store the information. The class is a good way to store and manage the information as there is a specific model and not randomly stored in a JSON format.

The product class is not too difficult as it contains setters and getters as well as specific functions which helps generate a JSON object and prints the information to the terminal.

Once the product object is created and complete with information then the object is added to a buffer which a secondary multithreaded function handles the save feature of the function to an AWS DynamoDB database. Below is the save thread class which handles the connection and processing.

Please note when creating a DynamoDB table, enter a string to the table index. In this case we use “uid”.

Finally we can wrap the whole python application with the main function

Here we see the URLs found, save thread created and started, as well as two product finder classes which handle half the URL list each. To prevent the main function finishing and closing before the multithreaded classes, we enable idlers which prevent the main function closing before the URL lists have emptied and the save product buffer is emptied as well.

Please find my GitHub page to see the full version of the code:

Thank you for the read and as all ways, Stay Awesome!

Also, note the application works best on Unix based machines, for example, Linux and Mac.
I’m currently using an Apple MacBook Pro to run this code.

Pseudocode Break Down

1. Search domain – Common-crawl API

2. Add URLs to list

3. Start saving thread

4. Slip up URLs into two threads

5. Loop through list of URLs, Download page and confirm if product

* Full HTML Archive copy from Common-crawl

6. If Page is a product, extract details and create new product Object

7. Add object to save buffer, which adds it to DynamoDB

8. Loop until list completed

Tags from the story
, ,
Written By

Engineering Student who lives in Australia. Interests include Technology from Computers to Web Applications.

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *