Thursday, January 24, 2013

Scrapping away at the JigSaw

Introduction

For a concept idea that I am working on I needed to get some realistic sample people data. The data fields are Name, address, email phone, occupation etc.. I know it that this data could easily be fabricated, I tried it out got more like these :

Jules abdoullah Desai,
123 Farber Lakes Dr,
Monroeville, Texas

That address is not convincing and that person may not exist. I wanted to get the sample data from a source so that there is some distribution of people across location (city, state etc..) and real valid address (well at least close enough). The following example looks more convincing for me:

Ibrahim Khan
1 Lilly Corporate Ctr
IndianapolisIN 46285-0001
United States
 This blog post details on the data source that provided me with the type of data and how I built a program and extracted information from this source. At the end of reading this post, you will have some idea of
  • The data source
  • Method used to get the data.
  • web scrapping.
  • The program and libraries that helped in extracting the information.

Data source

I search to find a data source (web site) which could give me a dataset. First stop was www.whitepages.com. You could get hold a list of person by searching on a last name. The search result produced as below :


  
as you could see; the distribution of people by locality is there; but are too many duplicates.also the address is not in full. hence I was not satisfied with this.

On further searching and digging I finally came to www.jigsaw.com. Although Jigsaw's original intention is to host a repository of business lead contacts and provide a medium where different people could exchange this information. My intention was different but it served to be ideal. To start off, a simple search by name "Ibrahim khan" returns the following result :




from the search result I am able to get company, title and name. When I click on the name itself; I get a little bit more details  as below :


Though the actual phone & email address is hidden (you need to pay for it); but other information like Name, Job Title, Organization, Address seems more realistic and valid. Hence this is a perfect data source for getting the data I need. I could live with the fact of auto generating the email address (like firstname.lastname@somerandom.org) and phone number for these contacts data.

NOTE: Jigsaw had a rest based service to retrieve the information of contacts, but at the time of this experimentation; it was closed and I was not able to get an account to access this service.

Strategy for data collection

Ideally it would be great, if we have a list of names in a database and we could query for each name. Unfortunately, Jigsaw did not provide this. A workaround is to search based on a last name (ex: carpenter) and from the list of names (which has a URL) present in the search result, iterate each and retrieve the necessary information (name, title, org, address, etc..). The only downside is for you to come up a list of common last names that you could use for the search.

You could get the list of common last names from the us census, like the list from 2000 census Genealogy Data: Frequently Occurring Surnames. The amount of data collected from each last name search would be good enough at least for my case.

For the data collection process; I am going to be doing a 2 pass on JigSaw.com. As from above, based on a list of last names
  1. First pass:   For each last name do a search and store the resultant links for each contact.
  2. Second pass: For each url (obtained from the above), do a HTTP request and extract the required set of data.
Also to note is that your program should have logged into Jigsaw and get a HTTPs session, in order to retrieve all the data from the search result.  

Tool

Having found the ideal data source, the next step is to find a way to extract the data from this data source.A well known popular methodology is to use a web crawler/site scrapper. 

From the time, I have known about this methodology; it has come a long way. Now you could easily create site scrappers with scrapping software (ex: automationanywhere) or employ a SAAS (ex: grepsr or mozenda) which provides sophisticated extraction and scheduling for scrapping at intervals etc..

I could simply used the above mentioned options (it would save a lot of time); but I just wanted to get my hands dirty a little, I needed to understand what is involved with scrapping and how it can be achieved. Hence I choose to program a little; now I did not want to start from a scratch I wanted to use existing well known libraries that provided 
  • website crawling
  • a framework for crawling
  • discover links and further crawl
  • ability to program using a language like python, java etc..
  • handle website with HTTPS sessions. 
Now, there are many framework and libraries that could achieve the above, for my exploration; I choose  Scrapy; as it had more references in forums (like StackOverflow) and Google search result. I also liked Scrapy in that the output will be stored as CSV, JSON, XML all based on argument parameters.

In order to do a web site scrapping; it is necessary that you have to understand the web page source html; inspect and get to know the locations/xpath etc.. I recommend using the developer tools of Firefox or Chrome to do the same. I would not be explaining this process.

Scrapy and dynamic rendered web pages


Before jumping right into the code; i would have to explain how to handle java script rendered web page. I am doing this before, as i learned it the hard way of how the web page got rendered and how to use scrappy on these dynamic content

Why is there a section of this is for the reason that the search result and many other pages from Jigsaw are dynamically rendered by Javascript. Unless the web page is rendered, you will not get list of contact names, urls etc..
Crawling with Scrapy is simple as long as the web pages are pure HTML and rendered. However if the page content/page section/links that you are interested in is java script rendered then it becomes difficult. The hard way is for your scrapper to look at the whole the HTML content and determine how the data could be extracted. Another way is to use a web-page javascript rendering component/library  that first reads the html and gives a rendered output, again this is hard.
Upon exploration i came a work around to do scrapping Selenium. Selenium web-driver interacts with a browser (of choice) and loads up the url. Once the page has been rendered by the browser, we could use classes in web-driver to get the rendered web page. The following will use the "Selenium Web driver" component. The logic of loading url and extraction is :
  1. Open the url using the "Selenium web driver"
  2. Selenium opens a browser instance (firefox in my case), and loads the url.
  3. Once the url is loaded, we wait a configurable amount of time, in order for the browser to render the page.
  4. Once rendered, using classes from Selenium web driver, we look for the content (sections, links etc) that are of interest to us and proceed further.

Code

The following set of section will dwelve into how I used Scrapy to crawl/scrape and extract the data.  In order to develop the scrapper with scrapy, it needs to be developed with Python. I am hoping that you know Python. As of today, I am not yet an expert in Python, but enough to develop simple solutions.

The code has been developed in a windows based environment. If you would want to use this code, I leave it to you to reasearch and find out how to install python in windows and also install libraries like libxml, scrappy, selenium etc..

The code is checked in GitHub project  (Will be updating this shortly) Jigsaw Scrapper.git

Extracted data

As mentioned in the strategy, the code will be based on a 2 pass/run at jig saw, in order to extract the data that is of interested. In the first pass, we will extract the following:
  • Contact Name
  • URL for this contact
This will be stored in a text file ("ContactsToURL.csv"). In the second pass, we retrieve the details of the contact and store in a text file. The details of the contacts that will be extracted are :
  • Contact Name (First, Middle & Last)
  • Contact Id
  • Title
  • Company or Organization
  • Address
  • Phone (if present)
  • Added by
  • Last updated.
The definition of above data structure is defined in the "Items.py" file as required by Scrapy.
The output format is set at the command line of scrapy. For the first pass, the data was stored in a CSV format and for the second pass i stored the output in a JSON format.

Code logic

First pass : Searching and capturing the contact urls

As mentioned earlier in the first pass, the scrapper does a search on the last name, iterates over the result set and stores the contact name and contact URL.

This is done by the class "SearchAndRetrieveContactsURL.py". The class extends from CrawlSpider class of Scrapy which is meant for crawling.

For the list of last names, i have initialized it in an array (though this could be done be reading a file as well). The start_urls is empty as we are going to override the default loading procedure.

We over ride the method "start_requests"; to handle the logon process. We also instantiate the Selenium Webdriver and load up the logon URLS. This ensures the selenium instantiated browser is loaded and logged in.

If the login is successful; the "after_login" method is called up. This then checks if the login was indeed a success and redirects to the search contact page.

Once on the search page, a search is done using a last name (initialized at the class level) and once we get the search result we parse the result. This is done by method "getLastNameSearchURL".

The "parse_searchResult" then iterates over the result and captures the contact name and URL for the contact and if there is next_page link; then it crawls to the next page in a recursive fashion. The parsing of the search result page is done from the webpage that gets rendered in the browser (that was instantiated by Selenium).


On finishing one last name, the next last name is retrieved and searching and fetching process continues until all the names have been completed. 

Second Pass: Fetching contact data

Given the contact url, I found that it is not mandatory to logon to JigSaw, a simple URL load, gets you the page and it is rendered and hence no need to worry about dynamic rendering. 

This is done by the scrapper class "RetrieveContactDetails.py". The scrapper loads the text file containing the contacts urls and for each url, it does a HTTP get extract data content and goes onto the next one.

The class extends from the "BaseSpider" class of Scrapy, as crawling is not needed.

Retrospective

As mentioned earlier, there are much simpler solutions to web site scrapping using software's and SAAS providers. If its a one time based setup, I would get a software, if scrapping is needed to be done on a regular basis, I would recommend exploring the SAAS provider based solution, as it relieves you in cost of hardware, skill set etc..

There are also various books on site scrapping for programmers to start off Web bots, Spiders, and Screen Scrapers.

Before employing any of the above, it is better that you would understand what you want to scrap, copy right issues, authentication mechanism etc..

Last words


Till next time, "Vanakkam"

4 comments:

  1. Where do you define JigsawSearchContactResultItem?

    ReplyDelete
  2. Good explanation, Thank you. But I find that exploring the index (of companies or contacts) is better and more straightforward than doing a search by lastname.
    I f you go through the companies directory like for example this one here(https://connect.data.com/directory/company/g) you will access more contacts that are already organized in tables, rather than accessing each contact page.

    ReplyDelete